# 问题内容:

My question: what is the most efficient way of removing a predictor with `NA`s and consider the complete cases excluding that predictor?

The question arises from the following regression situation with `NA`s, in which there are missing values in `Ozone` (mostly) and `Solar.R`.

``````data(airquality)
summary(airquality)
#     Ozone           Solar.R           Wind             Temp           Month
# Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000
# 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000
# Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000
# Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993
# 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000
# Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000
# NA's   :37       NA's   :7
#      Day
# Min.   : 1.0
# 1st Qu.: 8.0
# Median :16.0
# Mean   :15.8
# 3rd Qu.:23.0
# Max.   :31.0
``````

Regression of `Wind` on the remaining variables. Considers only the complete cases.

``````summary(lm(Wind ~ ., data = airquality))
#
# Call:
# lm(formula = Wind ~ ., data = airquality)
#
# Residuals:
#     Min      1Q  Median      3Q     Max
# -4.3908 -2.2800 -0.3078  1.4132  9.6501
#
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)
# (Intercept) 15.519460   2.918393   5.318 5.96e-07 ***
# Ozone       -0.060746   0.011798  -5.149 1.23e-06 ***
# Solar.R      0.003791   0.003216   1.179    0.241
# Temp        -0.036604   0.044576  -0.821    0.413
# Month       -0.159671   0.208082  -0.767    0.445
# Day          0.017353   0.031238   0.556    0.580
# ---
# Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
# Residual standard error: 2.822 on 105 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.3994,  Adjusted R-squared:  0.3708
# F-statistic: 13.96 on 5 and 105 DF,  p-value: 1.857e-10
``````

If `Ozone` is removed, still considers only the complete cases (with `Ozone` included). But this is different from manually removing `Ozone`.

``````summary(lm(Wind ~ . - Ozone, data = airquality))
#
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
#
# Residuals:
#    Min     1Q Median     3Q    Max
# -6.012 -2.323 -0.361  1.493  9.605
#
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept) 24.3159074  2.6354288   9.227 3.09e-15 ***
# Solar.R      0.0009228  0.0035281   0.262    0.794
# Temp        -0.1900820  0.0369159  -5.149 1.21e-06 ***
# Month        0.0313046  0.2280600   0.137    0.891
# Day          0.0008969  0.0346116   0.026    0.979
# ---
# Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
# Residual standard error: 3.143 on 106 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.2477,  Adjusted R-squared:  0.2193
# F-statistic: 8.727 on 4 and 106 DF,  p-value: 3.961e-06

summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
#
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
#
# Residuals:
#     Min      1Q  Median      3Q     Max
# -8.1779 -2.2063 -0.2757  1.9448  9.3510
#
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.660271   2.416766   9.790  < 2e-16 ***
# Solar.R      0.002980   0.003113   0.957    0.340
# Temp        -0.186386   0.032725  -5.695 6.89e-08 ***
# Month        0.074952   0.206334   0.363    0.717
# Day         -0.011028   0.030304  -0.364    0.716
# ---
# Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
# Residual standard error: 3.158 on 141 degrees of freedom
#   (7 observations deleted due to missingness)
# Multiple R-squared:  0.2125,  Adjusted R-squared:  0.1901
# F-statistic: 9.511 on 4 and 141 DF,  p-value: 7.761e-07
``````

## 问题评论:

In your first case it looks like `Ozone` is removed from the analysis, but exists in the formula, so rows with `Ozone` `NA`s will still be removed. In the second case `Ozone` is not the formula, so the `Ozone` `NA`s are completely ignored. Those methods are exactly the same if there are no `NA` values for the variable you want to remove.
Yes, I agree. But how could `Ozone` be removed in a more efficient way than by specifying manually the variables to be included? I like the `~ . - `approach in the formula, it is somehow striking that it does not work properly in this case.
– epsilone
3 hours ago
Simply create another dataset that has no `Ozone` and use that with `y ~ .`
I really like this question. Debugging through `stats::model.frame.default` (which is where complete-cases calculation gets done), we find that the results of `terms(Wind~.-Ozone)` does indeed contain `Ozone` in its list of variables. This seems wrong to me, but so deeply embedded that I don’t think it’s going to get changed …

# 答案:

## 答案1:

It is indeed unfortunate and surprising that `Wind ~ . - Ozone` considers `Ozone` when finding complete cases; seems worth discussion on the `r-devel@r-project.org` mailing list, if you want to pursue it. In the meantime, how about

`````` summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))
``````

?

## 答案评论:

Good approach and thanks for the suggestion on the `r-devel@r-project.org` list. I am going to send the question there as well.
– epsilone
2 hours ago

## 原文地址：

https://stackoverflow.com/questions/47753117/missing-data-behaviour-in-lm-complete-cases-used-even-with-predictors-without-m

Tags:, ,