Missing data behaviour in lm: complete cases used even with predictors without missing data

问题内容:

My question: what is the most efficient way of removing a predictor with NAs and consider the complete cases excluding that predictor?

The question arises from the following regression situation with NAs, in which there are missing values in Ozone (mostly) and Solar.R.

data(airquality)
summary(airquality)
#     Ozone           Solar.R           Wind             Temp           Month      
# Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
# 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
# Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
# Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
# 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
# Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
# NA's   :37       NA's   :7                                                       
#      Day      
# Min.   : 1.0  
# 1st Qu.: 8.0  
# Median :16.0  
# Mean   :15.8  
# 3rd Qu.:23.0  
# Max.   :31.0  

Regression of Wind on the remaining variables. Considers only the complete cases.

summary(lm(Wind ~ ., data = airquality))
# 
# Call:
# lm(formula = Wind ~ ., data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.3908 -2.2800 -0.3078  1.4132  9.6501 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 15.519460   2.918393   5.318 5.96e-07 ***
# Ozone       -0.060746   0.011798  -5.149 1.23e-06 ***
# Solar.R      0.003791   0.003216   1.179    0.241    
# Temp        -0.036604   0.044576  -0.821    0.413    
# Month       -0.159671   0.208082  -0.767    0.445    
# Day          0.017353   0.031238   0.556    0.580    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.822 on 105 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.3994,  Adjusted R-squared:  0.3708 
# F-statistic: 13.96 on 5 and 105 DF,  p-value: 1.857e-10

If Ozone is removed, still considers only the complete cases (with Ozone included). But this is different from manually removing Ozone.

summary(lm(Wind ~ . - Ozone, data = airquality))
# 
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -6.012 -2.323 -0.361  1.493  9.605 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 24.3159074  2.6354288   9.227 3.09e-15 ***
# Solar.R      0.0009228  0.0035281   0.262    0.794    
# Temp        -0.1900820  0.0369159  -5.149 1.21e-06 ***
# Month        0.0313046  0.2280600   0.137    0.891    
# Day          0.0008969  0.0346116   0.026    0.979    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.143 on 106 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.2477,  Adjusted R-squared:  0.2193 
# F-statistic: 8.727 on 4 and 106 DF,  p-value: 3.961e-06

summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
# 
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -8.1779 -2.2063 -0.2757  1.9448  9.3510 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 23.660271   2.416766   9.790  < 2e-16 ***
# Solar.R      0.002980   0.003113   0.957    0.340    
# Temp        -0.186386   0.032725  -5.695 6.89e-08 ***
# Month        0.074952   0.206334   0.363    0.717    
# Day         -0.011028   0.030304  -0.364    0.716    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.158 on 141 degrees of freedom
#   (7 observations deleted due to missingness)
# Multiple R-squared:  0.2125,  Adjusted R-squared:  0.1901 
# F-statistic: 9.511 on 4 and 141 DF,  p-value: 7.761e-07

问题评论:

1  
In your first case it looks like Ozone is removed from the analysis, but exists in the formula, so rows with Ozone NAs will still be removed. In the second case Ozone is not the formula, so the Ozone NAs are completely ignored. Those methods are exactly the same if there are no NA values for the variable you want to remove.
1  
Yes, I agree. But how could Ozone be removed in a more efficient way than by specifying manually the variables to be included? I like the ~ . - approach in the formula, it is somehow striking that it does not work properly in this case.
– epsilone
3 hours ago
    
Simply create another dataset that has no Ozone and use that with y ~ .
3  
I really like this question. Debugging through stats::model.frame.default (which is where complete-cases calculation gets done), we find that the results of terms(Wind~.-Ozone) does indeed contain Ozone in its list of variables. This seems wrong to me, but so deeply embedded that I don’t think it’s going to get changed …

答案:

答案1:

It is indeed unfortunate and surprising that Wind ~ . - Ozone considers Ozone when finding complete cases; seems worth discussion on the r-devel@r-project.org mailing list, if you want to pursue it. In the meantime, how about

 summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))

?

答案评论:

    
Good approach and thanks for the suggestion on the r-devel@r-project.org list. I am going to send the question there as well.
– epsilone
2 hours ago

原文地址:

https://stackoverflow.com/questions/47753117/missing-data-behaviour-in-lm-complete-cases-used-even-with-predictors-without-m

Tags:, ,

添加评论

友情链接:蝴蝶教程