Erh-Chi Hsu - Lecture 3 Design Based Regression Analysis

Course Information

Course: PFRH 712 - Methods in Analysis of Large Population Surveys
Instructor: Dr. Saifuddin Ahmed

Overview

This lecture focuses on the principles and applications of design-based regression analysis using complex survey data in R. It introduces the correct use of lm(), glm(), and svyglm() functions with appropriate survey design using the survey package. It emphasizes best practices in model fitting, diagnostics, and interpretation, and highlights common pitfalls such as misuse of stepwise regression and pseudo R².

4 Steps for Fitting a Model

1. Specify

Use literature and a conceptual framework to guide covariate selection
Avoid stepwise regression
> “Stepwise variable selection is one of the most widely used and abused of all data analysis techniques.” – Frank Harrell
Use theory-driven or evidence-based model building

2. Fit

📘 Linear regression with survey weights

library(survey)
# Define the survey design
des <- svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = df)

# Linear regression
model_lm <- svyglm(gfr ~ cpr, design = des)
summary(model_lm)

📘 Logistic regression with design

# Binary outcome model
model_logit <- svyglm(mcu ~ age + education, design = des, family = quasibinomial())
summary(model_logit)

📘 Log-binomial (risk ratio)

model_logbin <- svyglm(mcu ~ age + education, design = des, family = binomial(link = "log"))
summary(model_logbin)

📘 Poisson alternative for estimating RR

model_poisson <- svyglm(mcu ~ age + education, design = des, family = poisson(link = "log"))
summary(model_poisson)

3. Check / Test

📘 Model diagnostics for linear models

Residual vs Fitted Plot

plot(model_lm, which = 1)

Interpretation:

✅ Good fit if points scatter randomly around the horizontal line.

❌ Poor fit if there’s a curve or funnel shape (non-linearity or heteroscedasticity).

Q-Q Plot (Normality of Residuals)

plot(model_lm, which = 2)

Interpretation:

✅ Good fit if points lie approximately on the diagonal line.

❌ Poor fit if points deviate at the ends (skewed distribution).

📘 Heteroscedasticity (Breusch-Pagan test)

bptest(model_lm)


    studentized Breusch-Pagan test

data:  model_lm
BP = 0.040438, df = 1, p-value = 0.8406

Interpretation:

p > 0.05 → Homoscedasticity (good!)
p < 0.05 → Heteroscedasticity (bad — use robust SEs)

📘 Cook’s Distance (outlier detection)

plot(model_lm, which = 4)

Interpretation:

Points with Cook’s D > 1 may be influential and deserve review.

📘 Goodness-of-fit for logistic models

model_logit <- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(model_logit)


Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
hp           0.03626    0.01773   2.044  0.04091 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

# Run Hosmer-Lemeshow test
hoslem.test(mtcars$am, fitted(model_logit))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  mtcars$am, fitted(model_logit)
X-squared = 4.9517, df = 8, p-value = 0.7627

Interpretation:

✅ p > 0.05 = model fits well

❌ p < 0.05 = poor fit

Summary Table

Diagnostic	Good Fit	Poor Fit
Residual Plot	Random scatter	Curved / funnel shape
Q-Q Plot	Points near line	Points deviate
Breusch-Pagan Test	p > 0.05	p < 0.05
Cook’s Distance	All < 1	Some > 1
Hosmer-Lemeshow Test	p > 0.05	p < 0.05

📘 Model comparison using AIC/BIC

AIC(model_logit)
BIC(model_logit)

Use AIC/BIC to compare models: lower = better

4. Interpret

R² in linear regression: Proportion of variance explained
Pseudo R² in logistic regression: Not equivalent to R² and should not be interpreted as such
Odds Ratio (OR): exp(b) from logistic regression

exp(coef(model_logit))

Risk Ratio (RR): Use log-binomial or Poisson models with robust SEs

exp(coef(model_poisson))

Key Takeaways

Always use svydesign() and svyglm() when analyzing complex survey data in R
Avoid stepwise regression—prefer theory-driven model building
Validate model assumptions with residual plots and Goodness-of-Fit tests
Use log-binomial or Poisson models to estimate risk ratios when outcomes are common
AIC and BIC help compare model fit, especially when choosing between linear vs. polynomial or nested models

References

Lecture slides by Dr. Saifuddin Ahmed, PFRH 712 (2025)
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
Harrell, F. (2015). Regression Modeling Strategies. Springer.
Hosmer, D., Lemeshow, S., & Sturdivant, R. (2013). Applied Logistic Regression. Wiley.
R packages: survey, lmtest, ResourceSelection, ggplot2, srvyr