Topic: Methods in Analysis of Large Population
Surveys
Instructor: Dr. Saifuddin Ahmed
Overview
This course focuses on the analysis of complex survey
data, which differ from simple random sampling (SRS). Most
national surveys use multistage, stratified, and
clustered sampling designs, requiring special statistical
approaches.
Survey Design Types
Simple Random Sampling (SRS): With or without
replacement (SRSWR/SRSWOR)
- In practice SRSWR is not attractive: we do not want to interview
same individuals more than once. But, in mathematical term it is much
simpler to relate the sample to population by SRSWR.
- SRSWOR provides two additional advantages:
- elements are not repeated
- variance estimation is smaller (than SRSWR with same sample
size).
Systematic Sampling: Only the first unit is
selected at random, the rest being selected according to a predetermined
pattern
- It is close to SRS.
- Intraclass correlation could affect systematic sampling.
Stratified Sampling: Population divided into
mutually exclusive strata (e.g., region, race) – increases precision
- Provides opportunity to study the stratum variations - estimation
could be made for each stratum
- Total variance = between-stratum variance + within-stratum
variance
- In stratified sampling design only “variance within” is
considered
- Varinace within < total variance: Variance estimation from
stratified sampling is lower than the variance estimated from SRS of
equal sample size
Cluster Sampling: Population divided into
natural groups (e.g., villages); cost-effective but increases
variance
- The cluster sampling is less efficient
- The total variance is the sum of within-cluster and
between-cluster variances: \(\sigma^2 =
\sigma_w^2 + \sigma_b^2\)
- \(\text{Deff} = 1 + (M - 1)\rho\)
- \(M\) is the average number of
individuals per cluster
- \(\rho\) is the intra-class
correlation coefficient (ICC)
- In cluster sampling, variance depends on: cluster
size, and ICC
Two Methods of Conducting Surveys
Single-stage sampling design
Multi-stage sampling design: The estimation
formula of variance becomes “too complicated” or no longer possible in
“closed form” - need to use the following to estimate variance
- “Approximate methods” (e.g, Taylor linearization method), or
- “Replicated sampling” (e.g, jackknife method)
Key Concepts
- Design Effect (deff): Ratio of variance from
complex design to SRS
- \(deff = \frac{\text{Variance estimate
from the designed / implemented survey}}{\text{Variance estimate from
SRS with same number of sample elements}}\)
- \(deft = \sqrt{deff} =
\frac{\text{SE(estimate from the designed / implemented
survey)}}{\text{SE(estimate from SRS with same number of sample
elements)}}\)
- deff > 1 → Less efficient → more uncertainty in
estimates (clustering or unequal weights inflate variance)
- deff < 1 → More efficient → better
precision
- We want deff to be SMALLER!
- Intraclass Correlation (ICC): Measures similarity
within clusters
- Weights: Adjust for unequal sampling
probability
- Variance Estimation: Must account for design;
ignoring it underestimates standard errors (SEs)
Why Variance Estimation Matters
- Ignoring design leads to:
- Underestimated standard errors
- Incorrect confidence intervals
- Misleading p-values
Three Major Tasks in Complex Survey Analysis
- Account for survey design: strata, clusters,
weights
- Decide on weight usage: when and how to apply
- Handle missing values: apply imputation or
adjustment strategies
Outcome Types & Models
- Continuous: Linear regression
- Binary: Logistic / Log-binomial
- Multinomial: Multinomial logit
- Ordered: Ordered logistic regression
- Count: Poisson
- Time-to-event: Survival models
Stata Codes
- Use
svy: prefix to account for design (e.g.,
svy: mean, svy: reg)
estat effects gives DEFF and DEFT
- Standard (SRS):
reg,
logit
- Design-based:
svy: reg,
svy: logit
- Robust SE:
, cluster(id)
- GEE models:
xtgee
- Multilevel models:
- Random intercept:
xtlogit, re,
xtmelogit
- Random slope:
xtmelogit ... || cluster: variable
- Fixed effects:
xtlogit, fe
End of Summary