Lecture 1 Complex Survey Data 101

Topic: Methods in Analysis of Large Population Surveys
Instructor: Dr. Saifuddin Ahmed

Overview

This course focuses on the analysis of complex survey data, which differ from simple random sampling (SRS). Most national surveys use multistage, stratified, and clustered sampling designs, requiring special statistical approaches.

Survey Design Types

Simple Random Sampling (SRS): With or without replacement (SRSWR/SRSWOR)
- In practice SRSWR is not attractive: we do not want to interview same individuals more than once. But, in mathematical term it is much simpler to relate the sample to population by SRSWR.
- SRSWOR provides two additional advantages:
  - elements are not repeated
  - variance estimation is smaller (than SRSWR with same sample size).
Systematic Sampling: Only the first unit is selected at random, the rest being selected according to a predetermined pattern
- It is close to SRS.
- Intraclass correlation could affect systematic sampling.
Stratified Sampling: Population divided into mutually exclusive strata (e.g., region, race) – increases precision
- Provides opportunity to study the stratum variations - estimation could be made for each stratum
- Total variance = between-stratum variance + within-stratum variance
- In stratified sampling design only “variance within” is considered
- Varinace within < total variance: Variance estimation from stratified sampling is lower than the variance estimated from SRS of equal sample size
Cluster Sampling: Population divided into natural groups (e.g., villages); cost-effective but increases variance
- The cluster sampling is less efficient
- The total variance is the sum of within-cluster and between-cluster variances: \(\sigma^2 = \sigma_w^2 + \sigma_b^2\)
- \(\text{Deff} = 1 + (M - 1)\rho\)
  - \(M\) is the average number of individuals per cluster
  - \(\rho\) is the intra-class correlation coefficient (ICC)
- In cluster sampling, variance depends on: cluster size, and ICC

Two Methods of Conducting Surveys

Single-stage sampling design
Multi-stage sampling design: The estimation formula of variance becomes “too complicated” or no longer possible in “closed form” - need to use the following to estimate variance
- “Approximate methods” (e.g, Taylor linearization method), or
- “Replicated sampling” (e.g, jackknife method)

Key Concepts

Design Effect (deff): Ratio of variance from complex design to SRS
- \(deff = \frac{\text{Variance estimate from the designed / implemented survey}}{\text{Variance estimate from SRS with same number of sample elements}}\)
- \(deft = \sqrt{deff} = \frac{\text{SE(estimate from the designed / implemented survey)}}{\text{SE(estimate from SRS with same number of sample elements)}}\)
- deff > 1 → Less efficient → more uncertainty in estimates (clustering or unequal weights inflate variance)
- deff < 1 → More efficient → better precision
- We want deff to be SMALLER!
Intraclass Correlation (ICC): Measures similarity within clusters
Weights: Adjust for unequal sampling probability
Variance Estimation: Must account for design; ignoring it underestimates standard errors (SEs)

Why Variance Estimation Matters

Ignoring design leads to:
- Underestimated standard errors
- Incorrect confidence intervals
- Misleading p-values

Three Major Tasks in Complex Survey Analysis

Account for survey design: strata, clusters, weights
Decide on weight usage: when and how to apply
Handle missing values: apply imputation or adjustment strategies

Outcome Types & Models

Continuous: Linear regression
Binary: Logistic / Log-binomial
Multinomial: Multinomial logit
Ordered: Ordered logistic regression
Count: Poisson
Time-to-event: Survival models

Stata Codes

Use svy: prefix to account for design (e.g., svy: mean, svy: reg)
estat effects gives DEFF and DEFT
Standard (SRS): reg, logit
Design-based: svy: reg, svy: logit
Robust SE: , cluster(id)
GEE models: xtgee
Multilevel models:
- Random intercept: xtlogit, re, xtmelogit
- Random slope: xtmelogit ... || cluster: variable
Fixed effects: xtlogit, fe

End of Summary