Interactive Stata Output Lab - Learn to Read Decomposition Results Values shown are for demonstration purposes only
← Back to Presentation
Command History - Click to View
Oaxaca-Blinder Commands
1
oaxaca SBP BMI age, by(pop) A. Three-fold decomposition (default)
2
oaxaca SBP BMI age, by(pop) pooled B. Two-fold with pooled (RECOMMENDED)
3
oaxaca ... vce(bootstrap, reps(100)) C. With bootstrap standard errors
4
oaxaca ... (functional: adl iadl) D. Grouped variables
5
oaxaca ... pooled svy E. Survey-weighted decomposition
Fairlie Commands
1
fairlie hypertension BMI age, by(pop) A. Basic Fairlie decomposition
2
fairlie ... pooled(pop) ro reps(300) B. Preferred: pooled + RO + groups
3
bootstrap, reps(100) seed(12345): fairlie ... C. Bootstrap SEs for publication
Stata Results Window Click any highlighted section for detailed explanation
. oaxaca SBP BMI age, by(population) Blinder-Oaxaca decomposition Number of obs = 2,000 Model = linear Group 1: population = 0 N of obs 1 = 1,000 Group 2: population = 1 N of obs 2 = 1,000 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ SBP | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 175.6579 .3630455 483.85 0.000 174.9464 176.3695 group_2 | 148.3537 .3165138 468.71 0.000 147.7334 148.9741 difference | 27.3042 .4816461 56.69 0.000 26.3602 28.24821 endowments | 11.45145 .6372081 17.97 0.000 10.20255 12.70036 coefficients | 14.32706 .6640514 21.58 0.000 13.02554 15.62858 interaction | 1.525692 .7903181 1.93 0.054 -.0233032 3.074687 -------------+---------------------------------------------------------------- endowments | BMI | 7.825849 .5644088 13.87 0.000 6.719628 8.93207 age | 3.625604 .2893155 12.53 0.000 3.058556 4.192653 -------------+---------------------------------------------------------------- coefficients | BMI | 4.497143 2.22776 2.02 0.044 .1308142 8.863471 age | .2580753 1.60092 0.16 0.872 -2.87967 3.395821 _cons | 9.571841 3.155704 3.03 0.002 3.386775 15.75691 -------------+---------------------------------------------------------------- interaction | BMI | 1.477729 .7328098 2.02 0.044 .041448 2.91401 age | .0479629 .2975408 0.16 0.872 -.5352063 .631132 ------------------------------------------------------------------------------
Click output to see explanation

Click any highlighted section in the Stata output to see a detailed explanation here.

Model Info
Formula
Overall Gap
Explained
Unexplained
Variables
Confidence Intervals
Result Text
Decomposition Results
. oaxaca Y X, by(group) weight(0.5) Blinder-Oaxaca decomposition Model = linear Group 1: group = A N of obs 1 = 1,000 Group 2: group = B N of obs 2 = 1,000 Reference coefficients: b* = 0.5*b1 + 0.5*b2 explained: (X1 - X2) * b* unexplained: X1*(b1-b*) + X2*(b*-b2) + (a1-a2) ----------------------------------------------------------------- Y | Coef. Std.err. z P>|z| [95% Conf.Int] -----------+------------------------------------------------- overall | group_A | 175.0000 0.363 482.27 0.000 174.29 175.71 group_B | 140.0000 0.317 442.34 0.000 139.38 140.62 difference | 35.0000 0.482 72.68 0.000 34.06 35.94 explained | 11.2500 0.492 22.86 0.000 10.29 12.21 unexplained| 23.7500 0.550 43.17 0.000 22.67 24.83 -----------+------------------------------------------------- explained | X | 11.2500 0.424 26.55 0.000 10.42 12.08 -----------------------------------------------------------------
The mean outcome difference between groups was 0.0 units (95% CI, -0.80.8), with higher values in Group A (90.0) compared to Group B (90.0). Of this gap, 0.0 units (95% CI, -0.40.4; 0%) was statistically explained by differences in measured covariates (X), whereas 0.0 units (95% CI, -0.30.3; 0%) remained unexplained. The unexplained component may reflect unmeasured confounding, model misspecification, or true group differences in how X affects Y.
Oaxaca-Blinder Decomposition
Gap (ȲA − ȲB) = Explained (ΔX̄ × β*) + Unexplained
β* = (βA + βB) / 2 = (2.5 + 2.0) / 2 = 2.25  (Stata: weight(0.5))
175.0 − 140.0 = 35.0 = (30 − 25) × 2.25 = 11.3 + 23.8
Gap Decomposition Total: 0.0
Explained: 0%
Unexplained: 0%
GROUP A
20.0
α 80
β 0.5
GROUP B
20.0
α 80
β 0.5
Regression Lines
A B
Fairlie Decomposition Results
. fairlie hypertension age, by(population) Logistic regression (reference model: population == 0) Number of obs = 1,000 LR chi2(1) = 436.22 Prob > chi2 = 0.0000 Log likelihood = -321.05 Pseudo R2 = 0.4041 -------------------------------------------------------------- hypertension | Coef. Std.err. z P>|z| [95% CI] -------------+------------------------------------------------ age | 0.120 0.011 10.91 0.000 0.099 0.141 _cons | -5.750 0.616 -9.34 0.000 -6.96 -4.54 -------------------------------------------------------------- Non-linear decomposition by population (G) Number of obs = 2,000 N of obs G=0 = 1,000 N of obs G=1 = 1,000 Pr(Y!=0|G=0) = .934 Pr(Y!=0|G=1) = .148 Difference = .786 Total expl. = .372 -------------------------------------------------------------- hypertension | Coef. Std.err. z P>|z| [95% CI] -------------+------------------------------------------------ age | .372 .032 11.63 0.000 .310 .434 --------------------------------------------------------------
The outcome prevalence difference between groups was 78.6 percentage points (95% CI, 73.084.2), with higher prevalence in Group 0 (93.4%) compared to Group 1 (14.8%). Using Fairlie nonlinear decomposition (Stata default, reference(0)), 37.2 pp (95% CI, 31.043.4; 47%) was statistically explained by differences in age, reflecting higher average age in Group 0. The remaining gap may reflect unmeasured confounding or true group differences in how age affects hypertension risk.
FAIRLIE DECOMPOSITION
Gap = Explained + Unexplained
β* = β0  (reference: Group 0, Stata default)
78.6 = 37.2 + 41.4
(47% explained, 53% unexplained)
Gap Decomposition Total: 50.0 pp
Explained Unexplained
Group 0 (Reference)
X̄ (Age) 55
β(Age) 0.080
Group 1
X̄ (Age) 55
β(Age) 0.080
Key Concepts (Fairlie 2005)
• Binary outcome decomposition via logistic regression
• Results in percentage points (pp)
• Variable ordering affects individual contributions; use ro to randomize
Logistic Probability Curves
G=0 G=1
Fairlie Decomposition
Pr0 − Pr1 = Explained + Unexplained
82.1 = (23.7 + 2.7) + 55.7
Age + Comorbid = Σ contributions
Gap (Pr₀−Pr₁)
82.1 pp
Age
23.7 pp
Comorbid
2.7 pp
Explained
26.3 pp
(32%)
Unexplained
55.7 pp
(68%)
26.3 32 68
PATH DEPENDENCE
Drag variables to reorder • Individual contributions change, total stays same
logit(p) = β0 +
⋮⋮ βCom·XCom
+
⋮⋮ βAge·XAge
Comorbid
0.0 pp
Age
0.0 pp
✓ Total Explained: 0.0 pp (stable)
Group 0 (Reference)
X̄ (Age) 40
X̄ (Com) 0.0
Group 1
X̄ (Age) 40
X̄ (Com) 0.0
COEFFICIENTS (β) - Different β creates Unexplained
Group 0 (reference)
β₀(Age) 0.00
β₀(Com) 0.00
Group 1 🔒
β₁(Age) 0.00
β₁(Com) 0.00
🎮 View Control
🔍

Functional Form & Model Misspecification

The TRUE data is U-shaped. Pick different models to see how misspecification biases the decomposition. Only U-Shape is correct.
Model Specification
Distribution Parameters
Group A
Mean X75
Spread8
Group B
Mean X40
Spread8
True vs Fitted Relationship
True f(X) A fit B fit Ȳₐ ȲB
X Distribution by Group
Decomposition Under Current Model
How much of the gap is explained vs. unexplained?
True Gap
Ȳₐ − ȲB (DGP)
21.0
()
()
Explained Unexplained
Misspecification Bias
Bias in Explained
Bias Direction
Select a model
. * TRUE DGP (U-shaped):
. * Y = 0.04*(X-50)² + 110
. * Fitted: — (no model selected)
. oaxaca Y X, by(group) pooled
difference | 21.00
explained |
unexplained |
ℹ Pick a model above to fit it to the U-shaped data and see how misspecification biases the decomposition.

Common Support & Extrapolation

Distribution Parameters
Group A
Mean X55
Spread10
Range
37 – 72
Group B
Mean X45
Spread10
Range
27 – 62
Overlap Region
37 – 62
✓ B has data at A's mean
Counterfactual Question
"What Y would B have at X = 55?"
Quick Scenarios
Where do the groups overlap?
Solid line = model fit on observed data  ·  Dashed line = model predicting beyond data
Group A (observed) Group B (observed) Fit (within data) Extrapolated (no data)
✓ Trustworthy: Group B has real observations near A's mean X. The counterfactual is answered by actual people — not by the model's assumptions.
Counterfactual What would B have with A's characteristics?
Total Gap
Ȳₐ − ȲB
42.5
Endowments (X̄ₐ−X̄BB
15.0
(5545) × 1.50
Coefficients BA−βB)
22.5
45 × (2.001.50)
Interaction (X̄ₐ−X̄B)(βA−βB)
5.0
Visual Decomposition
With overlap → uses actual data Without → extrapolation (unreliable)

Non-Additivity on the Probability Scale

Logistic regression vs. the linear-predictor (log-odds) scale — an interactive illustration for clinical researchers.
Clinical scenario. We are predicting 30-day post-operative mortality from age and number of comorbidities. The model has no interaction term. Still, the same comorbidity coefficient does not produce the same risk change at every age — that mismatch is what this figure shows.
log-odds at age 65, 0 comorbidities
OR per year of age =
OR per comorbidity =
curves drawn for 0 … comorbidities
Presets:

Linear predictor (log-odds scale)

log-odds = β₀ + β₁·(age − 65) + β₂·comorbidities
Brackets = effect of +1 comorbidity at each age. All identical → additive.

Predicted probability (logistic)

odds = e^(log-odds)   →   P = odds / (1 + odds)
Brackets = risk difference (pp) for +1 comorbidity. All different → non-additive.

Why Linear Models Fail for Binary Outcomes

Interactive illustration of two problems: LPM can predict probabilities outside [0, 1], and it assumes the effect of X is constant — logistic regression handles both.
Clinical scenario. We want to predict 1-year mortality from age in an older-adult cohort. Both models share the same local slope at age 65 (the reference age). An LPM extrapolates that slope as a straight line; logistic regression bends it into an S-curve so P stays in [0, 1]. Move the sliders to see where and when each model breaks.
P(death within 1 yr) at reference age
LPM uses this slope everywhere; logistic only at age 65
Presets:
LPM predictions stay within [0, 1] across the age range.

Linear Probability Model (LPM)

P(death) = p₀ + β · (age − 65)
Brackets = ΔP for +5 years — identical at every age (constant-effect assumption). Red: predicted P outside [0, 1] (impossible!).

Logistic Regression

P(death) = σ(logit(p₀) + β₁ · (age − 65))
Brackets = ΔP for +5 years — different at each age: bigger in the middle, smaller near 0 or 1. P is always in [0, 1].

How Splines Work: One Curve, Many Lines

The big idea: place knots to split a non-linear curve into short segments — each segment is approximately a straight line. Together they reproduce the curve.
The concept. A linear model can't bend. But if you split the data at knots, the pieces in between are roughly linear — and fitting a line to each piece recovers the curve. More knots = more pieces = closer to reality.
Creates segments
Spread of simulated points
Presets:

Whole Curve

Non-linear data, split at the knots, with the spline fit on top.
data one straight line (fails) spline fit knots
Within each knot interval, a straight line fits well — slopes differ by segment.