Blinder-Oaxaca Decomposition in R
In this post, I will explain in simple terms the Blinder-Oaxaca decomposition and how to run it in R.
The two R
libraries needed if you want to run the code are
library(oaxaca)
install.packages(oaxaca)
and for tidy outputs
library(devtools)
install_github("giacomovagni/oaxacad")
library(oaxacad)
Learn more about the method here or here.
Introduction
The Blinder-Oaxaca decomposition (BO) is a method designed for decomposing gaps; health, wages, time-use, etc. The gap is decomposed into an “explained” part and an “unexplained” part.
The method is generally applied by comparing the gap between two groups, rich and poor, man and woman, young and old, etc.
Let’s take the example of height between rich and poor children. Let’s say that the average height of rich children is 70 cm, and the average height of poor children is 60 cm. The average gap is 10 cm.
Let’s call
You identify the main factors responsible for this gap: nutrition, sleeping time, and exposure to pollution. Let’s call the sum of these factors
You can think of two ways
The first way is that rich children have better nutrition, more sleep, and less pollution exposure. Let’s say that having a high value in
This is called “endowment” in the Oaxaca jargon.
This means that the two groups are different in their group characteristics. For instance, let’s imagine that rich kids have an average value of 20 for nutrition, sleep and pollution exposure (
Imagine that the effect of 1-unit of
[rich children equation]
[poor children equation]
This would look something like
(red is rich kids, group 0 or A; blue is poor kids, group 1 or B)

The intercepts are the same, the slope of the regression line is the same for both groups, and only the average characteristics
What would happen if we gave the poor kids the resources of the rich kids?
In other words, we keep the intercepts and coefficients of the poor kids (
[poor children equation with rich kids resources]
The gap would disappear completely! When the gap is completely reduced by plugging the characteristics of one group into the other group, we call this the “explained” part of the gap. The gap is purely compositional due to the resources or characteristics making up the groups.
The average gap is
10 points.
The “explained” (E) part of the gap is (from the point of view of group B)
The percentage explained is simply Explained/Gap.
Here we have
Therefore, 10/10 = 1, 100% of the gap is explained by differences in characteristics or resources.
Note, however, that in certain cases, the reference group can influence the share of the gap explained. Picking one reference group instead of another can explain more or explain less of the gap.
I mentioned above that there is another way to think about how
Imagine now that there is not only a difference in the characteristics or resources but also in the coefficients,
Imagine that the rich kids have higher coefficients (
The poor kids still have a beta of 1.
What does this mean? It means that even with the same characteristics, the rich give benefit more from the same resources. They have greater “returns” on the resources than poor kids. This could happen for many reasons. This could be due to unobserved factors. For instance, stress is not accounted for and plays a role in the quality of sleep. For the same amount of sleep time, the effect of sleep is less beneficial for poorer kids. The differences in the coefficients, or the size of the effects, are the “unexplained” part of the gap.
In the case of the wage gap between men and women, this unexplained part is interpreted as discrimination. For instance, if you have two individuals doing the same job in the same company with the same education, etc., but the man is paid more (i.e. the coefficients yield higher returns) than the woman, then these differences in the
Going back to our example, we now have a gap of 90, which is 150 (i.e. the average outcome for group 1) minus 60 (i.e. the average outcome for group 2).
[rich kids]
[poor kids]
The figure looks very different. The slopes are also different here.

We can still compute the “explained” part of the gap due to the characteristics or resources.
We have
With the
In this example, the intercepts were the same; therefore, the “unexplained” part is due to the differences in coefficients.
Common Formalisation
The average gap is
Can be re-written as (from the point of view of group B)
(from the point of view of group A)
Simulation in R
Let’s simulate an example. Group A has a higher intercept (55 vs 50 for group B), larger coefficients (beta = 2 vs beta = 1) and higher values for X (xA_avr = 20 vs xB_avr = 10 for group B). Group A is advantaged both in characteristics (X) and in the size, strength or returns in the coefficients and intercepts.

library(tidyverse)
library(broom)
library(gtools)
interceptA_avr = 55
interceptB_avr = 50
betaA = 2
betaB = 1
xA_avr = 20
xB_avr = 10
n=1000
# group A and B characteristics
XA = rnorm(n,mean = xA_avr, sd = 5) # you dont have to generate a normal distribution
XB = rnorm(n,mean = xB_avr, sd = 5) # you can just use the average xA_avr
# outcome for groups A and B
YA = interceptA_avr + betaA*XA # + rnorm(n)
YB = interceptB_avr + betaB*XB # + rnorm(n)
# GAP
mean(YA) - mean(YB)
35
# point of view of group B
(interceptA_avr-interceptB_avr) + betaB*(mean(XA)-mean(XB)) + mean(XB)*(betaA-betaB) + (mean(XA)-mean(XB))*(betaA-betaB)
35
# point of view of group B
(interceptB_avr-interceptA_avr) + betaA*(mean(XB)-mean(XA)) + mean(XA)*(betaB-betaA) + (mean(XB)-mean(XA))*(betaB-betaA)
-35
One thing to bear in mind is that the choice of reference group can change the % explained
betaB*(mean(XA)-mean(XB))
10/35
betaA*(mean(XB)-mean(XA))
-20/-35
We can check with the package oaxacad
library(oaxaca)
library(oaxacad)
oaxaca_sim(interceptA_avr = interceptA_avr, interceptB_avr = interceptB_avr,
betaA = betaA, betaB = betaB, xA_avr = xA_avr, xB_avr = xB_avr)
oaxaca_sim(interceptA_avr =interceptB_avr , interceptB_avr = interceptA_avr,
betaA = betaB, betaB = betaA, xA_avr = xB_avr, xB_avr = xA_avr)
Empirical Example in R
Let’s have a look at a simple example using the R swiss
dataset, which provides fertility and socio-economic indicators for the 47 French-speaking provinces of Switzerland around 1888.
Let’s take Fertility
as the outcome of interest (Y
) and Education
as the explanatory variable of interest (X
). Let’s take as a group variable whether the city is a Catholic
city or not (we define Catholic if more than 80% of people are Catholic).
Non-Catholic cities have, on average, a Fertility rate of 65.52% and Catholic cities have a fertility rate of 80%. The average gap is about 15%.
Fertility is on the y-axis, and Education is on the x-axis.

data(swiss)
swiss$catholic_city = ifelse(swiss$Catholic > 80, 1, 0)
swiss$name = rownames(swiss)
swiss %>%
ggplot(aes(Education, Fertility, colour = as.factor(catholic_city), label = name)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = F, fullrange = T) +
geom_text_repel() +
scale_color_manual("Catholic City", values = c('red3', 'blue4')) +
theme_minimal(base_size = 16) + ggtitle("Education and Fertility in French-speaking provinces of Switzerland in 1888")
m0 = lm(Fertility ~ Education, data = swiss)
xt1 = cbind(expand_grid(Education = c(0,10,20,50)),
predicted = predict(m0, newdata = expand_grid(Education = c(0,10,20,50))))
xt1
We see that there is a negative association between Education and Fertility. For each increase in education, fertility declines by about 8.6%.
Let’s simulate the following predictions from a simple OLS model. If we have a city with 10% of the population beyond primary school, we predict a fertility rate of 70.99%. If we have a city with 50% of the population beyond primary school, we predict a fertility rate of 36%.
Now let’s run two separate regressions for Catholic and Non-Catholic cities.
First, we see that the intercepts are different. The Catholic cities have higher fertility at the onset. This will contribute to the unexplained part of the decomposition. Then we see that the regression coefficients are very different; the effect of education is stronger in Non-Catholic cities, at
m1 = lm(Fertility ~ Education, data = swiss, subset = catholic_city == 0 )
m2 = lm(Fertility ~ Education, data = swiss, subset = catholic_city == 1 )
data.frame(coef(m1), coef(m2))
Let’s look at the difference in characteristics between the two types of cities. Let’s only consider Education and the level of Agriculture between Catholic and Non-Catholic cities.
We can simply compute the average differences in these characteristics.
out_X = swiss %>%
group_by(catholic_city) %>%
summarise(mean(Education), mean(Agriculture), mean(Fertility))
We see that Non-Catholic cities have a higher share of the population that is educated beyond primary school,
We have a Fertility gap of
How much is that gap due to observed characteristics such as education and agriculture?
Decomposing the Fertility Gap
The basic R command for the oaxaca decomposition is
library(oaxaca)
library(oaxacad)
# let's reverse the scale of agriculture
swiss$Agriculture_rev = rev(swiss$Agriculture)
# the oaxaca decomposition
ox = oaxaca(Fertility ~ Education + Agriculture_rev | catholic_city, data=swiss)
# with the package oaxacad, we can retrieve the table below
Decomp_Regression(ox)
We have the stronger effect of education for group in
The column
Group 1, Non-Catholic cities, have, on average, higher educated people and higher urban share.
The column
Threefold Decomposition
The average gap is composed in a traditional “threefold” decomposition of the Explained part
We saw this equation previously (for group B)
Which is the table is summarised like
Intercepts plus Explained plus Coefficients and Interaction.
The share of the explained part is
We pick group 2 as the reference group, and we calculate the decomposition based on the table above.
The explained part is the sum of all the explanatory variables
(education)
(agriculture)
(sum)
which can be found on line 4 of the
The total gap is
Therefore,
This is shown in the last row of the table entitled “decomposition %”.
Counterfactuals
Note that the decomposition gives us counterfactuals.
For Catholic cities, we have 6.625% of the population with education and 46.981% of urban development.
The OLS equation for Catholic cities was
This is
Now we can switch the values for agriculture and education for Catholic cities with the ones from Non-Catholic cities.
This
If Catholic cities had had the same characteristics as Non-Catholic cities, they would have about
Decomposition and Regression
The Oaxaca decomposition is similar to a regression with an interaction between
# R
m2 = lm(Fertility ~ catholic_city*Education, data = swiss)
ex = expand_grid(catholic_city = c(0, 1), Education = c(6,13))
cbind(ex, predicted = predict(m2, newdata = ex))
The predicted values from this regression shows the “counterfactual” values
We could develop a similar counterfactual notation will something along the lines
Limitations
Everything discussed above is predicated on the assumption that the regressions estimate the causal effect of
Generally, in an effort to address all possible confounders (“closing the back-door paths”), researchers dump as many variables as possible into the decomposition. The issue is that, with cross-sectional data, this can create more bias (“collider”).
Playing with simulation with oaxacad
The package oaxacad
provides a way to simulate Oaxaca Decomposition to test different specifications.
In oaxaca_sim
, the user can set the intercepts, betas and X for group A and group B, and inspect the results.
oaxaca_sim(interceptA_avr = 100, interceptB_avr = 100,
betaA = 5, betaB = 5,
xA_avr = 10, xB_avr = 5)