Simulate Logistic Regression Data in R
Introduction
Simulation can be a great way to understand an empirical quantitative problem. In this post, let’s see how we can generate binary (dummy) outcome variables for logistic regression. A binary variable is only made of zeros and ones.
To simulate binary data, we follow three main steps
- Build a linear model
- Transform the outcome of the model into probabilities
- From the probabilities, draw binary data (from a binomial distribution)
The Model
Imagine that we want to simulate the probability of getting a job. This is our outcome variable.
We will build the following simple model: getting a job is affected by education (\(e\)) and discrimination (\(d\)). In this model, education and discrimination are uncorrelated.
We simulate this DAG with a simple linear model
\begin{equation} \gamma_{i} = \beta_{1} e_{i} + \beta_{2} d_{i} + \epsilon_{i} \end{equation}
Let Education (\(e\)) be a standardised normal variable and Discrimination (\(d\)) a binary variable indicating if the person belongs to group \(D\). Let’s generate group membership with a binomial distribution, with 50% of chances of belonging to group \(D\).
library(tidyverse)
N=500
education = rnorm(N,0,1)
d = rbinom(N, 1, 0.5)
Let’s give the following values to \(\beta_{1}\) and \(\beta_{2}\)
\begin{equation} \gamma_{i} = ( 1.5 \cdot e_{i}) - (2 \cdot d_{i}) + \epsilon_{i} \end{equation}
Education increases the chances of getting a job, while discrimination reduces the chances of getting a job.
Let’s include an error term to the equation
error = rnorm(N) # random error term
y = 1.5*education + (-2*d) + error
We can now transform this outcome \(y\) (which we conceive being on a logit scale), into a probability, which we call \(\pi_{i}\), using the following transformation.
\begin{equation} \pi_{i} = \frac{1}{1 + \text{exp}^{-\gamma_{i}} } \end{equation}
Which is the same equation as
\begin{equation} \pi_{i} = \frac{\text{exp}^{\gamma_{i}} }{1 + \text{exp}^{\gamma_{i}} } \end{equation}
Note that this is the standard logistic model
\begin{equation} \pi_{i} = \frac{1}{1 + \text{exp}^{- \mathbf{X\beta}} } \end{equation}
The logistic transformation is not the only transformation possible to turn values into probabilities (e.g. arcsine transformation).
Generate Probabilities
The function plogis
is the logistic equation shown above
y_proba = plogis(y)
We now have a probability for each row, each case / individual, that is a function of education and discrimination.
The logistic transformation transform the values we generated from our model (which we takes as being log-odds) into probabilities. Note that probabilities are bounded \([0,1]\) while our log-odds (“logit values”) goes from minus infinity to infinity, \([- \infty, \infty]\)
You can generate this plot with the following code
logit_scale = seq(-6, 6, by = 0.1)
plot(plogis(logit_values), logit_scale, xlab = 'probabilities', type = 'n')
lines(plogis(logit_values), logit_scale)
Generate Binary Outcome
Let’s now finally generate the binary outcome for the logistic regression.
We use the R function rbinom
generate data based on the Binomial distribution.
y_dummy = rbinom(n = N, size = 1, prob = y_proba)
The function rbinom
generates 0 and 1 based on the probabilities that we input (y_proba
). For instance, a case/row/individual with a probability of 0.5 means that it has 50% of getting either 0 or 1.
We now have a dummy variable we can use to test a logistic regression model. Let’s put everything into a dataset.
df_sim = data.frame(id = 1:N, education, d, y, y_proba, y_dummy)
The dataset looks like this (you will have different values)
\[\begin{aligned}[ht] \begin{array}{rrrrrrr} \hline & id & education & d & y & y\_proba & y\_dummy \\ \hline 1 & 1 & -1.18 & 0 & -0.46 & 0.39 & 0 \\ 2 & 2 & 0.28 & 1 & -1.19 & 0.23 & 0 \\ 3 & 3 & 1.16 & 1 & 0.05 & 0.51 & 1 \\ 4 & 4 & 0.96 & 0 & -0.51 & 0.37 & 0 \\ 5 & 5 & -1.24 & 0 & -0.42 & 0.40 & 0 \\ 6 & 6 & 0.53 & 1 & -3.65 & 0.03 & 0 \\ ... & ... & ... & ... & ... & ... & ... \\ \hline \end{array} \end{aligned}\]Logistic Regression
Finally we can run a logistic regression on the simulated dataset
glm_model1 = glm(y_dummy ~ education + d, data = df_sim, family = 'binomial')
\[\begin{aligned}[ht] \begin{array}{rlrrrr} \hline & \text{term} & \hat{\beta} & std.error & statistic & p.value \\ \hline 1 & \text{(Intercept)} & -0.02 & 0.15 & -0.14 & 0.89 \\ 2 & \text{education} & 1.39 & 0.15 & 9.44 & 0.00 \\ 3 & \text{discrimination} & -1.47 & 0.23 & -6.35 & 0.00 \\ \hline \end{array} \end{aligned}\] The coefficients are expressed in log-odds, we can retrieve the odds by simply taking the exponential of the coefficients.
exp(coef(glm_model1))
We can convert the coefficient to probabilities with the predict()
function
pred_prob_repsonse = predict(glm_model1, type = "response")
Check if our predicted probabilites from the model are close to the true probabilities we generated.
head( cbind(df_sim, pred_prob_repsonse) )
Let’s plot the predicted probabilities with the library(ggeffects)
.
library(ggeffects) # you need to install this library
# predicted probabilities
g_pred1 = ggpredict(glm_model1, terms = c('education[all]', 'd'))
g_pred1 = as.data.frame(g_pred1)
# ggplot
ggplot(g_pred1, aes(x, predicted, colour = group)) +
geom_point() + geom_line() + xlab('Education') +
scale_x_continuous(breaks = seq(-3,3,by=1)) + theme_minimal()
We can also compare the results to a simple OLS regression on the binary outcome (i.e. linear probability model)
lm_model1 = lm(y_dummy ~ education + d, data = df_sim)
pred_lm_model = predict(lm_model1)
What do you see?
plot(pred_lm_model, pred_prob_repsonse)
The full code
library(tidyverse)
library(ggeffects)
N=500
education = rnorm(N,0,1)
d = rbinom(N, 1, 0.5)
error = rnorm(N) # random error term
# the model
y = 1.5*education + (-2*d) + error
# logistic transformation
y_proba = plogis(y)
# draw binary data from binomial distribution
y_dummy = rbinom(n = N, size = 1, prob = y_proba)
# dataframe
df_sim = data.frame(id = 1:N, education, d, y, y_proba, y_dummy)
# the logistic regression model
glm_model1 = glm(y_dummy ~ education + d, data = df_sim, family = 'binomial')
# summary, odds and confidence intervals
summary(glm_model1)
exp(coef(glm_model1))
confint(glm_model1)
# predicted probabilities
g_pred1 = ggpredict(glm_model1, terms = c('education[all]', 'd'))
g_pred1 = as.data.frame(g_pred1)
# ggplot
ggplot(g_pred1, aes(x, predicted, colour = group)) +
geom_point() + geom_line() + xlab('Education') +
scale_x_continuous(breaks = seq(-3,3,by=1)) + theme_minimal()