In this post, let’s have a look at odds-ratio. Odds-ratio plays an important part in social sciences, in particular in the sociological study of social mobility. It used to understand the strength of the association between parents’ social class and the social class of their children.

Odds-ratio in a measure of association between two categorical variables. A categorical variable is a variable made of discrete categories such as married, divorced, single, …

Odds

The idea of odds is very close to the more familiar notion of probability.

If I roll a dice I have 1 chance over 6 to get any number. My probability of getting a \(2\), is simply \(\frac{1}{6}\)

However, the odds to get a \(2\) is \(1:5\). In other words “the odds are against me”.

Let’s use the symbol \(\pi\) (pi) to denotes probabilities. We can define the odds simply as The probability of “success” against the probability of “failure”.

\begin{equation} \text{odds} = \frac{\pi}{1 - \pi} \end{equation}

We can retrieve a probability from odds by using the following equations.

\begin{equation} \begin{split} \pi = \frac{\text{odds}}{1 + \text{odds}} \end{split} \end{equation}

Empirical Example

Let’s illustrate the use of Odds-Ratio with an example.

We have a population of 1000, with 700 (70%) are poor and 300 (30%) are rich.

In this population, 200 have been to university (\(D=1\) is “yes”, \(D=0\) is “no”).

\[\begin{aligned}[] \begin{array}{r|ll|l} \text{university} & \text{poor} & \text{rich} & \\ \hline \text{no} & 650 & 150 & 800 \\ \text{yes} & 50 & 150 & 200 \\ \hline & 700 & 300 & 1000 \end{array} \end{aligned}\]

We see in this table that poor people have much less chance to enter university than rich people. Although, poor people make up 70% of the population (with 700 people), only 50 go to university.

Odds

We can use odds to describe in more detail the patterns of the table.

Let’s start with the poor. We have \(50\) people with a university degree and \(650\) without. The probability is \(\frac{50}{700} = 0.07143\), and the odds is \(\frac{50}{650} = 0.0769\). Let’s get a fractional form \(\frac{1}{0.0769} = \frac{1}{13}\).

This means that for every 1 poor person who goes to university 13 poor people do not go to university.

The odds of poor people going to university is

\begin{equation} \frac{1}{13} \end{equation}

For the rich, we have \(150\) people with a university degree and \(150\) without. The probability is \(\frac{150}{300} = 0.5\), and the odds is \(\frac{150}{150} = \frac{1}{1}\).

This means that for every 1 rich person who goes to university 1 rich person does not go to university.

The odds of rich people going to university is

\begin{equation} \frac{1}{1} \end{equation}

Odds-Ratio

The odds ratio puts these two odds together

\begin{equation} \text{OR} = \frac{\frac{50}{650}}{\frac{150}{150}} = \frac{\frac{1}{13}}{\frac{1}{1}} \end{equation}

Is this odds-ratio (\(\text{OR}\)) reflects a strong association? We can think about the strength of the association by calculating the odds-ratio (\(\text{OR}\)) is economic status and education were not related (i.e. independent).

Independance Table

Let’s put this odds-ratio (\(\text{OR}\)) in context by calculating the table we would observed if being rich or poor did not have any association with university access. We say that the variables are independent if they are not associated. Learning about one variable is not informative or predictive about the other variable.

How do we calculate an independent table?

Poor people make 70% of this population, then they should make 70% of the people who went to university, if being poor did not matter for access to university.

Let’s call \(\delta\) (delta) the “access to university”, with \(\delta_{\text{poor}} = \delta_{\text{rich}}\) denoting that access is the same whether poor or rich. If the access is the same, then economic status does not play a role in predicting university access, therefore the variables are independent.

This is the so-called NULL hypothesis \(\text{H0} : \delta_{\text{poor}} = \delta_{\text{rich}}\). The alternative hypothesis is \(\text{HA} : \delta_{\text{poor}} \ne \delta_{\text{rich}}\), i.e. access is conditional economic status. Economic status is associated with access to university.

Let’s calculate the number of people who went to university according to the population proportion of the poor and the rich. In other words, the number if \(\text{H0}\) was true (i.e. if access was independent of economic status).

\begin{equation} 0.7 \times 200 = 140 \end{equation}

\begin{equation} 0.3 \times 200 = 60 \end{equation}

These values would fill these two cells

\[\begin{aligned}[] \begin{array}{r|ll|l} \text{university} & \text{poor} & \text{rich} & \\ \hline \text{no} & & & 800 \\ \text{yes} & 140 & 60 & 200 \\ \hline & 700 & 300 & 1000 \end{array} \end{aligned}\]

We can also calculate the numbers for those who did not go to university (which is 800 people) according to the population proportion of economic status.

\begin{equation} 0.7 \times 800 = 560 \end{equation}

\begin{equation} 0.3 \times 800 = 60 \end{equation}

Now we have a table that reflects the proportion of the population of the poor and the rich in university attendance. The access to university simply reflects the size of each population.

This is the table we should observe if access to university was independent to economic status. In other words, if \(H0\) was true.

\[\begin{aligned}[] \begin{array}{r|ll|l} \text{university} & \text{poor} & \text{rich} & \\ \hline \text{no} & 560 & 240 & 800 \\ \text{yes} & 140 & 60 & 200 \\ \hline & 700 & 300 & 1000 \end{array} \end{aligned}\]

Another simple way to calculate this is to multiply the margin and divide by the total. For instance for cell number 1, \(\frac{700 \times 800} = 560\).

Expected Table \(\begin{aligned}[] \begin{array}{r|ll|l} \text{university} & \text{poor} & \text{rich} & \\ \hline \text{no} & 560 & 240 & 800 \\ \text{yes} & 140 & 60 & 200 \\ \hline & 700 & 300 & 1000 \end{array} \end{aligned}\)

Observed Table \(\begin{aligned}[] \begin{array}{r|ll|l} \text{university} & \text{poor} & \text{rich} & \\ \hline \text{no} & 650 & 150 & 800 \\ \text{yes} & 50 & 150 & 200 \\ \hline & 700 & 300 & 1000 \end{array} \end{aligned}\)

Odds-Ratio of the independence table

Let’s investigate the odds-ratio we would have given that economic status had no bearing on university access. In other words, if the two variables were not associated.

We first have our independence table, also called the “expected” table (expected given no association or expected if the NULL hypothesis was true).

If we repeated the calculation of odds for the poor and the rich for the independence/expected table we would have the following result

\begin{equation} \text{odds poor} = \frac{140}{560} = \frac{1}{4} = 0.25 \end{equation}

\begin{equation} \text{odds poor} = \frac{60}{240} = \frac{1}{4} = 0.25 \end{equation}

Now if we take again the ratio of these two odds we have

\begin{equation} \text{OR ind} = \frac{\frac{140}{560}}{\frac{60}{240}} = \frac{0.25}{0.25} = 1 \end{equation}

We have discovered that when two variables are independent (not associated, uncorrelated, etc) we have an odds-ratio of 1 (The odds-ratio of independence \(\text{OR ind} = 1\)).

How does our observed odds-ratio of \(\frac{1}{13}\) compares with the odds-ratio if the two variables were independent \(\text{OR ind} = 1\)?

In other words, how far is \(\frac{1}{13}\) from the expected value of 1?

How likely is it to get this odds-ratio by chance alone?

Statistical inference of Odds-Ratio

This bit is a bit more technical because it involves hypothesis testing, which is a another topic.

A common way for testing this is to use a log transformation because we can then works with a normal distribution.

\begin{equation} L = \ln \left( \text{OR} \right) \end{equation}

\begin{equation} L = \ln \left( \frac{1}{13} \right) = -2.56 \end{equation}

Note that the log value of the odds-ratio for the independant value is

\begin{equation} L = \ln \left( \frac{1}{1} \right) = 0 \end{equation}

So, we now ask the same question: how far is -2.56 from 0?

Is -2.56 statistically significantly different from 0?

To answer this question we need to calculate the standard error, \text{SE}, which is simply

\begin{equation} \text{SE} = \sqrt{ \frac{1}{650} + \frac{1}{50} + \frac{1}{150} + \frac{1}{150} } = 0.187 \end{equation}

We can derive a z-value by \(z = \frac{L}{\text{SE}}\) and from it derive a p-value.

In our case the z-value is

\begin{equation} z = \frac{-2.56}{0.187} = -13.68 \end{equation}

A simply way to compute the p.value for the odds-ratio is to do a poisson regression

\[\begin{aligned} \begin{array}{rllr} \hline & status & university & obs \\ \hline 1 & poor & no & 650.00 \\ 2 & rich & no & 150.00 \\ 3 & poor & yes & 50.00 \\ 4 & rich & yes & 150.00 \\ \end{array} \end{aligned}\]

In R this is straightforward

# the data
ex = expand.grid(status = c('poor','rich'), university = c('no', 'yes'))
ex$obs = c(650, 150, 50, 150)
ex$status = relevel(ex$status, 'rich')

We use the function glm and run a regression interacting university and economic status

glm(obs ~ university*status, data=ex, family = 'poisson')

\[\begin{aligned} \begin{array}{rlrrrr} \hline & term & estimate & std.error & statistic & p.value \\ \hline 1 & (Intercept) & 5.01 & 0.08 & 61.37 & 0.00 \\ 2 & \text{university : yes} & -0.00 & 0.12 & -0.00 & 1.00 \\ 3 & \text{status : poor} & 1.47 & 0.09 & 16.19 & 0.00 \\ 4 & \text{university-yes x status-poor} & -2.56 & 0.19 & -13.74 & 0.00 \\ \hline \end{array} \end{aligned}\]

We find the same value for the coefficient of \(-2.56\), which is the log of the odds, and the z-value of \(-13\). We find a p-value less than 0.00, which indicates a highly significant result

Remember that exponentiating the log returns the odds, and in this case the odds-ratio. \(\text{exp}(-2.56) = \frac{1}{13}\).

In R you can retrieve your table by predicted the values from the poisson regression

g1 = glm(obs ~ university*status, data=ex, family = 'poisson')
predict(g1, type = "response") # # same as below 
exp(predict(g1)) # same as above 

Neatly, you can generated the values under independence by simply running a regression without the interaction term.

# Independance model
g_indep = glm(obs ~ university + status, data=ex, family = 'poisson')
predict(g_indep, type = "response") # these are the values under undependance