Logistic Regression

1 Logistic Regression

Logistic regression (or logit) is a regression technique for when the outcome variable is binary.

In logistic regression the goal is to predict \(\text{P}(Y=1|X=x)\), where \(X\) is the predictor. Although the outcome variable is either 0 or 1, the predicted outcome will be a probability, so it can be anything between 0 and 1.

Complication: to get a linear model, we need to do a non-linear transformation of \(\text{P}(Y=1|X=x)\), using what’s known as a “logit” or “log-odds” function.

If we let \(p(x) = \text{P}(Y=1|X=x)\), then the logistic regression model has the following functional form:

\[\ln \bigg( \frac{p(x_i)}{1-p(x_i)} \bigg) = \beta_0 + \beta_1 x_{i1} + ... + \beta_k x_{ik} + \varepsilon_i\]

where the LHS is the logit function, and the RHS is a linear function of the predictors, just like in OLS regression.

Another complication: the regression coefficients. In logit models the regression coefficients represent log-odds ratios. If we exponentiate the coefficients (which we usually do), we get an odds-ratio, which represents the likelihood of outcome variable being a success (\(Y=1\)) given the predictor.

If you’re interested, the next section gives an example of a logit model used in a real study on the effects of language on economic decision-making.

2 Linguistic Determinism? A Study

A study by Keith Chen in 2013 tested a “linguistic-savings hypothesis”, whether grammatical differences in languages affect intertemporal choice and economic decision-making. [Read the original study here.] Chen investigated whether Strong FTR languages (i.e. ones that grammatically ‘mark’ the future) cause speakers to discount the future more than Weak FTR languages (i.e. ones that do not grammatically mark the future), and whether this behavioural difference is manifest in individual savings behaviour. His study concluded that the distancing mechanism inherent in Strong FTR languages does indeed significantly lower the probability that an individual will take future-oriented actions.

First, a brief note on languages:

A Strong FTR language is one that grammatically ‘marks’ the future (with a word, or a set of words, etc.). English and French are two such examples:

Weak FTR languages do not grammatically mark the future in this manner; often in such languages the present tense form of the verb is used, and the notion of future time is gleaned rather from context than grammatical markers. German and Finnish are two examples:

Since saving money is a future-oriented activity, Chen claimed the distancing mechanism in Strong FTR languages causes people to (unwittingly) save less money. In his study, Chen found a significant difference in savings behaviour between speakers of Strong FTR and Weak FTR languages, even when controlling for a number of individual and country variables.

Chen used a logit model to estimate the probability than an individual will save money, based on whether they speak a Strong or Weak FTR language:

\[\text{P}(\text{save}_{it}) = \frac{\exp(Z_{it})}{1 + \exp(Z_{it})}\]

where \(\text{save}_{it}\) is a binary outcome variable (either the individual saved money (1) or did not save money (0) during a particular year). The logit model is:

\[Z_{it} = \beta_{0} + \beta_{1} \text{StrongFTR} + \beta_{2} X_{it} + \beta_{3} X_{t} + \beta_{4} F_{it}^{ex} + \beta_{5} F_{t}^{c}\]

where \(\text{StrongFTR}\) is the variable of interest, and the other predictors serve as various individual and country-level control variables.

Below is a crude reimplementation of the main logit model in the study. [Download the data here.] We can use the glm() function from the stats package to perform logistic regression.

library(stats)
library(alpaca)

## simple logit regression
reg1 = glm(SavedThisYr ~ StrongFTR,
             data = ccdata, family = binomial(link = 'logit'))

## model summary
summary(reg1)
## 
## Call:
## glm(formula = SavedThisYr ~ StrongFTR, family = binomial(link = "logit"), 
##     data = ccdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9504  -0.6797  -0.6797  -0.6797   1.7769  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.56054    0.01471  -38.11   <2e-16 ***
## StrongFTR   -0.78711    0.01623  -48.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 160446  on 149692  degrees of freedom
## Residual deviance: 158225  on 149691  degrees of freedom
##   (14928 observations deleted due to missingness)
## AIC: 158229
## 
## Number of Fisher Scoring iterations: 4

The code above performs a simple logistic regression of the binary outcome, SavedThisYr, on the main predictor of interest, StrongFTR (also a binary variable, coded 1 if the language is classified Strong FTR, 0 if the language is classified Weak FTR). The coefficient on StrongFTR is -0.7871. This is the log-odds ratio. To make this number more meaningful, we exponentiate the model coefficients, as follows:

## odds-ratio 
exp(coef(reg1))
## (Intercept)   StrongFTR 
##   0.5709008   0.4551603

These numbers are odds-ratios. They say that, based on the data in the study, people who speak Strong FTR languages are on average only 45% as likely to save money that year than people who speak Weak FTR languages. This was indeed Chen’s main conclusion in the study.

It’s a bold claim. Many have contested the validity of Chen’s study, claiming (with good reason) that he did not properly isolate (or even identify) the causal effect in his study. Read a critical response to Chen’s study here: https://dlc.hypotheses.org/360.

Nevertheless, it’s still an interesting result. Chen managed to show that the coefficient on StrongFTR manages to hold even after using a number of individual and country-level control variables.

The following model adds control variables for a country’s legal system, its log GDP per capita, the log GDP growth rate, the unemployment and interest rates, and a number of individual (age, gender) and regional (country, continent) variables:

## logistic regression with controls
reg6 = feglm(SavedThisYr ~ StrongFTR + LegalFr + LegalGe + LegalSc + logPCGDP
              + Growth_PCGDP + Unemployed + RealIntRate + LegalRightsIndex + TrustMostPpl + 
                FamilyImp + AvgTrust + AvgFamilyImp + LanguageShare + FTRShare | 
                AgeCat + Sex + Continent, data = ccdata, family = binomial(link = 'logit'))

## model summary
summary(reg6)
## binomial - logit link
## 
## SavedThisYr ~ StrongFTR + LegalFr + LegalGe + LegalSc + logPCGDP + 
##     Growth_PCGDP + Unemployed + RealIntRate + LegalRightsIndex + 
##     TrustMostPpl + FamilyImp + AvgTrust + AvgFamilyImp + LanguageShare + 
##     FTRShare | AgeCat + Sex + Continent
## 
## Estimates:
##                    Estimate Std. error z value Pr(> |z|)    
## StrongFTR        -0.7291403  0.0247283 -29.486   < 2e-16 ***
## LegalFr          -0.5459611  0.0279293 -19.548   < 2e-16 ***
## LegalGe          -0.7240986  0.0382698 -18.921   < 2e-16 ***
## LegalSc          -0.3028061  0.0615508  -4.920  8.67e-07 ***
## logPCGDP          0.1404443  0.0074360  18.887   < 2e-16 ***
## Growth_PCGDP     -0.8701121  0.1600799  -5.435  5.46e-08 ***
## Unemployed       -0.6530556  0.0262043 -24.922   < 2e-16 ***
## RealIntRate      -0.0030567  0.0004945  -6.181  6.36e-10 ***
## LegalRightsIndex  0.0321659  0.0050492   6.370  1.88e-10 ***
## TrustMostPpl     -0.2055092  0.0157110 -13.081   < 2e-16 ***
## FamilyImp        -0.1196940  0.0204598  -5.850  4.91e-09 ***
## AvgTrust          0.1055689  0.0762300   1.385    0.1661    
## AvgFamilyImp      0.2907947  0.1121252   2.593    0.0095 ** 
## LanguageShare    -0.1227765  0.0275766  -4.452  8.50e-06 ***
## FTRShare         -0.2613427  0.0657343  -3.976  7.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## residual deviance= 134284.65,
## null deviance= 142139.96,
## n= 130623, l= [9, 2, 6]
## 
## ( 33998 observation(s) deleted due to missingness )
## 
## Number of Fisher Scoring Iterations: 4
## log-odds ratios
exp(coef(reg6))
##        StrongFTR          LegalFr          LegalGe          LegalSc 
##        0.4823235        0.5792848        0.4847613        0.7387424 
##         logPCGDP     Growth_PCGDP       Unemployed      RealIntRate 
##        1.1507850        0.4189046        0.5204531        0.9969479 
## LegalRightsIndex     TrustMostPpl        FamilyImp         AvgTrust 
##        1.0326888        0.8142326        0.8871918        1.1113426 
##     AvgFamilyImp    LanguageShare         FTRShare 
##        1.3374899        0.8844613        0.7700170

As you can see, the coefficient on StrongFTR maintains its value, even after adding these controls.