Today we will learn

Classical Hypothesis Testing
- Using t-values
- Using p-values
- Compare this to inference with CIs
How to model and interpret interactions with dummies

1 Our Dataset: Gender Gap in Payment

Today we will be working with the the example from the lecture, and we will explore the relationship between individual’s education and earnings. The data we are using are simulated.

dat1 <- read.csv("raw-data/income_fake_data.csv")

glimpse(dat1)

## Rows: 100
## Columns: 3
## $ gender    <chr> "male", "female", "female", "female", "female", "female", "f~
## $ education <int> 6, 10, 8, 4, 10, 8, 7, 9, 11, 8, 7, 9, 5, 11, 8, 9, 7, 6, 7,~
## $ income    <dbl> 27964.16, 32660.45, 29997.91, 22471.18, 33713.58, 28609.15, ~

head(dat1)

##   gender education   income
## 1   male         6 27964.16
## 2 female        10 32660.45
## 3 female         8 29997.91
## 4 female         4 22471.18
## 5 female        10 33713.58
## 6 female         8 28609.15

1.1 Preprocessing

The gender variable is not quite in the format we want to use. Let’s build the binary variable (aka dummy) female, which is 1 if gender is female and 0 if gender is male.

# we can do this with built-in ifelse statement, but we'll need to nest two
dat1$female <- ifelse(dat1$gender == "female", 1,
  ifelse(dat1$gender == "male", 0, NA)
)

# dplyr has a case_when function for such purposes
# you can use the function that is more intuitive
female <- case_when(
  dat1$gender == "female" ~ 1,
  dat1$gender == "male" ~ 0
  # all other values are NAs as not specified
)

# compare the result
all(dat1$female == female)

## [1] TRUE

1.2 Running the model

Very good. Now we want to run a model to investigate the relationship between education and income. Our hypothesis is that education positively affects income. We will start with a very simple model first:

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_{1} \text{Education}_i \]

m1 <- lm(income ~ education, data = dat1)

Let’s have a look at the model summary:

summary(m1)

## 
## Call:
## lm(formula = income ~ education, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3369.6  -640.6    17.1   803.9  3250.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21286.61     640.69   33.23   <2e-16 ***
## education    1138.53      73.92   15.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared:  0.7077, Adjusted R-squared:  0.7047 
## F-statistic: 237.2 on 1 and 98 DF,  p-value: < 2.2e-16

And since having a picture is almost always a good idea, let’s plot this:

Base R

plot(
  x = jitter(dat1$education), # why use jitter here?
  y = dat1$income,
  ylim = c(20000, 40000),
  bty = "n",
  xlab = "Education (years)",
  ylab = "Income (USD '000)", # original scale is in USD
  main = "Relationship between Education and Income",
  pch = 19,
  axes = F, # we hide default axes
  col = magma(1, alpha = 0.75) # another colorblind-friendly palette
)

# And add the regression line
abline(
  a = (coef(m1)[1]),
  b = (coef(m1)[2]),
  col = magma(1, alpha = 0.75),
  lwd = 2
)

axis(1) # we add back x-axis labels
axis(2, # add back y-axis
  at = seq(20000, 40000, by = 5000),
  labels = seq(20, 40, by = 5), # manually specify the labels
  las = 1 # rotate the numbers
)

How would you interpret the coefficients?

The regression equation may help: \[ \begin{aligned} \operatorname{\widehat{income}} &= 21286.61 + 1138.53(\operatorname{education}) \end{aligned} \]

ggplot2

ggplot(data = dat1, aes(education, income)) +
  geom_point(
    position = position_jitter(w = 0.15, h = 0), # only shift along x axis
    color = magma(1, alpha = 0.75)
  ) +
  geom_smooth( 
    method = "lm", # add regression line
    se = FALSE,
    color = magma(1, alpha = 0.75)
  ) +
  theme_classic() +
  labs(
    x = "Education (years)",
    y = "Income",
    title = "Relationship between Education and Income"
  ) +
  scale_y_continuous(
    labels = scales::dollar, # nice y labels
    limits = c(20000, 40000)
    )

How would you interpret the coefficients?

The regression equation may help: \[ \begin{aligned} \operatorname{\widehat{income}} &= 21286.61 + 1138.53(\operatorname{education}) \end{aligned} \]

2 Statistical Inference for Regression Coefficents

2.1 Standard Error of a Slope Coeffcient

For any approach for statistical inference we would undertake, we will require the standard errors of our coefficients. We will start work with the slope coefficient here. The formula that we have below comes from the slides:

\[ SE(\hat\beta_1)= \sqrt{Var(\hat{\beta_1})}, \qquad Var(\hat{\beta_1}) = \frac{\overbrace{\hat\sigma^2}^{\text{estimated regression error}\\\text{a.k.a. residual variance}}}{\underbrace{\sum_{i=1}^{n}(x_i - \bar x)^2}_{\text{sum of squares of } x}} \]

And for the residual variance, we need the model residuals, i.e., estimated error terms, and the degrees of freedom:

\[ \hat\sigma^2 = \frac{\overbrace{\sum_{i=1}^{n}\hat{e}_i^2}^{\text{sum of squared residuals}}}{ \underbrace{ \underbrace{n}_{\text{number of}\\\text{observations}}-\underbrace{k}_{\text{number of}\\\text{covariates}} - 1 }_{\text{degrees of freedom}} } \]

Let’s implement these formulas in R:

# start with residual variance components:
# get n (sample size) and k (number of covariates + 1, i.e. slopes + intercept)
n <- nrow(dat1)
k <- length(coef(m1))

# calculate the estimate of sigma^2 hat
sigma_sq_hat <- sum(residuals(m1)^2) / (n - k) # k already includes (- 1) part!
sqrt(sigma_sq_hat)

## [1] 1457.272

# get the sum of squares of x
denominator <- sum((dat1$education - mean(dat1$education))^2)

# calculate the slope SE
se <- sqrt(sigma_sq_hat / denominator)

se

## [1] 73.92085

# compare to built-in function
summary(m1)

## 
## Call:
## lm(formula = income ~ education, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3369.6  -640.6    17.1   803.9  3250.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21286.61     640.69   33.23   <2e-16 ***
## education    1138.53      73.92   15.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared:  0.7077, Adjusted R-squared:  0.7047 
## F-statistic: 237.2 on 1 and 98 DF,  p-value: < 2.2e-16

We can write our own function to calculate standard errors of regression coefficients.

se_slope <- function(lm, x) {
  n <- length(x)
  k <- length(coef(lm))
  sigma_est <- sqrt(sum(residuals(lm)^2) / (n - k)) # k includes covariates + 1
  x_se <- sqrt(sum((x - mean(x))^2))
  return(sigma_est / x_se)
}

se <- se_slope(m1, dat1$education)
se

## [1] 73.92085

2.2 Confidence Intervals

Now that we have the standard error, we can move to calculating the confidence interval. As you learned in the lecture, we could assume normal sampling distribution for the parameter, i.e. use normal approximation. This works well when the number of observations is large. But if not, we may need to use a \(t\)-distribution. Let’s look at both of them.

2.2.1 Normal Approximation

This approach would be exactly what we were doing before, when calculating the confidence intervals for the mean. Let’s get 95% confidence interval for the slope in our model:

# point estimate +/- z-score * SE
coef(m1)[2] + qnorm(c(0.025, 0.975), 0, 1) * se

## [1]  993.651 1283.415

# FYI, we can also write it as follows:
qnorm(c(0.025, 0.975), coef(m1)[2], se)

## [1]  993.651 1283.415

# qnorm part gives us z-scores, which are quantiles from a standard normal
# we are basically reverting the standardization
qnorm(c(0.025, 0.975), 0, 1) * se

## [1] -144.8822  144.8822

qnorm(c(0.025, 0.975), 0, se)

## [1] -144.8822  144.8822

# and shift the distribution to one side
coef(m1)[2] + qnorm(c(0.025, 0.975), 0, 1)

## [1] 1136.573 1140.493

qnorm(c(0.025, 0.975), coef(m1)[2], 1)

## [1] 1136.573 1140.493

How would you interpret this?

Remember, XX% of random samples of size \(n\) are expected to produce XX% confidence intervals that contain the true population parameter.

2.2.2 Using t-distribution

Since we cannot directly calculate the standard error for \(\hat \beta\), and we use the sample standard deviation \(\hat \sigma\) instead of the population standard deviation \(\sigma\). This strategy tends to work well when we have a lot of data and can estimate \(\sigma\) using \(\hat \sigma\) accurately. However, the estimate is less precise with smaller samples, and this leads to problems when using the normal distribution. Instead, we can use the \(t\)-distribution.

The \(t\)-distribution is always centered at zero and has a single parameter: degrees of freedom. The degrees of freedom (\(df\)) describe the precise form of the bell-shaped \(t\)-distribution. When we have more observations, the degrees of freedom will be larger and the \(t\)-distribution will look more like the standard normal distribution \(\mathcal{N}(0,1)\); when the degrees of freedom is about 30 or more, the \(t\)-distribution is nearly indistinguishable from \(\mathcal{N}(0,1)\), as you can spot from the plot.

What we see from this plot is that \(t\)-distribution with smaller degrees of freedom has thicker tails than a standard normal distribution. Thicker tails indicate that more data points lie further from the center and closer to more extreme values.

How should the fact that \(t\)-distribution has thinker tails affect the width of the confidence interval, when we use t-values instead of z-scores?

# degrees of freedom

df <- nrow(dat1) - length(coef(m1))

coef(m1)[2] + qt(c(0.025, 0.975), df = df) * se

## [1]  991.8397 1285.2267

2.3 Hypothesis Testing

We are also using \(t\)-distribution for hypothesis testing even more explicitly. In this framework, instead of focusing on uncertainty more generally, like with a confidence interval, we specify a null hypothesis and an alternative hypothesis. In other words, we are comparing the For regression coefficients, we would usually want to know if the estimated coefficient is different from zero, hence this is the value \(\beta^{*}_{edu}\) for comparison:

\(H_0: \beta^{*}_{edu} = 0\)
\(H_A: \beta^{*}_{edu} \neq 0\)

2.3.1 Types of alternative hypotheses

One-sided (one tailed) alternatives: The parameter is hypothesized to be less than or greater than the null value, < or >
Two-sided (two tailed) alternatives: The parameter is hypothesized to be not equal to the null value, \(\neq\)
- Calculated as two times the tail area beyond the observed sample statistic
- More objective, and hence more widely preferred

2.3.2 t-values by hand

A test statistic is constructed as:

\[ t^* = \frac{\hat\beta_{edu} - \beta_{edu}^*}{SE(\hat\beta_{edu})} = \frac{\hat\beta_{edu} - 0}{SE(\hat\beta_{edu})} =\frac{\hat\beta_{edu}}{SE(\hat\beta_{edu})} \]

We can now translate it to R:

# we use our standard error function from above to get standard errors
se_educ <- se_slope(m1, dat1$education)
t_educ_0 <- coef(m1)[2] / se_educ
t_educ_0

## education 
##  15.40206

To test the hypothesis using the \(t\)-value, we need to compare \(t\)-values to a critical value. This critical value depends on our (predefined) level of significance. If our level of significance \(\alpha=95\%\), here’s what we do:

# test statistic, t-value, should be either larger than:
qt(0.975, df = 98)

## [1] 1.984467

# or smaller than:
qt(0.025, df = 98)

## [1] -1.984467

# in short, we can reject H_0 if:
abs(t_educ_0) > abs(qt(0.025, df = 98))

## education 
##      TRUE

The null hypothesis here is that there is no effect of education. What would you would have to do if you wanted to test against a different \(H_0\)?

2.3.3 p-values by hand

You may recall that yet another way to do statistical inference is by computing p-values. The idea behind p-values is: Given the observed \(t^*\), what is the smallest significance level at which the null hypothesis would be rejected?

\[ p = Pr({|t_{n-k-1}|}>{|t^*|}) = 2 \times Pr({t_{n-k-1}}>{|t^*|}) \]

And how do we find the probability of observed or more extreme outcome given that the null hypothesis is true? We do this with the distribution function with p, probability. The absolute value \(|t^*|\) will be a positive number, and we want to find the cumulative probability of \(t_{n-k-1}\) being more extreme. Knowing that functions in R by default look for \(Pr(X \leq x)\), we need to be a little creative when calculating the probabilities manually:

# what does this show?
pt(15.40206, df = 98)

## [1] 1

# what will this value give us?
pt(15.40206, df = 98, lower.tail = F)

## [1] 3.207699e-28

# what about this?
pt(abs(-15.40206), df = 98)

## [1] 1

# and this?
pt(-abs(15.40206), df = 98)

## [1] 3.207699e-28

# why are we multiplying by 2?
2 * pt(-15.40206, df = 98)

## [1] 6.415397e-28

# our p-values
p_m2 <- 2 * pt(-abs(15.40206), df = 98)
p_m2

## [1] 6.415397e-28

The interpretation is as follows:

If p-value \(<\alpha\), reject \(H_0\) in favor of \(H_A\): The data provide convincing evidence for the alternative hypothesis.
If p-value \(>\alpha\), fail to reject \(H_0\) in favor of \(H_A\): The data do not provide convincing evidence for the alternative hypothesis

2.4 Built-in Functions in R

As usually, R has built in functions for these essential calculations. summary will give you these numbers from an lm object. However, in case we need to extract certain values from the coefficients, we can do this without copying the numbers from the summary output by hand. Avoiding such manual copying is more flexible and less error prone (thus, better coding practice). For instance, tidy function from broom package will generate a nice data.frame-kind object, from which we can easily extract the numbers. Note that these p- and t-values only test against zero.

# summary will give us all these numbers
summary(m1)

## 
## Call:
## lm(formula = income ~ education, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3369.6  -640.6    17.1   803.9  3250.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21286.61     640.69   33.23   <2e-16 ***
## education    1138.53      73.92   15.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared:  0.7077, Adjusted R-squared:  0.7047 
## F-statistic: 237.2 on 1 and 98 DF,  p-value: < 2.2e-16

# and tidy() from broom package will give a nice data.frame with these numbers
coefs <- tidy(m1, conf.int = T, conf.level = 0.95)
coefs # does tidy() use normal approximation or t-values?

## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   21287.     641.       33.2 3.80e-55   20015.    22558.
## 2 education      1139.      73.9      15.4 6.42e-28     992.     1285.

And this way, we can also refer to the objects we already created and say that, for instance, the we have found the t-value for education coefficient to be 15.4020587. Yet we can see that this number is very long and we may want to cut it to 3 digits by using format function: 15.4.

3 Dummies and Interactions

3.1 Model with Separate Means

We have already explored the relationship between education and earnings, but to better understand the idea behind interactions, let’s take a step back. Let’s estimate a regression with only one predictor - the female dummy variable.

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i \]

# to make plotting easier, let's transform USD to USD ('000)
dat1$income <- dat1$income / 1000

m2 <- lm(income ~ female, dat1)
summary(m2)

## 
## Call:
## lm(formula = income ~ female, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5990 -1.8133  0.0167  1.6234  5.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.0359     0.3874  82.692   <2e-16 ***
## female       -1.9657     0.5087  -3.864    2e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.511 on 98 degrees of freedom
## Multiple R-squared:  0.1322, Adjusted R-squared:  0.1234 
## F-statistic: 14.93 on 1 and 98 DF,  p-value: 0.0002003

How would we interpret the coefficients?

\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{to females}} \text{Female}_i \]

As usually, let’s see what is going on. We’ll add the effects of gender onto the scatterplot of income and education. Though education is not in the model, this will help us in understanding the interactions later.

Base R

# specify the two colors we'll use in plots
col_vec <- case_when(
  dat1$female == 1 ~ magma(2, alpha = 0.75, end = 0.7)[1],
  dat1$female == 0 ~ magma(2, alpha = 0.75, end = 0.7)[2]
)

# main plot
plot(
  x = jitter(dat1$education),
  y = dat1$income,
  ylim = c(20, 40),
  bty = "n",
  las = 1,
  main = "The Effect of Gender on Income (Intercept Shift)",
  xlab = "Education (years)",
  ylab = "Income (USD '000)",
  pch = 19,
  col = col_vec
)

# add the regression lines
# male (baseline category)
abline(
  h = (coef(m2)[1]),
  col = col_vec[1],
  lwd = 3
)
# female
abline(
  h = (coef(m2)[1] + coef(m2)[2]),
  col = col_vec[2],
  lwd = 3
)

# add a legend
legend("topleft",
  legend = c("Male", "Female"),
  pch = 19,
  lwd = 2,
  col = col_vec,
  bty = "n"
)

ggplot2

ggplot(
  data = dat1,
  aes(x = education, y = income, color = gender)
) +
  geom_point(
    position = position_jitter(w = 0.15, h = 0), # only shift along x axis
  ) +
  geom_hline(
    yintercept = m2$coefficients[1],
    color = magma(2, end = 0.7)[1]
  ) +
  geom_hline(
    yintercept = sum(m2$coefficients),
    color = magma(2, end = 0.7)[2]
  ) +
  theme_classic() +
  labs(
    x = "Education (years)",
    y = "Income (USD '000)",
    color = "",
    title = "The Effect of Gender on Income",
    subtitle = "Intercept Shift"
  ) +
  scale_y_continuous(
    limits = c(20, 40)
  ) +
  scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
  theme(legend.position = "top")

3.2 Model with Average Education Effect

Now let’s add the education variable back in, but let’s assume that the effect of education is the same for both subgroups - males and females:

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i + \hat\beta_2 \text{Education}_i \]

We can run the model now:

m3 <- lm(income ~ education + female, dat1)
summary(m3)

## 
## Call:
## lm(formula = income ~ education + female, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6461 -0.9260  0.1294  0.8374  2.3533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.46033    0.51975  43.214  < 2e-16 ***
## education    1.12654    0.05757  19.567  < 2e-16 ***
## female      -1.84918    0.22996  -8.041 2.19e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.135 on 97 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.821 
## F-statistic:   228 on 2 and 97 DF,  p-value: < 2.2e-16

And plot the regression line, now with the slope coefficient:

Base R

plot(
  x = jitter(dat1$education),
  y = dat1$income,
  ylim = c(20, 40),
  bty = "n",
  las = 1,
  main = "The Effect of Education and Gender on Income\n(Intercept Shift with Same Slope)",
  xlab = "Education (years)",
  ylab = "Income (USD '000)",
  pch = 19,
  col = col_vec
)

# And finally add the regression lines.
# for the baseline category, males
abline(
  a = (coef(m3)[1]),
  b = (coef(m3)[2]),
  col = col_vec[1],
  lwd = 3
)
# for the other category, females
abline(
  a = (coef(m3)[1] + coef(m3)[3]),
  b = (coef(m3)[2]),
  col = col_vec[2],
  lwd = 3
)

# add a legend
legend("topleft",
  legend = c("Male", "Female"),
  pch = 19,
  lwd = 2,
  col = col_vec,
  bty = "n"
)

How would we interpret the coefficients?

\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males with no}\\\text{education}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{in income}\\\text{to females}} \text{Female}_i + \underbrace{\hat\beta_2}_{\text{Effect of each}\\\text{additional year}\\\text{of education}\\\text{for both males}\\\text{and females}} \text{Education}_i \]

ggplot2

ggplot(
  data = dat1,
  aes(x = education, y = income, color = gender)
) +
  geom_point(
    position = position_jitter(w = 0.15, h = 0), # only shift along x axis
  ) +
  geom_line(mapping=aes(y=predict(m3))) +
  theme_classic() +
  labs(
    x = "Education (years)",
    y = "Income (USD '000)",
    color = "",
    title = "The Effect of Education and Gender on Income",
    subtitle = "Intercept Shift with Same Slope"
  ) +
  scale_y_continuous(
    limits = c(20, 40)
  ) +
  scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
  theme(legend.position = "top")

How would we interpret the coefficients?

3.3 Heterogeneous Effect of Education

What if females are more efficient in applying education? In other words: A shift in the intercept may not be sufficient to explain differences between men and women and their income. Rather, we may want to model different effect of education for these groups.

For heterogeneous effects, we specify our interaction model (Brambor, Clark, and Golder 2006). Note that the variable that alters the effect of another variable, is also called a moderator (people sometimes mix up mediators (see previous lab) and moderators, so be careful with using the term).

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i + \hat\beta_2 \text{Education}_i + \hat\beta_3 \text{Female}_i \times \text{Education}_i \]

The syntax is very straightforward:

m4 <- lm(income ~ female + education + education * female, # * includes interaction
  data = dat1
)

But interpretation of the coefficients may get a little tricky:

\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males with no}\\\text{education}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{in income}\\\text{to females}} \text{Female}_i + \underbrace{\hat\beta_2}_{\text{Effect of each}\\\text{additional year}\\\text{of education}\\\text{for males}} \text{Education}_i + \underbrace{\hat\beta_3}_{\text{Difference in}\\\text{effect of education}\\\text{between males}\\ \text{and females}} \text{Female}_i \times \text{Education}_i \]

We can thus see that if we only interested in prediction of income for males, the model would reduce to:

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \times 0 + \hat\beta_2 \text{Education}_i + \hat\beta_3 0 \times \text{Education}_i \\ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Male intercept}} + \underbrace{\hat\beta_2}_{\text{Effect of}\\\text{education}\\ \text{for men}} \text{Education}_i \]

And for females, the coefficients will be:

\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \times 1 + \hat\beta_2 \text{Education}_i + \hat\beta_3 1 \times \text{Education}_i\\ \text{Income}_i = \underbrace{\hat\beta_0 + \hat\beta_1}_{\text{Female intercept}} + \underbrace{(\hat\beta_2 +\hat\beta_3)}_{\text{Effect of}\\\text{education}\\ \text{for women}} \text{Education}_i \]

Let’s now look at the exact coefficients we estimated:

coef(m4)

##      (Intercept)           female        education female:education 
##       24.7354537       -5.7756535        0.8588811        0.4643318

How would we interpret the coefficients?

3.3.1 Causal interaction vs. effect modification

Note that statistical models by themselves are not distinguishing between these two cases:

the effect of education on income differs between men and women
the effect gender on income differs across the levels of education

Mathematically, you can interpret the coefficients either way. However, the choice of control variables you include in the model would likely differ, depending on whether you are interested, primarily, in the effect of education or in the effect of gender on the income. It thus will make sense to interpret the interaction coefficients in line with your causal story and your selected control variables. Keele and Stevenson (2021) discuss this issue in more detail.

3.3.2 More Details on Plotting Regression Lines

Again, we want to add our predictions as separate regression lines to a plot:

Base R

# Scatterplot with dots colored according to female/male
plot(
  x = jitter(dat1$education),
  y = dat1$income,
  ylim = c(20, 40),
  bty = "n",
  las = 1,
  main = "Heterogeneous Effect of Education on Income",
  xlab = "Education (years)",
  ylab = "Income (USD '000)",
  pch = 19,
  col = col_vec
)

# Add a legend
legend("topleft",
  legend = c("Male", "Female"),
  pch = 19,
  lwd = 2,
  col = col_vec,
  bty = "n"
)

# We use the clip command to restrict the plotting area
# to x-values supported by the data
clip(
  x1 = min(jitter(dat1$education)[dat1$female == 0]),
  x2 = max(jitter(dat1$education)[dat1$female == 0]),
  y1 = -100,
  y2 = 100
)
abline(
  a = (coef(m4)[1]), # beta_0
  b = (coef(m4)[3]), # beta_2
  lwd = 3,
  col = col_vec[1]
)

# Second, for women (this will be a bit tricky):
# Only plot the line for x-values supported by the data
clip(
  x1 = min(jitter(dat1$education)[dat1$female == 1]),
  x2 = max(jitter(dat1$education)[dat1$female == 1]),
  y1 = -100,
  y2 = 100
)
abline(
  a = (coef(m4)[1] + coef(m4)[2]), # beta_0 + beta_1
  b = (coef(m4)[3] + coef(m4)[4]), # beta_3 + beta_4
  lwd = 3,
  col = col_vec[2]
)

ggplot2

ggplot(
  data = dat1,
  aes(x = education, y = income, color = gender)
) +
  geom_point(
    position = position_jitter(w = 0.15, h = 0), # only shift along x axis
  ) +
  geom_line(mapping = aes(y = predict(m4))) +
  theme_classic() +
  labs(
    x = "Education (years)",
    y = "Income (USD '000)",
    color = "",
    title = "The Effect of Education and Gender on Income",
    subtitle = "Intercept Shift with Same Slope"
  ) +
  scale_y_continuous(
    limits = c(20, 40)
  ) +
  scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
  theme(legend.position = "top")

3.3.3 More Subgroups and Interactions between Continuous Variables

Now you have seen how we can get different slopes for subgroups. We can have more subgroups than just two - in this case, once we add the interaction, each of the categories will have a separate slope.

We can also have interactions between two continuous variables. In this case, people often use marginal effect plots or, when plotting the predictions, select a few meaningful values in one variable (e.g., from some smaller and larger percentiles) and plot the predictions for the range of another variable in a similar manner to discrete case as we had here. You will learn about this in the next lab.

Exercise section

We will do some of these exercises in class, and the rest you should try doing at home, to better prepare yourself for the homework.

Exercise I: First difference in income between female and male

Assume that you enjoyed 10 years of education. Using our regression model with interaction effect (m4), you want to find out:

how much income you could expect if you enter the job market now.
how much more income you could expect from one additional year of education.

Do this calculation for both genders. Which gender benefits more from one additional year of education?

Answer:

One additional year of education for males is associated with [increase/decrease] in expected income.
The predicted average income of females with 11 years of education is .
Females with 11 years of education, on average, earn [more/less] than males with 11 years of education.

Exercise II: Alternative Education Coding

We’ve been using the years of schooling, assuming that each addition year of education has the same effect on income. But is this really the case? Can we expect the change from 1 to 2 years of education to have the same effect on income as an additional year of education, when you already studied for 10 years?

While one way to address this question would be with a transformation of the education variable, here we will use a different coding of the education variable to explore this difference. In this dataset, people with up to 6 years of education could be considered having primary education, those with 7 to 10 years - secondary, and those with 11 to 14 years - higher education. So we will first need to create such a variable edu_category.

Now you need to fit a regression to see what is the effect of education (recoded) on earnings. But before you fit this model, you may want to do another recoding, this time to create dummies. You will be particularly interested in the effect of having higher education in comparison to having secondary education. Decide on:

How many dummies do you need?
What is the best way to create them?

Now run the model and interpret the coefficients. Remember that your recoding matters!

The predicted income for a person with primary education would be , and the 95% confidence interval for this prediction is .
The average difference in income between those with secondary education and those with higher education is and the 95% confidence interval for this difference is .

Exercise III: Testing against a different null hypothesis

Consider the simple model from earlier, with education in years, without an interaction term again:

\[ \operatorname{income} = \hat{\beta}_{0} + \hat{\beta}_{1}(\operatorname{education}) + \hat{\beta}_{2}(\operatorname{female}) + \epsilon \]

tidy(m3)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    22.5     0.520      43.2  3.59e-65
## 2 education       1.13    0.0576     19.6  1.90e-35
## 3 female         -1.85    0.230      -8.04 2.19e-12

Someone claims that one additional year of education leads to an average increase in income of 500, holding gender constant. You make the case that this is not true. Your alternative hypothesis is that the effect of education on income is different from 500. Calculate the t- and p-value. Can you support your claim?

Answer:

The estimated effect of income is:
This effect [is/is not] significantly different from 500.

Concluding Remarks

In this lab, you have learned to implement hypothesis testing in R and worked with categorical variables. As you have seen, the uncertainty around the coefficient estimates is expressed in standard errors, confidence intervals, t-values, p-values (Pr(>|t|)). Yet essentially, all these numbers express exactly the same. And even more importantly, statistical significance is only one part of the story you will want to tell with your models; substantive significance is crucial, too. Plus, remember that the use of p-values is now heavily debated…

If all else fails, use “significant at a p>0.05 level” and hope no one notices.

In your homework, we will ask you to:

give some definitions
run a new model with the German election data
practice interpreting regression results

You are very welcome to learn more about the handy things you may see in this lab and also use them in your homework assignments and data essay:

doing references in RMarkdown on our course website
working in Latex Math mode to make nice regression equations with equatiomatic package

Appendix

Appendix I: DAG for Our Fake Data

Here you can find a DAG for the fake dataset we created for this lab. In this case, we explicitly worked with a case of interactions. As you can see, we are adding an interaction explicitly as a variable into the DAG, and this variable affects income. Yet we also assume that these is a direct effect of both education and female on income separately as well.

dagitty('dag {
bb="0,0,1,1"
"Female * Education" [pos="0.3,0.45"]
Education [exposure,pos="0.2,0.5"]
Female [pos="0.25,0.4"]
Income [outcome,pos="0.6,0.5"]
"Female * Education" -> Income
Education -> "Female * Education"
Education -> Income
Female -> "Female * Education"
Female -> Income
}
') %>%
  plot()

Assumed Relationship in the Dataset

References

Brambor, Thomas, William Roberts Clark, and Matt Golder. 2006. “Understanding Interaction Models: Improving Empirical Analyses.” Political Analysis 14 (1): 63–82. https://www.jstor.org/stable/25791835.

Keele, Luke, and Randolph T. Stevenson. 2021. “Causal Interaction and Effect Modification: Same Model, Different Concepts.” Political Science Research and Methods 9 (3): 641–49. https://doi.org/10.1017/psrm.2020.12.

QM 2021 Week 6: Hypothesis Testing, Dummies, and Interactions

Oliver Rittmann

Viktoriia Semenova

David Grundmanns

October 14 | 18 | 19, 2021

Today we will learn

1 Our Dataset: Gender Gap in Payment

1.1 Preprocessing

1.2 Running the model

Base R

ggplot2

2 Statistical Inference for Regression Coefficents

2.1 Standard Error of a Slope Coeffcient

2.2 Confidence Intervals

2.2.1 Normal Approximation

2.2.2 Using t-distribution

2.3 Hypothesis Testing

2.3.1 Types of alternative hypotheses

2.3.2 t-values by hand

2.3.3 p-values by hand

2.4 Built-in Functions in R

3 Dummies and Interactions

3.1 Model with Separate Means

Base R

ggplot2

3.2 Model with Average Education Effect

Base R

ggplot2

3.3 Heterogeneous Effect of Education

3.3.1 Causal interaction vs. effect modification

3.3.2 More Details on Plotting Regression Lines

Base R

ggplot2

3.3.3 More Subgroups and Interactions between Continuous Variables

Exercise section

Exercise I: First difference in income between female and male

Exercise II: Alternative Education Coding

Exercise III: Testing against a different null hypothesis

Concluding Remarks

Appendix

Appendix I: DAG for Our Fake Data

References