Classical Hypothesis Testing
How to model and interpret interactions with dummies
Today we will be working with the the example from the lecture, and we will explore the relationship between individual’s education and earnings. The data we are using are simulated.
<- read.csv("raw-data/income_fake_data.csv")
dat1
glimpse(dat1)
## Rows: 100
## Columns: 3
## $ gender <chr> "male", "female", "female", "female", "female", "female", "f~
## $ education <int> 6, 10, 8, 4, 10, 8, 7, 9, 11, 8, 7, 9, 5, 11, 8, 9, 7, 6, 7,~
## $ income <dbl> 27964.16, 32660.45, 29997.91, 22471.18, 33713.58, 28609.15, ~
head(dat1)
## gender education income
## 1 male 6 27964.16
## 2 female 10 32660.45
## 3 female 8 29997.91
## 4 female 4 22471.18
## 5 female 10 33713.58
## 6 female 8 28609.15
The gender
variable is not quite in the format we want to use. Let’s build the binary variable (aka dummy) female
, which is 1 if gender
is female and 0 if gender
is male.
# we can do this with built-in ifelse statement, but we'll need to nest two
$female <- ifelse(dat1$gender == "female", 1,
dat1ifelse(dat1$gender == "male", 0, NA)
)
# dplyr has a case_when function for such purposes
# you can use the function that is more intuitive
<- case_when(
female $gender == "female" ~ 1,
dat1$gender == "male" ~ 0
dat1# all other values are NAs as not specified
)
# compare the result
all(dat1$female == female)
## [1] TRUE
Very good. Now we want to run a model to investigate the relationship between education and income. Our hypothesis is that education positively affects income. We will start with a very simple model first:
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_{1} \text{Education}_i \]
<- lm(income ~ education, data = dat1) m1
Let’s have a look at the model summary:
summary(m1)
##
## Call:
## lm(formula = income ~ education, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3369.6 -640.6 17.1 803.9 3250.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21286.61 640.69 33.23 <2e-16 ***
## education 1138.53 73.92 15.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.7047
## F-statistic: 237.2 on 1 and 98 DF, p-value: < 2.2e-16
And since having a picture is almost always a good idea, let’s plot this:
plot(
x = jitter(dat1$education), # why use jitter here?
y = dat1$income,
ylim = c(20000, 40000),
bty = "n",
xlab = "Education (years)",
ylab = "Income (USD '000)", # original scale is in USD
main = "Relationship between Education and Income",
pch = 19,
axes = F, # we hide default axes
col = magma(1, alpha = 0.75) # another colorblind-friendly palette
)
# And add the regression line
abline(
a = (coef(m1)[1]),
b = (coef(m1)[2]),
col = magma(1, alpha = 0.75),
lwd = 2
)
axis(1) # we add back x-axis labels
axis(2, # add back y-axis
at = seq(20000, 40000, by = 5000),
labels = seq(20, 40, by = 5), # manually specify the labels
las = 1 # rotate the numbers
)
How would you interpret the coefficients?
The regression equation may help: \[ \begin{aligned} \operatorname{\widehat{income}} &= 21286.61 + 1138.53(\operatorname{education}) \end{aligned} \]
ggplot(data = dat1, aes(education, income)) +
geom_point(
position = position_jitter(w = 0.15, h = 0), # only shift along x axis
color = magma(1, alpha = 0.75)
+
) geom_smooth(
method = "lm", # add regression line
se = FALSE,
color = magma(1, alpha = 0.75)
+
) theme_classic() +
labs(
x = "Education (years)",
y = "Income",
title = "Relationship between Education and Income"
+
) scale_y_continuous(
labels = scales::dollar, # nice y labels
limits = c(20000, 40000)
)
How would you interpret the coefficients?
The regression equation may help: \[ \begin{aligned} \operatorname{\widehat{income}} &= 21286.61 + 1138.53(\operatorname{education}) \end{aligned} \]
For any approach for statistical inference we would undertake, we will require the standard errors of our coefficients. We will start work with the slope coefficient here. The formula that we have below comes from the slides:
\[ SE(\hat\beta_1)= \sqrt{Var(\hat{\beta_1})}, \qquad Var(\hat{\beta_1}) = \frac{\overbrace{\hat\sigma^2}^{\text{estimated regression error}\\\text{a.k.a. residual variance}}}{\underbrace{\sum_{i=1}^{n}(x_i - \bar x)^2}_{\text{sum of squares of } x}} \]
And for the residual variance, we need the model residuals, i.e., estimated error terms, and the degrees of freedom:
\[ \hat\sigma^2 = \frac{\overbrace{\sum_{i=1}^{n}\hat{e}_i^2}^{\text{sum of squared residuals}}}{ \underbrace{ \underbrace{n}_{\text{number of}\\\text{observations}}-\underbrace{k}_{\text{number of}\\\text{covariates}} - 1 }_{\text{degrees of freedom}} } \]
Let’s implement these formulas in R:
# start with residual variance components:
# get n (sample size) and k (number of covariates + 1, i.e. slopes + intercept)
<- nrow(dat1)
n <- length(coef(m1))
k
# calculate the estimate of sigma^2 hat
<- sum(residuals(m1)^2) / (n - k) # k already includes (- 1) part!
sigma_sq_hat sqrt(sigma_sq_hat)
## [1] 1457.272
# get the sum of squares of x
<- sum((dat1$education - mean(dat1$education))^2)
denominator
# calculate the slope SE
<- sqrt(sigma_sq_hat / denominator)
se
se
## [1] 73.92085
# compare to built-in function
summary(m1)
##
## Call:
## lm(formula = income ~ education, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3369.6 -640.6 17.1 803.9 3250.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21286.61 640.69 33.23 <2e-16 ***
## education 1138.53 73.92 15.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.7047
## F-statistic: 237.2 on 1 and 98 DF, p-value: < 2.2e-16
We can write our own function to calculate standard errors of regression coefficients.
<- function(lm, x) {
se_slope <- length(x)
n <- length(coef(lm))
k <- sqrt(sum(residuals(lm)^2) / (n - k)) # k includes covariates + 1
sigma_est <- sqrt(sum((x - mean(x))^2))
x_se return(sigma_est / x_se)
}
<- se_slope(m1, dat1$education)
se se
## [1] 73.92085
Now that we have the standard error, we can move to calculating the confidence interval. As you learned in the lecture, we could assume normal sampling distribution for the parameter, i.e. use normal approximation. This works well when the number of observations is large. But if not, we may need to use a \(t\)-distribution. Let’s look at both of them.
This approach would be exactly what we were doing before, when calculating the confidence intervals for the mean. Let’s get 95% confidence interval for the slope in our model:
# point estimate +/- z-score * SE
coef(m1)[2] + qnorm(c(0.025, 0.975), 0, 1) * se
## [1] 993.651 1283.415
# FYI, we can also write it as follows:
qnorm(c(0.025, 0.975), coef(m1)[2], se)
## [1] 993.651 1283.415
# qnorm part gives us z-scores, which are quantiles from a standard normal
# we are basically reverting the standardization
qnorm(c(0.025, 0.975), 0, 1) * se
## [1] -144.8822 144.8822
qnorm(c(0.025, 0.975), 0, se)
## [1] -144.8822 144.8822
# and shift the distribution to one side
coef(m1)[2] + qnorm(c(0.025, 0.975), 0, 1)
## [1] 1136.573 1140.493
qnorm(c(0.025, 0.975), coef(m1)[2], 1)
## [1] 1136.573 1140.493
How would you interpret this?
Remember, XX% of random samples of size \(n\) are expected to produce XX% confidence intervals that contain the true population parameter.
Since we cannot directly calculate the standard error for \(\hat \beta\), and we use the sample standard deviation \(\hat \sigma\) instead of the population standard deviation \(\sigma\). This strategy tends to work well when we have a lot of data and can estimate \(\sigma\) using \(\hat \sigma\) accurately. However, the estimate is less precise with smaller samples, and this leads to problems when using the normal distribution. Instead, we can use the \(t\)-distribution.
The \(t\)-distribution is always centered at zero and has a single parameter: degrees of freedom. The degrees of freedom (\(df\)) describe the precise form of the bell-shaped \(t\)-distribution. When we have more observations, the degrees of freedom will be larger and the \(t\)-distribution will look more like the standard normal distribution \(\mathcal{N}(0,1)\); when the degrees of freedom is about 30 or more, the \(t\)-distribution is nearly indistinguishable from \(\mathcal{N}(0,1)\), as you can spot from the plot.
What we see from this plot is that \(t\)-distribution with smaller degrees of freedom has thicker tails than a standard normal distribution. Thicker tails indicate that more data points lie further from the center and closer to more extreme values.
How should the fact that \(t\)-distribution has thinker tails affect the width of the confidence interval, when we use t-values instead of z-scores?
# degrees of freedom
<- nrow(dat1) - length(coef(m1))
df
coef(m1)[2] + qt(c(0.025, 0.975), df = df) * se
## [1] 991.8397 1285.2267
We are also using \(t\)-distribution for hypothesis testing even more explicitly. In this framework, instead of focusing on uncertainty more generally, like with a confidence interval, we specify a null hypothesis and an alternative hypothesis. In other words, we are comparing the For regression coefficients, we would usually want to know if the estimated coefficient is different from zero, hence this is the value \(\beta^{*}_{edu}\) for comparison:
One-sided (one tailed) alternatives: The parameter is hypothesized to be less than or greater than the null value, < or >
Two-sided (two tailed) alternatives: The parameter is hypothesized to be not equal to the null value, \(\neq\)
A test statistic is constructed as:
\[ t^* = \frac{\hat\beta_{edu} - \beta_{edu}^*}{SE(\hat\beta_{edu})} = \frac{\hat\beta_{edu} - 0}{SE(\hat\beta_{edu})} =\frac{\hat\beta_{edu}}{SE(\hat\beta_{edu})} \]
We can now translate it to R:
# we use our standard error function from above to get standard errors
<- se_slope(m1, dat1$education)
se_educ <- coef(m1)[2] / se_educ
t_educ_0 t_educ_0
## education
## 15.40206
To test the hypothesis using the \(t\)-value, we need to compare \(t\)-values to a critical value. This critical value depends on our (predefined) level of significance. If our level of significance \(\alpha=95\%\), here’s what we do:
# test statistic, t-value, should be either larger than:
qt(0.975, df = 98)
## [1] 1.984467
# or smaller than:
qt(0.025, df = 98)
## [1] -1.984467
# in short, we can reject H_0 if:
abs(t_educ_0) > abs(qt(0.025, df = 98))
## education
## TRUE
The null hypothesis here is that there is no effect of education. What would you would have to do if you wanted to test against a different \(H_0\)?
You may recall that yet another way to do statistical inference is by computing p-values. The idea behind p-values is: Given the observed \(t^*\), what is the smallest significance level at which the null hypothesis would be rejected?
\[ p = Pr({|t_{n-k-1}|}>{|t^*|}) = 2 \times Pr({t_{n-k-1}}>{|t^*|}) \]
And how do we find the probability of observed or more extreme outcome given that the null hypothesis is true? We do this with the distribution function with p
, probability. The absolute value \(|t^*|\) will be a positive number, and we want to find the cumulative probability of \(t_{n-k-1}\) being more extreme. Knowing that functions in R by default look for \(Pr(X \leq x)\), we need to be a little creative when calculating the probabilities manually:
# what does this show?
pt(15.40206, df = 98)
## [1] 1
# what will this value give us?
pt(15.40206, df = 98, lower.tail = F)
## [1] 3.207699e-28
# what about this?
pt(abs(-15.40206), df = 98)
## [1] 1
# and this?
pt(-abs(15.40206), df = 98)
## [1] 3.207699e-28
# why are we multiplying by 2?
2 * pt(-15.40206, df = 98)
## [1] 6.415397e-28
# our p-values
<- 2 * pt(-abs(15.40206), df = 98)
p_m2 p_m2
## [1] 6.415397e-28
The interpretation is as follows:
If p-value \(<\alpha\), reject \(H_0\) in favor of \(H_A\): The data provide convincing evidence for the alternative hypothesis.
If p-value \(>\alpha\), fail to reject \(H_0\) in favor of \(H_A\): The data do not provide convincing evidence for the alternative hypothesis
As usually, R has built in functions for these essential calculations. summary
will give you these numbers from an lm
object. However, in case we need to extract certain values from the coefficients, we can do this without copying the numbers from the summary
output by hand. Avoiding such manual copying is more flexible and less error prone (thus, better coding practice). For instance, tidy
function from broom
package will generate a nice data.frame
-kind object, from which we can easily extract the numbers. Note that these p- and t-values only test against zero.
# summary will give us all these numbers
summary(m1)
##
## Call:
## lm(formula = income ~ education, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3369.6 -640.6 17.1 803.9 3250.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21286.61 640.69 33.23 <2e-16 ***
## education 1138.53 73.92 15.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1457 on 98 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.7047
## F-statistic: 237.2 on 1 and 98 DF, p-value: < 2.2e-16
# and tidy() from broom package will give a nice data.frame with these numbers
<- tidy(m1, conf.int = T, conf.level = 0.95)
coefs # does tidy() use normal approximation or t-values? coefs
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 21287. 641. 33.2 3.80e-55 20015. 22558.
## 2 education 1139. 73.9 15.4 6.42e-28 992. 1285.
And this way, we can also refer to the objects we already created and say that, for instance, the we have found the t-value for education coefficient to be 15.4020587. Yet we can see that this number is very long and we may want to cut it to 3 digits by using format
function: 15.4.
We have already explored the relationship between education and earnings, but to better understand the idea behind interactions, let’s take a step back. Let’s estimate a regression with only one predictor - the female
dummy variable.
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i \]
# to make plotting easier, let's transform USD to USD ('000)
$income <- dat1$income / 1000
dat1
<- lm(income ~ female, dat1)
m2 summary(m2)
##
## Call:
## lm(formula = income ~ female, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5990 -1.8133 0.0167 1.6234 5.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.0359 0.3874 82.692 <2e-16 ***
## female -1.9657 0.5087 -3.864 2e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.511 on 98 degrees of freedom
## Multiple R-squared: 0.1322, Adjusted R-squared: 0.1234
## F-statistic: 14.93 on 1 and 98 DF, p-value: 0.0002003
How would we interpret the coefficients?
\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{to females}} \text{Female}_i \]
As usually, let’s see what is going on. We’ll add the effects of gender onto the scatterplot of income and education. Though education is not in the model, this will help us in understanding the interactions later.
# specify the two colors we'll use in plots
<- case_when(
col_vec $female == 1 ~ magma(2, alpha = 0.75, end = 0.7)[1],
dat1$female == 0 ~ magma(2, alpha = 0.75, end = 0.7)[2]
dat1
)
# main plot
plot(
x = jitter(dat1$education),
y = dat1$income,
ylim = c(20, 40),
bty = "n",
las = 1,
main = "The Effect of Gender on Income (Intercept Shift)",
xlab = "Education (years)",
ylab = "Income (USD '000)",
pch = 19,
col = col_vec
)
# add the regression lines
# male (baseline category)
abline(
h = (coef(m2)[1]),
col = col_vec[1],
lwd = 3
)# female
abline(
h = (coef(m2)[1] + coef(m2)[2]),
col = col_vec[2],
lwd = 3
)
# add a legend
legend("topleft",
legend = c("Male", "Female"),
pch = 19,
lwd = 2,
col = col_vec,
bty = "n"
)
ggplot(
data = dat1,
aes(x = education, y = income, color = gender)
+
) geom_point(
position = position_jitter(w = 0.15, h = 0), # only shift along x axis
+
) geom_hline(
yintercept = m2$coefficients[1],
color = magma(2, end = 0.7)[1]
+
) geom_hline(
yintercept = sum(m2$coefficients),
color = magma(2, end = 0.7)[2]
+
) theme_classic() +
labs(
x = "Education (years)",
y = "Income (USD '000)",
color = "",
title = "The Effect of Gender on Income",
subtitle = "Intercept Shift"
+
) scale_y_continuous(
limits = c(20, 40)
+
) scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
theme(legend.position = "top")
Now let’s add the education variable back in, but let’s assume that the effect of education is the same for both subgroups - males and females:
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i + \hat\beta_2 \text{Education}_i \]
We can run the model now:
<- lm(income ~ education + female, dat1)
m3 summary(m3)
##
## Call:
## lm(formula = income ~ education + female, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6461 -0.9260 0.1294 0.8374 2.3533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.46033 0.51975 43.214 < 2e-16 ***
## education 1.12654 0.05757 19.567 < 2e-16 ***
## female -1.84918 0.22996 -8.041 2.19e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.135 on 97 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.821
## F-statistic: 228 on 2 and 97 DF, p-value: < 2.2e-16
And plot the regression line, now with the slope coefficient:
plot(
x = jitter(dat1$education),
y = dat1$income,
ylim = c(20, 40),
bty = "n",
las = 1,
main = "The Effect of Education and Gender on Income\n(Intercept Shift with Same Slope)",
xlab = "Education (years)",
ylab = "Income (USD '000)",
pch = 19,
col = col_vec
)
# And finally add the regression lines.
# for the baseline category, males
abline(
a = (coef(m3)[1]),
b = (coef(m3)[2]),
col = col_vec[1],
lwd = 3
)# for the other category, females
abline(
a = (coef(m3)[1] + coef(m3)[3]),
b = (coef(m3)[2]),
col = col_vec[2],
lwd = 3
)
# add a legend
legend("topleft",
legend = c("Male", "Female"),
pch = 19,
lwd = 2,
col = col_vec,
bty = "n"
)
How would we interpret the coefficients?
\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males with no}\\\text{education}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{in income}\\\text{to females}} \text{Female}_i + \underbrace{\hat\beta_2}_{\text{Effect of each}\\\text{additional year}\\\text{of education}\\\text{for both males}\\\text{and females}} \text{Education}_i \]
ggplot(
data = dat1,
aes(x = education, y = income, color = gender)
+
) geom_point(
position = position_jitter(w = 0.15, h = 0), # only shift along x axis
+
) geom_line(mapping=aes(y=predict(m3))) +
theme_classic() +
labs(
x = "Education (years)",
y = "Income (USD '000)",
color = "",
title = "The Effect of Education and Gender on Income",
subtitle = "Intercept Shift with Same Slope"
+
) scale_y_continuous(
limits = c(20, 40)
+
) scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
theme(legend.position = "top")
How would we interpret the coefficients?
\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males with no}\\\text{education}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{in income}\\\text{to females}} \text{Female}_i + \underbrace{\hat\beta_2}_{\text{Effect of each}\\\text{additional year}\\\text{of education}\\\text{for both males}\\\text{and females}} \text{Education}_i \]
What if females are more efficient in applying education? In other words: A shift in the intercept may not be sufficient to explain differences between men and women and their income. Rather, we may want to model different effect of education for these groups.
For heterogeneous effects, we specify our interaction model (Brambor, Clark, and Golder 2006). Note that the variable that alters the effect of another variable, is also called a moderator (people sometimes mix up mediators (see previous lab) and moderators, so be careful with using the term).
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \text{Female}_i + \hat\beta_2 \text{Education}_i + \hat\beta_3 \text{Female}_i \times \text{Education}_i \]
The syntax is very straightforward:
<- lm(income ~ female + education + education * female, # * includes interaction
m4 data = dat1
)
But interpretation of the coefficients may get a little tricky:
\[ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Expected income}\\\text{of males with no}\\\text{education}} + \underbrace{\hat\beta_1}_{\text{Difference}\\\text{in income}\\\text{to females}} \text{Female}_i + \underbrace{\hat\beta_2}_{\text{Effect of each}\\\text{additional year}\\\text{of education}\\\text{for males}} \text{Education}_i + \underbrace{\hat\beta_3}_{\text{Difference in}\\\text{effect of education}\\\text{between males}\\ \text{and females}} \text{Female}_i \times \text{Education}_i \]
We can thus see that if we only interested in prediction of income for males, the model would reduce to:
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \times 0 + \hat\beta_2 \text{Education}_i + \hat\beta_3 0 \times \text{Education}_i \\ \text{Income}_i = \underbrace{\hat\beta_0}_{\text{Male intercept}} + \underbrace{\hat\beta_2}_{\text{Effect of}\\\text{education}\\ \text{for men}} \text{Education}_i \]
And for females, the coefficients will be:
\[ \text{Income}_i = \hat\beta_0 + \hat\beta_1 \times 1 + \hat\beta_2 \text{Education}_i + \hat\beta_3 1 \times \text{Education}_i\\ \text{Income}_i = \underbrace{\hat\beta_0 + \hat\beta_1}_{\text{Female intercept}} + \underbrace{(\hat\beta_2 +\hat\beta_3)}_{\text{Effect of}\\\text{education}\\ \text{for women}} \text{Education}_i \]
Let’s now look at the exact coefficients we estimated:
coef(m4)
## (Intercept) female education female:education
## 24.7354537 -5.7756535 0.8588811 0.4643318
How would we interpret the coefficients?
Note that statistical models by themselves are not distinguishing between these two cases:
Mathematically, you can interpret the coefficients either way. However, the choice of control variables you include in the model would likely differ, depending on whether you are interested, primarily, in the effect of education or in the effect of gender on the income. It thus will make sense to interpret the interaction coefficients in line with your causal story and your selected control variables. Keele and Stevenson (2021) discuss this issue in more detail.
Again, we want to add our predictions as separate regression lines to a plot:
# Scatterplot with dots colored according to female/male
plot(
x = jitter(dat1$education),
y = dat1$income,
ylim = c(20, 40),
bty = "n",
las = 1,
main = "Heterogeneous Effect of Education on Income",
xlab = "Education (years)",
ylab = "Income (USD '000)",
pch = 19,
col = col_vec
)
# Add a legend
legend("topleft",
legend = c("Male", "Female"),
pch = 19,
lwd = 2,
col = col_vec,
bty = "n"
)
# We use the clip command to restrict the plotting area
# to x-values supported by the data
clip(
x1 = min(jitter(dat1$education)[dat1$female == 0]),
x2 = max(jitter(dat1$education)[dat1$female == 0]),
y1 = -100,
y2 = 100
)abline(
a = (coef(m4)[1]), # beta_0
b = (coef(m4)[3]), # beta_2
lwd = 3,
col = col_vec[1]
)
# Second, for women (this will be a bit tricky):
# Only plot the line for x-values supported by the data
clip(
x1 = min(jitter(dat1$education)[dat1$female == 1]),
x2 = max(jitter(dat1$education)[dat1$female == 1]),
y1 = -100,
y2 = 100
)abline(
a = (coef(m4)[1] + coef(m4)[2]), # beta_0 + beta_1
b = (coef(m4)[3] + coef(m4)[4]), # beta_3 + beta_4
lwd = 3,
col = col_vec[2]
)
ggplot(
data = dat1,
aes(x = education, y = income, color = gender)
+
) geom_point(
position = position_jitter(w = 0.15, h = 0), # only shift along x axis
+
) geom_line(mapping = aes(y = predict(m4))) +
theme_classic() +
labs(
x = "Education (years)",
y = "Income (USD '000)",
color = "",
title = "The Effect of Education and Gender on Income",
subtitle = "Intercept Shift with Same Slope"
+
) scale_y_continuous(
limits = c(20, 40)
+
) scale_color_manual(values = magma(2, direction = -1, end = 0.7)) +
theme(legend.position = "top")
Now you have seen how we can get different slopes for subgroups. We can have more subgroups than just two - in this case, once we add the interaction, each of the categories will have a separate slope.
We can also have interactions between two continuous variables. In this case, people often use marginal effect plots or, when plotting the predictions, select a few meaningful values in one variable (e.g., from some smaller and larger percentiles) and plot the predictions for the range of another variable in a similar manner to discrete case as we had here. You will learn about this in the next lab.
We will do some of these exercises in class, and the rest you should try doing at home, to better prepare yourself for the homework.
Assume that you enjoyed 10 years of education. Using our regression model with interaction effect (m4
), you want to find out:
Do this calculation for both genders. Which gender benefits more from one additional year of education?
Answer:
We’ve been using the years of schooling, assuming that each addition year of education has the same effect on income. But is this really the case? Can we expect the change from 1 to 2 years of education to have the same effect on income as an additional year of education, when you already studied for 10 years?
While one way to address this question would be with a transformation of the education variable, here we will use a different coding of the education variable to explore this difference. In this dataset, people with up to 6 years of education could be considered having primary education, those with 7 to 10 years - secondary, and those with 11 to 14 years - higher education. So we will first need to create such a variable edu_category
.
Now you need to fit a regression to see what is the effect of education (recoded) on earnings. But before you fit this model, you may want to do another recoding, this time to create dummies. You will be particularly interested in the effect of having higher education in comparison to having secondary education. Decide on:
Now run the model and interpret the coefficients. Remember that your recoding matters!
Consider the simple model from earlier, with education in years, without an interaction term again:
\[ \operatorname{income} = \hat{\beta}_{0} + \hat{\beta}_{1}(\operatorname{education}) + \hat{\beta}_{2}(\operatorname{female}) + \epsilon \]
tidy(m3)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 22.5 0.520 43.2 3.59e-65
## 2 education 1.13 0.0576 19.6 1.90e-35
## 3 female -1.85 0.230 -8.04 2.19e-12
Someone claims that one additional year of education leads to an average increase in income of 500, holding gender constant. You make the case that this is not true. Your alternative hypothesis is that the effect of education on income is different from 500. Calculate the t- and p-value. Can you support your claim?
Answer:
In this lab, you have learned to implement hypothesis testing in R and worked with categorical variables. As you have seen, the uncertainty around the coefficient estimates is expressed in standard errors, confidence intervals, t-values, p-values (Pr(>|t|)
). Yet essentially, all these numbers express exactly the same. And even more importantly, statistical significance is only one part of the story you will want to tell with your models; substantive significance is crucial, too. Plus, remember that the use of p-values is now heavily debated…
In your homework, we will ask you to:
You are very welcome to learn more about the handy things you may see in this lab and also use them in your homework assignments and data essay:
Here you can find a DAG for the fake dataset we created for this lab. In this case, we explicitly worked with a case of interactions. As you can see, we are adding an interaction explicitly as a variable into the DAG, and this variable affects income. Yet we also assume that these is a direct effect of both education and female on income separately as well.
dagitty('dag {
bb="0,0,1,1"
"Female * Education" [pos="0.3,0.45"]
Education [exposure,pos="0.2,0.5"]
Female [pos="0.25,0.4"]
Income [outcome,pos="0.6,0.5"]
"Female * Education" -> Income
Education -> "Female * Education"
Education -> Income
Female -> "Female * Education"
Female -> Income
}
') %>%
plot()