Chapter 5 – Econometrics

The final chapter of this script gives a brief introduction into the realm of econometrics. Put simply, econometrics can be described as the study of methods that allow statistical analysis of economic issues. Such analyses are of central importance to the profession: no matter how sophisticated or convincing a theoretical model may be, the strongest proof of a theory is always its consistency with observations made in the real world. Therefore, economists need a set of tools allowing them to contrast their theoretical considerations against the real world. Conversely, econometric methods also allow for more exploratory analyses of empirical relationships, the results of which may themselves give rise to hypotheses that form the bases of interesting theoretical studies. Finally, independent of any economic theory, econometric methods can be used to assess the effectiveness of different political measures (policy evaluation), typically by treating the policy intervention as an exogenous shock to the economy, the causal effect of which is relatively easily identified through the use of appropriate methods.

The mathematics of econometrics are somewhat distinct from what we have seen thus far. It relies on new concepts, specifically random variables and their moments (expected values, variance, etc.), but at the same time, especially the concepts of differentiation, integration and optimization are also of central importance, which should facilitate our familiarization with econometrics. This chapter tries to serve as a refresher of/introduction to the very basics of econometrics, without going too deep into any specific direction. As such, it covers an introductory discussion of correlation and causation, the concept of random variables, and the linear regression model and its estimation.

Correlation does not mean or imply Causation

When engaging in an empirical analysis, our aim is usually to identify causal relationship, that is, we seek to arrive at conclusions such as “X causes Y to increase” or “Productivity growth has slowed down because the rate of technology diffusion has decreased”. In other sciences such as medicine, such statements are readily proved or disproved using controlled experiments. Put simply, to understand the effect of some treatment, you split a test population into two identical groups, apply the treatment to one group but not to the other, and observe how outcomes differ on average between the groups. While the outcomes will never be exactly identical across groups, we can use statistical methods that allow to quantify how likely it is to obtain the observed difference under the hypothesis that there is no treatment effect. If the likelihood is sufficiently small (usually: the difference of outcomes is sufficiently large), then we may claim to have identified a treatment effect.

In economics, however, things are rarely as simple. First, selecting two groups of identical economies/economic regions is already a vastly difficult task, as there are a multitude of economic characteristics along which regions may differ, such as demographics, political orientation, trade integration, and many more. Especially if not all relevant characteristics are observable, defining two comparable groups may be difficult. Much more importantly, however, economic experiments would usually be (i) very costly, and (ii) potentially unethical. Considering the example of slower technology diffusion and lower productivity growth from before: if you artificially banned some firms from adopting technologies that could improve their performance, these firms may fall behind in competition, and you could cause, among others, worker layoffs, firm bankruptcies, weakened competition, and lower government income from revenue tax. Further, if your hypothesis is indeed true (and existing research seems to show that it is), you would weaken aggregate productivity growth, a central determinant of overall economic well-being.

Sometimes, researchers find themselves in the situation that nature takes over the role of the unethical interventionist, e.g. when natural disasters destroy a relevant share of economic infrastructure in some regions, but does not in other neighboring and thus comparable regions. Here, quasi-experimental methods can be applied to obtain causal conclusions. The more common scenario, however, is that economists use methods of correlation analysis, and need to argue that the identified correlational relationships are indeed causal. Verbally, correlation refers to the co-movement of certain quantities. A perfect correlation is when quantities co-move in a one-to-one fashion without any deviation; a positive perfect correlation would, for instance, observed for the sale of left-hand and right-hand gloves, if gloves are sold in pairs, and a perfect negative correlation can be assumed between the inventory stock and the quantity of sales, so long as there is no restocking. Less obvious real-world correlations are rarely perfect. Coming back to medical treatments, some patients may not respond as much to a treatment as others, but as long as the treatment is effective in some way, we should observe a positive correlation between the intensity of the treatment and the health outcome.

For economists that mainly analyze correlations, a critical conceptual circumstance is the following: correlation does not imply causation, or put differently, correlation is not a sufficient condition for causation. Two simple examples help illustrate this point.

    1. Over the past decades, in the US, energy consumption has increased while the marriage rate (share of unmarried persons in legal marrying age getting married in a given year) has decreased.
    2. Over a given year, in months with higher ice cream sales, the number of shark attacks is higher.

Example 1 describes a negative correlation between energy consumption and the marriage rate, while example 2 describes a positive correlation between ice cream sales and shark attacks over time. So, does energy consumption decrease people’s willingness to get married, and do ice cream sales cause shark attacks? Clearly not. While the two quantities of example 1 are likely completely independent, in example 2, the quantities have a common determinant, but are not directly related. This common determinant is seasonality, or respectively the weather. On warm, sunny days, more people buy ice cream, and go to the beach swimming where they can be attacked by sharks. Therefore, the weather directly determines ice cream sales, and indirectly determines shark attacks.

A further limitation of correlations is that they only describe co-movement, but they also do not allow to infer on the direction of a potential causality between quantities. For instance, if employment and GDP are positively correlated, even if we are told there is a causal relationship between the two quantities, we don’t know whether higher GDP causes employment to increase, or whether increasing employment causes higher GDP.Crucially, as correlations may be in part driven by forces entirely unrelated to the causal relationship, it could even be the case that higher GDP causes lower employment, despite the quantities being positively correlated. That is, a positive correlation does not preclude a negative causal relationship, and vice versa, see also the example below. Such issues, unfortunately, can ex ante also be a concern in the more sophisticated models that we use. If we want to study how x affects y, but y may also affect x, we call this issue reverse causality.

The following example of a fictional economy describes how it may potentially be very harmful to base economic actions on insights from correlations. Consider an economy of a “saturated labor market” where strong population growth leads to only modest increases in the employment level:

 

\begin{tabular}{cccccc} period & total pop. & L & UR & $Y$ & $\Delta Y$ \\ \hline 0& 11 & 10 & 1/11 & 100 & - \\ 1& 14 & 12 & 1/7 & 140 & 1/6 \\ 2& 20 & 15 & 1/4 & 200 & 1/5 \\ 3& 30 & 20 & 1/3 & 300 & 1/4 \\ \end{tabular}

 

For simplicity, production Y is assumed to follow Y = 10 L, where L is employment. Suppose we were interested in understanding the effect of the unemployment rate UR on GDP growth \Delta Y. In this example, the higher the unemployment rate, the higher GDP growth. If a policy maker would infer causality from this correlation, they could conclude that forcing workers into unemployment could boost GDP growth, possibly a desirable policy target. However, in the example, the unemployment rate and GDP growth are jointly determined by the total population (or population growth) and the induced effect on employment L, similar to the example of the seasonality origin of the correlation between ice cream sales and shark attacks. Indeed, if you increased the unemployment rate at a given population level, i.e. holding this quantity constant, you would reduce employment, and thus Y and \Delta Y, so that the unemployment rate has a negative effect on GDP growth, despite the positive correlation.

In this example, the true effect of the unemployment rate on the outcome of interest, GDP growth, could be uncovered from considering the relationship holding constant a third variable that drives the correlation but is unrelated to the causal relationship. This is precisely the idea of the methods we use in practice: they allow to “hold constant”, or control for a set of observable quantities, and thereby enable us to zoom in on the residual relationship of quantities of interest unexplained by controlled quantities. By doing so, we are able to isolate the “direct” correlation between the two variables that abstracts from indirect dependence through third variables, and if we account for all such third variables, this direct correlation can be interpreted causally. The practical issue then is to know which third variables to account for, what to do if some of them are unobserved and/or unmeasurable, and how to identify the direction of the causality, that is not addressed by the mere correlation. To this end, a common tool, especially in macroeconometrics where we work with data that entail a time dimension, is to study dynamic relationships, i.e. the correlation between x_{t-1} and y_t, exploiting that causation works in only one direction through time.

To wrap up, you should take away that correlations are not always useful in arguing for causal relationships, as they can occur entirely at random (e.g. energy consumption and marriage rate), or due to joint dependence on a third variable (e.g. ice cream sales and shark attacks). Further, the sign of a correlation is not always informative about the sign of the causal relationship even if there is one (e.g. unemployment rate and GDP growth) if third variables play a role. Even if a correlation does describe a causal relationship (i.e., there are no third variables, or all relevant third variables are “held constant” by an appropriate model), it does not tell us the direction of the causal relationship, which needs to be argued for separately.

Probability Spaces and Random Variables

Before moving to our methods of empirical analysis, as per usual, a key step is to have all ingredients and our vocabulary straight. In the world of statistics, in which econometrics is a sub-discipline, the foundation of everything is probability theory, or stochastics. We will touch on this issue very briefly, focusing on the core concepts relevant to economists. The concepts we discuss here, especially random variables and their moments, are also key in many macroeconomic models that feature some uncertainty, and where agents have to form expectations about the future and adapt their behavior accordingly.

Usually, when engaging in empirical analysis, we try to learn something about the future, either directly or indirectly. A direct empirical question related to the future could be: “Will increasing university budgets augment GDP growth, and if so, by how much?” More indirectly, one may ask “Has increasing university budgets augmented GDP growth over a recent period, and if so, by how much?” The relevance of this question, however, also lies in the future: by looking at the (recent) past, we want to know whether one could expect similar relationships also in the future. Generally, we can learn about the future from the past if conditions remain similar enough. In this case, we exploit the circumstance that the events in the past used to be the future at some point, and outcomes were yet to materialize. In other words, as the outcomes were associated with some uncertainty, they were, to some degree, random (or: unpredictable), just as the future’s outcomes are somewhat random from today’s perspective. So, if conditions do not change much, we should be able to learn from the realizations of uncertainty in the past about what to expect for uncertainty in the future, and we can also quantify how accurate these expectations are, based on how accurate they would have been for the period on which we already have data.

This describes the intuition of why probability theory is relevant to economics, and especially econometrics: it allows to model the future as an environment associated with uncertainty, and we are able to describe this uncertainty and form expectations based on the assumption that the past was characterized by the same model, with the only difference that the uncertainty has already materialized. Therefore, a crucial thing to keep in mind is that all predictions for the future obtained from econometric analysis, including the effectiveness of policy measures and the identification of structural economic relationships, inform about the future only to the degree that framework conditions do not change significantly relative to the period covered by analyzed data. This is why well-published empirical papers tend to use data that is as recent as possible, and it is also why economic crises such as the global financial crisis around 2008-2012, but also more recently the Covid pandemic and Russia’s war on Ukraine and the ensuing economic repercussions, constitute considerable challenges for economists: these times of very unique economic conditions are unprecedented in history, and analyzing data from the past will only be partially able to guide effective policy in response to these crises.

Probability Space and Probability Measure

Definition: Probability Space.
A probability space \mathcal P is a triple \mathcal P := (\Omega, \mathcal A , P), where

    • \Omega is the sample space, the set of all possible outcomes,
    • \mathcal A is the event space, the set of all possible events, and
    • P: \mathcal A \to [0, 1] is the probability measure that assigns events A\in \mathcal A a probability.

 

The above definition defines the probability space. It is best illustrated at a simple example: consider the case of rolling a regular dice. Before the dice is rolled, the set of possible outcomes is \Omega = \{ 1, 2, 3, 4, 5, 6 \}. One example of an event that could occur is that a specific number is rolled, e.g. A = \{ 2 \} or A = \{ 5 \}. However, it could also be that an even number is rolled, A = \{2, 4, 6\}, or that a 6 is not rolled, A = \{1, 2, 3, 4, 5\}. So, any combination of elements in \Omega may constitute an event, and more generally, also beyond this simple example, we tend to consider \mathcal A = \mathcal P(\Omega), i.e. the power set of the sample space, as the set of possible events.

It remains to define the probability measure P: \mathcal A \to [0, 1] that assigns events A \in \mathcal A a probability between 0 and 1. Clearly, the measure should assign A = \Omega, the event that any number between 1 and 6 is rolled, a probability of 1. Further, for any specific number, i.e. A = \{a \}, a\in \Omega, the probability should be 1/6. More generally, as the dice is fair (i.e., all numbers have the same probability), we can define P(A) = |A| / 6, where |\cdot| is the cardinality of the set: the probability of an event is the amount of different numbers it allows, divided by the amount of all possible numbers.

Definition: Event Space, \sigma-Algebra.
Consider a sample space \Omega. Then, a set \mathcal A is called an event space, or \sigma-algebra if

    • \varnothing \in A and \Omega \in A,
    • for any A\in \mathcal A, for A^c := \Omega \backslash A, it holds that A^c \in A,
    • for any A, B \in A, it holds that A \cup B \in A.

 

Definition: Probability Measure.
Consider a sample space \Omega and an event space \mathcal A. Then, a function P: \mathcal A \to [0, 1] is called a probability measure on (\Omega, \mathcal A) if

    • P(\Omega) = 1 and P(\varnothing) = 0 is the sample space, the set of all possible outcomes,
    • if A, B\in \mathcal A are disjoint, i.e. A \cap B = \varnothing, then P(A \cup B) = P(A) + P(B).

 

While \Omega is always the “base set” of possible outcomes and requires little further characterization to allow generalization to richer stochastic environments, the other two components of the probability space do. These characterizations are given in the definitions above. Verbally, the event space requires that we consider that “anything can happen” (\Omega 'in \mathcal A), that for any event A, the opposite A^c can happen, and that for any two events A and B that can happen, the event that either one of them can happen, A \cup B, is also accounted for. The simplest way to achieve this is \mathcal A = \mathcal P(\Omega), which is the default case for everything to follow.

The definition of the probability measure may seem surprisingly general. Next to the intuitive characteristic of assuming only numbers between 0 and 1, we only require that “anything will happen” with probability 1 (P(\Omega) = 1) while “nothing happens” with probability zero (P(\varnothing) = 0), and that further, the probability that either one of two distinct events occurs is the sum of their probabilities. To see the latter, imagine the dice roll with A = \{1 , 2\} and B = \{3\} for which P(A) = 2/6, P(B) = 1/6 and P(A\cup B) = P(\{1,2,3\}) = 3/6. The focus on disjoint sets is crucial: if some elements \omega\in\Omega would realize both A and B, for instance modifying B= \{2,3\}, then additivity need not be assumed. Note that by the second property, it is usually sufficient to know the probability of events referring to individual elements of \Omega, as any event A can be decomposed into a disjoint union of such events: A = \bigcup_{i\in I} \{\omega_i\}, \omega_i \in \Omega. For notational simplicity, we define P(\omega) := P(\{\omega\}) for \omega \in \Omega.

The reason that the definition of a probability measure is so broad is that we want to allow ourselves to adapt the concept flexibly to concrete contexts. While we are allowed to call many functions \mathcal P(\{1,2,3,4,5,6\})\to [0,1] a probability measure that describes the experiment of a dice roll in accordance with the definition of the probability measure, as we have seen above, only one of them suits the specific context of rolling a fair dice. Similar to the metric concept, the definition gives us a range of functions that we may attribute the label of a probability measure, and our exercise in concrete contexts is to find the most useful one. As an exercise, try to define the probability measure that characterizes rolling a rigged dice for which it is twice as likely to roll a 6 than it is to roll a 1, but all other numbers remain as likely as with a fair dice.


We know that P(\omega)=1/6 for any \omega \in \Omega \backslash\{1,6\}, so that P(1) + P(6) = 2/6. By P(6) = 2 P(1), we get 3P(1) = 2/6, or P(1) = 1/9, and P(6) = 2 P(1) = 2/9.

 

Random Variable

In most practical scenarios, we care less about the realized events themselves, but rather about (numeric) outcomes that they imply. For instance, when you plan on going to the beach, you do not want to know all the meteorological conditions \omega \in \Omega in the space of possible conditions, but you care about the functions x(\omega) = \mathds 1[\omega \text{ leads to rain}] and about c(\omega) that measures the temperature in degrees Celsius implied by the conditions \omega. In other words, you care about two random variables: variables X on the real line that are determined by (possibly much richer) outcomes in the event set \Omega of a probability space. As the beach example illustrates, the appeal of random variables is two-fold: on one hand, they can be defined to address directly what we are interested in, and they allow to reduce dimensionality and possibly reduce the prediction problem. To see the latter point, if we build a model to forecast exact meteorological conditions, this model will need a lot of data and may even then perform poorly, and if different conditions imply a similar temperature, we may be better off forecasting the temperature directly.

Definition: Random Variable, Random Vector.
Consider a probability space (\Omega, \mathcal A , P). Then, a function x:\Omega \mapsto \mathbb R is called a random variable, and a vector \mathbf x = (x_1,x_2,\ldots, x_n)' where x_i, i\in\{1,\ldots,n\} are random variables, is called a random vector.

 

This definition gives a broad definition of the random variable concept. Those with a stronger background in statistics or econometrics may excuse that it is very vague and simplistic; the concept is mathematically not straightforward to define, and for our purposes, it suffices to focus on the characteristics stated in the definition.

Continuing the example of the fair dice roll, let’s consider a game where I give you 100 Euros if you roll a 6, but you get nothing otherwise. Can you define the function \pi(\omega) that describes your profit from this game, assuming for now I let you play for free?


The function is \pi: \{1,\ldots, 6\}\mapsto\mathbb R, \omega \mapsto \pi(\omega) with \pi(\omega) = 0 for \omega \in \{1,2,3,4,5\} and \pi(\omega) = 100 for \omega = 6.

 

A key concept related to random variables X is their distribution, i.e. the description of probabilities characterizing the possible realizations of X. Our leading example of the dice roll, and the game where you can win 100 Euros, is an example of a discrete random variable, characterized by a discrete probability distribution. For such random variables, there are a finite number of realizations (in our case: two, either 0 Euros or 100 Euros), and every realization may occur with a strictly positive probability (5/6 vs. 1/6). The other case is a continuous probability distribution, with an infinite number of realizations that can not be attributed positive probability: P(X = x) = 0 for any x\in\mathbb R. One such example is a different game of dice: say I generate a random real number X between 0 and 25, and then I let you roll a dice with result D, giving you a profit \pi = D \cdot X, so that the result of the dice roll is multiplied by this random number. The first stage of this game, i.e. randomly choosing a number from [0,25], is characterized by the probability distribution P(X\in [a,b]) = (a - b) / 25 for a, b \in [0,25] with a > b. More commonly, we set b as the minimal possible value (sometimes b=-\infty) and characterize the distribution as P(X\leq a) = F(a), in our case P(X\leq a) = a / 25. The probability density function is the derivative of this function: f(a) = F'(a), in our case f(a) = \mathds 1[x\in[0,25]]\cdot  1/25.

Definition: Expected Value.
Consider a probability space (\Omega, \mathcal A , P) and a random variable X:\Omega \mapsto \mathbb R on this space, with probability density function f_X. Then, the expected value \mathbb E[X] of X is defined as the integral

    \[ \mathbb E[X] = \int_{-\infty}^{\infty} x f_X(x) dx, \]

and for a transformation g(X) of X,

    \[ \mathbb E[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx. \]

 

Definition: Variance, Standard Deviation.
Consider a probability space (\Omega, \mathcal A , P) and a random variable X:\Omega \mapsto \mathbb R on this space. Then, the variance of X is defined as Var[X] = \mathbb E[(X- \mathbb E[X])^2], and its square root is called the standard deviation of X, denoted sd(X) = \sqrt{Var[X]}. The variance can generally be computed as Var[X] = \mathbb E[X^2] - \mathbb E[X]^2.

 

Before moving to economic models, we need to introduce two more key concepts, given by the definitions above. The expected value gives us the best possible prediction of X given its distribution. Crucially, you always use the probabilities referring to X, even if you consider a transformation g(X) of X. For discrete distributions, when \{x_i\}_{i\in I} is the set of values that X can take with positive probability, the expression simplifies to \mathbb E[X] = \sum_{i\in I} P(X=x_i) \cdot x_i. This shows the intuition of the expected value: you expect the value x_i with probability p_i = P(X=x_i), and you expectation is then simply the sum of all these terms. While the expected value gives you a “mean” of what to expect for X, it is silent on how good this prediction is (note that the expected value need not be a possible realization of X. In the case of the dice roll, you can convince yourself that E[D] = 3.5, which is not a number that the dice can show). This is the purpose of the variance, measures the expected degree of deviation of X from the expected value. As the variance is based on a square term, we usually re-normalize it by taking the square root when talking about deviations from the expected value, which we call the standard deviation.

To close this section, you may test your understanding of the concepts thus far by answering two questions: First, if you were neutral to risk (i.e., you only care about your expected profit), would you play the game where you can win 100 Euros if you roll a 6 if you needed to buy in for 15 Euros (also think about whether you would spontaneously agree to this game without doing the math)? And second, what game do you prefer in terms of expectations, the one where you get a 100 Euros for a 6, or the one where you get D \cdot X Euros, and X is drawn randomly from [0,25]? And which of these games has the more certain/less volatile outcome? For the latter game, you may use that the dice role is independent of the randomly drawn number X, so that E[D^kX^k]=E[D^k]E[X^k] for any k\in \mathbb N. (If you get stuck, feel free to open and read through the solution.)


The expected value of game 1 is

    \[\mathbb E[\pi] = \sum_{i\in I} P(X=x_i) \cdot x_i = \frac{5}{6} \cdot 0 + \frac{1}{6} \cdot 100 = \frac{100}{6} \approx 16.67.\]

So you should want to pay no more than 16.67 Euros to play this game. If I offer this to you for a buy-in of 15 Euros, if you are risk-neutral, you should do it.

On a side note, in practice, most people would probably not take this game for 15 Euros, as people are risk-averse (they don’t assess games based on expected values) and they dislike losses more than they like gains. Behavioral economists argue that with these conditions, it is rational for people not to play such a game. Things were to change only if we played this game repeatedly, say a 1000 times, where you would expect to eventually be close to the expected profit on average. This is indeed a central result of probability theory, the law of large numbers, as introduced later: P(|\frac{1}{n}\sum_{i=1}^n X_i - E[X_i]|>\varepsilon) \to 0 as n\to\infty for any fixed \varepsilon > 0, if the realizations of X_i are independent and follow the same distribution (independent and identically distributed, i.i.d.). Verbally, by repeating a game/an experiment often enough, the average of realizations approaches the expected value with arbitrary precision.

For game 2, the expected value is

    \[\begin{split} \mathbb E[D\cdot X] = \mathbb E[D] \mathbb E[X] &= 3.5 \cdot \int_{-\infty}^{\infty} x \mathds 1 [x\in[0,25]] \cdot \frac{1}{25} dx \\ &= 3.5 \cdot \frac{1}{25} \int_{0}^{25} x dx \\ & =3.5 \cdot \frac{1}{25} [\frac{1}{2}x^2]_{x=0}^{x=25} \\ &= 3.5 \cdot \frac{625}{50} = 3.5 \cdot 12.5 = 43.75 \end{split}\]

Therefore, this game is much preferable to game 1 in terms of the expected profit. To address the volatility of outcomes, we need to compute the variance. Since Var[X] = E[X^2] - E[X]^2, it remains to compute the expectations of the squared terms. First, for game 1,

    \[\mathbb E[\pi^2] = \sum_{i\in I} P(X=x_i) \cdot x_i^2 = \frac{5}{6} \cdot 0 + \frac{1}{6} \cdot 100^2 = \frac{10000}{6},\]

so that

    \[ Var[\pi] = \mathbb E[\pi^2] - \mathbb E[\pi]^2 = \frac{10000}{6} - \left ( \frac{100}{6}\right )^2 = 10 000 \left (\frac{1}{6} - \frac{1}{36}\right ) = \frac{10 000 \cdot 5}{36} \approx 1388.89 \]

with sd(\pi) = \sqrt{ 1388.89} \approx 37.27.

Next,

    \[\mathbb E [D^2] = \sum_{d=1}^6 P(D=d) \cdot d^2 = 1/6 (1 + 4 + 9 + 16 + 25 + 36) = 91/6 \]

and in analogy to above,

    \[ \mathbb E[X^2] = \int_{-\infty}^{\infty} x^2 \mathds 1 [x\in[0,25]] \cdot \frac{1}{25} dx = \frac{1}{25} [\frac{1}{3}x^3]_{x=0}^{x=25} = \frac{25^3}{75} = \frac{625}{3} \]

so that

    \[ Var(DX) = \mathbb E [D^2]\mathbb E[X^2] - \mathbb E[D\cdot X]^2 = \frac{91}{6} \frac{625}{3} - \left (43.75\right )^2 \approx 1245.66 \]

with sd(DX) = \sqrt{1245.66} \approx 35.29.

Therefore, not only is game 2 better in terms of the expected value, but it is also slightly less uncertain/volatile. The more risk-averse an agent is, the more they value certainty, and the uncertainty comparison could therefore be an additional practical reason to opt for game 2 rather than game 1.

 

Linear Regression Model

Having laid the foundation of stochastic concepts, this final section proceeds to discuss basic econometric models used in empirical practice. Recalling the introduction to this chapter, these models aim to do the first step from correlations to causality by allowing to “control”/”hold constant” third confounding variables. We will thoroughly keep this intuition in mind when discussing their definition and estimation.

Specification

One final concept that we need to understand the relationship between random variables, as we intend to model in the following, is the one of a conditional expectation – the value we expect a random variable Y to take knowing the value of X. This quantity we call \mathbb E[Y|X], the expected value of Y given X. To understand this concept in simple terms, recall the game 2 in dice roll experiment: in the first step, we drew a random number from [0,25] with a uniform distribution, i.e. all parts of equal length within [0,25] were equally likely. For the purpose of notational consistency, let’s for now call this variable Y rather than X, as we did before. The expected value, as calculated earlier, was \mathbb E[Y] = 12.5 = 25/2. More generally, if we draw randomly with uniform distribution from [0,a], then the expected value is \mathbb E[Y] = a/2. But what if we draw the upper bound at random in a first step? Then things become a lot more complicated, and depend on the distribution from which we draw the upper bound a. Still, if X is the random variable that gives us the upper bound a, we can say that \mathbb E[Y|X=a] = a/2. Note that a\in \mathbb R is still a concrete realization of X. However, we do not need to condition on concrete realizations: no matter what X will be, given that we know X, we know that the expected value of Y will be X/2. Therefore, we write \mathbb E[Y|X] = X/2. This object, now, is a random variable that is a function of X: \mathbb E[Y|X] = g(X) with g(x) = x/2.

This simple example illustrates an important point: when we condition on concrete values of X, the conditional expectation \mathbb E[Y|X=x] will be a real number, as the “unconditional” expectation \mathbb E[Y] is, too (note that this statement refers to the type of object; in general, it will not be the case that \mathbb E[Y|X=x] = \mathbb E[Y]). More generally, the conditional expectation of Y given X is a random variable itself, and functionally depends on X: \mathbb E[Y|X] = g(X).

Now it is time to put these concepts to use. Recall that initially, we set out to come up with a model that allows to causally address how a variable X affects another variable Y. Since we want to speak to the future where the realizations of X and Y are not known, we treat these quantities as random variables, and exploit that we have data that records existing realizations of these random variables in the past. The simplest way to describe a relationship between Y and X using the conditional expectation is to define e := Y -\mathbb E[Y|X], where e is the deviation of Y from what we would expect for it given X, or the error made from predicting Y using X. Re-arranging terms, one obtains

    \[ Y = \mathbb E[Y|X] + e \]

We know that generally, \mathbb E[Y|X] = g(X) for an unknown function g(\cdot). While we do not know g(\cdot) itself, we do know Taylor’s theorem, which tells us that a polynomial approximation to g(\cdot) may be “good”. The simple, linear model assumes that a first order approximation fares sufficiently well. It should be mentioned that there are a number of tests for linearity, but they are considered beyond the purpose of this course. Under this assumption of linearity, g(x) = \beta_0 + \beta x, and the equation becomes

    \[ Y =  \beta_0 + \beta X + e \]

In this model, if we observe that X is higher by \Delta x > 0 units, we expect Y to be higher by \beta \cdot \Delta x units. Note that this does not assert anything about causality: a non-zero coefficient \beta simply tells us that X is useful in predicting Y, similar to the way ice cream sales are useful in predicting shark attacks. To move closer to a causal interpretation, let us consider a random vector of third variables Z of length k that may describe an indirect relationship between X and Y, such as seasonality in the ice cream/sharks example, or the population and employment level in the unemployment rate/GDP growth example. Similar to before, we can write

    \[ Y = \mathbb E [ Y | X, Z] + e \overset{\text{Ass. of linearity}}= \beta_0 + \beta \cdot (X, Z) + e =\beta_0 +  \beta^x X + \beta^z_1 Z_1 + \ldots + \beta^z_k Z_k + e \]

Then, \beta^x measures the predictive potential of X for Y conditional on Z, i.e. holding constant Z. To see this more explicitly, consider the equation in concrete realizations,

    \[ Y = \mathbb E [ Y | X=x, Z=z] + e = \beta_0 + \beta^x x + \beta^z_1 z_1 + \ldots + \beta^z_k z_k + e \]

where the marginal effect of increasing the realization x of X on Y can be obtained as

    \[ \frac{\partial Y}{\partial x} = \frac{\partial (\beta_0 + \beta^x x + \beta^z_1 z_1 + \ldots + \beta^z_k z_k + e)}{\partial x} = \beta^x. \]

In this view, \beta^x is the effect that X has on Y, holding the control variables Z_1, \ldots, Z_k constant at the levels z_1, \ldots, z_k. By the linear nature of the model, the levels at which these variables are held constant do not affect the marginal effect of X on Y. Therefore, coming back to the example of unemployment rate and GDP growth, an appropriate empirical model is

    \[ \text{GDP growth} = \beta_0 + \beta^x UR + \beta^z_1 POP + \beta^z_2 L + e. \]

In this model, \beta^x describes the relationship between UR and GDP growth, holding constant the levels of population and employment. Still, the sign and magnitude of \beta^x only tells you how useful UR is in predicting GDP growth at constant population and employment, and this relationship could, in practice, also arise due to further omitted variables that are not yet included in the model, or because GDP growth affects the unemployment rate. This aspect is discussed further in the last section of this chapter.

Estimation

Assume that you have arrived at a model that you think is useful in describing the relationship of interest, of the form

(1)   \begin{equation*} Y = \beta_0 + \beta^x X + \beta^z_1 Z_1 + \ldots + \beta^z_k Z_k + e. \end{equation*}

This model will tell you something interesting only if you manage to know/get a plausible estimate of the coefficient of interest, \beta^x. For what follows, we return to the simpler case of only one random variable on the right hand side of the model, but the methods easily extend (with some slight complexification of the algebra) to the multivariate case. So, for now, we return to

(2)   \begin{equation*} Y = \beta_0 + \beta^x X + e. \end{equation*}

Assume that you observe n units of data on Y, \{x_i, y_i\}_{i=1}^n. The intuition behind the approach to estimating the coefficients \beta_0 and \beta^x is relatively simple. We know that the expected value of Y given X, \mathbb E[Y|X], which we assume to be linear, i.e. \mathbb E[Y|X] = \beta_0 + \beta^x X , is the best prediction that can be obtained for Y given X. Therefore, we when calculating the estimate \hat\beta = (\hat\beta_0, \hat\beta^x)'\in\mathbb R^2 of \beta = (\beta_0, \beta^x)\in\mathbb R^2 from the data, we proceed to choose \hat\beta such that it gives the best linear prediction of the data points \{y_i\}_{i\in I} given \{x_i\}_{i\in I}. If we slightly rephrase the issue as optimizing the quality of prediction, things begin to sound very familiar. Indeed, the only part that we need yet to specify is what we mean by “best” prediction. As per the economist’s preference for the Euclidean space (recall that the Euclidean norm captures the direct distance), it seems natural to minimize the distance of the vectors y = (y_1, \ldots, y_n)'\in\mathbb R^n of data points for Y and \mathbf X \beta, where

    \[ \mathbf X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \]

is the data matrix for the equation’s right hand side (RHS) that stacks the observations of RHS variables in its columns. To make sure that you understand the structure of this matrix, think about how it would look like if you had the more general model of the first estimation equation with k=1, i.e. only one control variable Z_1 = Z.


The matrix with one control variable is

    \[ \mathbf X = \begin{pmatrix} 1 & x_1 & z_1 \\ 1 & x_2 & z_2 \\ \vdots & \vdots \\ 1 & x_n & z_n\end{pmatrix} \]

where \{z_i\}_{i\in I} are the observations for the random variable Z. The vector \beta would in this case be \beta = (\beta_0,\beta^x,\beta^z)'.

 

With this in mind, we can express the optimization problem as:

    \[ \min_{b\in\mathbb R^2} \| y - \mathbf X b \|_2 \hspace{0.5cm}\text{or}\hspace{0.5cm}  \min_{b\in\mathbb R^2}\sqrt{\sum_{i=1}^n (y_i - b_0 - b^x x_i)^2 }. \]

With the tools the previous chapters have equipped us with, solving this should be a relatively simple task: there are no constraints, and the objective function entails a polynomial of the solution. One source of complication could be the square root, but we can easily circumvent it by solving the equivalent problem

(3)   \begin{equation*} \min_{b\in\mathbb R^2} \sum_{i=1}^n (y_i - b_0 - b^x x_i)^2 \end{equation*}

This formulation gives the most common estimators of the linear model their name: least-squares estimators. To see this name more intuitively, note that
y_i - b_0 - b^x x_i =: \hat e_i(b) is the error in predicting y_i using the coefficient vector b = (b_0, b^x)' for prediction by x_i, which we also call the residual. Therefore, the solution \hat\beta = \arg\min_{b\in\mathbb R^2} \sum_{i=1}^n (y_i - b_0 - b^x x_i)^2, provided that it exists, minimizes the residual sum of squares RSS(b) = \sum_{i=1}^n \hat e_i(b)^2, i.e. the sum of squared prediction errors.

We are now set up to derive the (hopefully unique) solution for the estimator \hat\beta = \arg\min_{b\in\mathbb R^2} \sum_{i=1}^n \hat e_i(b)^2. To test your understanding of Chapter 4, you can try to derive it yourself before reading on. You may use without proof two helpful properties of differential calculus with matrices that are investigated on the problem sets of the course:

  1. For A\in\mathbb R^{n\times m} and x\in\mathbb R^m, \frac{d}{dx} Ax = A, and
  2. For A\in\mathbb R^{m\times m} and x\in\mathbb R^m, \frac{d}{dx} x'Ax = x' (A + A')

Further, it may be helpful to recall the discussion of positive definiteness and the rank; you may assume that \mathbf X has full column rank. To help you get started, things remain more tractable if you stick with the vector notation, and start from the problem

    \[ \min_{b\in\mathbb R^2} (y - \mathbf X b) ' (y - \mathbf X b). \]

If you have tried to solve it, or you want to read on directly, let’s get to the solution of this problem. The first step is to multiply out the objective. Doing so, the problem becomes

    \[ \min_{b\in\mathbb R^2} y'y - y'\mathbf X b - b ' \mathbf X ' y + b ' \mathbf X ' \mathbf X b. \]

A first useful observation is that  b ' \mathbf X ' y is a scalar, so that  b ' \mathbf X ' y = ( b ' \mathbf X ' y)' = y' \mathbf X b, and the problem simplifies to

    \[ \min_{b\in\mathbb R^2} y'y - 2 y'\mathbf X b + b ' \mathbf X ' \mathbf X b. \]

In obtaining the first order condition, the two properties of differential calculus with matrices mentioned above are very useful. Applying them, we know that

    \[ \frac{d}{db} y'\mathbf X b =  y'\mathbf X \hspace{0.5cm}\text{and}\hspace{0.5cm}  \frac{d}{db} b ' \mathbf X ' \mathbf X b =  b' (\mathbf X ' \mathbf X + (\mathbf X ' \mathbf X)')  = 2 b' \mathbf X ' \mathbf X . \]

Putting these together, the first order condition is

    \[ \mathbf 0' = \frac{d}{db} \left ( y'y - 2 y'\mathbf X b + b ' \mathbf X ' \mathbf X b \right ) = - 2 y'\mathbf X + 2 b' \mathbf X'\mathbf X  \]

so that the solution \hat \beta must satisfy

    \[ y'\mathbf X = \hat \beta  ' \mathbf X'\mathbf X \hspace{0.5cm}\text{or equivalently}\hspace{0.5cm} \mathbf X'\mathbf X \hat \beta = \mathbf X ' y. \]

While it looks like we are almost there, there are two things that we need to address: (i) can we invert \mathbf X'\mathbf X to obtain a unique solution? (ii) If we can, does this solution constitute a (strict) local minimum? (And is this local minimum global?)

We know that matrices of the form A'A for A\in\mathbb R^{n\times m} are always positive semi-definite: for v\in\mathbb R^m, v'A' A v = (Av)' Av = w'w = \sum_{i=1}^m w_i^2 \geq 0 for w = Av\in\mathbb R^m. Now, recalling Chapter 2, this matrix is indeed positive definite if for any v\neq\mathbf 0, Av\neq \mathbf 0, which is the case if A has full column rank, i.e. no column of A can be written as a linear combination of the remaining columns. In our case, A=\mathbf X has just two columns, and one of them is the vector \mathbf 1 = (1, 1,\ldots,1)' containing only ones. Therefore, the matrix does not have full column rank if there exists \lambda\in\mathbb R such that x_i = \lambda \cdot 1 = \lambda for all i\in \{1,\ldots,n\}, in which case the second column of \mathbf X, x = (x_1,x_2,\ldots,x_n)', can be expressed as x = \lambda \cdot 1.

In less abstract terms, this occurs if the random variable X exhibits no variation across the observations in our data, i.e. for every observed unit i, we observe the same realization x_i = \lambda. In this case, we would say that X is collinear to the constant. However, in practice, the variables we usually consider do vary across observations, and the issue does not occur. Therefore, a key assumption in least-squares estimation is that \mathbf X has full column rank, or that the RHS random variables (including the “constant” that is always equal to one) are not linearly dependent. This assumption is usually met unless one adds variables that essentially give the same information. In practice, statistical softwares will usually give you an error message if this condition is violated in the data you analyze.

While collinearity to the constant is a relatively trivial issue that is easily avoided, in models with more RHS variables, the issue of collinearity can be less direct to spot. If you have vectors x and z of observations for random variables X and Z, respectively, then collinearity occurs if \mathbf 1 = \lambda x + \mu z for some \lambda,\mu \in \mathbb R. A simple example is X=\mathds 1[i \text{ is older than 50 years old}] and Z = \mathds 1[i \text{ is younger than 50 years old}] when units i refer to persons, another one is X = age and Z = years until 100th birthday, where Z = 100 - age and 1 = 1/100 X + 1/100 Z. Intuitively, however, the issue of collinearity is still similar to the univariate case: you are adding at least one variable that does not give you new information beyond those you already have, and if you have two (sets of) variables that measure essentially the same thing, a statistical model will be unable to choose either one over the other in predicting the outcome variable Y. On this, recall also our discussions of non-full ranks and their interpretation as “redundant equations” in the analysis of linear equation systems in Chapter 2: even in the narrow mathematical sense, the full rank condition can be interpreted as the model not containing any redundant variable.

Returning to the optimization problem, if the observations for X, \{x_i\}_{i\in\{1,\ldots,n\}} vary across individuals, then the unique solution is

    \[ \hat \beta = (\mathbf X'\mathbf X )^{-1} \mathbf X ' y \]

and it constitutes a strict local minimum by positive definiteness of \mathbf X'\mathbf X', the second derivative of the objective function. This local solution is also global, as for any b\in\mathbb R^2, \lim_{\lambda\to\infty} (y'y - 2 y'\mathbf X (\lambda b) + (\lambda b) ' \mathbf X ' \mathbf X (\lambda b)) = \lim_{\lambda\to\infty} \lambda^2 b' \mathbf X ' \mathbf X b = \infty, so that the objective vanishes to +\infty in any asymptotic direction.

This solution \hat \beta = (\mathbf X'\mathbf X )^{-1} \mathbf X ' y, called the ordinary least squares (OLS) estimator, can be shown to have very strong theoretical properties: it is unbiased, i.e. \mathbb E[\hat \beta] = \beta, meaning that it is expected to estimate the true parameter \beta, and within the class of all estimators that are unbiased, it has the lowest variance, i.e. it is expected to provide the estimate closest to \beta for any given sample. This estimator always estimates the coefficients of the linear conditional expectations function \mathbb E[Y|X] = \beta_0 + \beta^x X, and as the excursion below discusses briefly, even if the conditional expectation function is not linear, the coefficients estimated by the OLS estimator recover the best linear prediction of Y given X. This is a crucial point: when you estimate OLS, you always estimate the best linear prediction, which corresponds to the conditional mean if it can be described by a linear function. Any discussion of “misspecification” or “estimation bias” does (or at least: should!) not argue that we are unable to recover the linear prediction coefficients, but that we are unable to interpret these coefficients in a causal/desired way. The last section addresses this issue to more detail.

Excursion. Non-linearity and best linear prediction.

A crucial question concerns the quality of the OLS estimator when the true conditional expectation function is not linear, i.e. there exist no parameters \beta_0, \beta^x\in\mathbb R such that \mathbb E[Y|X] = \beta_0 + \beta^x X. A first case that can generally be easily handled by the concepts introduced thus far is the one of linearity in parameters. One such example is \mathbb E[Y|X] = \beta_0 + \beta_1^x X + \beta_2^x X^2. Here, even though the conditional mean is not linear in X, the inclusion of Z=X^2 achieves a setup that is consistent with the linear model we introduced. Note that generally, X^2 \neq \lambda X for any given \lambda \in \mathbb R, so that inclusion of the squared term does not create a collinearity issue. However, in this case the marginal effect of X on Y depends on the level of X: \frac{\partial Y}{\partial X} = \beta_1^x + 2\beta_2^x X. Another example is \mathbb E[Y|X] = \beta_0 + \beta_1^x \log(X), the case of log-linearity. Also this setup is consistent with our model. If Y = \log(\tilde Y), then we have a log-log model where \beta_1^x can be interpreted as an elasticity capturing a relationship in percentages: if X increases by 1%, then \tilde Y increases by \beta^x %.

However, linearity in parameters may often be a restrictive assumption. Without taking a stance on the functional form of \mathbb E[Y|X], we can still interpret the linear model

    \[ Y = \beta_0 + \beta^x X + e =  \tilde X' \beta + e;\hspace{0.5cm} \beta = \begin{pmatrix} \beta_0 \\ \beta^x \end{pmatrix}, \space \tilde X = \begin{pmatrix} 1 \\ X \end{pmatrix}. \]

and its multivariate extensions in a useful way. To see this, note that the OLS estimator recovers the best linear prediction of Y given X in the data, i.e. given the sample \{(x_i,y_i)\}_{i\in\{1,\ldots,n\}} of observations of pairs of X and Y. It relies only on minimizing the residual sum of squares, and not on linearity of the conditional mean for the underlying random variables. Thus, the question at hand is how to interpret this linear prediction if not through the conditional mean. For this, consider the linear forecast for Y using an arbitrary vector b\in\mathbb R^2. Then, the forecast \tilde X' b for Y is good if

  1. \mathbb E[Y - \tilde X' b ] = 0, and
  2. \mathbb E[(Y - \tilde X' b)^2 ] = Var[Y - \tilde X' b] is as small as possible.

Equality in point 2 follows from Var[Y - \tilde X' b] = \mathbb E[(Y - \tilde X' b)^2 ] -  \mathbb E[Y - \tilde X' b]^2 and \mathbb E[Y - \tilde X' b]=0. Therefore, such forecast would be right “on average”, and have the minimum amount of volatility. Let’s see how the optimum looks like:

    \[ \mathbbE[(Y - \tilde X' b)^2 ] = \mathbb E[Y^2 - 2Y\tilde X' b + (\tilde X' b)^2] = \mathbb E[Y^2] - 2\mathbb E[Y\tilde X' b] + \mathbb E[(\tilde X' b)^2]. \]

where the second equality uses linearity of the mean. By this property of linearity, we can also isolate non-random vectors at the beginning and end of expressions from the expected value, specifically the vector b. Writing (\tilde X' b)^2 = b' \tilde X \tilde X' b, the expression becomes

    \[ \mathbbE[(Y - \tilde X'b)^2 ] = \mathbb E[Y^2] - 2\mathbb E[\tilde XY]b + b'\mathbb E[\tilde X \tilde X'] b. \]

Optimizing this expression with respect to b is analogous to the OLS problem, and provided that \mathbb E[\tilde X \tilde X'] is invertible, one obtains the unique solution

    \[ \beta = \mathbb E[\tilde X \tilde X']^{-1} \mathbb E[\tilde XY]. \]

This looks very familiar to the OLS solution. This is no accident: while OLS recovers the best linear prediction in the data, the estimated parameter is the best linear prediction in expectation, or in the population. To reiterate, regardless of the functional form of the conditional expectation \mathbb E[Y|X], the vector \beta of parameters estimated by the OLS estimator \hat \beta gives \tilde X' \beta as the best linear prediction of Y given X, called the linear projection of X onto Y. If the conditional mean is indeed linear, it coincides with the linear projection of X onto Y. Still, the linear projection is the more general concept, as it always exists, regardless of the functional form of the conditional mean.

In terms of interpretation, without relying on the conditional mean and its intuition of marginal effects, the linear projection coefficients \beta tell you how much, on average, Y can be expected to be higher if X is observed to be higher by one unit. If you can argue that the model appropriately absorbs third variables, you can therefore claim that the linear projection model recovers average effects.

Inference

One thing we have only insufficiently addressed thus far is the quality of estimation. While the OLS estimator is not biased, and it is the “most efficient” estimator (lowest variance among the set of unbiased estimators), in practice, its variance (i.e., the expected squared error) can still be very high – just not as high as the one you would obtain with another estimator. Therefore, if you have a simple model Y = \beta_0 + \beta^x X + e and I tell you that OLS has been applied to obtain \hat\beta^x = 0.05 based on 10 data points, would you be confident to preclude that the true value of \beta^x is 0? And would you rule out that it is 1?

To answer such questions, we need to address the uncertainty associated with our estimation. Only once we can quantify this uncertainty we will be able to perform meaningful inference on the estimates we obtained. To do so, we rely on the key concepts of convergence in probability and convergence in distribution.

Definition: Convergence in Probability.
Let \{T_n\}_{n\in\mathbb N} be a sequence of random variables, and let T be random variable. Then, we say that T_n converges to T in probability if for any \varepsilon > 0,

    \[ \lim_{n\to\infty} P(|T_n - T|> \varepsilon) = 0 \hspace{0.5cm}\text{or equivalently}\hspace{0.5cm} \lim_{n\to\infty} P(|T_n - T|\leq \varepsilon) = 1. \]

We write T_n \overset p \to T.

 

Definition: Convergence in Distribution.
Let \{T_n\}_{n\in\mathbb N} be a sequence of random variables, and let T be random variable. Then, we say that T_n converges to T in distribution if for any x\in\mathbb R,

    \[ \lim_{n\to\infty} F_{T_n}(x) =\lim_{n\to\infty} P(T_n\leq x) = P(T \leq x) = F_T(x). \]

We write T_n \overset d \to T. If the distribution of T is known, e.g. T is normally distributed with mean \mu and variance \sigma^2, i.e. T\sim N(\mu,\sigma^2), we write T_n \overset d \to  N(\mu,\sigma^2).

 

To put it more simply, T_n converges to T in probability if for large n, T_n and T are as good as equal (deviation smaller than an arbitrarily small \varepsilon) with probability 1. The fairly abstract definition of convergence in distribution has a relatively simple interpretation: the sequence of distribution functions F_{T_n} associated with the sequence T_n of random variables approaches the distribution function F_T of T pointwise, i.e. at any given location x (this criterion is indeed weaker than the one of uniform convergence, where we would require convergence across locations x). In simpler words, if n is large enough, then the distribution of T_n should be as good as identical to the one of T. This is good news especially if the distribution of T_n is difficult to characterize for finite n, but the one of T is easily obtained. Note that in distinction to convergence in probability, like you can draw two highly distinct numbers from the same normal distribution, specific realizations of T_n can still be very different from those of T, but their probability distributions will coincide. Therefore, this concept is weaker than convergence in probability. To link the two concepts, note that convergence in probability implies convergence of T_n - T to the “random variable” L with expectation \mathbb E[L] = 0 and Var[L] = 0, i.e. the variable L that is equal to 0 with probability 1. Indeed, most probability limits you will come across are real numbers/matrices, rather than stochastic random variables. This gives the following convention: if T_n\overset p \to C, where C is a non-random object (e.g. real number or matrix), then this implies T_n\overset d \to C, where C is the “random variable” equal to C with probability 1.

In our context, the random variable T_n will be the OLS estimator. Treating the observations \{(x_i,y_i)\}_{i\in\{1,\ldots,n\}} as realizations of n independent and identically distributed random variables (X_i,Y_i), the OLS estimator without concrete realizations of data points is

    \[ \hat \beta = \left (\mathbf X ' \mathbf X \right )^{-1 } (\mathbf X ' y) =\left  (\sum_{i=1}^n \begin{pmatrix} 1 & X_i \\ X_i & X_i^2 \end{pmatrix} \right )^{-1 }\sum_{i=1}^n \begin{pmatrix} 1 \\ X_i Y_i \end{pmatrix} = \left (\frac{1}{n} \sum_{i=1}^n \begin{pmatrix} 1 & X_i \\ X_i & X_i^2 \end{pmatrix} \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \begin{pmatrix} Y_i \\ X_i Y_i \end{pmatrix} \]

Adopting the convention \tilde X_i = (1, X_i)', we can more compactly represent it as

    \[ \hat \beta = \left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i Y_i \]

To address the error we make in estimation, it is useful to consider the quantity \hat \beta - \beta. To express it in a useful way, we plug in Y_i = \tilde X_i'\beta + e_i with e_i = Y_i - X_i\beta to the expression above, and we obtain

    \[\begin{split} \hat \beta - \beta &=  \left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i (\tilde X_i'\beta + e_i ) - \beta \\ & = \left [\left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i \tilde X_i' \right ]\beta - \beta + \left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i e_i \\ &= \left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i e_i. \end{split}\]

This is the quantity that we need to study in order to characterize the error we make in estimating \beta using \hat\beta. The final ingredients are given in the Theorems below. For the sake of compactness, they are not proven here or in the script, but excellent resources exist online or in textbooks. These results are fundamental enough that you will find comprehensive proofs on Wikipedia. However, for our purposes it is much more important to apply them properly, rather than to understand their theoretical justification.

Theorem: Law of large numbers.
Consider a set \{T_i\}_{i\in\{1,\ldots,n\}} of n independently and identically distributed variables with \mathbb E[T_i^2] < \infty. Then,

    \[ \bar T_n := \frac{1}{n} \sum_{i=1}^n T_i \overset p \to \mathbb E[T_i]. \]

 

Theorem: Central limit theorem.
Consider a set \{T_i\}_{i\in\{1,\ldots,n\}} of n independently and identically distributed variables with \mathbb E[T_i^4] < \infty. Then, for \bar T_n := \frac{1}{n} \sum_{i=1}^n T_i, it holds that

    \[ \sqrt{n}(\bar T_n - \mathbb E[T_i]) \overset d \to N(0, Var[T_i]). \]

 

Theorem: Slutsky’s theorem.
Let \{T_n\}_{n\in\mathbb N} and \{W_n\}_{n\in\mathbb N} be sequences of random variables with T_n\overset d \to N(0, \Sigma) and W_n \overset p \to W and \mathbb E[W^2] < \infty. Then,

    \[W_n T_n \overset d \to N(0, \mathbb E[W]\Sigma \mathbb E[W]) \]

and if W_n is invertible with probability 1, then also

    \[ W_n^{-1} T_n \overset d \to N(0, \mathbb E[W]^{-1}\Sigma \mathbb E[W]^{-1}) \]

 

One last detail will help us put everything together: note that

    \[\begin{split} \mathbb E[\tilde X_i e_i] &= \mathbb E[\tilde X_i Y_i -\tilde X_i \tilde X_i'\beta]  \\& =  \mathbb E[\tilde X_i Y_i] -\mathbb E[\tilde X_i \tilde X_i'] \beta \\& =  \mathbb E[\tilde X_i Y_i] -\mathbb E[\tilde X_i \tilde X_i'] \left (\mathbb E[\tilde X_i \tilde X_i']^{-1}\mathbb E[\tilde X_i Y_i]\right ) \\& = \mathbf 0. \end{split} \]

For a derivation of \beta = \mathbb E[\tilde X_i \tilde X_i']^{-1}\mathbb E[\tilde X_i Y_i], see the excursion above. With this, we proceed as follows:

Step 1. Applying the central limit theorem,

    \[ \frac{1}{\sqrt{n}} \sum_{i=1}^n \tilde X_i e_i = \sqrt{n} \left ( \frac{1}{n}\sum_{i=1}^n \tilde X_i e_i - \underbrace{\mathbb E[\tilde X_i e_i]}_{=0}\right ) \overset d \to N(0, Var[\tilde X_ie_i]). \]

The variance of the asymptotic distribution is

    \[ Var[\tilde X_ie_i] = \mathbb E[\tilde X_i \tilde X_i ' e_i^2] - \underbrace{\mathbb E[\tilde X_i e_i]\mathbb E[\tilde X_i e_i]'}_{=\mathbf 0} = \mathbb E[\tilde X_i \tilde X_i ' e_i^2] \]

so that compactly,

    \[ \frac{1}{\sqrt{n}} \sum_{i=1}^n \tilde X_i e_i \overset d \to N(0, \mathbb E[\tilde X_i \tilde X_i ' e_i^2]) \]

Step 2. By the law of large numbers,

    \[ \frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \overset p \to \mathbb E[\tilde X_i \tilde X_i']. \]

Moreover, the law of large numbers gives

    \[ \frac{1}{n} \sum_{i=1}^n \tilde X_i e_i \overset p \to \mathbb E[\tilde X_i e_i] = \mathbf 0. \]

Step 3. We have two components that correspond to the setup of Slutsky’s theorem. First, putting together the two results of step 2, we obtain the following from Slutsky’s theorem (recalling that convergence in probability to a constant implies convergence in distribution to the same constant):

Corollary: OLS consistency.
For the OLS estimator \hat\beta = \left (\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i' \right )^{-1 }\frac{1}{n}\sum_{i=1}^n \tilde X_i Y_i, it holds that

    \[ \hat \beta - \beta \overset p \to \mathbb E[\tilde X_i \tilde X_i']^{-1} \mathbb E[\tilde X_i e_i] = \mathbf 0 \]

and thus also \hat\beta \overset p \to \beta. Therefore, we say that \hat\beta is a consistent estimator of \beta.

 

This informs us that deviations of \hat\beta from \beta will asymptotically be arbitrarily small with probability 1. However, it does not inform the distribution of \hat \beta for concrete sample sizes n as we investigate in practice. For this, we consider instead the distribution of \sqrt{n}(\hat \beta - \beta):

    \[ \sqrt{n}(\hat \beta - \beta) = \left (\underbrace{\frac{1}{n} \sum_{i=1}^n \tilde X_i \tilde X_i'}_{\overset p \to \mathbb E[\tilde X_i \tilde X_i']} \right )^{-1 }\underbrace{\frac{1}{\sqrt{n}}\sum_{i=1}^n \tilde X_i e_i}_{\overset d \to N(0, \mathbb E[\tilde X_i \tilde X_i ' e_i^2])}  \hspace{0.5cm}\overset d \to \hspace{0.5cm} N(0,  \mathbb E[\tilde X_i \tilde X_i']^{-1} \mathbb E[\tilde X_i \tilde X_i ' e_i^2]\mathbb E[\tilde X_i \tilde X_i']^{-1} ) \]

We define \Sigma_{\hat\beta} := \mathbb E[\tilde X_i \tilde X_i']^{-1} \mathbb E[\tilde X_i \tilde X_i ' e_i^2]\mathbb E[\tilde X_i \tilde X_i']^{-1} as the asymptotic variance of \hat\beta. With this result, for sufficiently large samples, the distribution of \sqrt{n}(\hat \beta - \beta) is approximately equal to the one of a random variable with distribution N(0, \Sigma_{\hat\beta}). This is useful because it also means that the distribution of \hat \beta - \beta is approximately equal to N(0, \frac{1}{n}\Sigma_{\hat\beta}) (by Var[aX] = a^2Var[X]), and finally that

    \[ \hat \beta \overset a \sim N(\beta, \frac{1}{n}\Sigma_{\hat\beta}). \]

Here, \overset a \sim is used to indicate that asymptotically, the distribution of \hat\beta can be described as N(\beta, \frac{1}{n}\Sigma_{\hat\beta}). With some oversimplification, for large enough n, the distribution of \hat\beta is as good as indistinguishable from a normal distribution with mean \beta and variance
 \frac{1}{n}\Sigma_{\hat\beta}.

The important bit of information obtained from this is that we now have an approximation to the finite sample variance of \hat \beta that is “good” if the sample is “large”. In practice, a few hundred observations are usually considered to suffice for simple models and a few thousand for more complex models, i.e. longer vectors \beta. The matrix \Sigma_{\hat\beta} = \mathbb E[\tilde X_i \tilde X_i']^{-1} \mathbb E[\tilde X_i \tilde X_i ' e_i^2]\mathbb E[\tilde X_i \tilde X_i']^{-1} is straightforward to estimate using averages instead of expectations, and replacing the unobserved e_i with its sample counterpart \hat e_i = Y_i - \tilde X_i ' \hat \beta:

    \[ \hat \Sigma_{\hat\beta} = \left (\frac{1}{n}\sum_{i=1}^n \tilde X_i \tilde X_i'\right )^{-1}\frac{1}{n}\sum_{i=1}^n \tilde X_i \tilde X_i' \hat e_i^2 \left (\frac{1}{n}\sum_{i=1}^n \tilde X_i \tilde X_i'\right )^{-1}. \]

In the matrix \frac{1}{n} \hat \Sigma_{\hat\beta}, the (1,1) element gives the estimated variance of \hat\beta_0, and the (2,2) element the one of \hat \beta^x, denoted as \hat\sigma^2_{\beta_0} and \hat\sigma^2_{\beta^x}, respectively. One can show that

    \[ \frac{\hat\beta_0 - \beta_0}{\hat\sigma_{\beta_0}} \overset a \sim N(0, 1)\hspace{0.5cm}\text{and}\hspace{0.5cm}  \frac{\hat\beta^x - \beta^x}{\hat\sigma_{\beta^x}} \overset a \sim N(0, 1). \]

This circumstance can be used to test hypotheses such as “\beta^x = 0” or “\beta^x = 1” as mentioned in the introduction of this subsection. The interested reader is referred to test theory, but in a nutshell, the procedure is as follows: you assume that your hypothesized value, let’s say \beta^x = 0, is indeed the true value of \beta^x. Under this hypothesis, \frac{\hat\beta_0}{\hat\sigma_{\beta_0}} should be normally distributed with zero mean and unit variance. You then investigate what is the probability of observing a deviation of at least \bigl|\frac{\hat\beta_0}{\hat\sigma_{\beta_0}} \bigr| from zero under this hypothesis, i.e. in a scenario where \frac{\hat\beta_0}{\hat\sigma_{\beta_0}} is indeed distributed N(0,1): if Z\sim N(0,1), then for c>0,

    \[ P(|Z|>c) = P(Z<-c) + P(Z > c). \]

By symmetry of the normal distribution, P(Z<-c) = P(Z>c) , and

    \[ P(|Z|>c) = 2P(Z>c) = 2 (1 - P(Z\leq c)) = 2(1 - \Phi(c)) \]

where \Phi(\cdot) is the distribution function of the standard normal distribution N(0,1). Therefore, the probability of observing a deviation from zero of at least
\bigl|\frac{\hat\beta_0}{\hat\sigma_{\beta_0}} \bigr| is

    \[ p = 2(1-\Phi\left (\bigl|\frac{\hat\beta_0}{\hat\sigma_{\beta_0}} \bigr|\right ) \]

This quantity is called the p-value, and is typically reported in regression output tables. If this probability is sufficiently large, you can not reject the hypothesis that \beta^x = 0. If it is sufficiently low, however, you reject the hypothesis. If your model is specified in a way that allows for a marginal effect interpretation, this rejection is required to claim that the model identifies a non-zero effect of X on Y.

To conclude the discussion of the linear model, we have introduced the concept of the conditional expectation, and the linear conditional expectation regression model. This model is also useful if the true conditional expectation is not linear, among others due to a Taylor approximation justification. In this model, coefficients can be interpreted as marginal effects of one variable on the outcome Y, holding the other variables constant. Therefore, it allows to describe the partial relationship between X and Y that is unexplained by third variables Z. This model can be estimated using the OLS estimator, which could be derived from a simple unconstrained optimization problem. This estimator has powerful theoretical properties, and always estimates the best linear prediction of Y given X. The practical exercise of the econometrician is usually not to allow for “consistent” estimation of the linear prediction (which is guaranteed), but to write down a linear prediction that is useful for studying a given practical issue at hand. Beyond estimation, we have covered inference, i.e. methods of quantifying estimation uncertainty, as well as their justification through large sample asymptotics. These methods allow to test for, and especially to reject certain values for the estimated parameters in the model.

Correlation and Causality Revisited

The mathematics of econometric models is one side of the coin the empirical economist has to understand, their practical application is the other. While this script intends to give an introduction to the mathematical foundation of the economic profession, the latter aspect is still important enough to deserve a brief discussion also here. As mentioned in the first sections of this chapter, our goal is usually to arrive at a causal conclusion (“X causes Y”) while our methods measure only a correlational relationship (“X predicts Y”). While the linear model goes one step from correlation to causality allowing to hold fixed some third variables, its interpretation is still inherently correlational: the most we can say is that “holding constant Z, X predicts Y”, or equivalently “X predicts Y given Z”. So, how do we argue for causality, and what needs to be kept in mind in these arguments?

The typical approach is very simple: you assume that there exists a true, causal model of the form

    \[ Y = \gamma_0 + \gamma^x X + e \]

i.e. that Y has some baseline level \gamma_0, and is linearly driven by X (\gamma^x X) and other factors e. This is an assumption that you can not test in practice, just like you can not test any assumption you impose on a theoretical model. Your research questions are typically:

    1. Does X affect Y, i.e. does the hypothesis \gamma^x \neq 0 hold?
    2. If \gamma^x \neq 0, is the causal effect economically relevant?

For the latter point, not any non-zero effect concerns/is considered relevant the economist. If you find that the introduction of 1000 robots to an industry reduces employment in this industry by 0.0001%, and there are no industries in your sample that introduce more than 1000 robots in a given year, then you would not claim that robot adoption reduces employment. The threshold to relevance, however, is subjective and depends also on the context.

There are two possible reasons that prevent you from estimating the causal \gamma-coefficients through the predictive model:

    1. Omitted variables: there are third variables Z that cause Y and are correlated with X.
    2. Reverse causality: Y causes X.

While reverse causality is not a problem in the causal model, it is an issue in the predictive model: suppose you are a firm that sells backpacks at 50 Euros a piece. An employee working in marketing wants to convince you that you should spend more on marketing, as he thinks business expansion could greatly boost the number of backpacks you sell. He shows you some estimates from a linear model he estimated:

    \[ x_t = 0.02 \cdot r_t + e_t \]

where x_t is the number of backpacks sold in month t and r_t is your firm’s revenue in this month. He has also learned about inference and shows you that his estimates are highly statistically significant. Would you, then, believe that if you managed to grow as a firm in terms of revenue, this would attract more new customers?

Well, if you re-write his estimates, you obtain

    \[ r_t = 50 \cdot x_t - 50 \cdot e_t \]

Given that your revenue comes from sales of backpacks which you sell at 50 Euros a piece, this equation does not look very surprising. This issue captures the essence of reverse causality: just because X is a good predictor of Y, even if X and Y are causally related, it does not mean that X causes Y, but the predictive quality could also be because Y causes X. Indeed, his estimates from the predictive equation are perfectly consistent with causal equations where X does not cause Y, i.e. \gamma^x = 0, but Y causes X with coefficient 50.

So, how do we manage to estimate the causal coefficients using the predictive model? For omitted variables, since they cause Y, they should be contained in the remaining factors. If they enter Y linearly, we can write e = \gamma^z Z + \varepsilon, so that the causal equation becomes

    \[ Y = \gamma_0 + \gamma^x X + \gamma^z Z + \varepsilon. \]

Intuitively, this equation explicitly accounts for all variables Z that would induce X to predict Y for non-causal reasons. Indeed, we can show that estimation of this model allows to recover the causal coefficient \gamma^x, the formalities of this investigation are however beyond the purpose of this course. Note, however, that we design the model to estimate the causal coefficient for X, \gamma^x. The causal coefficient(s) for Z may not be estimated consistently if there are third variables that determine Y and are correlated with Z. Thus, always limit your interpretations as much as possible to the coefficients corresponding to variable(s) for which you construct the model, and abstain from speculating on the coefficients of control variables.

Control variables can also be used to address issues of reverse causality: usually, if we are in the comfortable situation to have data for several periods, the first step to circumvent reverse causality is to adjust the timing, exploiting that causation can only work forward in time, not backwards. Continuing the example of the backpack firm, as this month’s sales of backpacks can not be caused by last month’s revenues, this step would suggest to specify the model

    \[ x_t = \beta_0 + \beta^r r_{t-1} + e_t. \]

However, a second step may still be necessary if backpack sales are serially dependent, i.e. if x_t depends on x_{t-1}. For instance, if customers are satisfied with your product and recommend it to others, then if sales numbers are high in one month, they may also be higher in following months because more people recommend your product. In your model, this means that x_{t-1} causes x_t (not directly, but indirectly through increased mouth-to-mouth advertising). This is an issue because x_{t-1} is clearly correlated with your explanatory variable of interest, r_{t-1}. This discussion suggests that you should include x_{t-1} as a control variable. Your model becomes

    \[ x_t = \beta_0 + \beta^r r_{t-1} + \beta^x x_{t-1} + e_t. \]

Whether it is sufficient to include just one lag for the dependent (i.e., left-hand side) variable is not a trivial question, but an issue of central interest to the field of time series analysis. Intuitively, a second lag would be needed if it explains today’s values even conditional on the first lag, which may occur for more complex dynamic processes. In practice, a simple but efficient approach is to estimate the model once with and once without the second lag, and compare the estimates obtained for the coefficient of interest. If it is different across models, the second lag should be included, and an analogous test should be performed on whether or not to include a third lag, and the process should be iterated until the coefficient of interest converges.

To summarize, we have briefly reviewed the two most common issues in empirical analysis, omitted variables and reverse causality, and we have seen how they can be addressed using control variables and the appropriate use of variable timing. In practice, especially omitted variables are tricky for two reasons: (i) you do not know ex-ante which third variables may be relevant for the relationship of interest, and (ii) not all of the variables you think are relevant are measured in the data available for your analysis. To address the former point, we usually estimate relatively rich models with many variables, as the cost of missing one important variable are much larger than including a few irrelevant ones, especially with large datasets. Next to specific variables, we also include fixed effects aimed at absorbing unmeasured variables that vary at higher levels; the interested reader can look up the standard fixed effects and the two-way fixed effects model. To address the latter point, several techniques have been developed to estimate causal parameters in environments where (we can) not (be sure that) all relevant third variables are observed. To this end, the interested reader is referred to instrumental variables estimation, difference-in-difference estimation, regression discontinuity design and propensity score matching. All these techniques are frequently used in the economic literature, where the former two have been most prominent.