Chapter 4 – Optimization

An overview of this chapter’s contents and take-aways can be found here.

We have seen a lot of new concepts in the last three chapters. This chapter is concerned with putting all the pieces together and uses them to give you an in-depth understanding of the economist’s favorite exercise: optimization. As anyone with some background in economics should know by now, optimization is virtually everywhere in economic considerations: households optimize the allocation of their time budget to working for income and leisure, optimally divide their budget into consumption and savings over time, optimize the composition of their consumption bundles, firms optimize input use, output quantities, factor composition (i.e., the production shares of labor and capital, respectively), and governments (or: social planners) seek to maximize welfare through public goods provision, minimize negative externalities through market regulations, and optimize the social value generated from each unit of tax money over a broad portfolio of possible actions. In doing so, the agents we consider are almost always constrained in some form: households do not have infinitely much time, firms face output targets, and governments can not spend more money as is generated from tax revenue. Thus, to be able to track all these issues mathematically, we as economists need to firmly understand the techniques of constrained optimization.

Introduction and Key Concepts

As a first step to solving it, let us consider how to write down an optimization problem mathematically.

    \[ (\mathcal{P}_{min}) \hspace{2cm} \begin{aligned} & \underset{x\in \text{ dom}(f)}{\text{minimize}} & & f(x)\\ & \text{subject to} & & g_i(x) = 0, \; i = 1, \ldots, m.\\ & & & h_i(x) \leq 0, \; i = 1, \ldots, k.\\ \end{aligned} \]

 

    \[ (\mathcal{P}_{max}) \hspace{2cm} \begin{aligned} & \underset{x\in \text{ dom}(f)}{\text{maximize}} & & f(x)\\ & \text{subject to} & & g_i(x) = 0, \; i = 1, \ldots, m.\\ & & & h_i(x) \leq 0, \; i = 1, \ldots, k.\\ \end{aligned} \]

Here, f, g_i, i\in\{1,\ldots,m\} and h_i, i\in\{1,\ldots,k\} are functions mapping from the set of x‘s, the domain dom(f), to \mathbb R (i.e., there is only one quantity to be maximized, not a vector). We call f our objective function and x the choice variables. In words, we are looking for the x in the domain of f that yields either the smallest (or largest) value f can possibly attain when we require that the functions g_i must attain the value 0 and h_i can not lie strictly above 0 — we want to minimize (or maximize) f subject to the equality constraints given by the g_i and the inequality constraints by h_i. You may wonder if it is not a restrictive formulation to have only equality constraints requiring a function to be equal to zero, and “less or equal” inequalities. Indeed, it is not: note that we can always re-write a condition \tilde g(x) = c as g(x):= \tilde g(x) - c = 0 and \tilde h(x) \geq c as  h(x):=c-\tilde h(x) \leq 0. Lastly, if m=k=0, we say that the problem is unconstrained.

Throughout the entire chapter, we will most often consider maximization problems, as it is the problem more common in economics. Indeed, every minimization problem with objective f can be equivalently solved as the maximization problem of -f, such that this approach is without loss of generality. Still, conceptually, maximization and minimization problems are of course distinct (one example is the different second order condition for the solution).

Before moving on, a few words on our conceptual approach: as with multivariate differentiation, the elaborations in the online course will focus on more on the “how” of optimization rather than the “why”, which is paid more attention in the companion script and perhaps our course week. Still, we do discuss why the conditions we use in our approaches to solving optimization problems are valid so as to allow you to understand what you are doing when applying them, which should make your applications less error-prone. In doing so, we focus again on intuitive and suggestive reasoning, and omit the formal proofs. While it is of course more important that you know how to correctly apply the methods of constrained optimization and you need not master their formal justification completely just yet, feel encouraged to dig deeper into the formalities behind this issue when you find the time! Due to the importance of constrained optimization in economics, any good economist should be thoroughly aware of why his or her approach to solving these optimization problems works, and once you dig deeper into economic or econometric theory, you will find that formally understanding constrained optimization methods is a necessary condition for being able to deal with the more advanced techniques and more complex issues studied there. So while the details of multivariate differentiation are good to know, the ones of constrained optimization are the ones you are strongly advised to familiarize yourself with at least until the end of your master’s studies – but the sooner the better, since doing so will facilitate any application you see during your coursework and also exams, and also greatly train your skill of dealing with Taylor expansions.

Definitions

Before digging into the whole process of optimization, it is wise to (i) formally define what we are looking for (i.e. maximum or minimum) and (ii) evaluate how “likely” it is that we will find it by looking at specific properties of the objective function and of its domain.

Definition: Extremum: Minimum and Maximum of a Set.
Let X\subseteq\mathbb R. Then, \bar x\in \mathbb R is called the maximum of X, denoted \bar x = \max X, if \bar x = \sup(X) and \bar x\in X. Conversely, \underline{x}\in \mathbb R is called the minimum of X, denoted \underline{x} = \min X, if \underline{x} = \inf(X) and \underline{x}\in X. x\in \mathbb R is called an extremum of X if x=\max X or x = \min X.

 

Verbally, x is the maximum of X if x is the smallest number greater or equal than all elements of X, and it is also contained in X. More simply put, it is the largest value in the set if there is any such value. This need not be the case, recall that e.g. open intervals (a,b) neither have a maximum nor a minimum, but only infimum and supremum. However, for sets of real numbers, if there is a minimum or a maximum, it (i) is unique and (ii) coincides with the infimum or the supremum, respectively.

In terms of our optimization problem, we will be looking for the maximum of the set of attainable values of f under the constraints of the problem \mathcal P:

    \[A_{\mathcal P}(f) = \{f(x): x\in\text{dom}(f),\ g_i(x) = 0,\ h_j(x)\leq 0 \ \forall i\in\{1,\ldots,m\}\forall j\in\{1,\ldots,k\}\}.\]

Now, next to the maximum attainable value, we are frequently also (most of the time: even more) interested in the (set of) solution(s), i.e. the arguments x that maximize f under the constraints of \mathcal P. So, let’s define them in a next step:

Definition: Local and Global Maximizers.
Let X\subseteq\mathbb R^n, f:X\mapsto\mathbb R. Then, x_0\in X is

    • a global maximizer for f if \forall x\in X: f(x_0)\geq f(x)
    • a strict global maximizer for f if \forall x\in X\backslash\{x_0\}: f(x_0) > f(x).
    • a local maximizer for f if there exists \varepsilon>0 such that \forall x\in X\cap B_\varepsilon(x_0): f(x_0)\geq f(x)
    • a strict local maximizer for f if there exists \varepsilon>0 such that \forall x\in X\cap B_\varepsilon(x_0)\backslash\{x_0\}: f(x_0) > f(x)

 

Verbally, local means that there must be a neighborhood (i.e. an open ball B_\varepsilon(x_0) around x_0, restricted to the domain X of f to ensure that f is defined for every considered argument) such that x_0 maximizes f in this neighborhood, or respectively, it maximizes the restricted function

    \[f|_{B_\varepsilon(x_0)\cap X}:B_\varepsilon(x_0)\cap X\mapsto\mathbb R, x\mapsto f(x).\]

Strict means that x_0 is the unique maximizer of f (local: when restricted to a neighborhood), such that all other x (in this neighborhood) yield strictly lower values for f if f is defined in x, i.e. x\in X. Note that local implies global, because if all elements of X yield smaller values for f than x_0, then so do especially all those that lie in neighborhoods of x_0. The converse is not always true, and some local maximizers may not be global ones.

In optimization, infrequently care about the whole set X, as it contains many points that do not satisfy the given constraints. Let us see how we can formally restrict the problem to only those points consistent with the problem’s constraints:

Definition: Restriction of a Function.
Let X,Y be sets and f:X\mapsto Y a function. Then, for S\subseteq X, the function f|_S:S\mapsto Y, x\mapsto f(x) is called the restriction of f on S.

 

Definition: Constraint Set.
Consider an optimization problem \mathcal P with objective function f:X\mapsto\mathbb R, X\subseteq\mathbb R^n, equality constraints g_i(x) = 0 \forall i\in\{1,\ldots,m\} and inequality constraints h_j(x)\leq 0 \forall j\in\{1,\ldots,k\}. Then, the set

    \[C(\mathcal P):= \{x\in X: ((\forall i\in\{1,\ldots,m\}: g_i(x) = 0) \land (\forall j\in\{1,\ldots,k\}: h_j(x)\leq 0 ))\}\]

is called the constraint set of \mathcal P.

 

The constraint set of a problem \mathcal P defines the restriction f|_{C(P)}. Beyond now being able to represent optimization problems more compactly as

    \[ \underset{x\in C(\mathcal P)}{\text{maximize}} f(x) \]

the value of having defined the constraint set is that it more formally narrows down what we are indeed after when considering the problem \mathcal P mathematically: finding the maximizers of the restricted function f|_{C(\mathcal P)}! Note that for an unconstrained problem, we simply have C(\mathcal P) = X so that f|_{C(\mathcal P)} = f. Beyond the general maximizers defined above, this allows us to define the maximizers we care about in optimization problems:

Definition: Solutions, Maximizing Arguments.
Consider an optimization problem \mathcal P with constraint set C(\mathcal P). The solutions to \mathcal P are given by the set of global maximizers of f|_{C(\mathcal P)},

    \[\arg\max_{x\in C(\mathcal P)} f(x) := \{x_0 \in C(\mathcal P): (\forall x\in C(\mathcal P): f(x_0)\geq f(x)\}.\]

We call \arg\max_{x\in C(\mathcal P)} f(x) the maximizing arguments or the arg max of the problem \mathcal P. If this set contains only a single element x^*, we also write x^* = \arg\max_{x\in C(\mathcal P)} f(x).

 

As indicated by the bold words, it is crucial to note that the solutions typically constitute a set that may contain arbitrarily many elements or none at all (consider e.g. \arg\max_{x\in\mathbb R} x^2 = \varnothing or \arg\max_{x\in\mathbb R} \cos(x) = \{2\pi n: n\in\mathbb N\}). Only if there is a unique maximizer, by convention, the arg max also refers to a number or a vector (but still to the set as well)! Finally, note that regardless of whether the problem \mathcal P has solutions, and even regardless of whether there are any values that satisfy the constraints (i.e., whether C(\mathcal P)\neq \varnothing)), the arg max is always defined!

Now that we have introduced all key concepts of the optimization context, it is a good time to make sure that we are thoroughly familiar with them. To practice, consider these two objects that you frequently see across all economic fields, and applications of mathematics more generally:

    \[\max_{x\in C(\mathcal P)} f(x) \hspace{0.5cm}\text{and}\hspace{0.5cm} \arg\max_{x\in C(\mathcal P)} f(x).\]

Try to answer the following questions on their differences and their relationship:

  1. What types of mathematical objects do the quantities refer to (real number/vector/set/function/etc.)?
  2. How do these quantities relate to the domain and codomain of f?
  3. Can you characterize the relationship of the quantities by a quantifying statement?

In doing so, be aware that the former expression is shorthand notation for \max \{f(x): x\in C(\mathcal P)\}.

1. \max_{x\in C(\mathcal P)} f(x) is a real number, \arg\max_{x\in C(\mathcal P)} f(x) is a set.
2. \max_{x\in C(\mathcal P)} f(x) is an element in the codomain of f, \arg\max_{x\in C(\mathcal P)} f(x) is a subset of the domain of f.
3. The relationship between the quantities can be described by

    \[\forall x^*\in \arg\max_{x\in C(\mathcal P)} f(x): (f(x^*) = \max_{x\in C(\mathcal P)} f(x)).\]

 

Characterizing the Set of Solutions

For this subsection, make sure to recall the concepts of (i) a bounded set, (ii) a closed set, (iii) supremum and infimum, and (iv) the Heine-Borel theorem for characterization of compactness in \mathbb R^n that have been introduced throughout the earlier chapters of this script.

First, we need one more definition:

Definition: Bounded Function.
Let f:X\mapsto\mathbb R be a real-valued function. If im(X), the image of X under f is bounded, we say that f is a bounded function. Moreover, for S\subseteq X, we say that f is bounded on S if f|_S is bounded.

 

Recall that when discussing Heine-Borel, we highlighted the value of compact subsets of \mathbb R for optimization: a continuous function will necessarily assume both a (global) maximum and minimum on them, such that compactness of the set is a sufficient condition for existence of solutions to optimization problems! To transfer this intuition to the \mathbb R^n, we need to review what Heine-Borel precisely says about compact sets: they are closed and bounded. Boundedness of a set X\subseteq\mathbb R^n, defined as X being contained in some ball B_r(x_0) where x_0\in\mathbb R^n and 0\leq r<\infty is immensely helpful because it restricts the degree to which the arguments x of a function f:X\mapsto\mathbb R can extend into any direction in X — if they move too far away from x_0, then boundedness tells us that they can no longer be arguments of f! Similar to the univariate case where bounded sets are intervals with finite bounds, this prevents solution-breakers like limits x_k\to\infty, as e.g. in f(x_1,x_2) = x_1 + x_2 where if x is allowed to infinitely expand e.g. in direction e_2 = (0,1)', then it is allowed to diverge to +\infty as x_2\to\infty. As with univariate functions, we can prevent this by defining f on a bounded set. Next, why do we need closedness? For univariate functions, this is also rather straightforward: if e.g. f:(a,b)\mapsto\mathbb R, x\mapsto 2x + 3, then the set is bounded and f can not diverge due to “too large” arguments, but clearly, for any x,y\in(a,b): (x<y\Rightarrow f(x)<f(y)), and since the domain of f does not include its boundary, there are neither minimizing nor maximizing values! This logic extends to the multivariate case in a one-to-one fashion.


551x386

Finally, why do we need continuity? To see this, consider the function

    \[ f:[0,1]\mapsto\mathbb R, x\mapsto f(x) = \begin{cases} x^2 + 2 & x<2/3 \\ 1 & 1\geq 2/3\end{cases}\]

the graph of which is illustrated in the figure above. What is the maximizer of f? Clearly, it is not an element of [2/3,1], so we can restrict our search to [0,2/3). But this set is no longer closed, and we may run into our boundary problem again — and the way f is defined, we indeed do! More generally, if f is discontinuous, the function can “jump”, i.e. it can increase or decrease steadily into one direction, but just before reaching the high/low point, attain an entirely different value. In other words, when f approaches x_0\in X, it approaches the level \lim_{x\to x_0}f(x). If this level would be a maximum/minimum, we need to ensure that it lies in the range of f by requiring f(x_0) = \lim_{x\to x_0}f(x), which is precisely the definition of continuity.

It is possible to show formally that the intuitive line of reasoning above goes through formally:

Theorem: Weierstrass Extreme Value Theorem.
Suppose that X\subseteq\mathbb R^n is compact, and that f:X\mapsto\mathbb R is continuous, then, f assumes its maximum and minimum on X, such that \arg\max_{x\in X} f(x) \neq \varnothing and \arg\min_{x\in X} f(x) \neq \varnothing.

 

Unconstrained Optimization

As per our tradition of moving from easier to harder problems, we begin the study of optimization problems with those that are not subject to any constraints, but just seek to maximize a function f over its domain, i.e.

    \[ (\mathcal P)\hspace{0.5cm}\underset{x\in\text{dom}(f)}{\text{maximize }} f(x)\]

where dom(f)=X\subseteq\mathbb R^n and f:X\mapsto\mathbb R is a real-valued function. From a technical side, note that when considering constrained problems, we may equivalently consider the “unconstrained” problem with objective f|_{C(\mathcal P)}, but the key complication is that continuity of f and appealing properties of the domain do not transfer easily to this issue, such that we need to consider it later on its own right.

First and Second Order Necessary Conditions

To solve optimization problems, we typically rely on a combination of necessary and sufficient conditions. Here, we focus on the perhaps most commonly known conditions: the first and second order necessary conditions for a local extremizer.

For univariate functions f:X\mapsto\mathbb R, X\subseteq\mathbb R, you may know how to approach the problem of unconstrained optimization: set the first derivative of f to zero, and check that the second derivative is smaller than 0 — any point that satisfies these conditions is a candidate for an interior solution. Next to the interior solutions, you will have to consider border solutions: points of non-differentiability and points on the boundary of the support X of f. There are, of course, applications where you need not worry about border solutions: if f is differentiable everywhere (especially: f\in C^2(X), so that the second derivative is easily computed) or the boundary points are either non-existent (X=\mathbb R) or excluded from the domain (X=(a,b)). Either way, you evaluate f at all candidate points that you found, and the one yielding the highest value is your global maximizer. Here, we focus on how this procedure extends to the multivariate context.

First of all, it is noteworthy that we restrict attention to local maximizers in our analytical search, and only subsequently identify the global maximum from the rather crude and inelegant technique of comparing the value of the objective at all candidates. To find necessary conditions, we restrict attention to the interior, the (finite set of) boundary points are considered in isolation subsequently! Note that we assume away the non-differentiability issue by imposing f\in C^2(X), otherwise, we would worry about these points too.

To come up with the necessary conditions we use, we imagine for the moment that there is a point x^*\in int(X) that maximizes f locally, and search for some (ideally strong) characteristic features x^* must necessarily have. Starting from the univariate case, suppose x^* is a local maximizer, such that for the \varepsilon-ball restricted to X with \varepsilon>0, there are no values of f above f(x^*), i.e. \forall x\in B_\varepsilon(x^*): f(x)\leq f(x^*). Then, how does f look like around x^*, provided that it is (twice) differentiable and thus especially continuous?


537x299

As you can see, f must either be flat around x^*, or constitute a “hill” with peak x^* (a mixture also exists, with flatness on one side and a “downhill” part on the other). What does this mean? As illustrated, intuitively, we should have a zero slope of f at x^*, or for a multivariate function f, a zero gradient, i.e. a zero slope into any direction (generalizing the concepts of flatness and hilltop to the \mathbb R^n). Indeed, this intuition holds formally:

Theorem: Unconstrained Interior Maximum – First Order Necessary Condition.
Let X\subseteq\mathbb R^n and f\in C^1(X). Suppose that x^*\in int (X) is a local maximizer of f. Then, \nabla f(x^*) = \mathbf 0.

 

The intuition for the formal reason behind this theorem is easy to see from the univariate case: recall that a differentiable function f was strictly increasing (decreasing) in x_0 if and only if f'(x_0) > 0\ (f'(x_0) < 0), and because f is twice differentiable, f' is differentiable and thus especially continuous. Thus, if f'(x^*) > 0, (f'(x^*) < 0), by continuity, f'(x)>0 (f'(x)<0) for points “close enough” to x^*, and thus, there lie points slightly to the right (left) of x^* with strictly larger values, and x^* can not be a maximizer! Thus, x^* can not be a local maximizer unless f'(x^*) = 0 – which gives the necessary condition. For the multivariate case, this reasoning applies along any of the fundamental directions of the domain, which involves slightly more notation, but the principle is the same.

The theorem above suggests that points with a zero gradient are important. Indeed, they deserve an own label:

Definition: Critical Point or Stationary Point.
Let X\subseteq\mathbb R^n, f:X\mapsto\mathbb R and x^*\in X. Then, if f is differentiable at x^* and \nabla f(x^*) = \mathbf 0, we call x^* a critical point of f or a stationary point of f.

 

Note that just because we have defined generally, a critical point need not exist in every specific scenario! There are a broad variety of functions that have a non-zero gradient everywhere, e.g. f(x_1, x_2) = x_1 + x_2, just like there are univariate functions with a globally non-zero derivative. Also, it is easily seen that this condition is not sufficient for a local maximizer, because \nabla f(x^*) = \mathbf 0 applies also to local minima that are not local maxima. Accordingly, the first order necessary condition is the same for local minima and maxima! Moreover, \nabla f(x) = 0 applies to “saddle points”, as illustrated in the left panel of the figure above, which are neither local minima nor maxima.

As has just emerged, the distinction between local maxima and minima lies beyond the first derivative. Thus, it is generally not sufficient to set the first derivative to zero when looking for a maximum! But as you know, we can consult the second derivative to tell maximizers and minimizers apart. To find a further helpful necessary condition, our thought experiment is again to start from a local maximizer and think about a property it necessary satisfies.

To see the intuition, consider again the univariate case: if x^* is a local maximizer of f, in a small-enough neighborhood around x^*, there can be no points x so that f(x) > f(x^*). Accordingly, it must necessarily be the case that f'(x) \geq 0 to the left and f'(x) \leq 0 to the right of x^* – or respectively, f' must be a decreasing function around x^*. Because f' is differentiable, we can express this notion using its derivative f": in a neighborhood of x^*, we must have that f"(x)\leq 0. To extend this to the multivariate case, recall that matrix definiteness can be viewed as a generalization of the sign, so that the second derivative of a multivariate function, its Hessian, is “smaller or equal to zero” if it is negative semi-definite. Indeed, it formally holds that:

Theorem: Unconstrained Interior Maximum — Second Order Necessary Condition.
Let X\subseteq\mathbb R^n and f\in C^1(X). Suppose that x^*\in int (X) is a local maximizer of f. Then, if f is twice continuously differentiable at x^*, H_f(x^*) is negative semi-definite.

 

Similarly, for a local minimum, we can of course invert the logic of our intuitive reasoning above. This suggests:

Theorem: Unconstrained Interior Minimum — Second Order Necessary Condition.
Let X\subseteq\mathbb R^n and f\in C^1(X). Suppose that x^*\in int (X) is a local minimizer of f. Then, if f is twice continuously differentiable at x^*, H_f(x^*) is positive semi-definite.

 

Indeed, also the formal reasoning behind the second order necessary condition for the minimum is exactly analogous to the one for the maximum.

One can nicely illustrate the grasp of this necessary condition by illustrating critical values of bivariate functions it rules out: consider f(x_1,x_2) = x_1^2 - x_2^2. The necessary condition of a critical point tells us that any candidate x^* = (x_1^*,x_2^*) must satisfy

    \[\nabla f (x_1^*,x_2^*) = (2x_1^*, -2x_2^*) = \mathbf 0\wspace\Leftrightarrow  (x_1^*,x_2^*) = \mathbf 0.\]

Thus, there is a unique candidate… But is it also a maximizer (or minimizer)? Let’s consult the Hessian — it is straightforward to verify that for all x=(x_1,x_2)\in\mathbb R^2,

    \[H_f(x) = \begin{pmatrix}2&0\\0&-2\end{pmatrix}.\]

What about its definiteness? Pick z = (z_1,z_2)'\in\mathbb R^2. Then,

    \[z'H_f(x)z = (z_1,z_2) \begin{pmatrix}2&0\\0&-2\end{pmatrix}\begin{pmatrix}z_1\\z_2\end{pmatrix} = (z_1,z_2) \begin{pmatrix}2z_1\\-2z_2\end{pmatrix} = 2( z_1^2 - z_2^2).\]

Thus, there exist z = (1,0) and \tilde z = (0,1) such that

    \[z'H_f(x)z > 0 > \tilde z'H_f(x)\tilde z\]

and H_f(x) is neither positive nor negative semi-definite, i.e. it is indefinite. Thus, the solution x^* = \mathbf 0 can neither constitute a local maximum nor a local minimum: we call this a saddle point. To see why, consult the figure below, and think how this would fit on a horse (or your preferred animal to ride).


412x362

The crucial thing to take away from multivariate saddle points is that the critical value constitutes a maximum in one direction (here: e_2 = (0,1)', i.e. moving along the x_2-axis) but a minimum in the other (e_1 = (1,0)').

Also, it is important to know that this condition is indeed only necessary – it still applies to more points than we would ideally hope for, and even combining both necessary conditions introduced above, we still find points that are no local extremizers. To see this in an example, consider f(x) = x^3:


240x400

Clearly, the function has a saddle point at x=0. Since f'(x) = 3x^2 and f"(x) = 6x, we get the unique critical value at x=0 with second derivative f"(x) = 0, which is indeed negative semi-definite! Therefore, both necessary conditions for a local maximum hold at this point. However, so does the positive semi-definiteness condition for the local minimum. The graph will ensure you that the point is neither: f'(x)>0 for any x\neq 0, the function strictly increases (decreases) in value for infinitely small deviations to the right (left) of x=0. Thus, both the first and second order necessary conditions hold, but we don’t have a maximum (or a minimum). Hence, we need more conditions to rule out points like x=0 in the example here, and ideally this time sufficient ones that leave no room for critical points that are not the type of local extremum we want to find.

Sufficient Conditions and Application Procedure

Continuing our intuitive investigations into the univariate case, we can indeed strengthen our requirements imposed on the second derivative to get a sufficient condition. Above, we said that around a local maximum, f' must be decreasing, a notion that also allows f' to be constant. However, if it is not, i.e. if f"(x)<0 and f' is strictly decreasing around x^*, then we know that any point in a small-enough neighborhood of x^* will lead to strictly smaller values of the objective f – and in such cases, x^* always constitutes a strict local maximizer! From the word “always”, you can immediately see that the condition considered will be sufficient. To generalize to the multivariate case, we again rely on the intuitive link between the sign of a scalar second derivative and the definiteness of a matrix valued one.

Note that we only strengthened the second order condition; the first order necessary condition, i.e. x^* being a critical point, must still be imposed separately. This gives:

Theorem: Unconstrained Interior Strict Local Maximum — Sufficient Conditions.
Let X\subseteq\mathbb R^n, f\in C^2(X) and x^*\in  int(X). Suppose that x^* is a critical point of f, and that H_f(x^*) is negative definite. Then, x^* is a strict local maximizer of f.

 

Again, analogous reasoning can be applied to obtain the corresponding theorem for the minimum:

Theorem: Unconstrained Interior Strict Local Minimum — Sufficient Conditions.
Let X\subseteq\mathbb R^n, f\in C^2(X) and x^*\in  int(X). Suppose that x^* is a critical point of f, and that H_f(x^*) is positive definite. Then, x^* is a strict local minimizer of f.

 

While the necessary condition was not sufficiently restrictive to find only local maxima, the sufficient condition faces the converse issue: while it only finds local maxima, it may be even too restrictive to find all of them! This needs to be taken into account when coming up with our solution technique below.

Now that we have established the sufficient condition for a local interior maximum, let us take a moment to think about what to make of the results thus far. Indeed, we are ready to discuss the general approach to solving unconstrained optimization problems. When concerned with interior maxima, we can restrict the candidate set to critical values, because all potential maxima are necessarily critical values! Then, if f is “sufficiently smooth” (i.e., it satisfies all required differentiability conditions — here: being twice continuously differentiable at all critical values) we can check the Hessian and determine its definiteness. If at a critical value x^*, H_f is negative (positive) definite, we have sufficient evidence that x^* is a local maximum (minimum)! That may not be necessary — but we can rule out any value x^* where H_f is not at least negative (positive) semi-definite, because they violate the necessary conditions. Thus, we are generally happy if all critical values are associated with either a positive/negative definite Hessian or an indefinite one — but we have to beware semi-definite Hessians, for which no theorem tells us whether or not such points are extrema! As a further note of caution, all theorems thus far have concerned interior points — for boundary points, no theorem exists, and we must either consider them in isolation or simply consider f with an open set dom(f) such that the boundary issue is avoided altogether, which, however, may prevent us to argue for solution existence using the Weierstrass theorem (as typically, an open set is either not closed or not bounded).

Accordingly, the consideration of the Hessian function helps us to categorize all our candidates – the critical points solving the first order condition plus “border solutions” – into the following categories:

 

\begin{tabular}{c|cccc}\hline Candidate Type & Property & Characterization (SOC) & keep? & Reason\\\hline\hline Interior & Hessian neg. def. & strict local maximizer & yes & Suff. SOC\\ Interior & Hessian pos. def. & strict local minimizer & no & Necess. SOC\\ Interior & Hessian indefinte & not a local extremizer & no & Necess. SOC\\ Interior & Hessian otherwise not neg. semi-def. & not a local maximizer & no & Necess. SOC\\ Interior & Hessian neg. semi-def. & don't know & yes & can not rule out\\ Border & Point of Non-differentiability & - & yes & -\\ Border & Boundary Point & - & yes & -\\\hline \end{tabular}

 

The last two colums indicate whether we keep the respective point in the set of candidates for which we compare values in the final step when searching for a global maximum, and why (or why not). Note that indefinite and positive definite Hessians are special cases of Hessians that are not negative definite. The reason that they are listed separately is that we can derive a destinctive characterization from the conditions we discussed.

Now, we know the final set of candidates for which we know, or can not rule out, that they constitute a local maximum. In the last step, it remains to evaluate the function at all these points and see which point yields the largest value to conclude on the global maximum. When doing so, one thing has to be kept in mind: we need some justification that the local maximum we found indeed constitutes a global maximum, that is, that the global maximum does indeed exist (in which case, recall, it will definitely be also a local maximum). This is not ex ante guaranteed! In most cases, we can establish this either through the Weierstrass theorem, or by considering the limit behavior of the function – in the multivariate context, along any of its (fundamental) directions. More details can be found in the script.

To see the intuition in the univariate case, consider

    \[f:[0,\infty)\mapsto\mathbb R, x\mapsto \frac{x}{x^2 + 1}.\]

Here, the only critical point is x=1, where f"(1) = -0.5. Thus, we identify the point as a strict local maximum per our conditions. Since f(1) = 0.5 > 0 = f(0), the only boundary point here, x=0, can be ruled out as the global maximum. Still, ex ante, it is not guaranteed that \lim_{x\to\infty} f(x)< f(1) = 0.5, which we have to additionally establish to argue for the global maximum. This is straightforward here, since

    \[\lim_{x\to\infty} f(x) = \lim_{x\to\infty} \frac{1}{x + 1/x} = 0\]

where the last equality follows from the quotient rule of the limit.

If instead, the function had been defined as

    \[f:[0,10]\mapsto\mathbb R, x\mapsto \frac{x}{x^2 + 1}\]

we knew that the global maxiumum must exist by the Weierstrass theorem (closed intervals of \mathbb R are compact), and here, it is sufficient to compare the interior solution x^* with the boundary points, i.e. f(1)=0.5 to f(0)=0 and f(10)\approx 0.1.

In summary, we can summarize our approach to finding the global maximizer as follows:

Algorithm: Solving the Unconstrained Maximization Problem.

  1. Find the set of initial candidates: interior points that satisfy the first order necessary condition (“critical points”) and border solutions – boundary points and points of non-differentiability.
  2. Compute the Hessian at every interior point and classify the set of candidates according to the table above.
  3. Keep all points that can not be ruled out as local maximizers.
  4. Find the candidate(s) that yield the largest value for f among all remaining candidates.
  5. In case of uncertainty about existence of the global maximizer, study the limit behavior of f.

 

Indeed, strictly speaking, this “cookbook recipe” for the solution of the unconstrained problem is everything you need to know in practice, and many undergraduate programs indeed teach nothing but this. However, as always, the person knowing the reasons for why some procedure works are much more likely to correctly perform all steps than the one who has only memorized the sequence, especially when things get tricky and non-standard. Also, the correct procedure itself is of course much easier to memorize if you know how it comes to be. Thus, feel encouraged to understand at least the intuition-based elaborations above, if not the formal details outlined in the companion script.

When discussing convexity of functions in the last chapter, we briefly noted that the property is also quite useful for optimization. Let’s see how precisely this is:

Theorem: Sufficiency for the Global Unconstrained Maximum.
Let X\subseteq\mathbb R^n be a convex set, and f\in C^2(X). Then, if f is concave and for x^*\in int(X), it holds that \nabla f(x^*) = \mathbf 0, then x^* is a global maximizer of f.

 

Analogously, it holds that:

Theorem: Sufficiency for the Global Unconstrained Minimum.
Let X\subseteq\mathbb R^n be a convex set, and f\in C^2(X). Then, if f is convex and for x^*\in int(X), it holds that \nabla f(x^*) = \mathbf 0, then x^* is a global minimizer of f.

 

The reason for this fact can be seen intuitively again from the univariate case: the second derivative of a concave (convex) function is non-positive (non-negative) everywhere. As such, the first derivative is non-increasing (non-decreasing), and any critical point x^* will only have points with f'(x)\geq 0 (f'(x)\leq 0) to its left and only points with f'(x)\leq 0 (f'(x)\geq 0) to its right. Thus, f increases (decreases) before the point and decreases thereafter – which defines a global maximum. However, be careful to note that there may be a plateau of global maxima (minima): it is not ruled out that f'(x) remains at 0 on some interval in the domain!

This is extremely powerful: when we have a concave objective function (and many classical economic functions are concave, e.g. most standard utility or production functions), then the critical point criterion (before only necessary for a local maximizer) is now sufficient for a global maximizer! This reduces our multi-step procedure defined in the algorithm above to just looking at the first-order condition, which represents a dramatic reduction in the effort necessary to solve the problem. Thus, it may oftentimes be worthwhile to check convexity and concavity of the objective function in a first step.

If you feel like testing your understanding of the optimization basics and our discussion of unconstrained optimization, you can take a short quiz found here.

Optimization with Equality Constraints

Let us take the next step to solving the general optimization problem by now allowing for equality constraints:

    \[ (\mathcal P)\hspace{0.5cm}\underset{x\in C(\mathcal P)}{\text{maximize }} f(x)\hspace{0.5cm}\text{where}\hspace{0.5cm} C(\mathcal P) = \{x\in\text{dom}(f): g_i(x) = 0\ \forall i\in\{1,\ldots,m\}\}.\]

Alternatively, you may read the problem in forms like

    \[\max f(x) \hspace{0.5cm}\text{subject to}\hspace{0.5cm} g_i(x) = 0\ \forall i\in\{1,\ldots,n\}.\]

Perhaps, you are familiar with the Lagrangian method — but unless you have had some lectures explicitly devoted to mathematical analysis in your undergraduate studies, I believe you are not familiar with its formal justification, and perhaps also not how we may apply it for multivariate functions. Thus, the following addresses precisely these issues, while restricting attention to the intuition when it comes to the formal details.

Before moving on, make yourself aware that what we are doing here is actually already quite useful in a broad range of applications, since the loss in generality from omitting inequality constraints from the problem formulation, depending on the application in mind, may not be too severe: consider for instance utility maximization with the budget constraint p_1x_1 + p_2x_2 \leq y, meaning that the consumer can not spend more on consumption than his income y. Now, if there is no time dimension (and thus no savings motive) and he has a utility function u(x_1,x_2) which strictly increases in both arguments, as we typically assume (“strictly more is strictly better”), the consumer will never find it optimal to not spend all his income, so that the problem is entirely equivalent to one with the equality constraint p_1x_1 + p_2x_2 = y!

Level Sets and Implicit Functions for Optimization

Do you recall our definition of the lower- and upper-level sets when we discussed the generalization of convexity to the multivariate case? Here, we need a closely related concept: the level set, which is both a lower and the upper level set given a certain value!

Definition: Level Set.
Let X\subseteq\mathbb R^n, g:X\mapsto\mathbb R, and c\in\mathbb R. Then, we call L_c(g) = \{x\in X: g(x) = c\} the c-level set of g.

 

This definition may look indeed familiar, in the sense that you should be able to describe L_c(g) with vocabulary that has been introduced very early on: think about what we typically would call a set \{x\in X: g(x) = c\} when c is a value in the codomain of g. As a hint, we denote it by g^{-1}[\{c\}]. … if you have practiced your notation, you will know from this that the level set is nothing but the pre-image of \{c\} under g! To see why we care about level sets is analytically straightforward: in our optimization problem, when there is only one constraint g, we are looking for solutions on the level set L_0(g), i.e. C(\mathcal P) = L_0(g), in words, the constraint set is nothing but the zero-level set of g! With multiple constraints, note that

    \[ \begin{split} C(\mathcal P) &= \{x\in X: g_i(x) = 0\ \forall i\in\{1,\ldots,m\}\} \\& = \{x\in X: (g_1(x) = 0\land g_2(x) = 0\land \ldots \land g_m(x) = 0)\} \\& = \{x\in X: g_1(x) = 0\} \cap \{x\in X: g_2(x) = 0\} \cap \ldots \cap \{x\in X: g_m(x) = 0\} \\&= \bigcap_{i=1}^m L_0(g_i), \end{split}\]

and that we are looking for solutions in the intersections of the level sets.

From a geometrical perspective, level sets are oftentimes useful to map three-dimensional objects onto a two-dimensional space. The most famous practical example is perhaps the one of contour lines as used for geographical maps, as shown below (figure taken from https://www.ordnancesurvey.co.uk/blog/2015/11/map-reading-skills-making-sense-of-contour-lines/, accessed at June 16, 2020):


680x495

Also in economics, we have various important examples: for instance budget sets (with the restriction that all money is spent) where the level is equal to the disposable income y, and utility functions with level sets L_{\bar u}(u) = \{x\in\mathbb R^2: u(x) = \bar u\} for the \mathbb R^2. For utility, we usually call sets L_{\bar u}(u) indifference sets, and their graphical representation in the \mathbb R^2 the indifference curve, which represents all points (x_1,x_2)\in X that yield the same utility level \bar u. For the Cobb-Douglas utility with equal coefficients, i.e. f(x) = \sqrt{x_1x_2}, this relationship is shown here (the RHS plot displays utility on the vertical axis):


855x334

Coming back to our pre-image interpretation, make sure to understand that one fixes a value of interest, c, that the function may take, for instance either the altitude of a coordinate point, or the utility level associated with a consumption choice. Then, the level set is the collection of all the points x that yield this value c when evaluating g at x, i.e. g(x) = c. Note that level sets may just consist of one element, be empty altogether, but conversely may also encompass even the whole domain of g, namely when g is a “flat” function.

Now that we have convinced ourselves that equality-constrained maximization is nothing but maximization on level sets, let us think about whether we can use this insight to reduce our new issue of solving the equality-constrained problem to the old one that we already know, the unconstrained problem. There is good news and bad news. The first piece of good news is that precisely this is indeed possible! The (full set of) bad news, however, is that things will get quite heavy in notation. To somewhat mitigate this problem, the following restricts attention to the scenario of one equality constraint — but the generalization entirely works in a one to one fashion, the respective result is given at the end of this section. Finally, the last piece of good news, and perhaps the most re-assuring fact is that while mathematically somewhat tedious to write down, the intuition of the approach is really simple.

To understand how we generally proceed to dealing with equality constrained problems, consider the following simple exemplary problem:

    \[\max_{x\in\mathbb R^2} x_1\cdot x_2^2 \hspace{0.5cm}\text{subject to}\hspace{0.5cm}x_1 + 2x_2 = 5.\]

Our solution approach is to reduce the problem to one without constraints, which we know how to solve from the previous section. The level set that we consider is L_g(0) for g:\mathbb R^2\mapsto\mathbb R, (x_1,x_2)\mapsto x_1 + 2x_2 - 5. Here, we can express the condition (x_1,x_2)\in L_g(0) through an explicit function for x_1 with argument x_2:

    \[(x_1,x_2)\in L_g(0)\hspace{0.5cm}\Leftrightarrow\hspace{0.5cm}x_1 = 5-2x_2\text{ and } x_2\in\mathbb R.\]

Thus, the constraint indeed fixes x_1 given x_2, and we are only truly free to choose x_2. Of course, you could also have gone the other way around to express x_2 as a function of x_1; the key take-away is that the constraint reduces the number of free variables by one. Imposing the restriction of the constraint onto the objective, the problem becomes

    \[\max_{x_2\in\mathbb R} x_2^2\cdot (5-2x_2).\]

The constraint need not be imposed separately, as it has already been plugged in for x_1 and is guaranteed to hold if we set x_1 = 5 - 2x_2 after solving for the optimal x_2. Thus, being able to plug in the explicit function representing the constraint into the objective has helped us to re-write the problem explicitly as an unconstrained one! If you need practice in unconstrained optimization, go ahead and solve it, as the only local maximizer (which is not a global maximizer because the objective diverges to +\infty as x_2\to-\infty).

The issue to transferring this idea to general applications is that coming up with the explicit function, depending on how g looks like, can be prohibitively complex or simply infeasible. In fact, this approach is very unlikely to work outside the context of linear constraints as we have seen it in the example above; already if we introduce square terms, e.g.

    \[g(x) = x_1^2 + x_2^2 - 5,\]

it is no longer possible to solve g(x) = 0 for x_1 as a function of x_2 explicitly, as you arrive at x_1 = \pm \sqrt{5 - x_2^2}, that is, two values rather than one that give a value for x_1 consistent with the constraint given x_2, but per our definition of a function, we can have only one value. This is why we need the concept of implicit functions.

To understand the concept of implicit functions, let us think again about why the explicit function was valuable. Starting from a point x_0 = (x_{0,1}, x_{0,2})' in the constraint set L_g(0), when we varied x_2 slightly, we were able to derive an explicit rule on how to adjust x_1 in response in order to remain on the level set. That is, when we write the explicit function as x_1 = h(x_2), for any movement \Delta x_2 from x_{2,0}, we knew that

    \[g(h(x_{2,0} + \Delta x_2), x_{2,0} + \Delta x_2) = 0\]

and the point (h(x_{2,0} + \Delta x_2), x_{2,0} + \Delta x_2)' would also be an element of the level set. So what was the key condition for this to work? Well, we can only “pull g back towards 0” by adjusting x_1 if the function g indeed moves with x_1, else, the variation in g induced by variation in x_2 can not be counterbalanced by adjustment of x_1 (e.g. if g(x_1,x_2) = x_1x_2 + 5 and \Delta x_2 = - x_{2,0}). Formally, it is necessary that \frac{\partial g}{\partial x_1} \neq 0. This is precisely the approach we want to generalize now. To be applicable to more general problems, we will weaken the two following strong requirements of the explicit function: (i) being able to derive an explicit mathematical expression for h(\cdot), and (ii) “global” validity of the adjustment rule, that is, applicability to arbitrarily large changes \Delta x_2. The motivation of (ii) is that we will again be concerned with local extremizers in our search, so that we are already happy with adjustment rules in small neighborhoods of some value for x.

The intuition above should help understand the rather abstract implicit function theorem presented below. To make it even more accessible, before introducing it, let us re-state the above in terms of a general problem in n arguments, i.e. X\subseteq\mathbb R^n and f,g:X\mapsto\mathbb R, and think about the conditions that we need to locally express the adjustment in one argument in response to variation in all others necessary to remain on the level set in this context.

Suppose that f and g are “smooth” in an appropriate sense, for now at least once continuously differentiable, and start from a point x_0 on the level set L_g(0). We are looking for a way to express a rule for how to stay on this level set locally, that is, in some neighborhood U of x_0, by defining a function of one variable in all others. Now, suppose that we start from x_0 and only marginally vary the components at positions 2,3,\ldots,n of x_0, but not the first component, say to \tilde x = x_0 + \Delta x where \Delta x_1 = 0 and \|\Delta x\| < \varepsilon where \varepsilon > 0 small. Because g is continuous, g(\tilde x) = g(x_{0,1}, x_{0,2} + \Delta x_2,\ldots x_{0,n} + \Delta x_n) should still lie “really close” to g(x_0) = 0 (g(x_0) = 0 holds because x_0\in L_0(g)!). By continuous differentiability of g, the first partial derivative of g is continuous, and we should be able to vary the first argument by \Delta x_1 to move g(x) back to zero, i.e. cancel any potential change in g induced by the marginal variation in the other arguments through a function h(\cdot) such that

    \[g(h(x_{0,2} + \Delta x_2,\ldots, x_{0,n} + \Delta x_n), x_{0,2} + \Delta x_2,\ldots, x_{0,n} + \Delta x_n) = g(x_0) = 0.\]

Of course, we typically need not pick the first element to do this, so that we may instead simply decompose x = (y,z) where y is the univariate variable that we want to adjust, and z are all the other variables that are allowed to move freely (note that although we write x = (y,z), we don’t require y to be the first element of x! You may e.g. have seen this in decompositions like x = (x_i, x_{-i}) that says that x is composed of the i-th element and all other elements). Rather, we can pick any variable for y that is able to induce variation in g in a neighborhood of x_0, i.e. any y with \frac{\partial g}{\partial y}(x^*) \neq 0, as by continuity of g‘s partial derivatives, this is sufficient to ensure that \frac{\partial g}{\partial y} will remain non-zero in some small neighborhood U around x_0.

To complete the theorem, consider one more interesting observation: if we define \tilde g(z) := g(h(z), z) for x_0 = (y_0, z_0)\in L_0(g), then around z_0, we have \tilde g(z) = 0, so that in the neighborhood we consider, \frac{d\tilde g}{dz} = 0. Assuming for now that h is differentiable, even though we can not compute h itself, we can actually use this to derive an expression for the derivative of h using the chain rule or respectively, the total derivative of \tilde g with respect to z:

    \[\begin{split} 0 = \frac{d\tilde g}{dz}(z) &= \frac{\partial g}{\partial y}(h(z),z)\frac{d y}{d z}(z) +  \frac{\partial g}{\partial z}(h(z),z)\frac{d z}{d z}(z) \\&= \frac{\partial g}{\partial y}(h(z),z) \cdot \nabla h(z) + \frac{\partial g}{\partial z}(h(z),z) \end{split}\]

where \frac{\partial g}{\partial z} denotes the vector of partial derivatives with respect to elements of z of length n-1, and the second line follows as \frac{d z}{d z}(z) is equal to I_{n-1}, the identity matrix of dimension n-1. Re-arranging the expression, noting that \frac{\partial g}{\partial y}(x(z)) is a non-zero real number around z_0 and we can thus divide by it, we get

    \[ \nabla h(x) = - \left (\frac{\partial g}{\partial y}(x(z))\right )^{-1}\frac{\partial g}{\partial z}(x(z)). \]

Finally, we are ready to put everything together by stating the result justifying our approach above:

Theorem: Univariate Implicit Function Theorem.
Let X_1\subseteq\mathbb R, X_2\subseteq\mathbb R^{n-1} and X:=X_1\times X_2, and g:X\mapsto\mathbb R. Suppose that g\in C^1(X), and that for a (y^*,z^*)\in X_1\times X_2, g(y^*,z^*) = 0. Then, if \frac{\partial g}{\partial y}(y^*,z^*) \neq 0, there exists an open set U\subseteq\mathbb R^{n-1} such that z^*\in U and h:U\mapsto\mathbb R for which y^* = h(z^*) and \forall z\in U: g(h(z),z) = 0. Moreover, it holds that h\in C^1(U) with derivative

    \[\nabla h(z) = -\left (\frac{\partial g}{\partial y}(h(z),z)\right )^{-1}\frac{\partial g}{\partial z}(h(z),z)\hspace{0.5cm}\forall z\in U.\]

 

Hopefully, the elaborations above help you in understanding the rather abstract statement of this theorem. The next section sheds light on how we end up using it for our optimization context.

The Lagrangian Method for One Constraint

The idea that we employ to make use of the implicit function theorem in our constrained optimization context is the following: we consider again a local maximizer as the starting point and contemplate how to characterize it, first through necessary conditions. For this, we locally re-write the constrained problem to an unconstrained one by making use of the implicit function. So, let us focus again on a local maximizer x^* = (y^*, z^*)', this time of the constrained problem

    \[\max_{x\in\mathbb R^n} f(x) \hspace{0.5cm}\text{subject to}\hspace{0.5cm}g(x) = 0.\]

This means that g(x^*) = 0 and for an \varepsilon > 0, for any x\in B_\varepsilon(x^*)\cap L_g(0), f(x)\leq f(x^*). If we now meet the requirement of the implicit function theorem, i.e. \frac{\partial g}{\partial y}(x^*)\neq 0, then we can define our implicit function h, for which especially y^* = h(z^*), and we also have a \delta > 0 such that:

    \[ \forall z\in B_\delta(z^*): f(h(z),z) \leq f(h(z^*),z^*). \]

Don’t worry too much about the \delta here, it is certainly >0, and represents the room of variation left for z when imposing that (y,z)\in B_\varepsilon(x^*) and simultaneously that y = h(z). The important message of this statement is that if x^* = (y^*, z^*) is a local maximizer in the constrained problem and \frac{\partial g}{\partial y}(x^*)\neq 0, then z^* is a local maximizer in the unconstrained problem

(1)   \begin{equation*}\max_{z\in \mathbb R^{n-1}} f(h(z), z),\end{equation*}

But as a local maximizer in an unconstrained problem, we know how to characterize z^* through necessary and sufficient conditions from the previous section! Hence, it should be possible to derive the same kind of conditions also for the constrained problem, and it indeed is. The following discusses how this transfer works.

Let us start with the first order condition: z^* must be a critical point of \tilde f(z) = f(h(z),z) = (f\circ x)(z) where x(z) = (h(z),z) as we already defined above. This means that \frac{d\tilde f}{dz}(z^*) = \mathbf 0. To obtain a workable first order condition for the constrained problem from this, we need to get rid of the unknown function h. In analogy to the derivation above, this can be achieved by applying the total derivative:

    \[\mathbf 0 = \frac{d\tilde f}{dz}(z^*) = \frac{\partial f}{\partial y}(h(z^*),z^*) \cdot \nabla h(z^*) + \frac{\partial f}{\partial z}(h(z^*),z^*).\]

Now, we just plug in what we know about \nabla h(z^*) from the implicit function theorem, and use that h(z^*) = y^*:

    \[\mathbf 0 = \frac{\partial f}{\partial y}(y^*,z^*) \cdot \left [-\left (\frac{\partial g}{\partial y}(y^*,z^*)\right )^{-1}\frac{\partial g}{\partial z}(y^*,z^*)\right ] + \frac{\partial f}{\partial z}(y^*,z^*).\]

Re-arranging terms and plugging in that x^* = (y^*, z^*), we get

    \[\mathbf 0 = \frac{\partial f}{\partial z}(x^*)-\left [\frac{\partial f}{\partial y}(x^*) \cdot \left (\frac{\partial g}{\partial y}(x^*)\right )^{-1}\right ]\frac{\partial g}{\partial z}(x^*) =: \frac{\partial f}{\partial z}(x^*) - \lambda \frac{\partial g}{\partial z}(x^*)\]

where we define \lambda as the expression in square brackets. Moreover, it holds that

    \[\frac{\partial f}{\partial y}(x^*) - \lambda \frac{\partial g}{\partial y}(x^*) = \frac{\partial f}{\partial y}(x^*) - \left [\frac{\partial f}{\partial y}(x^*) \cdot \left (\frac{\partial g}{\partial y}(x^*)\right )^{-1}\right ] \frac{\partial g}{\partial y}(x^*) = \frac{\partial f}{\partial y}(x^*) - \frac{\partial f}{\partial y}(x^*) = 0.\]

Thus, we can summarize more compactly

    \[ \nabla f(x^*) - \lambda\nabla g(x^*) = \mathbf 0\hspace{0.5cm}\Leftrightarrow\hspace{0.5cm}\nabla f(x^*) = \lambda\nabla g(x^*). \]

This is indeed what we were after: a characterization of the local maximizer x^* in the constrained problem using first derivatives. Note that this result is an implication of x^* being a local maximizer, such that it is a necessary condition for a local maximum. Let us write this down:

Theorem: Lagrange’s Necessary First Order Condition.
Consider the constrained problem \max_{x\in L_0(g)} f(x) where X\subseteq\mathbb R^n and f,g\in C^1(X). Let x^*\in  L_0(g) and suppose that \nabla g(x^*)\neq \mathbf 0. Then, x^* is a local maximizer of the constrained problem only if there exists \lambda\in\mathbb R: \nabla f(x^*) = \lambda \nabla g(x^*). If such \lambda\in\mathbb R exists, we call it the Lagrange multiplier associated with x^*.

 

The condition “\nabla g(x^*)\neq \mathbf 0” ensures that we can find the implicit function to express how we may remain within the level set around x^*. Conversely, this means that in comparison to the unconstrained problem, we need to take care of one more “border solution” candidate: singularities of the level set, i.e. points x^s where g(x^s) = 0 and \nabla g(x^s) = \mathbf 0.

Let’s take a moment to consider and formally justify the Lagrangian function, or simply, the Lagrangian. The necessary condition is “\exists\lambda\in\mathbb R: \nabla f(x^*) = \lambda \nabla g(x^*)“. Thus, for said \lambda, it is an easy exercise to verify that evaluated at x^*, the gradient of the following function is equal to \mathbf 0:

    \[\mathcal L(\lambda,x) = f(x) - \lambda g(x) = f(x) + \lambda (0 - g(x)).\]

We call this function the Lagrangian. Importantly, from our considerations in the previous section, the necessary condition for a solution to unconstrained maximization of the Lagrangian is \nabla \mathcal L(\lambda,x) = 0, which is however equivalent to our necessary condition for the constrained problem we had just derived. Finally, note that the derivative of \mathcal L(\lambda, x) with respect to \lambda is just -g(x), and requiring g(x) = 0 is thus equivalent to requiring \frac{\partial\mathcal L}{\partial \lambda}(\lambda,x) = 0. Thus, any interior solution candidate in the constrained problem is associated with a \lambda^*\in\mathbb R so that (\lambda^*, x^*) is a critical point of the Lagrangian function! However, as a technical comment, as you can convince yourself with the help of the companion script, the Lagrangian function never has maxima or minima, but only saddle points. Thus, be careful not to say that constrained maximization is equivalent to maximization of the Lagrangian, as the Lagrangian has no local maximizers and the latter problem never has a solution!

Still, the equivalence of critical points gives the usual Lagrangian first order conditions for a solution x^* that you may be familiar with:

\begin{tabular}{cc} $[x]:$ & $\frac{\partial \mathcal L}{\partial x}(x^*) = \mathbf 0 \hspace{0.5cm}\Leftrightarrow\hspace{0.5cm} \nabla f(x^*) = \lambda^* \nabla g(x^*)$\\ $[\lambda]:$ & $\frac{\partial \mathcal L}{\partial \lambda}(x^*) = 0\hspace{0.5cm}\Leftrightarrow\hspace{0.5cm} g(x^*) = 0$ \end{tabular}

 

This gives a system of n+1 equations is n+1 unknowns, representing our starting point for searching our solution candidates of any specific equality-constrained problem with one constraint. In the unconstrained case, we could consult the second derivative to further narrow down the set of candidates we needed to compare, and were able to also find a sufficient condition for local maximizers. Let us see how this generalizes to the constrained case. Doing so, we will confine ourselves to studying the theorem and not care too much how precisely this comes to be.

Definition: Leading Principal Minor.
Consider a symmetric matrix A = (a_{ij})_{i,j\in\{1,\ldots,n\}}\in\mathbb R^{n\times n}. Then, for k\leq n, the k-th leading principal minor of A, or the leading principal minor of A of order k is the matrix obtained from eliminating all rows and columns with index above k from A, i.e. the matrix M^A_k = (a_{ij})_{i,j\in\{1,\ldots,k\}}\in\mathbb R^{n\times n}.

 

In words, the k-th leading principal minor of A corresponds to the upper k\times k block of A. Let’s consider an example. Let

    \[A = \begin{pmatrix} 1 & 4 & 3 & 2\\ 2 & 0 & 0 & 0\\ 3 & 4 & -1 & -2\\ 0 & 1 & e & \pi \end{pmatrix}.\]

Then, the leading principal minors of order 1,2,3 and 4, respectively, are

    \[ M^A_1 = (1),\hspace{0.5cm} M^A_2 = \begin{pmatrix} 1 & 4 \\ 2 & 0 \end{pmatrix},\hspace{0.5cm} M^A_3 = \begin{pmatrix} 1 & 4 & 3 \\ 2 & 0 & 0 \\ 3 & 4 & -1 \end{pmatrix},\hspace{0.5cm} M^A_4 = A. \]

Because it will important for the theorem, answer the following: what are the last two leading principal minors of A?


M^A_3 and M^A_4.

 

The last concept that we need for our “second order condition” is the Bordered Hessian Matrix, given by the Hessian of the Lagrangian function:

    \[H_{\mathcal L}(\lambda, x) = \begin{pmatrix} \frac{\partial^2\mathcal L}{\partial \lambda^2}(\lambda, x) & \frac{\partial^2\mathcal L}{\partial \lambda \partial x}(\lambda, x) \\ \frac{\partial^2\mathcal L}{\partial x\partial \lambda}(\lambda, x) & \frac{\partial^2\mathcal L}{\partial x \partial x'}(\lambda, x)\end{pmatrix} = \begin{pmatrix} 0 & -\nabla g(x) \\ -(\nabla g(x))' & H_f(x) - \lambda H_g(x)\end{pmatrix}\]

where H_f and H_g denote the Hessians of f and g, respectively.

Theorem: Sufficient Conditions for the Constrained Problem.
Consider the constrained problem \max_{x\in L_0(g)} f(x) where X\subseteq\mathbb R^n and f,g\in C^2(X). Let x^*\in  L_0(g) and \lambda^*\in\mathbb R such that (\lambda^*, x^*) is a critical point of the Lagrangian function, i.e. \nabla f(x^*) = \lambda^*\nabla g(x^*) and g(x^*) = 0. If m=1 is the number of equality constraints, denote by M^{H_{\mathcal L}}_{n-m + 1}(\lambda^*, x^*),\ldots, M^{H_{\mathcal L}}_{n}(\lambda^*, x^*) the last n-m principal minors of H_{\mathcal L}(\lambda^*, x^*). If

    • \forall j\in\{n-m + 1,\ldots, n\}: \text{sgn}(\det(M^{H_{\mathcal L}}_j)) = (-1)^m, then x is a local minimizer of the constrained problem.
    • \forall j\in\{n-m + 1,\ldots, n\}: \text{sgn}(\det(M^{H_{\mathcal L}}_j)) = (-1)^j, then x is a local maximizer of the constrained problem.

 

Here, sgn(x) denotes the sign function equal to -1 if x<0, to 0 for x=0 and to 1 else.

Note that we introduced the variable m=1 to indicate the generalization to multiple constraints. Further, note that (i) here, we only have a second order sufficient, but not a second order necessary condition, so our ability to rule out critical values based on failure of a necessary condition is comparatively limited (this is not too bad, usually, there are not too many critical values anyway) and (ii) as with the second order condition in the unconstrained case, we require the functions f and g to be C^2, i.e. twice continuously differentiable, such that again, the second order condition is subject to a stronger smoothness regularity. Typically in economic problems, this regularity assumption will be satisfied.

Finally, note that the search for global maxima, because the level set L_0(g) is usually not convex, the theorem for the unconstrained case does not transfer, such that a concave objective is not sufficient for the global maximum.

Lagrangian with One Equality Constraint: Example and Interpretation

To get a feeling for the Lagrangian method, before moving on to multiple constraints, let us consider an example here: consider the problem

    \[\max_{x\in\mathbb R^2} -\|x\|_2 \hspace{0.5cm}\text{subject to}\hspace{0.5cm} x_1 + x_2 = 1.\]

That is, we seek the vector with minimum Euclidean length \|x\|_2 = \sqrt{x_1^2 + x_2^2} in the \mathbb R^2 that satisfies x_1 + x_2 = 1 (for economic context, you may think of the length as a cost function and 1 as a production target). How do we approach this issue? First, a trick to get rid of the unwieldy square root; let’s consider the equivalent problem

    \[\max_{x\in\mathbb R^2} -(x_1^2 + x_2^2) \hspace{0.5cm}\text{subject to}\hspace{0.5cm} x_1 + x_2 = 1.\]

This may always be worthwhile to think about — can you re-write the objective to a simpler expression without changing the solutions? Just remember to plug the solutions into the original objective in the end, then this approach works just fine and may save you a lot of work!

In step zero, we think about points that the Lagrangian method won’t find: boundary points and singularities! However, because \mathbb R^2 is an open set, boundary points are not an issue. Next, at any x\in\mathbb R^2, the gradient of the constraint function g(x) = x_1 + x_2 - 1 is \nabla g(x) = (1,1)'\neq \mathbf 0, such that there are no singularities. Thus, we can restrict attention to critical values as found by the Lagrangian method. Make sure that you always think about this step!

In the first step, we look for our candidates using the first order necessary conditions, or the Lagrangian first order conditions (FOCs), given by

\begin{tabular}{cc} $[x]:$ & $ f(x^*) = \lambda g(x^*)$\\ $[\lambda]:$ & $g(x^*) = 0$ \end{tabular}

So, let’s compute the gradient of f(x) = -\|x\|^2_2 (the one of g, we already took care of above):

    \[\nabla f = -(2x_1,2x_2).\]

Thus, the Lagrangian FOC for x^* tells us that

    \[ \begin{pmatrix} 2x_1^*\\2x_2^*\end{pmatrix} = \lambda^* \begin{pmatrix} 1\\1 \end{pmatrix}. \]

This gives either \lambda^* = x_1^* = x_2^* = 0 or x_1^* = x_2^* = \lambda^*/2 \neq 0. Considering the FOC for \lambda, i.e. the constraint, the former case can be ruled out; plugging in the latter gives

    \[\lambda^*/2 + \lambda^*/2 = 1\hspace{0.5cm}\Rightarrow\hspace{0.5cm} \lambda^* = 1,\ x_1^* = x_2^* = \frac{1}{2}.\]

Hence, there is just a unique candidate to consider: x^* = (1/2,1/2)'. Let’s consult the second order sufficient condition and see whether we can decide upon the type of extremum: recall the form of the Bordered Hessian above. Computing it for an arbitrary (\lambda, x) gives

    \[ H_{\mathcal L}(\lambda,x) = \begin{pmatrix} 0 & -1 & -1\\ -1 & -2 & 0\\ -1 & 0 & -2\end{pmatrix}. \]

Thus, it is the same irrepsective of x and \lambda, and we don’t need to plug in our candidate solution (\lambda^*, x^*). Now, let’s apply the theorem. We have n=2 variables and m=1 constraints, so n-m=1, and indeed, we only have to compute the determinant of the last leading principal minor, so the Bordered Hessian itself. Before doing so, let’s form expectations: the point is (i) a maximum if the determinant’s sign is (-1)^n = 1, i.e. \det(H_{\mathcal L}(\lambda^*,x^*))  > 0, (ii) a minimum if the determinant’s sign is (-1)^m = -1, i.e. \det(H_{\mathcal L}(\lambda^*,x^*))  < 0, and (iii) may be either of the two, or a saddle point if \det(H_{\mathcal L}(\lambda^*,x^*)) = 0.

Applying our 3\times 3 determinant rule, all the right-diagonals yield a zero product, while we twice get (-1)(-1)(-2) = -2 on the left-diagonal, such that \det(H_{\mathcal L}(\lambda^*,x^*)) = 4 > 0, and we have indeed found a maximum!

To conclude, x^* = (1/2, 1/2)' minimizes the Euclidean length of vectors in \mathbb R^2 that satisfy x_1+x_2 = 1, and it has length
\|x^*\| = \sqrt{(1/2)^2 + (1/2)^2} = \frac{\sqrt{2}}{2}.

To conclude our investigations on the case of one equality constraint, let us focus on some intuition and interpretation of the method. First, for our weird and abstract second order sufficient condition that more or less fell from the sky, there is something intuitive to note about it: the number of principal minors to consider. Recall that the gradient of g, \nabla g, was ruled out to have a “row rank deficit”, which for row vectors is a fancy way of saying that they are not zero (generally, a matrix A\in\mathbb R^{n\times m} has a row rank deficit if its rank is below n; this concept will be important in the multivariate extension below). In general, this means that all constraints in g move in independent directions at the solution, in our case of just one constraint, it is sufficient that the function g moves at all with variation in x around the solution, i.e. we have not included a nonsense constraint such as 0=0, or even worse, -5=0 (also the seemingly smart attempt to re-write inequality constraints using indicator functions, e.g. x'x - 5 \leq 0 as \mathds {1}[x'x - 5 > 0] = 0 falls victim to this issue at every point where \mathds {1}[x'x - 5 > 0] is differentiable). As such, the constraint restricts m=1 directions of the \mathbb R^n in maximization of f, but it leaves free the remaining n-m, which is exactly the number of leading principal minors that we have to consider! Intuitively, you can think about this in analogy to our discussions of the saddle point graph — we need to rule out that we have a minimum in one and a maximum into the other direction.

The second, and perhaps far more important piece of interpretation concerns the Lagrangian multiplier. Have you ever heard the expression “shadow price of the budget constraint”? Then you may know that we associate it with the Lagrangian multiplier in constrained utility maximization subject to the budget constraint. But what does it mean? Let’s consider a general constraint \tilde g(x) = y, but to ease interpretation, keep in mind that y may be the budget and we may have \tilde g(x) = \sum_{i=1}^n p_ix_i as the sum of expenditures on individual goods. Note that given \tilde g, we can express the optimal solution (assuming for now that it is unique) as x^*(y), i.e. it will depend only on the value of y (the budget available) that constrains the problem. Denote F(y) = f(x^*(y)) as the objective function’s value at the optimum (the implicit utility function U(y) = u(x^*(y))), and let G(y) = \tilde g(x^*(y)) be the implicit constraint function (expressing expenditures at the optimum, E(y) = \sum_{i=1}^n p_ix_i^*(y)). Then, note that by the chain rule,

    \[\frac{d}{dy} F(y) = \nabla f(x^*(y)) \frac{dx^*}{dy}(y) = \lambda^*\nabla g(x^*(y)) \frac{dx^*}{dy}(y) = \lambda^* \frac{d}{dy} G(y)\]

and thus

    \[ \lambda = \frac{\frac{d}{dy} F(y)}{\frac{d}{dy} G(y)}\hspace{0.5cm} \left (\lambda = \frac{\frac{d}{dy} U(y)}{\frac{d}{dy} E(y)}\right ).\]

In case of the utility maximization problem, \lambda tells us the ratio of change in utility and change in expenditures. Because the household always spends all his income (provided that the utility function is strictly increasing in consumption), we have \frac{d}{dy} E(y) = 1 so that \lambda = \frac{d}{dy} U(y), i.e. it resembles the utility’s responsiveness to increases in y. Hence, we can conversely interpret it as the marginal utility cost of being budget-constrained! For the general problem, the intuition is similar, because if we continue to require \tilde g(x) = y, then trivially \frac{d}{dy} G(y) = 1, and we can interpret \lambda as the marginal increase in the objective function associated with marginally relaxing the constraint. Note that this means that the multiplier MUST BE non-negative, i.e. \lambda \geq 0! If \lambda = 0, we are in an interesting special case: here, the constraint does not matter, in the sense that x^*(y) is the solution to the unconstrained problem as well.

Indeed, this intuition will be crucial when we generalize our insights to inequality constraints: if an inequality constraint “matters”, in the set of x such that h_j(x)\leq 0, it constrains our optimization opportunities, then it must be the case that it has a non-zero “cost” in terms of the objective, that is, it is associated with a multiplier unequal to zero, and the constraint is binding at x^* in the sense that h_j(x^*) = 0. Otherwise, if the h_j(x^*) < 0, we say that the constraint is slack, we could deviate marginally from x^* and still satisfy the constraint — it thus seems not to have a value cost for the objective, and must be associated with a multiplier = 0. This suggests that when \mu_j is the multiplier associated with the constraint h_j\leq 0, we should have an equality of the form \mu_jh_j(x^*) = 0, called a complementary slackness condition (think about why this equality encompasses the two cases before). Indeed, next to the constraint holding, this is almost all we need to generalize the method to inequality constraints!

Consider again the Lagrangian (with constraint g(x) - y = 0):

    \[\mathcal L(\lambda, x) = f(x) + \lambda (y - \tilde g(x)).\]

Suppose now that we marginally relax the (budget) constraint, that is, assume that y marginally increases (the household gets more money). Then, note that

    \[\frac{\partial\mathcal L}{\partial y} = \lambda.\]

Lagrangian Optimization: Multiple Constraints

To conclude our studies of equality-constrained optimization problems, we need to briefly consider the method’s generalization to multiple constraints. There is no need for much discussion, all formal ideas and the intuition transfer from the case of one constraint. The only tricky part in generalization here is the rank condition for the Jacobian of the constraint function g= (g_1,\ldots,g_m)':\mathbb R\supseteq X\mapsto\mathbb R^m, the intuition of which we have already considered above. The reason for the close-to one-to-one transferability is that when starting from a problem of the form

    \[\max_{x\in C(\mathcal P)} f(x), \hspace{0.5cm}C(\mathcal P) = \{ x\in\mathbb R^n: (\forall i\in\{1,\ldots, m\}: g_i(x) = 0)\},\]

we can always define g as above so that

    \[C(\mathcal P) = \{ x\in\mathbb R^n: g(x) = \mathbf 0\}\]

which looks very much like the constraint set of the one-constraint problem, only that we are now dealing with a vector-valued function that must be equal to the zero vector. But, as you may recall from Chapter 1, vector spaces are endowed with very similar algebraic structures as real numbers, that in many cases, including this one, allow for convenient and relatively straightforward generalizations. Indeed, we may rely on the multivariate implicit function theorem:

Theorem: Multivariate Implicit Function Theorem.
Let X_1\subseteq\underline{\mathbb R^m}, X_2\subseteq\underline{\mathbb R^{n-m}} and X:=X_1\times X_2, and g:X\mapsto\underline{\mathbb R^m}. Suppose that g\in C^1(X,\mathbb R^m), and that for a (y^*,z^*)\in X_1\times X_2, g(y^*,z^*) = \underline{\mathbf 0}. Then, if \underline{rk\left (\frac{\partial g}{\partial y}(y^*,z^*)\right ) = m}, there exists an open set U\subseteq\underline{\mathbb R^{n-m}} such that z^*\in U and h:U\mapsto\underline{\mathbb R^m} for which y^* = h(z^*) and \forall z\in U: g(h(z),z) = \underline{\mathbf 0}. Moreover, it holds that h\in C^1(U, \mathbb R^m) with derivative

    \[\underline{J_h(z)} = -\left (\frac{\partial g}{\partial y}(h(z),z)\right )^{-1}\frac{\partial g}{\partial z}(h(z),z)\hspace{0.5cm}\forall z\in U.\]

 

The important changes from the univariate theorem are underlined. You can see that the only thing that has really changed is the rank condition, which encompasses the univariate case, as when m=1, rk\left (\frac{\partial g}{\partial y}(y^*,z^*)\right ) = m is equivalent to \frac{\partial g}{\partial y}(y^*,z^*) \neq 0, the rest is just adjustment to the increased dimensionality of g. Note that in the general case, \frac{\partial g}{\partial y} is the matrix-valued Jacobian. The intuition is the same as before: for every constraint g_i(x) = 0, we need to vary g_i independently from the other constraints so as to ensure that it holds in a neighborhood of the point of interest. As in the univariate case, we can not have m>n, i.e. more (meaningful) constraints than directions of variation.

In analogy to above, this theorem allows us to derive the necessary first order condition that gives us the set of “interior” candidates for extrema:

Theorem: Lagrange’s Necessary First Order Conditions for Multiple Constraints.
Consider the constrained problem \max_{x\in L_{\mathbf 0}(g)} f(x) where X\subseteq\mathbb R^n and f\in C^1(X),g\in C^1(X,\mathbb R^m). Let x^*\in  L_{\mathbf 0}(g) = \cap_{i\in\{1,\ldots,m\}}L_{0}(g_i) and suppose that rk(J_g(x^*)) = m. Then, x^* is a local maximizer of the constrained problem only if there exists \Lambda = (\lambda_1,\ldots,\lambda_m)'\in\mathbb R^m: \nabla f(x^*) = \Lambda' J_g(x^*). If such \Lambda\in\mathbb R exists, we call \lambda_i the Lagrange multiplier associated with x^* for the i-th constraint.

 

Here, instead of \nabla f(x^*) = \Lambda' J_g(x^*), you may be more familiar with the condition

    \[\nabla f(x^*) = \sum_{i=1}^m \lambda_i\nabla g_i(x^*).\]

You can check that these two characterizations are equivalent using the definition of the Jacobian. As before, this theorem gives us all the candidates for local extrema that we have to consider when searching for the global maximum/minimum. As with the univariate case, the i-th multiplier can be interpreted as the (non-negative) value cost of the i-th constraint at x^*.

Ruling out extreme values with the second order necessary condition is exactly analogous to before (because we had already formulated the second order condition quite generally earlier):

Theorem: Second Order Sufficient Conditions for Multiple Constraints.
Consider the constrained problem \max_{x\in L_{\mathbf 0}(g)} f(x) where X\subseteq\mathbb R^n and f\in C^2(X),g\in C^2(X,\mathbb R^m). Let x^*\in  L_{\mathbf 0}(g) = \cap_{i\in\{1,\ldots,m\}}L_{0}(g_i) and \Lambda^*\in\mathbb R such that (\Lambda^*, x^*) is a critical point of the Lagrangian function, i.e. \nabla f(x^*) = (\Lambda^*)' J_g(x^*) and g(x^*) = \mathbf 0. Denote by M^{H_{\mathcal L}}_{n-m + 1}(\lambda^*, x^*),\ldots, M^{H_{\mathcal L}}_{n}(\lambda^*, x^*) the last n-m principal minors of H_{\mathcal L}(\lambda^*, x^*). If

    • \forall j\in\{n-m + 1,\ldots, n\}: \text{sgn}(\det(M^{H_{\mathcal L}}_j)) = (-1)^m, then x is a local minimizer of the constrained problem.
    • \forall j\in\{n-m + 1,\ldots, n\}: \text{sgn}(\det(M^{H_{\mathcal L}}_j)) = (-1)^j, then x is a local maximizer of the constrained problem.

 

Next to the slightly different first order condition, the only difference is the form of the Bordered Hessian. It still corresponds to the Hessian of the Lagrangian function, however, due to the multitude of constraints, this Hessian now looks slightly different:

    \[H_{\mathcal L}(\Lambda, x) = H_{\mathcal L}(\lambda_1,\ldots,\lambda_m, x) =  \begin{pmatrix} \mathbf{0}_{m\times m} & (J_g(x))'\\ J_g(x) & H_f(x) - \sum_{i=1}^m \lambda_i H_{g_i}(x) \end{pmatrix}. \]

You can try to compute this yourself using the definition of the Hessian and appreciating the vector structure of \Lambda. Note again that we need a stronger smoothness assumption for the second order necessary condition, as f and g need to be C^2.

Equality Constraints and the Lagrangian: A Recipe

We have done quite some math to establish our approach to equality-constrained problems using the Lagrangian function. Having a good understanding of this analytic justification will greatly aid your general knowledge in econ-related math, however, it may be a bit overwhelming, such that the clear path to approach a specific problem may not be crystal clear just yet. Thus, in our last discussion of problems with equality constraints, as we did for unconstrained problems, let us summarize our findings in a “cookbook recipe” to follow when confronted with specific applications.

Starting Point. Consider a problem

    \[\max_{x\in\text{dom}(f)} f(x)\hspace{0.5cm}\text{subject to}\hspace{0.5cm}\forall i\in\{1,\ldots,m\}: g_i(x) = 0\]

and suppose that f and g = (g_1,\ldots, g_m)' are sufficiently smooth, i.e. f\in C^2(X) where X = supp(f)\subseteq\mathbb R^n and g\in C^2(X,\mathbb R^m). This ensures that both our first and second order conditions are applicable. If there are points of non-differentiability, they have to considered as border solution points separately.

Analytical Steps.

  1. Collect all candidate points that can not be identified via the “interior” (local extremum) method, i.e. boundary points of the support X of f (if any), and singularities of the level set L_g(x), i.e. x^* with rk(J_g(x^*))<m. Next, turn to the interior candidates.
  2. (First Order Necessary Condition/”FOC”) Compute the set of interior candidates, i.e. the critical points of the Lagrangian function \{x_1^*,\ldots, x_K^*\} and the vectors of associated multipliers \{\Lambda_1^*, \ldots, \Lambda_K^*\}.
  3. (Second Order Sufficient Condition/”SOC”) Can we identify the type of all candidates using the Bordered Hessian Criterion? Rule out Minima!
  4. (Value Comparison; only in case of multiple candidates) For all candidates x^* that remain until here (including those of step 1), compute f(x^*) and determine which point(s) remain(s) as candidates for the global extremizer.
  5. (Solution Existence) Argue that the global extremizer candidate you have found indeed yields the global extremum of interest by consulting the limits of f (within the constraint set) or using the Weierstrass theorem.

For the last step, for economic applications, it is oftentimes helpful to recall that the budget set, and sets defined in a very similar way, are compact, as are the intersections of a finite number of compact sets (as closedness and boundedness are preserved under finite set intersections). The “solution existence” part can also be taken care of in the first step, however, if you have to rely on a limits argument, it may be helpful to know f(x^*) for your global extremizer candidate x^*.

Inequality Constraints

The logic we employ to solve problems featuring also inequality constraints is the one outlined above: a solution candidate will be required to satisfy the Lagrangian conditions for all equality constraints, and further, for the inequality constraints we will impose (i) feasibility, i.e. being contained in the constraint set for a solution candidate, and (ii) our complementary slackness condition. The result we rely on is the Karush-Kuhn-Tucker theorem that tells us the necessary and sufficient conditions for an optimum. Here, we only state it, as the application procedure is more or less analogous to the “Lagrangian cookbook formula”, and the central take-away from this Chapter should be the general foundation of unconstrained optimization and how we formally/intuitively justify the Lagrangian approach.

For the problem \mathcal P with only inequality constraints

    \[\max_{x\in\text{dom}(f)} f(x)\hspace{0.5cm}\text{subject to}\hspace{0.5cm} \forall j\in\{1,\ldots,k\}: h_j(x) \leq 0,\]

where f,h_j are all C^1 functions, the result may be summarized as follows:

Theorem: Karush-Kuhn-Tucker Theorem.
For \mu = (\mu_1,\ldots,\mu_k)'\in\mathbb R^k, consider the optimality conditions

    • (Feasibility) \forall j\in\{1,\ldots,k\}: h_j(x) \leq 0,
    • (FOC for x) \nabla f(x) = \sum_{j=1}^k \mu_j h_j(x),
    • (Complementary Slackness) \forall j\in\{1,\ldots,k\}: \mu_jh_j(x) = 0.

Then,

  1. (Necessity) If x^*\in dom(f) is a local maximizer of \mathcal P for which the set \{\nabla h_j(x^*): h_j(x^*)=0\} is linearly independent, there exists \mu^*\in\mathbb R^k such that (x^*,\mu^*) satisfies the optimality conditions.
  2. (Sufficiency) If f is concave, g is quasi-convex and there exists x_0\in dom(f) such that g(x_0) < 0, if the optimality conditions hold at x^*\in dom(f), then x^* is a local maximizer of \mathcal P.

 

In contrast to the Lagrange Theorem discussed earlier, what has changed in the Karush-Kuhn-Tucker (KKT) Theorem is that the rank condition has yet become more complex again. Intuitively, when varying candidates x^*, the set of “relevant” equality constraints changes because at different x^*, different h_j(x^*)\leq 0 will hold with equality. In consequence, we have a different rank condition at different x^*, expressed by the more unwieldy characterization above. Nonetheless, the concept is the same as before, and as there, we must separately identify singularity candidates from points that violate this condition, if there are any. The rest of the “necessity” part of the theorem follows the logic discussed earlier: points must be feasible, and either an inequality constraint holds with equality (h_j(x) = 0), or it does not constrain our ability to maximize f, i.e. the value of the objective function, locally (\mu_j = 0), which we summarize in the complementary slackness condition.
The “sufficiency” part of the theorem summarizes the key insights from the issue of “convex optimization”. Deriving this result is beyond the scope of this class. However, note that when all conditions are met, that is, the linear independence rank condition (also called constraint qualification), the shape criteria for objective and constraint functions and \exists x_0\in dom(f): g(x_0) < 0 (also called Slater’s condition) are all satisfied, then the optimality conditions are indeed equivalent to the local maximizer property, which greatly simplifies the search for local extremizers. Furthermore, note that for search of local minimizers, the conditions are exactly the same, with the exception that f must be convex rather than concave to obtain sufficiency.
As you may have noted, thus far, we have not addressed problems with both equality and inequality constraints, which are, however, very relevant to economics – for instance, you can think of any equality-constrained problem with added non-negativity constraints for some choice variables. So, what about problems \mathcal P of the form

    \[\max_{x\in\text{dom}(f)} f(x)\hspace{0.5cm}\text{subject to}\hspace{0.5cm} \forall i\in\{1,\ldots,m\}: g_i(x) = 0\hspace{0.5cm}\land\hspace{0.5cm}\forall j\in\{1,\ldots,k\}: h_j(x) \leq 0?\]

Here, we also have a variant of KKT:

Theorem: Karush-Kuhn-Tucker with Equality Constraints.
For \Lambda = (\lambda_1,\ldots, \lambda_m)'\in\mathbb R^m and \mu = (\mu_1,\ldots,\mu_k)'\in\mathbb R^k, consider the optimality conditions

    • (Feasibility) \forall j\in\{1,\ldots,k\}: h_j(x) \leq 0 and \forall i\in\{1,\ldots,m\}: g_i(x) = 0,
    • (FOC for x) \nabla f(x) = \sum_{i=1}^m \lambda_i \nabla g_i(x) + \sum_{j=1}^k \mu_j h_j(x),
    • (Complementary Slackness) \forall j\in\{1,\ldots,k\}: \mu_jh_j(x) = 0.

Then, if x^*\in dom(f) is a local maximizer of the constrained problem for which the set \{\nabla h_j(x^*): h_j(x^*)=0\}\cup\{\nabla g_i(x^*):i\in\{1,\ldots,m\}\} is linearly independent, there exist \Lambda^*\in\mathbb R^m and \mu^*\in\mathbb R^k such that (x^*,\Lambda^*,\mu^*) satisfies the optimality conditions.

 

Thus, the “inequality-only” KKT transfers with two caveats: the first is technical and refers to the fact that constraint qualification has become more complex: because all equality constraints always “bind”, i.e., hold with equality, they need to be considered in the linear independence test set. The second is more severe from a methodological point of view: sufficient conditions are no longer “readily” obtained, and for problems with both equality and inequality constraints, we usually only use necessary conditions. While this may sound unfortunate, in many applications, it suffices to check for solution existence and then compare the values of candidates that remain after the FOC, which are usually not many.
That being said, it may still be quite tedious — or at least, time-consuming — to apply KKT, and it is also more error-prone relative to the simpler Lagrangian method. Fortunately, in economics, we can frequently avoid KKT by simplifying the problem to an equality-only or entirely unconstrained problem. This shall be the purpose of our remaining discussion.

Problem Simplifications

In the following, let us consider when and how we can re-write seemingly inequality-constrained problems as equality-constrained, which greatly facilitates the analytic solution. To start off, consider again the general constrained maximization problem,

    \[\max_{x\in\text{dom}(f)} f(x)\hspace{0.5cm}\text{subject to}\hspace{0.5cm} \forall i\in\{1,\ldots,m\}: g_i(x) = 0\hspace{0.5cm}\land\hspace{0.5cm}\forall j\in\{1,\ldots,k\}: h_j(x) \leq 0.\]

Instead of solving it using KKT, we may be able to re-write it as

    \[\max_{x\in\text{dom}(f)} f(x)\hspace{0.5cm}\text{subject to}\hspace{0.5cm} \forall i\in\{1,\ldots,\tilde m\}: g_i(x) = 0,\]

which we can apply the more familiar and straightforward Lagrangian method to. Generally, there are two approaches to simplifying the problem: (i) imposing that an inequality constraint is binding, i.e. that it necessarily holds with equality (and thus is “not really an inequality constraint at all”), or (ii) dropping inequality constraints.

Let us study these methods in the context of the budget-constrained utility maximization problem we already introduced,

    \[\max_{x\in\mathbb R^n} u(x) \hspace{0.5cm}\text{subject to}\hspace{0.5cm} p'x - y \leq 0,\ \forall i\in\{1,\ldots,n\}: x_i\geq 0 \]

with price vector p such that p_i>0 for all i and y>0. Utility functions that we typically consider are such that u(x) is strictly increasing in all arguments, i.e. \forall x\in\mathbb R^n\forall j\in\{1,\ldots,n\}:\frac{\partial u}{\partial x_j}(x) > 0, an assumption that we are usually ready to impose because strictly more is typically strictly better in the view of the consumer. Moreover, we assume that the marginal utility of the first bit is infinitely high, i.e. \lim_{x_j\to 0} \frac{\partial u}{\partial x_j}(x) = \infty. Common examples that satisfy these conditions are Cobb-Douglas or Constant Elasticity of Substitution utility functions.

The first simplification (“not really an inequality constraint”) can be applied to the budget constraint: a point x with p'x - y < 0 can never be a solution, because the consumer can spend more to increase utility! Thus, all solutions of the initial problem will be solutions of the re-written problem

    \[\max_{x\in\mathbb R^n} u(x) \hspace{0.5cm}\text{subject to}\hspace{0.5cm} p'x - y = 0,\ \forall i\in\{1,\ldots,n\}: x_i\geq 0. \]

Whenever this condition in bold holds, the simplification is justified. Next, note that the marginal utility from spending money on good i is \frac{\partial u}{\partial x_i}(x)/p_i (because 1 dollar gives 1/p_i units of the good). For any x such that x_i = 0 and x_j > 0, we can marginally decrease spending on x_j and increase spending on x_i – and, by our zero limit assumption for the marginal utility, generate a large leap in utility. Thus, the point x with x_i = 0 can not be an optimum. Thus, x_i\geq 0 will never bind, and if x^* solves our problem, it is also a local maximizer in the problem

    \[\max_{x\in\mathbb R^n} u(x) \hspace{0.5cm}\text{subject to}\hspace{0.5cm} p'x - y = 0. \]

The Lagrangian approach of this problem gives \nabla u(x^*) = \lambda p, and we can ex-post impose that x^*\in\mathbb R^n_+, i.e. that all entries in x^* must be non-negative. In case of a bivariate Cobb-Douglas function u(x) = x_1^\alpha x_2^{1-\alpha}, \alpha \in (0,1), it may be a good exercise to formally verify that the optimal input ratio x_1^*/x_2^* is given by \alpha/(1-\alpha) * p_2/p1, pay special attention to the line of arguments that ensures that this is indeed the unique global maximum. Thus, we can restrict the role of inequality constraints to feasibility of points in the Lagrangian problem, making them a further criterion in the “cookbook recipe” outlined above according to which we may rule out candidates identified from the FOC.

Accordingly, we may be able to reduce the problem to either a standard Lagrangian problem if we are able to argue that the inequality constraints are actually equality constraints, or solve an equivalent Lagrangian problem with an additional feasibility condition that rules out some of the critical values. Indeed, most optimization problems you face during the Master’s classes that feature inequality constraints can be reduced to such Lagrangian problems that you can solve using the methods discussed above.

As a final note on problem simplification, recall our motivation for deriving our one-equality-constraint procedure from the implicit function theorem: it may not always be possible to represent constraints using global, explicit functions. On the other hand, sometimes, this may indeed be the case. If so, you can always get rid of the respective constraint – solve for the explicit function representing a constraint and plug it into the problem! This usually makes the problem much easier to solve, as you reduce its dimensionality and the number of constraints. As an example, you can consult the elaborations on the explicit function in the section “Level Sets and Implicit Functions for Optimization” that we used to re-write the two-dimensional constrained problem as a one-dimensional, unconstrained one. When applying this trick, be sure to plug the explicit function not only in to the objective, but also to the other constraints, if any.

Conclusion

We have started by formally studying unconstrained optimization problems, which has allowed us to highlight these problems’ key aspects and properties, to formally justify our first-and second order condition solution approach, and to make clear how we may generalize the respective insights to the case where optimization is subject to equality constraints. For notational simplicity, we have proceeded to a rigorous formal study of the problem with one equality constraint, where we outlined necessary and sufficient conditions for interior solutions. Thereafter, we generalized this approach to multiple constraints exploiting the structural similarity of the vector space \mathbb R^n to the space of real numbers, and identified a stepwise procedure to solving general equality-constrained problems using the Lagrangian method. Subsequently, we investigated problems that feature inequality constraints (such as budget constraints and non-negativity constraints) and outlined possible approaches to reduce them to standard Lagrangian problems. In case this reduction is not applicable, we may consult the Karush-Kuhn-Tucker conditions. Finally, we saw that when we have an intuitive way of thinking about what “relaxing the constraint” refers to, we can utilize a mathematically simplistic and elegant way to determine the type of an extremizer that avoids computing a large amount of matrices and their determinants. From our discussions in this chapter, it has emerged that thorough understanding of the Lagrangian method and especially the intuition for the Lagrangian multipliers suffices to understand the approach to most economic optimization problems.

To test how well you understood the second half of the discussions on optimization, you can do a short quiz found here.