Chapter 3 – Multivariate Calculus

An overview of this chapter’s contents and take-aways can be found here.

Broadly, Chapters 0 and 1 have covered topics of fundamental concern to all mathematical disciplines, while the elaborations of Chapter 1 were already more closely linked to what mathematicians call linear algebra, that is, the study of linear operations such as scalar multiplication and vector addition. Subsequently, Chapter 2 has discussed a central building block of linear algebra: characterizing and solving systems of linear equations. For the rest of this course, we want to move away from linear algebra and instead consider key issues in mathematical analysis, where we are concerned with analyzing mathematical objects, especially functions and related equations (say \forall x\in X: f(x) = cf'(x)), and investigate whether they are continuous, differentiable, invertible, have maxima and minima, and much more. This chapter deals with what is perhaps one of if not the most central building block in analysis: differentiation and integration of “general” functions, that is, functions mapping between vector spaces, without further restrictions on domain and co-domain.

Introduction

As always, let us first consider why we as economists should bother with the concepts discussed here. While you certainly know how to e.g. take derivatives of functions mapping from \mathbb R to \mathbb R, being familiar with more general methods of (functional) analysis is invaluable because the typical functions we consider have more than one argument. To give one ubiquitous example, you may think about utility derived from a vector of goods (quantities). When thinking about optimization of these quantities, we can not proceed without the insights from multivariate calculus.

In economic problems, we rarely care about the choice variables (e.g. quantities of goods consumed) directly, but rather about the outcome they produce (e.g. utility derived from consumption), and therefore, there is a function mapping choice into outcome in the background of almost every economic consideration. Thus, it should be clear that functional analysis is a key competence an economist should acquire. Of the tools of functional analysis, especially differentiation and integration are important for the economist: think about the concept of marginal willingness-to-pay or the aggregate welfare in an economy.

As stated initially, the functions that we study here map between vector spaces. There is some related terminology that one should be familiar with before moving on, so let us review it here in a first step. Recall from the introduction that we write a function f as

    \[f:X\mapsto Y, x\mapsto y=f(x).\]

This statement is a complete summary of all of f‘s relevant aspects: the domain X, the codomain Y and the rule x\mapsto y that associates elements in the domain with the respective element in the codomain. Depending on what sets the domain and codomain come from, we attach different labels to f. The most important ones are:

    • If X\subseteq\mathbb R, we call f a univariate function.
    • If X\subseteq\mathbb R^n for n>1, we call f a multivariate function.
    • If Y\subseteq\mathbb R, we call f a real-valued function.
    • If Y\subseteq\mathbb R^m for m>1, we call f a vector-valued function.
    • If X and Y are sets of functions, we call f an operator.

To familiarize yourself with these expressions, think about the domain and codomain of

  1. a multivariate real-valued function
  2. a univariate vector-valued function

1. domain: vectors of length >1, codomain: real numbers; 2. domain: real numbers, codomain: vectors of length >1

 

and what labels we attach to

  1. the absolute value |\cdot| of a real number
  2. a norm \|\cdot\| of a vector of length n>1
  3. the scalar product

1. univariate real-valued function; 2. multivariate real-valued function; 3. multivariate real-valued function.

 

Basic Concepts

Let us begin with some basics. Here, we will have a brief discussion of invertibility, a concept which translates in a one-to-one fashion from univariate real-valued functions, and then consider the very important concepts of convexity and concavity.

Invertibility of Functions

Recall from our introductory discussion of functions in Chapter 0 that we could invert a function f:X\mapsto Y, X,Y\subseteq \mathbb R if and only if for any y\in Y, there exists exactly one x\in X such that f(x) = y. This is the case because then, and only then, can we identify a unique element x\in X that is mapped onto y for any y\in Y — recall that when considering g:Y\mapsto X as a candidate for the inverse function, we require g to be defined everywhere on Y, i.e. for all y\in Y, and by definition of g as a function, g must take exactly one value x\in X for any y\in Y, rather than multiple values or no value at all. An alternative way to think about this is that in the case that for any y\in Y, there exists exactly one x\in X such that f(x) = y, knowing the y is equivalent to knowing the x in terms of identifying the candidate pair (x,y) one considers, and from the rule that associates x‘s with y‘s, by knowing the pair (x,y) in the graph of f associated with every y\in Y, we can define the rule that associates y‘s with x‘s to yield the same pairs (x,y) as f. To see these abstract elaborations graphically, let’s consider two examples of functions where X and Y contain only three elements.


300x134

Here, the function f is represented by the blue arrows. As the green dotted boxes indicate, there are well-defined pairs (x,y), every one of which can be identified by knowing either the x– or the y-value. Hence, we can invert the rule of f and define the inverse function as the rule associating y‘s with x‘s to obtain the same pairs as we do under f, as indicated by the green arrows.


300x132

Here, again, f is represented by the blue arrows. However, now we have two sources of ambiguity in the attempt to invert the mapping of f: first, the value y_3 is associated with two x-values. By the definition of a function, we can, however, map y_3 only onto one x value when considering a mapping Y\mapsto X, so that it is not possible to define a value of the inverse function at y_3. Secondly, the value y_1 is associated with no x-value at all, and there is no candidate value for the inverse function to take at y_1.

What should be especially clear from these examples is that the conditions under which the inverse function exists or does not, respectively, are not specific to univariate, real-valued functions, but generally refers to any function f:X\mapsto Y with arbitrary domain X and codomain Y.

We can define invertibility elegantly using the concepts of injectivity and surjectivity.

Definition: Surjective Function.
Let f:X\mapsto Y for some sets X and Y. Then, f is said to be surjective if \forall y\in Y\exists x\in X:(f(x) = y), i.e. for every y in the codomain of f, there exists at least one element x in the domain that is mapped onto it.

 

Surjectivity rules out issues like the one we faced with y_1 in the example above. Note that next to the mapping rule of f (x\mapsto y=f(x)), surjectivity crucially depends on the set Y we choose to define f. Consider e.g. f(x) = x^2 where x\in\mathbb R. Is f surjective? It depends! If we define f: \mathbb R\mapsto \mathbb R, i.e. X = Y = \mathbb R, then any y\in Y, y<0 does not have an x\in X for which f(x) = y, so that f is not surjective. On the other hand, if we set Y=\mathbb R_+, then for any y there exists x=\sqrt{y}\in X: f(x)=y, and f is surjective! This principle holds true more generally: given the domain X that we consider, we can simply define Y=f[X] to “throw out” the values not mapped onto by f and ensure surjectivity (of course, in defining the inverse function, we must then pay special attention to restricting its domain to f[X]). As we have seen, surjectivity is the first requirement satisfaction of which tells us that we can find elements in X to map y\in Y onto when contemplating existence of the inverse function. Now, we just have to know whether the element in X is unique — enter injectivity:

Definition: Injective Function.
Let f:X\mapsto Y for some sets X and Y. Then, f is said to be injective if \forall x_1,x_2\in X: (x_1\neq x_2\Rightarrow f(x_1)\neq f(x_2)), i.e. every two different elements in X have a different image under f.

 

For the inverse function, injectivity rules out that for an y\in Y, we have two different elements x_1,x_2\in X so that f(x_1)=f(x_2) = y, like x_2 and x_3 in our example above. Coming back to the example of the square function, is f: \mathbb R\mapsto \mathbb R_+, x\mapsto x^2 injective? Clearly not: e.g. (-1)^2 = 1 = 1^2, so that f(-1) = f(1). Thus, it may also depend on the domain that we consider whether we can invert a given function — setting e.g. f: \mathbb R_+\mapsto \mathbb R_+, x\mapsto x^2 achieves also injectivity because for x_1,x_2\in\mathbb R, if x_1\neq x_2 then also x_1^2\neq x_2^2. Thus, if f is defined on X=\mathbb R_+ (rather than X=\mathbb R), we can invert f on Y=\mathbb R_+ (rather than Y=\mathbb R) as f^{-1}:\mathbb R_+\mapsto\mathbb R_+: y\mapsto \sqrt{y}. Then, indeed for any x\in\mathbb R_+, f^{-1}(f(x)) = \sqrt{x^2} = x.

In terms of language, sometimes, we also call surjective functions onto, because they map onto the whole space Y, and injective functions one-to-one, because they map every one element in X to one distinct element in Y. Before moving on, a last definition:

Definition: Bijective Function.
Let f:X\mapsto Y for some sets X and Y. Then, f is said to be bijective if f is injective and surjective.

 

Clearly, if we have inverted f to the function f^{-1}, then the function f^{-1} is also invertible with (f^{-1})^{-1} = f. This allows us to conclude:

Definition: Inverse Function.
Let f:X\mapsto Y for some sets X and Y. Then, the function g:Y\mapsto X, y\mapsto g(y) such that \forall x\in X: g(f(x)) = x and \forall y\in Y: f(g(y)) = y is called the inverse function of f. We write g=f^{-1}.

 

Theorem: Existence of the Inverse Function.
Let f:X\mapsto Y for some sets X and Y. Then, the inverse function f^{-1} of f exists if and only if f is bijective.

 

Indeed, the proof of this theorem is really simple – it does nothing but put our verbal elaborations into a more formal mathematical argument. If you are curious, you can find it in the companion script.

The considerations above reveal when a function is invertible, but do not address how we compute the inverse function if it exists. Formally, we are looking for the composition g that we need to apply to f to arrive at the initial value again, i.e. g such that g(f(x)) = x for all x in the domain of f. For univariate, real-valued functions as the square-function we discussed in our example, this is relatively straightforward as we know a variety of inverse relationships – x^k and y^{1/k}, \ln(x) and \exp(y), \sin(x) and \arcsin(y), etc. For multivariate functions, things get more tricky. There is, however, one exception where we can easily compute the inverse function: a function f of the form

    \[f:X\mapsto X, x\mapsto Ax\]

with a square, invertible matrix A. In this case, it is easily seen that the inverse function must be characterized by f^{-1}(y) = A^{-1}y, as then,

    \[f^{-1}(f(x)) = A^{-1}\cdot(Ax) = x\hspace{0.5cm}\text{and}\hspace{0.5cm}f(f^{-1}(y)) = A\cdot (A^{-1}y) = y.\]

Before moving on, be reminded again to not confuse the inverse function f^{-1}(y) with the preimage of a set S, f^{-1}[S]!! The latter quantity is always defined, but captures a fundamentally different concept – it is a set potentially containing a multitude of values (or none at all), whereas f^{-1} is a function and f^{-1}(y) a concrete value in its codomain, existence of which however depends on bijectivity of f.

Convexity and Concavity of Multivariate Real-Valued Functions

In this subsection, we consider two elementary properties functions can have: convexity and concavity. We restrict attention to multivariate real-valued functions, i.e. those functions f:\mathbb R^n\mapsto\mathbb R that may take vectors as arguments but map into real numbers. The properties’ importance stems from optimization and will thus be emphasized in the next chapter.

First, be reminded of the definition of a convex set from chapter 1:

Definition: Convex Set.
A set X is said to be convex if for any x,x'\in X and any \lambda\in[0,1], \lambda x + (1-\lambda) x'\in X.

 

It is the set that contains any convex combination of its elements. Verbally, a set is convex if for any two points in the set, the line piece connecting them is fully contained in the set. To develop a bit more intuition for this concept, consider the following figure and determine which of the illustrated subsets of the \mathbb R^2 are convex (think especially about the intuition of the line):


676x180


A and D are convex, B and C are not.

 

Definition: Convex and Concave Real Valued Function.
Let X \subseteq \mathbb{R}^n be a convex set. A function f: X \rightarrow \mathbb{R} is convex if for any x,y \in X and \lambda \in [0,1],

    \[f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda)f(y)\]

Moreover, if for any x,y \in X such that y \neq x and \lambda \in (0,1),

    \[f(\lambda x + (1-\lambda) y) < \lambda f(x) + (1-\lambda)f(y)\]

we say that f is strictly convex. Moreover, we say that f is (strictly) concave if -f is (strictly) convex.

 

Note that the definition of a concave real-valued function also requires that the function be defined on a convex domain — i.e. a set X which satisfies \forall x,y\in X\forall\lambda\in[0,1]: \lambda x + (1-\lambda)y\in X. For the most frequent cases, X=\mathbb R^n and X=\mathbb R^n_+, this is extremely straightforward to verify, and nothing you need to be scared of, but it should be kept in mind nonetheless. We require this in the definition because else, f(\lambda x + (1-\lambda) y) is not always defined, and we can not judge on the inequality defining convexity/concavity. The definition of concavity using -f may be a bit awkward, to check concavity, you can equvialently consider the defining inequalities

    \[f(\lambda x + (1-\lambda) y) \geq \lambda f(x) + (1-\lambda)f(y)\hspace{0.5cm}\forall x,y\in X\forall\lambda\in[0,1]\]

and for strict concavity

    \[f(\lambda x + (1-\lambda) y) > \lambda f(x) + (1-\lambda)f(y)\hspace{0.5cm}\forall x,y\in X\text{ so that }x\neq y\text{ and }\forall\lambda\in(0,1).\]

For univariate real-valued functions, you are likely well-familiar with the graphical representation of these concepts: note that all points \lambda f(x) + (1-\lambda)f(y) with \lambda\in[0,1] lie on the line segment connecting f(x) and f(y). Then, convexity (concavity) states that the graph of f must lie below (above) this line segment everywhere between x and y. This relationship is illustrated in the figure below.


855x341

When considering functions with multiple arguments, the conceptual idea is similar, yet graphically more challenging to display. Let us have a look at a simple convex function defined in X \subseteq \mathbb{R}^2, say, f(x_1,x_2)= x_1^2 + x_2^2. It is composed of two strictly convex univariate functions (because the square function is strictly convex, this can be easily verified using the intuition from the figure above). Indeed, we can formally show that it is strictly convex according to our definition introduced above. Let’s see how the graph of this function looks:


653x538

The graph of f illustrated here, G(f) = \{(x,f(x)): x\in\mathbb R^2\} lies in \mathbb{R}^{3}. Recall that we consider real-valued functions that map to \mathbb R, and that the codomain of f corresponds to the third, or vertical dimension in the plot.

Like with the univariate function, for any two points x,y\in \mathbb{R}^2, i.e. points in the domain of f, we want to consider the line

    \[L(x,y) = \{\lambda x + (1-\lambda)y: \lambda\in [0,1]\} = \{x + \lambda(y-x): \lambda\in [0,1]\}\]

and investigate whether on this line, the function lies “below” the line segment connecting f(x) and f(y). The challenge in graphical representation added by multiple dimensions is that the line L(x,y) does not live in the same space as the domain anymore: for the univariate functions considered above, the domain was a subset of \mathbb R with dimensionality 1, as was the line in the domain connecting any two points x and y. Here, however, the domain is a subset of the \mathbb R^2 with dimensionality 2, and the line L(x,y) still only extends along one direction, that is, the vector z = y - x, and is therefore still of dimensionality 1.

This suggests that to graphically represent convexity of multivariate functions, for any two candidate points x and y, we have to restrict the plotted domain to 1 dimension. Intuitively, this dimension must capture the directionality of the line, and allow further extension to the left and right in order to get a plot like in the univariate case. This is indeed what we do: given the candidates x and y in the domain of f, we restrict the plotted domain to

    \[D(x,y) = \{x + \lambda(y-x): \lambda\in\mathbb R\}.\]

Clearly, this domain features only one direction along which it extends: z=y-x. This is precisely the directionality of the line piece we consider. To see the uni-dimensionality more directly, note that we can define the restriction of f on D(x,y) as

    \[g: \mathbb R\mapsto\mathbb R, t\mapsto f(x + tz)\]

where points on the line L(x,y) correspond to g(t) for t\in[0,1].

A graphical illustration of this technical procedure looks like this:


855x355

Clearly, we see that once the function is restricted to the uni-dimensional domain, our graphical investigation of convexity works as usual – we just need to judge upon the convexity of the restricted function g as defined above! Note, however, that the restriction is possible only after picking the points x and y! For investigations of convexity, we need to consider any possible combination of x and y, and most of them will give rise to different directionalities y-x and therefore different uni-dimensional restrictions and different plots!

In other words, (strict) convexity of a function f:X\mapsto Y corresponds to the scenario where for any x,y\in X, the restricted function g:\mathbb R\mapsto\mathbb R, t\mapsto f(x+t(y-x)) is strictly convex. As the following theorem shows, this conclusion holds not only intuitively but also formally:

Theorem: Graphical Characterization of Convexity.
Let X \subseteq \mathbb{R}^n be a convex set and f:X\mapsto \mathbb R. Then, f is (strictly) convex if and only if \forall x,z\in X such that z\neq \mathbf 0, the function g:\mathbb R\mapsto\mathbb R, t\mapsto f(x+tz) is (strictly) convex.

 

The proof, given in the companion script, can be helpful for you if you feel the need to practice dealing with formal investigations of convexity; the approach taken there is very similar to the majority of convexity proofs you come across in economic coursework.

Now that we have a proper idea of how convexity (and concavity as its opposite) looks like in more general vector spaces or respectively, for general multivariate real-valued functions f:\mathbb R^n\mapsto\mathbb R, we move to some related but weaker concept: quasi-convexity, with the natural opposite quasi-concavity. The reason is that for many applications, requiring convexity in the narrow sense as discussed above is too restrictive: consider the example monotonic transformations. An increasing transformation of f is (g\circ f)(x) = g(f(x)) such that the function g(y) is increasing, i.e. y_1\geq y_2\Rightarrow g(y_1)\geq y_2. A decreasing transformation is the opposite, where y_1\geq y_2\Rightarrow g(y_1)\leq y_2. Strict versions with strict inequalities also exist. See also the definition of a monotonic function in the introductory chapter. Then, for a monotonic transformation of an initially convex function, it is not guaranteed that the resulting function will also be convex. As such, the narrow range of functions convexity (and concavity) applies to restricts our ability to perform general functional analysis. The appealing aspect of considering quasi-convexity instead is that while applying to a much broader class of functions, it preserves most of the convenient properties of convex functions that we are interested in.

As you will see in the next chapter, the convexity of the upper-level set (for concave functions) and convexity of the lower-level set (for convex functions) are the specific characteristics of concave and convex one would wish to preserve. As multivariate convexity and concavity can be reduced to univariate ones, let us consider these concepts for the univariate case. For what follows, note that a subset S\subseteq\mathbb R of the real line is convex if and only if S is an interval, i.e. if there are -\infty\leq a \leq b\leq \infty such that S = [a,b], S = (a,b], S=[a,b) or S = (a,b) (cf. the definition of a convex set above).

Definition: Lower and Upper Level Set of a Function.
Let X\subseteq\mathbb R^n be a convex set and f: X \rightarrow \mathbb{R} be a real-valued function. Then, for c \in \mathbb{R}, the set

    \[L_c^- := \{x | x \in X, f(x) \leq c\},\]

is called the lower-level set of f at c, and

    \[L_c^+ := \{x | x \in X, f(x) \geq c\}\]

is called the upper-level set of f at c.

 

To understand what follows, it is crucial to note that both the lower and the upper level sets at any level c are subsets of the domain of f collecting arguments of the function that satisfy the respective level restriction (f(x) \geq c or f(x) \leq c) – in a two-dimensional plot, they correspond to collections of points on the horizontal axis!

If one considers a convex function and draws a horizontal line (a “level” line), the lower-level set of the function at this level, i.e. the set of elements x in the domain with an image below this line, is convex. Similarly, if one considers a concave function and draws the level line, the upper-level set of the function, containing those elements x in the domain with an image above this line, is convex. The following figure, plotting a strictly convex function left and a strictly concave function right, illustrates this relationship graphically:


507x324

Quasiconvexity and quasiconcavity are precisely defined so as to preserve these two characteristic properties (and only these):

Definition: Quasiconvexity, Quasiconcavity.
Let X\subseteq\mathbb R^n be a convex set. A real-valued function f: X \rightarrow \mathbb{R} is called quasiconvex if for all c \in \mathbb{R}, the lower-level set of f at c is convex. Alternatively, f is called quasiconcave if for all c \in \mathbb{R}, the upper-level set of f at c is convex.

 

The following is an often more workable characterization:

Theorem: Quasiconvexity, Quasiconcavity.
Let X\subseteq\mathbb R^n be a convex set. A real-valued function f: X \rightarrow \mathbb{R} is quasiconvex if and only if

    \[\forall x,y \in X\forall \lambda \in [0,1]:f(\lambda x + (1-\lambda)y) \leq \max\{f(x),f(y)\}\]

Conversely, f is quasiconcave if and only if

    \[\forall x,y \in X\forall \lambda \in [0,1]:f(\lambda x + (1-\lambda)y) \geq \min\{f(x),f(y)\}\]

 

In the spirit of the definitions above, we further have the following characterizations that can sometimes be helpful:

Definition: Strict Quasiconvexity, Strict Quasiconcavity.
Let X\subseteq\mathbb R^n be a convex set. A real-valued function f: X \rightarrow \mathbb{R} is called strictly quasiconvex if

    \[\forall x,y \in X\text{ such that }x\neq y\text{ and }\forall \lambda \in (0,1):f(\lambda x + (1-\lambda)y) < \max\{f(x),f(y)\}\]

Conversely, f is strictly quasiconcave if

    \[\forall x,y \in X\text{ such that }x\neq y\text{ and }\forall \lambda \in (0,1):f(\lambda x + (1-\lambda)y) > \min\{f(x),f(y)\}\]

 

When considering quasi-concavity and quasi-convexity, note that we are in fact dealing with a strict broadening of concepts: all convex functions are quasi-convex, and all concave functions are quasi-concave. This stems from the fact that we have defined the concepts from a characteristic feature of convex or respectively, concave functions.

To practice your understanding of this concept, considering the level sets of the functions graphically illustrated below, determine which are convex, concave, quasi-convex and quasi-concave.


855x179


linear: all, monotonic: not convex, not concave, quasi-convex, quasi-concave, camel back: none

 

Be sure to remember that, as you have seen in the examples, like concavity and convexity, quasi-concavity and quasi-convexity are not mutually exclusive. However, for the “non-quasi” concepts, the only class of functions satisfying both properties are linear functions. Accordingly, we will call a function that is both quasi-concave and quasi-convex quasi-linear. While linear functions are indeed also quasi-linear, they are not the only functions with this property: for instance, as the example above has already hinted at, monotonic functions are another instance of quasi-linear functions! Of course, unlike the specific example you just saw above, monotonic functions can be also convex or concave, consider e.g. f_1(x) = x^2 or f_2(x) = \sqrt{x}.

As a final note of caution before moving on, an established result is that convex and concave functions are continuous. This was not the property we wanted to maintain when coming up with our definitions of quasi-convexity, and indeed, there are quasi-convex or quasi-concave functions that are discontinuous, e.g. indicator functions such as \mathds{1}[x>0].

Multivariate Differential Calculus

Wikipedia provides a good explanation of what calculus actually is about (https://en.wikipedia.org/wiki/Calculus, accessed August 03, 2019.):

“Calculus […] is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus. Differential calculus concerns instantaneous rates of change and the slopes of curves. Integral calculus concerns accumulation of quantities and the areas under and between curves. These two branches are related to each other by the fundamental theorem of calculus [stating that differentiation and integration are inverse operations].”

The remainder of this chapter is concerned with introducing you to both of these branches in the context of multivariate functions, with greater emphasis on differential calculus.

When introducing the matter of multivariate differential calculus, there are two central issues of interest: how differentiation of functions mapping between vector spaces works, and why. Given the complexity of the latter issue, the former is fascinatingly simple and relatively easy to understand. On the other hand, the formal justification of multivariate differential calculus is quite an extensive topic, and may be quite tough to absorb. Hence, the approach taken in this course is the following:

  1. Here in the online course, we will focus on the more application-relevant issue of how multivariate derivatives can be computed, and develop a (more or less) superficial intuition for the conceptual reason behind our method.
  2. In the companion script, you can find a comprehensive treatment of multivariate differential calculus consistent with our notation and relying only on results and knowledge already introduced and developed during this course. There, you can read everything about why multivariate differentiation works, including all required intermediate results and relevant proofs.

Starting Point: The Univariate Derivative

As repeatedly done before, let’s start from the most simple case: univariate real-valued functions f:X\mapsto \mathbb R where the domain is a subset of the real line: X\subseteq \mathbb R. Now, when asked what the instantaneous rate of change, or slope, of f at x_0\in X is, how would you go about to answer this? Don’t think about rules on how to determine the slope, but rather, how to conceptually describe the concept of the slope for a general function f at an arbitrary point x_0!

One common characterization is that the slope tells us how sensitive f is to changes in x, i.e. how much f(x) varies relative to the variation in x. Let’s consider a fixed real change in the argument from x_0 to a fixed x\in X, where “real” means that the argument indeed changes, so that x\neq x_0. Let us define the change as h:=x-x_0 so that for the new argument x,

    \[x = x_0 + x - x_0 = x_0 +h.\]

Then clearly, h is equal to a fixed real number and |h|>0, so that h is not “infinitely small”. This consideration is very helpful because it allows to characterize the relative change of f, i.e. the ratio of the variation in f and the one in the argument when moving from x_0 to x:

(1)   \begin{equation*} \frac{\Delta f(x)}{\Delta x} := \frac{f(x) - f(x_0)}{x-x_0} = \frac{f(x_0+h) - f(x_0)}{h}. \end{equation*}

Now, we know the relative change for any fixed change h\in\mathbb R. This suggests that, when concerned with finding the relative change induced by a marginal, i.e. infinitely small variation in x, we should be able to derive it from letting h go to zero in equation (1)! We have to be careful about one detail: the expression in equation (1) is always well-defined for fixed h>0; a limit, on the other hand, is not guaranteed to exist.

Definition: Differentiability and Derivative of a Univariate Real-Valued Function.
Let X\subseteq\mathbb R and consider the function f:X\mapsto\mathbb R. Let x_0\in X. If

    \[\lim_{h\to 0} \frac{f(x_0+h) - f(x_0)}{h}\]

exists, f is said to be differentiable at x_0, and we call this limit the derivative of f at x_0, denoted by f'(x_0). If for all x_0\in X, f is differentiable at x_0, f is said to be differentiable over X or differentiable. If f is differentiable, the function f':X\mapsto\mathbb R, x\mapsto f'(x) is called the derivative of f.

 

Note the following crucial distinction: the derivative of f at x_0, f'(x_0), is a limit and takes a value in \mathbb R, i.e. it is a real number. On the other hand, the derivative of f, f', like f is a function that maps from X to \mathbb R!

To summarize in words what we just did, we defined the derivative by first looking at a fixed change h and then studying what happens to \Delta f(x)/h if h becomes infinitely small. If (and only if) we arrive at the same, well-defined rate f'(x_0) regardless of how we let h approach 0, this rate of marginal change is unique and well-defined, and we can use it to infer on the function’s behavior at x_0. This concept is extremely helpful because it allows us to study the local behavior of functions (i.e. in small neighborhoods around fixed points x_0).

Function Notation and Conceptual Foundations

Above, we had already seen two levels of concepts at the heart of differential calculus: a derivative of a function at a point in the domain (real number) and a derivative function mapping points onto the derivative of the function at them. The third, and highest level concept is the derivative operator, a mapping between function spaces, associating the function f with its derivative f'. As the derivative operator can associate a derivative function only with functions f that are indeed differentiable, its domain corresponds to the set of once differentiable functions:

Definition: Set of k times Differentiable Functions.
Let X\subseteq\mathbb R. Consider a differentiable function f:X\mapsto\mathbb R. If its derivative f' is also differentiable, we say that f is two times differentiable, and call the derivative f" of f' the second derivative of f. In analogy, we define f^{(k)}, the k-th derivative of f, recursively as the derivative of f^{(k-1)}, the k-1-st derivative of f. If f^{(k)} exists, we call f k times differentiable. For any k\in\mathbb N, we define

    \[D^k(X,\mathbb{R}) = \{f:X\mapsto\mathbb R: f\text{ is } k\text{ times differentiable over X}\}\]

as the set of univariate real-valued functions with domain X that are k times differentiable. Moreover, we define

    \[ C^k(X,\mathbb{R}) = \{f\in D^k(X,\mathbb R): f^{(k)}\text{ is continuous}\} \]

as the set of k times continuously differentiable functions, i.e. k times differentiable functions with continuous k-th derivative f^{(k)}.
If X=\mathbb R, we write D^k(X,\mathbb R) = D^k(\mathbb R) and C^k(X,\mathbb R) = C^k(\mathbb R).

 

Now we know the domain of the differential operator. As for the derivative, we impose no restrictions but its function property, the codomain of differential operator is simply the space of all functions mapping from X to \mathbb R, which we denote as F_X:

    \[F_X:= \{f:X\mapsto \mathbb R\}.\]

This gives us everything we need to define the differential operator:

Definition: Differential Operator for Univariate Real-Valued Functions.
Let X\subseteq\mathbb R. Then, the differential operator is defined as the function

    \[ \frac{d}{dx}: D^1(X,\mathbb{R})\mapsto F_X, f\mapsto f' \]

where f' denotes the derivative of f\in D^1(X,\mathbb{R}).

 

Take the time to appreciate what this means. First, the derivative and the differential operator are not the same thing, indeed, they are fundamentally different functions, because one maps between function spaces and the other between sets of real vectors. The precise relationship is that the derivative of a specific function f (in the domain D^1(X,\mathbb{R}) of the differential operator) is a specific value (in the codomain) of the differential operator!

 

Secondly, you frequently see the expressions

(2)   \begin{equation*} \frac{df}{dx}(x)\hspace{0.5cm}\text{and}\hspace{0.5cm} \frac{df(x)}{dx}. \end{equation*}

Without further explanation, note that these quantities are not defined! So what should we make of them? Despite their frequent use, things are actually a bit tricky; the deliberations to follow give a thorough discussion. If you care less about these technical details, it is fine if you just memorize the take-away, as summarized below.

The formally correct way of writing the mapping rule of the derivative function, i.e. the rule x\mapsto y = f'(x) of the function f', is:

    \[f'(x) = \left [\frac{d}{dx}(f)\right ](x).\]

This states that we first evaluate the differential operator at f to obtain the derivative f': f' = \frac{d}{dx}(f); the resulting function is then evaluated at x. Because this looks a bit weird, one commonly writes f' = \frac{df}{dx}, so that the first expression in equation (2) represents as a justified notational convention for evaluating the derivative function at specific points x\in X, or respectively, when x is interpreted as a variable argument, as the mapping rule x\mapsto y=f'(x) of the derivative function.

The second expression in equation (2) is supposed to refer to the same object. However, this is arguably problematic: f(x) is not a function (like f), but rather a specific value in \mathbb R, and thus not an element in the domain of the differential operator! As such, this notation blurs the lines between the concepts of functions and mapping rules, and may easily cause conceptual misunderstandings.

The second notation may be more convenient to work with the mapping rules directly as a reduced form representation of the function f, especially when it is clear what the domain f is. For instance, suppose that we want to compute the derivative of f:\mathbb R\mapsto\mathbb R, x\mapsto x^2 + \sin(x). Here, we know that the derivative of f will, like f, have domain \mathbb R, and to fully characterize it, we only need to compute its mapping rule. We write this rule as \frac{d}{dx}f(x), where f(x) is the mapping rule of f. Then, it is formally correct to write:

    \[ \frac{d}{dx}f(x) = \frac{d}{dx}(x^2 + \sin(x)) = \frac{d}{dx}x^2 + \frac{d}{dx}\sin(x) = 2x + \cos(x) \]

To summarize, in practice,

  1. It is justified to pull the function onto the numerator of the derivative operator, as in \frac{df}{dx}(x).
  2. It is imprecise and not advisable to pull the mapping rule onto the numerator of the derivative operator, as in \frac{df(x)}{dx}, e.g. \frac{d(x^2 + \sin(x))}{dx}.
  3. When working with mapping rules as a reduced form representation of the function, as e.g. in f(x) = x^2 + \sin(x), the correct way to express the derivative’s mapping rule is \frac{d}{dx}f(x) = \frac{d}{dx}(x^2 + \sin(x)).

To get used to this (perhaps less familiar, but in advanced texts more prominent!) way of handling derivatives, as an exercise, let us re-state the central rules for derivatives of univariate real-valued functions we considered in Chapter 0. Before doing so, because this is the first time we formally deal with function spaces, we first need to define the concept formally and introduce the basis operations we will use.

Theorem: Basis Operations in Function Spaces.
Let F be a set of functions with domain X and codomain Y. Then, the usual basis operations “+” and “\cdot” are such that

    \[\forall f,g\in F: f+g\text{ is such that } (\forall x\in X: (f+g)(x) = f(x) + g(x))\]

and

    \[\forall f\in F\forall \lambda\in\mathbb R: \lambda f\text{ is such that } (\forall x\in X: (\lambda\cdot f)(x) = \lambda \cdot f(x)).\]

If F is closed under these operations, then \mathbb F:= (F, +, \cdot) constitutes a vector space.

 

As a technical note, this is a theorem rather than a definition because it asserts that closure under the operations just defined is sufficient for the vector space property. Further, in this space, vector multiplication is defined as

    \[\forall f,g\in F: f\cdot g\text{ is such that } (\forall x\in X: (f\cdot g)(x) = f(x) \cdot g(x))\]

Note that all objects considered above (f+g, \lambda\cdot f and f\cdot g) refer to functions, and the defining statement refers to a property the function has to satisfy at every value in the domain.

Now, for the derivative rules:

Theorem: Rules for Univariate Derivatives.
Let X\subseteq\mathbb R, f,g\in D^{1}(X,\mathbb{R}) and \lambda, \mu\in\mathbb R. Then,

  1. (Linearity) \lambda f + \mu g is differentiable and

        \[\frac{d}{dx}(\lambda f + \mu g) = \lambda \frac{df}{dx} + \mu \frac{dg}{dx},\]

  2. (Product Rule) The product fg is differentiable and

        \[\frac{d}{dx}(fg) = \frac{df}{dx}\cdot g + f\cdot \frac{dg}{dx},\]

  3. (Quotient Rule) If \forall x\in X, g(x)\neq 0, the quotient f/g is differentiable and

        \[\frac{d}{dx}(f/g) = \frac{\frac{df}{dx}\cdot g - f\cdot \frac{dg}{dx}}{g\cdot g},\]

  4. (Chain Rule) if all the following expressions are well-defined, then g \circ f is differentiable and

        \[\frac{d}{dx}(g \circ f)= \left (\frac{dg}{dx}\circ f\right ) \cdot\frac{df}{dx}.\]

 

Note that all expressions in the theorem are sums, products and compositions of functions and therefore functions themselves! To make this distinction even more clear, Let us re-consider the theorem for derivatives at specific points, which are no longer functions, but values in \mathbb R — carefully pay attention to where the argument x is put: never in the numerator of the differential expression!

Theorem: Rules for Univariate Derivatives at Specific Points.
Let X\subseteq\mathbb R, f,g:X\mapsto\mathbb{R} and \lambda, \mu\in\mathbb R. Let x_0\in X and suppose that f and g are differentiable at x_0. Then,

  1. (Linearity) \lambda f + \mu g is differentiable at x_0 and

        \[\frac{d(\lambda f + \mu g)}{dx}(x_0) = \lambda \frac{df}{dx}(x_0) + \mu \frac{dg}{dx}(x_0),\]

  2. (Product Rule) The product fg is differentiable at x_0 and

        \[\frac{dfg}{dx}(x_0) = \frac{df}{dx}(x_0)\cdot g(x_0) + f(x_0)\cdot \frac{dg}{dx}(x_0),\]

  3. (Quotient Rule) If g(x_0)\neq 0, then quotient f/g is differentiable at x_0 and

        \[\frac{df/g}{dx}(x_0) = \frac{\frac{df}{dx}(x_0)\cdot g(x_0) - f(x_0)\cdot \frac{dg}{dx}(x_0)}{g(x_0)^2},\]

  4. (Chain Rule) If all the following expressions are well-defined, then g \circ f is differentiable at x_0 and

        \[\frac{d(g \circ f)}{dx}(x_0) = \frac{dg}{dx}(f(x_0)) \cdot\frac{df}{dx}(x_0).\]

 

Before moving on, make sure that you are thoroughly familiar with the three conceptual levels of differential calculus and know the differences between them (here summarized from highest to lowest level):

  1. Derivative Operator \frac{d}{dx}: Function that maps between function spaces; maps differentiable functions onto their derivative function,
  2. Derivative Function \frac{df}{dx}: Function that maps between spaces of real vectors: maps arguments x of a function f onto the derivative of f evaluated at x,
  3. Derivative at a Point \frac{df}{dx}(x_0): Element of a real vector space (e.g. \mathbb R^n): concrete value of the derivative function at a specific point.

From Univariate to Multivariate Derivatives

With the conceptual foundation laid, we are now prepared to have a look at how precisely we can derive insight about the local behavior of a function from computing the derivative. For what follows, let f be a function with domain X\subseteq\mathbb R and codomain \mathbb{R}. The existence and value of the derivative of a function f gives us three important pieces of information about f:

1. If f is differentiable at x_0\in X, then f is also continuous at x_0.

To see this, recall that in Chapter 0, we characterized continuity at x_0 by the property \lim_{x\to x_0}f(x) = f(x_0)! Now, consider the derivative of f at x_0,

    \[f'(x_0) = \lim_{h\to 0} \frac{f(x_0 + h) - f(x_0)}{h} = \lim_{x\to x_0} \frac{f(x) - f(x_0)}{x-x_0}.\]

This gives

    \[\begin{split} 0 &= 0\cdot f'(x_0) = \lim_{x\to x_0} (x-x_0)\lim_{x\to x_0} \frac{f(x) - f(x_0)}{x-x_0} \\&= \lim_{x\to x_0} \left((x-x_0) \frac{f(x) - f(x_0)}{x-x_0}\right ) \\&= \lim_{x\to x_0} [f(x) - f(x_0)]. \end{split}\]

where the third equality follows from the product rule of the limit. Because f(x_0) is a constant, \lim_{x\to x_0} [f(x) - f(x_0)] = [\lim_{x\to x_0} f(x)] - f(x_0), and the equation above becomes

    \[ f(x_0) = \lim_{x\to x_0} f(x).\]

2. If f is differentiable at x_0\in X, then there exists a “good” linear approximation to f around x_0 (called the Taylor Approximation).

We like linear functions because they are simple and we know how they work. Unfortunately, it is not likely that the functions involved in our applications be linear. If a given function is too complex to handle but differentiable around a point of interest, a good solution is often to work with a local linear approximation to the function rather than the function itself: let’s make no further restrictions on f than assuming differentiability. Consider the following approximation to f at x_0\in X:

    \[T_{1,x_0}(x) = f(x_0) + f'(x_0)(x-x_0),\]

the so-called Taylor-Approximation to f at x_0 of order 1 (because we only take the first derivative). This expression is a linear function with the known values f(x_0) as intercept and f'(x_0) as slope, with the difference to the point of investigation, x-x_0, as the variable argument. Now, what do we mean when we say that this is a “good approximation around x_0“? Denote by \varepsilon_1(x):= T_{1,x_0}(x) - f(x) the error we make when approximating f using T_{1,x_0} at x\in X. Because f is an arbitrary function, when x is far away from x_0, this error may be quite large — however, as we approach x_0, the error becomes negligibly small relative to the distance x-x_0! Formally:

    \[\frac{\varepsilon_1(x)}{x-x_0} =  \frac{f(x_0) + f'(x_0)(x-x_0) - f(x)}{x-x_0}=f'(x_0) - \frac{f(x) - f(x_0)}{x-x_0}\hspace{0.5cm}\overset{x\to x_0}\to\hspace{0.5cm}0\]

where the limit result follows from the definition of the derivative.

The graphical intuition is illustrated in the figure below, where the point x_1 is too far away from x_0 for approximation quality to be decent, but x_2 is close enough to x_0 such that the Taylor approximation and the true function almost coincide — note especially that |\varepsilon_1(x_2)| is much smaller than |x_2-x_0|.


300x235

In practice, however, we usually don’t know how close is close enough — the Taylor approximation is just a limit statement for moving infinitely close to the point x_0, and for specific functions, even at small but fixed distances x-x_0=0.0000001, the difference may be quite large, so treat this result with a grain of caution.

Note also that, if f is more than once differentiable, we can get an even better approximation by taking higher order derivatives, and computing the Taylor approximation as a polynomial of higher degree — the higher the polynomial order, the more flexible the function and the better the approximation (or instance, our graphical example looks rather close to -cx^2 + d, so that a second order approximation should fare much better on a wider neighborhood of x_0). Because this concept will be helpful repeatedly, let’s take the time to consider the formal definition.

Theorem: Taylor Expansion for Univariate Functions.
Let X\subseteq\mathbb R and f\in D^{k}(X,\mathbb R) where k\in\mathbb N\cup\{\infty\}, i.e. f is k times differentiable. Then, for N\in\mathbb N\cup\{\infty\}, N\leq k, the Taylor expansion of order N for f at x_0\in X is

    \[ f(x) = T_{N,x_0}(x) + \varepsilon_N(x) = f(x_0) + \sum_{n=1}^N \frac{f^{(n)}(x_0)}{n!}(x-x_0)^n + \varepsilon_N(x),\]

where \varepsilon_N(x) is the approximation error of T_{N,x_0} for f at x\in X, and n! = 1\cdot 2 \cdot \ldots \cdot (n-1)\cdot n denotes the faculty of n. Then, the approximation quality satisfies \lim_{h\to 0} \varepsilon_N(x_0+h)/h^N = 0. Further, if f is N+1 times differentiable, there exists a \lambda\in(0,1) such that

    \[\varepsilon_N(x_0+h) = \frac{f^{(N+1)}(x_0+\lambda h)}{(N+1)!}h^{N+1}.\]

 

Some things are worth noting: (i) in contrast to the Taylor approximation T_{N, x_0}(x) for f at x_0, the Taylor expansion is always equal to the function value f(x) because it encompasses the approximation error as an unspecified object, (ii) when considering small deviations from x_0 rather than arbitrary points x, it may at times be more convenient to express the expansion directly in terms of the deviation h=x-x_0 rather than x:

    \[ f(x_0+h) = T_{N,x_0}(x_0+h) + \varepsilon_N(x_0+h) = f(x_0) + \sum_{n=1}^N \frac{f^{(n)}(x_0)}{n!}h^n + \varepsilon_N(x_0+h),\]

and (iii) \lim_{h\to 0} \varepsilon_N(x_0+h)/h^N = 0, says that higher N yield better approximations because h^N goes “faster” to zero the larger h is. To see this, consider a small h, e.g. h=0.01, and compute h^1, h^2=0.0001, h^4=10^{-8} etc. Thus, when considering larger N, we can divide the error by ever smaller numbers and still have convergence to zero — indeed, as N\to\infty, because \lim_{N\to\infty} h^N = 0 for any h<1, the Taylor approximation of infinite order N=\infty yields perfect approximation so that \varepsilon_\infty(x) = 0 \forall x\in X!

An immediate corollary, and nonetheless a very useful one, of the Taylor expansion theorem is the following:

Corollary: Mean Value Theorem.
Let X\subseteq\mathbb R and f\in D^1(X,\mathbb R), i.e. f is a differentiable function. Then, for any x_1,x_2\in X such that x_2> x_1, there exists x^*\in(x_1,x_2) such that

    \[f'(x^*) = \frac{f(x_2) - f(x_1)}{x_2 - x_1}.\]

 

The interested reader can check the companion script to see that this is indeed a direct implication of the Taylor expansion theorem. We will see later that this theorem is incredibly helpful for establishing existence of (local) maxima and minima.

3. If f is differentiable on the interval (a,b)\subseteq\mathbb R, then

  1. f is constant on (a,b) if and only if \forall x_0\in (a,b): f'(x_0) = 0,
  2. f is monotonically increasing (decreasing) on (a,b) if and only if \forall x_0\in (a,b): f'(x_0) \geq 0 (f'(x_0) \leq 0),
  3. If \forall x_0\in (a,b): f'(x_0) > 0 (f'(x_0) < 0), then f is strictly monotonically increasing (decreasing) on (a,b).

This fact relates the numeric value of the derivative to the graphical shape of the function’s graph, helping us – like, in fact, points 1. and 2. – to get an idea of how the function looks like around some point x_0 without referring to the graphical illustration. As initially stated, this is the key motivation for us wanting to preserve these properties: like with the basis operations in the vector space, we strive to ensure that the objects we can no longer visually grasp behave in a fashion “similar to” those we can, allowing us to think more intuitively about highly abstract mathematical objects.

Note that 3.3. only provides a sufficient condition for strict monotonicity. As we will also see in the next chapter, there are strictly monotonic functions (e.g. f(x) = x^3 on \mathbb R) that have a zero derivative at some points in their domain.

If you feel like testing your understanding of the basics of functional analysis and univariate differential calculus discussed thus far, you can take a short quiz found here.

Defining a Multivariate Derivative

The last section has given an extensive discussion of the formal nature of the univariate derivative and its usefulness for functional analysis. Our next step to be taken is to look at how this concept generalizes to multivariate and vector-valued functions. In doing so, we will go over the very central equations motivating this generalization here, but leave all thorough formal discussion to the companion script.

In the following, let us consider multivariate, but still real-valued functions, that is, functions of the form

    \[f:\mathbb R^n\mapsto\mathbb R,\hspace{0.5cm}n\in\mathbb N.\]

You are already familiar with many examples, e.g. the scalar product, vector norms, or utility or cost functions with multiple goods/inputs.

Now, how can we generalize the concept of the derivative? The fundamental issue is that unlike with the \mathbb R, when considering the \mathbb R^n, we can move away from x_0 in multiple directions (recall that the \mathbb R^n has n fundamental directions), and without specification of the direction, it is ex ante ambiguous what precisely we mean by a “small change”. Still, you will see shortly that this issue can be resolved with only a slight conceptional twist.

To define the multivariate derivative, we start from the univariate case. Recall that we call d^* the derivative of f at x_0 in the domain of f if

    \[\lim_{h\to 0} \frac{f(x_0+h) - f(x_0)}{h} = d^*.\]

Our conceptual problem now is that when the domain is multivariate, i.e. X\subseteq\mathbb R^n and n>1, then there is no clear notion of what we mean with “\lim_{h\to 0}“, as explained in the paragraph above. So, how can we rephrase this statement to something that does generalize to the \mathbb R^n? Note that we can re-write the equation above as

(3)   \begin{equation*} 0 = \lim_{h\to 0} \frac{f(x_0+h) - f(x_0)}{h} - d^* = \lim_{h\to 0} \frac{f(x_0+h) - f(x_0) - d^*\cdot h}{h}. \end{equation*}

A key fact that we can use now is the following:

Proposition: Continuity of the Norm.
Consider the normed vector space (\mathbb X, \|\cdot\|) where \mathbb X = (X,+,\cdot) is a real vector space. Then, the norm \|\cdot\| is continuous.

 

While sounding rather abstract, this fact is actually strikingly intuitive: recall the general definition of continuity of a real-valued function f at x_0 in its domain X:

    \[\forall \varepsilon > 0 \exists \delta>0:(\|x - x_0\|<\delta\Rightarrow (|f(x) - f(x_0)|<\varepsilon)\]

Verbally, continuity says that if two arguments x and x_0 considered don’t lie far apart, the function values don’t lie far apart either. Consider now the scenario where f(x) = \|x\|. Clearly, for two vectors x and x_0 to lie “close” to each other, a necessary condition is that they are of similar length – which is precisely what is measured by the norm of the vectors!

To see how this helps us, recall that we can pull the limit in and out of any continuous function. Thus, the proposition above tells us that it can especially be pulled in and out of norms. Recall that on \mathbb R, the absolute value |\cdot| is a norm. Further, the norm property ((i): non-negativity) yields that for x\in\mathbb R, |x| = 0 \Leftrightarrow x=0. Therefore, equation (3) is equivalent to

    \[\begin{split} 0 &= \left | \lim_{h\to 0} \frac{f(x_0+h) - f(x_0) - d^*\cdot h}{h}\right | =  \lim_{h\to 0} \left |\frac{f(x_0+h) - f(x_0) - d^*\cdot h}{h}\right |  \\&=  \lim_{h\to 0} \frac{|f(x_0+h) - f(x_0) - d^*\cdot h|}{|h|} \end{split}\]

where the second step follows from norm continuity and the third because for any a,b\in\mathbb R, |a/b| = |a|/|b|.

Finally, because |\cdot| is a norm on \mathbb R, we can argue (see the mini-excursion below) that the expression derived above is indeed equivalent to

(4)   \begin{equation*} \lim_{h\to 0} \frac{\|f(x_0+h) - f(x_0) - d^*h\|_a}{\|h\|_b} = 0 \end{equation*}

where \|\cdot\|_a and \|\cdot\|_b are any two norms on \mathbb R. This expression appears much more suitable for generalization: regardless of the dimensionality of h and the codomain of f, the norms in the numerator and denominator will continue to be real-valued, such that the fraction exists also for higher-dimensional functions.

To summarize the deliberations above, we started from equation (3), an equivalent way to define the univariate derivative. The difficulty in generalizing this expression to the multivariate case was division by a term that would become a vector. With equation (4), we found an equivalent way to express the derivative, where now, we have norms in numerator and denominator, and these norms will always be real-valued, regardless of whether we use real numbers or vectors as inputs. Because all of the reasoning relied on equivalent representations, there is no loss in generality when working with equation (4) rather than equation (3) or respectively, the initial definition of the univariate derivative!

Mini-Excursion: The last step is without loss of generality!
You would be very right to argue that the last step, going from the absolute value as a specific example of a norm on \mathbb R to the general norm notation is quite a stretch – even if the absolute value induces the natural metric of the \mathbb R, there are of course a great variety of other norms we could potentially use. So is the expression in equation (4) with arbitrary norms on \mathbb R really equivalent to the representation with the absolute value?

Yes! This is due to the fact that for any arbitrary norm \|\cdot\| that we can come up with for the \mathbb R, there will be a positive constant c>0 such that for all x\in\mathbb R, it holds that \|x\| = c\cdot |x|. Then, plugging this into equation (4), we just need to multiply both sides by c to arrive at the statement in absolute values.

But why does this hold? This relationship is ensured by the absolute homogeneity of the norm: note that for any x\in\mathbb R \|x\| = \|x\cdot 1\| = |x|\cdot \|1\|.

Therefore, not only when replacing |\cdot| by a general norm, but even if we use different norms in the numerator and denominator, say \|\cdot\|_a and \|\cdot\|_b, respectively, we get constants c_a, c_b > 0 so that

    \[\forall x\in\mathbb R:\ \|x\|_a = c_a\text{ and }\|x\|_b = c_b.\]

Hence,

    \[ \lim_{h\to 0} \frac{\|f(x_0+h) - f(x_0) - d^*h\|_a}{\|h\|_b} = \frac{c_a}{c_b}\cdot \lim_{h\to 0} \frac{|f(x_0+h) - f(x_0) - d^*h|}{|h|} \]

and one expression is equal to zero if and only if the other is as well.

 

With the equivalent representation of the univariate derivative given in equation (4), we are (almost) set up for generalizing the derivative concept. For the case of real-valued functions f, we can stick with the absolute value as the norm in the numerator of the limit expression defining the derivative, whereas in the denominator, we make use of a norm for the domain. This inspires the following definition:

Definition: Multivariate Derivative of Real-valued Functions.
Let X\subseteq\mathbb R^n, f:X\mapsto\mathbb R and \|\cdot\| a norm on \mathbb R^n. Further, let x_0\in int(X) be an interior point of X. Then, f is differentiable at x_0 if there exists d^*\in\mathbb R^{1\times n} such that

    \[ \lim_{\|h\|\to 0}\frac{|f(x_0+h) - f(x_0) - d^*h|}{\|h\|} = 0. \]

Then, we call d^* the derivative of f at x_0, denoted \frac{df}{dx}(x_0) or D_f(x_0). If f is differentiable at any x_0\in X, we say that f is differentiable, and we call D_f:X\mapsto \mathbb R^n, x\mapsto D_f(x) the derivative of f.

 

Because h is no longer a scalar but now a vector, to be able to subtract d^*h from f(x_0+h) - f(x_0), we require d^*h to be uni-dimensional, and therefore, d^* must now be a row vector of length n. The only thing that we don’t know yet is how we let a norm go to zero, that is, how precisely we can understand \lim_{\|h\|\to 0}.

Formally, we call g_0 the limit \lim_{\|h\|\to 0} g(h) of a function g: X\mapsto \mathbb R if

    \[ \forall \varepsilon > 0 \exists \delta > 0: (h\in B_\delta(\mathbf 0) \Rightarrow |g(h) - g_0| < \varepsilon).\]

That is, we consider a ball of radius \delta around \mathbf 0, the only element of \mathbb R^n with \|\cdot\|=0, and study the function’s behavior on this potentially very small ball. This solves the issue of approaching x_0 only from multiple directions! This concludes our definition of the derivative of multivariate real-valued functions.

 

The next step is to consider the general case of \mathbb R^m, m>1, as the codomain of our function. Consider again our “generalization equation” (4). Because we still consider arguments of length n, i.e. a domain X\subseteq\mathbb R^n, nothing changes in the denominator. However, because f is now vector-valued, the object fed into the numerator norm becomes a vector of length m, and we have to use an appropriate norm here as well. Hence, we define the general multivariate derivative as follows:

Definition: Multivariate Derivative of Vector-valued Functions.
Let X\subseteq\mathbb R^n, Y\subseteq\mathbb R^m and f:X\mapsto\mathbb Y. Further, let x_0\in int(X) be an interior point of X. Denote _k\|\cdot\| as a norm of \mathbb R^k, k\in\{n,m\}. Then, f is differentiable at x_0 if there exists a matrix D^*\in\mathbb R^{m\times n} such that

    \[ \lim_{_n\|h\|\to 0}\frac{_m\|f(x_0+h) - f(x_0) - D^*h\|}{_n\|h\|} = 0, \]

Then, we call D^* the derivative of f at x_0, denoted D_f(x_0) or \frac{df}{dx}(x_0). If f is differentiable at any x_0\in X, we say that f is differentiable, and we call D_f:X\mapsto \mathbb R^{m\times n}, x\mapsto D_f(x) the derivative of f.

 

Again, the dimensions of the derivative — which is now a matrix — are given by the dimension conformity requirement in the numerator, that is, by the fact that we need to be able to subtract D^*h from f(x_0+h) - f(x_0).

Having defined the derivatives formally, it is a good time to make ourselves aware of the three conceptual levels of differential calculus in the context of general derivatives of multivariate, potentially vector-valued functions: Consider the class D^1(X,Y) of differentiable functions with domain X and codomain Y, a subset of the set F(X,Y) of all functions f:X\mapsto Y. Then, the fundamental concepts relevant to differential calculus are

    1. the differential operator D: D^1(X,Y)\mapsto F(X,Y), f\mapsto D_f = \frac{df}{dx}
      (alternatively denoted as \frac{d}{dx}) mapping between function spaces and associating every function in the domain with its derivative,
    2. the derivative function D_f = \frac{df}{dx} with domain X mapping elements in x onto the derivative of the differentiable function f\in C^1(X,Y) at x,
    3. the derivative of a differentiable function f\in C^1(X,Y) evaluated at x, D_f(x) = \frac{df}{dx}(x), a value in the codomain of the derivative function.

Mini-Excursion: Two Notations for the Differential Operator.
You have seen that the definitions above use the notations D and \frac{d}{dx} for the differential operator interchangeably. While both of these are quite common, classical mathematical textbooks are more prone to using D_f(x_0) as the notation for the derivative of a multivariate function f at x_0 in its domain. However, you may think of this object as the same thing as \frac{df}{dx}(x_0), it is the exact same concept! The reason why textbooks may hesitate to write the multivariate differential operator as \frac{d}{dx} is because changes dx in the denominator refers to instantaneous variation in a multivariate object, and we don’t know how to divide by vectors. Then again, as we will see below, the notation \frac{\partial f}{\partial \bar x}(x_0) for taking the partial derivatives with respect to multiple elements \bar x of x (but not all, i.e. \bar x \neq x, thus the “partial” operator \partial), is widely accepted. Long story short, if the new notation D_f(x_0) confuses you, be clear that it is nothing else but (the generalization of) the regular derivative of f at x_0, \frac{df}{dx}(x_0), and sticking to it also in the multivariate context is perfectly fine.

 

Computing Multivariate Derivatives: Gradients, Jacobians and Hessians

We have spent quite some effort on discussing how to conceptually think of a multivariate derivative: we know the condition a candidate vector d^* or matrix D^* must satisfy in order to be called the derivative of a real- or vector-valued function at a point in its domain, respectively. Yet, in practice, we care less about characterizing the derivative in such an abstract way, but rather want to explicitly compute the function and evaluate it at certain points! So how do we go about this? This is where the online course takes its shortcut: we will not discuss why we may proceed as we do, but just how to compute multivariate derivatives. As already mentioned, the interested reader may find a comprehensive treatment of the mathematical background in the companion script.

First, we need a bunch of definitions:

Definition: Partial Derivative.
Consider a function f:X\mapsto\mathbb R where X\subseteq \mathbb R^n, and let x_0\in X. Then, if for j\in\{1,\ldots,n\}, the function

    \[\tilde f_{j,x_0}:\mathbb R\mapsto\mathbb R, t\mapsto f(x_{0,1},\ldots,x_{0,j-1}, x_{0,j}+t, x_{0,j+1}, \ldots, x_{0,n})\]

is differentiable at t=0, we say that f is partially differentiable at x_0 with respect to (the j-th argument) x_j, and we call \frac{d \tilde f_{j,x_0}}{dt}(0) the partial derivative of f at x_0 with respect to x_j, denoted by  \frac{\partial f}{\partial x_j}(x_0) or f_j(x_0).

 

Again, note the three conceptual levels: the operator \frac{\partial}{\partial x}, the function \frac{\partial f}{\partial x} and the value \frac{\partial f}{\partial x}(x). Verbally, the j-th partial derivative looks at the function f, considers all arguments but the j-th one fixed at x_0, that is, it treats them as constants, and takes the univariate derivative with respect to the j-th argument. Intuitively, it can be viewed as the description of how f varies along the j-th fundamental direction of the \mathbb R^n starting from x_0. If we collect all these partial derivatives in a row vector, we call the resulting object the gradient:

Definition: Gradient.
Consider a function f:X\mapsto\mathbb R where X\subseteq \mathbb R^n, and let x_0\in X. Then, if for all j\in\{1,\ldots,n\}, f is partially differentiable with respect to x_j at x_0, then we call the row vector

    \[\nabla f(x_0) = (f_1(x_0), f_2(x_0),\ldots, f_n(x_0))\]

the gradient of f at x_0. If for all x_0\in X and for all f is partially differentiable with respect to x_j at x_0, then we call the function \nabla f:\mathbb R^n\mapsto \mathbb R^{1\times n}, x_0\mapsto\nabla f(x_0) the gradient of f.

 

Also the gradient has an operator, function and value level. To get some feeling for the gradient concept, let’s consider some examples:

    \[f^1(x_1,x_2) = x_1 + x_2,\]

    \[f^2(x_1,x_2) = x_1x_2\]

and

    \[f^3(x_1,x_2) = x_1x_2^2 + \cos(x_1).\]

Recall that the j-th partial derivative is obtained from differentiating the function as if it had only one variable argument, namely x_j, and try to compute the gradients of f^1, f^2 and f^3.

    \[f^1_1(x_1,x_2) = 1,\hspace{0.5cm}f^1_2(x_1,x_2) = 1,\]

    \[\frac{\partial f^2}{\partial x_1} (x_1,x_2) = x_2,\hspace{0.5cm}\frac{\partial f^2}{\partial x_2}(x_1,x_2) = x_1,\]

    \[f^3_1(x_1,x_2) = x_2^2 - \sin(x_1),\hspace{0.5cm}f^3_2(x_1,x_2) = 2x_1x_2.\]

The gradient \nabla f is the respective collection (f_1,f_2) for f\in\{f^1, f^2, f^3\}.

 

Continue reading only if you are done with the exercise above, as the following paragraph includes a discussion of the results one obtains.

As you see, the partial derivatives can include none, some, or all of the components of the vector x! Generally, the point to be made is that simply because we are taking the derivative into one direction (say x_1, the other components (here: x_2, more generally, x_2,x_3,\ldots,x_n) do not drop out because they may interact non-linearly in f! Only if terms containing x_j are strictly linearly separable (e.g. f^1 or \cos(x_1) in f^3), they will drop out when taking the partial derivative with respect to a different x_l, l\neq j.

Verbally, the gradient value \nabla f(x_0) summarizes how f extends along all the fundamental directions of the space \mathbb R^n around the point x_0. As such, it offers a complete characterization of the instantaneous variation f exhibits around x_0, and is intuitively in line with what we would view as a suitable candidate for the derivative of f at x_0. Indeed, the following relationship holds:

Theorem: The Gradient and the Derivative.
Let X\subseteq\mathbb R^n and f:X\mapsto\mathbb R. Further, let x_0\in int(X) be an interior point of X, and suppose that f is differentiable at x_0. Then, all partial derivatives of f at x_0 exist, and D_f(x_0) = \nabla f(x_0).

 

As a technicality, the theorem contains the condition that f is differentiable at x_0. How do we verify this? A quite straightforward sufficient condition is the following:

Theorem: Partial Differentiablility and Differentiability.
Let X\subseteq\mathbb R^n and f:X\mapsto\mathbb R. Further, let x_0\in int(X) be an interior point of X. If all the partial derivatives of f at x_0 exist and are continuous, then f is differentiable at x_0.

 

Definition: Set of Continuously Differentiable Functions.
Let X\subseteq\mathbb R^n and f:X\mapsto\mathbb R. Then, we define

    \[C^1(X,\mathbb R) := \left \{f:X\mapsto\mathbb R: \left (\forall j\in\{1,\ldots,n\}: \frac{\partial f}{\partial x_j}\text{ is continuous}\right )\right \}\]

as the set of continuously differentiable real-valued functions over X. If the context reveals that the codomain of considered functions f is equal to \mathbb R, it is appropriate to use the alternative notation C^1(X).

 

The Theorem above tells us more compactly that “if f\in C^1(X,\mathbb R), then f is differentiable.” As the theorem only provides a sufficient condition, there is still room for exceptional cases where not all partial derivatives are continuous, but the function is still differentiable. However, we rarely deal with such functions in economics, such that this technicality should be of little concern to you.

Now that we know how to differentiate multivariate real-valued functions, the last step is to consider vector-valued functions. Indeed, we already more or less know how to do it! The key insight here is that we can re-write the vector-valued function as a collection of real-valued functions that we know how to deal with: note that we may write f:\mathbb R^n\mapsto\mathbb R^m as

(5)   \begin{equation*}  f = \begin{pmatrix} f^1\\ f^2\\ \vdots\\ f^m \end{pmatrix} \text{ so that }\forall x\in\mathbb R^n: f(x) = \begin{pmatrix} f^1(x)\\ f^2(x)\\ \vdots\\ f^m(x) \end{pmatrix} \end{equation*}

where for any i\in\{1,\ldots,m\}, f^i:\mathbb R^n\mapsto\mathbb R is a multivariate real-valued function. Let’s see how we stack the partial derivatives of all these functions into a collecting object:

Definition: Jacobian.
Let n,m\in\mathbb R^n, X\subseteq \mathbb R^n and f:X\mapsto\mathbb R^m and for i\in\{1,\ldots,m\}, let f^i:X\mapsto\mathbb R such that f = (f^1,\ldots,f^m)'. Let x_0\in X. Then, if at x_0, \forall i\in\{1,\ldots,m\}, f^i is partially differentiable with respect to any x_j, j\in\{1,\ldots,n\}, we call

    \[J_f(x_0) = \begin{pmatrix} \nabla f^1(x_0)\\ \nabla f^2(x_0)\\ \vdots\\ \nabla f^m(x_0) \end{pmatrix} = \begin{pmatrix} f^1_1(x_0) & f_2^1(x_0) & \ldots & f_n^1(x_0)\\ f^2_1(x_0) & f_2^2(x_0) & \ldots & f_n^2(x_0)\\ \vdots & \vdots & \ddots & \vdots\\ f^m_1(x_0) & f_2^m(x_0) & \ldots & f_n^m(x_0) \end{pmatrix} \]

the Jacobian of f at x_0. If the above holds at any x_0\in X, we call the mapping J_f:\mathbb R^{n}\mapsto \mathbb R^{n\times m}, x_0\mapsto J_f(x_0) the Jacobian of f.

 

Note again the three conceptual levels: the Jacobian operator J, function J_f and value J_f(x_0). As with the gradient, this object is an intuitive candidate for the derivative: it summarizes how any component function f^i varies along all the fundamental directions of the \mathbb R^n starting from x_0. Again, it turns out that the Jacobian precisely meets the theoretical requirements of the derivative:

Theorem: The Jacobian and the Derivative.
Let X\subseteq\mathbb R^n, f:X\mapsto\mathbb R^n and f^1,\ldots,f^n:X\mapsto\mathbb R such that equation (5) holds. Further, let x_0\in int(X) be an interior point of X, and suppose that f is differentiable at x_0. Then, for any f^i, i\in\{1,\ldots,m\}, all partial derivatives of f^i at x_0 exist, and D_f(x_0) = J_f(x_0).

 

While the gradient and the Jacobian may look intimidating at first, they are nothing but mere collections of partial derivatives, i.e. they describe rules for how we order them when presenting them together. This means that once you have understood firmly what a partial derivative is, these concepts are indeed rather straightforward to grasp, so don’t let the intense notation fool you!

To conclude our discussion of the first multivariate derivative, note that the rules you are well-familiar with (linearity of the derivative, product rule and chain rule) go through also for the multi-dimensional case. An exception is the quotient rule, because the quotient of two vectors is not a well-defined object. That being said, we need to apply some care to ensure concerning the order of derivatives, because unlike with real-valued functions where e.g. f'(x)g(x) = g(x)f'(x), recall that matrix multiplication is not commutative! Thus, be sure to respect the order in which the differential expressions appear in the following theorem:

Theorem: Rules for General Multivariate Derivatives.
Let X\subseteq\mathbb R^n, f,g:X\mapsto\mathbb R^m and h:\mathbb R^m\mapsto\mathbb R^k. Suppose that f,g and h are differentiable functions. Then,

    1. (Linearity) For all \lambda,\mu\in\mathbb R, \lambda f + \mu g is differentiable and D_{\lambda f + \mu g} = \lambda D_f + \mu D_g.
    2. (Product Rule) f'\cdot g is differentiable and D_{f'g} = f'\cdot D_g + g'\cdot D_f.
    3. (Chain Rule) h \circ f is differentiable and D_{h\circ f} = (D_h\circ f) \cdot D_f.

 

At the product rule, note that the prime indicates transposition and does not refer to our notation f' for the derivative, which is reserved for the univariate context!

As an example for the chain rule, consider v(x)= x'A'Ax where x\in\mathbb R^n and A\in\mathbb R^{m\times n}. Then, what is D_v(x)? Note that we can write v(x) = h(f(x)) where f(x) = Ax and h(y) = y'y. Albeit somewhat tedious, taking the partial derivatives of f is rather straightforward, and you should arrive at D_f(x)=A. For h, you should obtain D_h(y) = 2y'. Then, the chain rule tells us that

    \[ D_v(x) = D_h(f(x))\cdot D_f(x) = 2(Ax)'\cdot A = 2x'A'A.\]

This way, you have elegantly avoided dealing with squared expressions in taking the derivative directly.

This result could have also been derived from the product rule with f(x) = x and g(x) = A'Ax so that v = f'g, i.e. \forall x\in\mathbb R^n: v(x) = f'(x)\cdot g(x). Then, we can rather straightforwardly compute that D_f(x) = \mathbf{I_n} and D_g(x) = A'A, and the product rule gives us

    \[\begin{split} D_v(x) &= f'(x) D_g(x) + g'(x) D_f(x) = x'A'A + (A'Ax)'\mathbf{I_n}  \\&= x'A'A + x'A'(A')' = 2 x'A'A. \end{split}\]

To test your understanding, try to apply the product rule again with f(x) = g(x) = Ax, this should in fact be the simplest way of obtaining this result.


We have D_f(x) = A and f'(x) = x'A'. This gives:

    \[\begin{split} D_v(x) &= f'(x) D_g(x) + g'(x) D_f(x) \overset{f=g}= 2f'(x) D_f(x) \\&= 2x'A'A. \end{split}\]

 

Now we know how to take the first derivative of general functions. But what about higher order derivatives? You may have grasped that when starting from a function f:X\mapsto\mathbb R where X\subseteq\mathbb R^n, then taking the derivative comes with an increase in dimension: while for an x_0\in X, f(x_0)\in\mathbb R, \frac{df}{dx}(x_0)' = \nabla f(x_0)'\in\mathbb R^n, and for f:X\mapsto\mathbb R^m where f(x_0)\in\mathbb R^m, the derivative \frac{df}{dx}(x_0) = J_f(x_0) is already a matrix in \mathbb R^{m\times n}. Because we do not touch multi-dimensional matrices here (i.e. spaces of the form \mathbb R^{n_1\times n_2\times\ldots\times n_k}), which you indeed never come across in regular economic studies, this puts a natural limit to the derivatives we consider here: the first derivative for f:X\mapsto\mathbb R^m, which you already know from the last subsection, and the second derivative for f:X\mapsto\mathbb R. Like with univariate functions, it can be obtained from taking the derivative of the first derivative, provided that it exists.

Definition: Second Order Partial Derivative.
Let X\subseteq\mathbb R^n be an open set and f:\mathbb R^n\mapsto\mathbb R. Further, let x_0\in X, and suppose that f is differentiable at x_0. Then, if the i-th partial derivative of f, f_i = \frac{\partial f}{\partial x_i} is differentiable at x_0, then we call its j-th partial derivative at x_0 the (i,j)-second order partial derivative at x_0, denoted f_{i,j}(x_0) = \frac{\partial f_i}{\partial x_j}(x_0) = \frac{\partial^2 f}{\partial x_i\partial x_j}(x_0).

 

Higher order partial derivatives are defined in the exact same way, so that e.g. the (i,j,k,l) fourth order derivative of f at x_0 is \frac{\partial^4 f}{\partial x_i\partial x_j\partial x_k\partial x_l}(x_0). By requiring X to be an open set, we ensure X = int(X) so that it has only interior points. Recall that like the function f, the partial derivatives \frac{\partial f}{\partial x_i} are functions from X\subseteq\mathbb R^n to \mathbb R, so it makes indeed sense to think about their partial derivatives. As an example, take the function given by

    \[f^3(x_1,x_2,x_3) = x_1x_2^2 + \cos(x_1)e^{x_3}\]

with gradient

    \[\nabla f^3(x) = (x_2^2 - \sin(x_1)e^{x_3}, 2x_1x_2, \cos(x_1)e^{x_3}).\]

Should you need practice with the gradient concept, try verifying that this expression is correct. Before turning to the second order partial derivatives of this function, let us first study how we need to order them to obtain a second derivative, and make ourselves familiar with a very powerful rule in computing them.

Definition: Hessian or Hessian Matrix.
Let X\subseteq\mathbb R^n be an open set and f:X\mapsto\mathbb R. Further, let x_0\in X, and suppose that f is differentiable at x_0 and that all second order partial derivatives of f at x_0 exist. Then, the matrix

    \[ H_f(x_0) = \begin{pmatrix} \nabla{f_1}(x_0) \\ \nabla{f_2}(x_0)\\ \vdots  \\ \nabla{f_n} (x_0) \end{pmatrix} = \begin{pmatrix} f_{1,1}(x_0) & f_{1,2}(x_0) & \cdots & f_{1,n}(x_0) \\ f_{2,1}(x_0) & f_{2,2}(x_0) & \cdots & f_{2,n}(x_0) \\ \vdots  & \vdots  & \ddots & \vdots  \\ f_{n,1}(x_0) & f_{n,2}(x_0) & \cdots & f_{n,n}(x_0) \end{pmatrix} \]

is called the Hessian of f at x_0.

 

Now, remember the set C^1(X,\mathbb R) that we defined for X\subseteq\mathbb R to indicate that all partial derivatives of f exist and are continuous? Analogously, we can define more generally:

Definition: Set of k times Continuously Differentiable Functions.
Let X\subseteq\mathbb R^n and f:X\mapsto\mathbb R. Then, we define

    \[C^k(X,\mathbb R) := \left \{f:X\mapsto\mathbb R: \text{all }k\text{-th order partial derivatives of }f\text{ are continuous}\right \}\]

as the set of k times continuously differentiable real-valued functions over X. If the context reveals that the codomain of considered functions f is equal to \mathbb R, it is appropriate to use the alternative notation C^k(X).

 

This concept is useful in the given context because:

Theorem: Schwarz’s Theorem/Young’s Theorem.
Let X\subseteq\mathbb R^n be an open set and f:\mathbb R^n\mapsto\mathbb R. If f \in C^k(X), then the order in which the derivatives up to order k are taken can be permuted.

 

Here, permuted means simply to interchange in order, so that e.g. f_{1,2}(x) = f_{2,1}(x). You can assume that the functions we are typically concerned with are sufficiently well-behaved such that their partial derivatives we consider are continuous, and the order is interchangeable! Nonetheless, if in doubt, continuity of the respective partial derivatives is of course subject to investigation before applying this theorem.

Corollary: Hessian and Gradient.
Let X\subseteq\mathbb R^n and f\in C^2(X). Then, the Hessian is symmetric and corresponds to the derivative, i.e. the Jacobian of the transposed gradient (\nabla f)': H_f = J_{(\nabla f)'}. By its nature as the derivative of the first derivative, H_f(x_0) is the second derivative of f at x_0\in X.

 

From a technical side, note that (\nabla f)' is the function that maps x onto the column vector (\nabla f(x))'. The corollary holds because if the second order partial derivatives, i.e. the partial derivatives of the functions in the gradient, are all continuous, then we can take the derivative of the (transposed) gradient (\nabla f)':\mathbb R^n\mapsto\mathbb R^n by our sufficient condition for multivariate differentiability studied above. Because the (transposed) gradient is nothing but a vector-valued function, its derivative will coincide with its Jacobian J_{(\nabla f)'}. However, from the way the second order partial derivatives are arranged in the Hessian H_f, it follows that these two objects are precisely the same! Note, however, that the Hessian is certainly equal to second derivative only if f\in C^2(X) because otherwise, the second derivative may not even be defined! Also note that the Hessian is always a Jacobian (of the transposed gradient), but not every Jacobian is a Hessian — be sure to know the distinction between these two concepts.

Finally, let’s put all this knowledge to work. For the function we considered above, given by f^3(x_1,x_2,x_3) = x_1x_2^2 + \cos(x_1)e^{x_3}, compute the second order partial derivatives at a point (x_1,x_2,x_3) to obtain the Hessian at this point (Hint: the second order partial derivatives are continuous and you can exploit symmetry; this should save you 3 computations). Does the second derivative of f^3 exist? If so, what is it equal to?


The Hessian of f^3 at (x_1,x_2,x_3) is:

    \[ H_{f^3}(x_0)  = \begin{pmatrix} -\cos(x_1)e^{x_3} & 2x_2 & -\sin(x_1)e^{x_3} \\ 2x_2 & 2x_1 &0 \\ -\sin(x_1)e^{x_3} & 0  & \cos(x_1)e^{x_3} \end{pmatrix}. \]

Since the second order partial derivatives are all continuous, we also know that this Hessian gives us the second derivative of f^3.

 

Higher Order Taylor Approximations and Total Derivatives

In the end of the last subsection, we have learned about the second derivative of a multivariate function. The second derivative, equal to the Hessian matrix, can be used to improve over the linear fist order Taylor approximation of a multivariate function by introducing a “squared” term:

Theorem: Second Order Multivariate Taylor Approximation.
Let X \subseteq \mathbb{R}^n be an open set and consider f \in C^2(X). Let x_0\in X. Then, the second order Taylor approximation to f at x_0\in X is

    \[T_{2,x_0}(x) = f(x_0) + \nabla f(x_0) \cdot (x-x_0) + \frac{1}{2} (x-x_0)^{\prime} \cdot H_f(x_0)\cdot(x-x_0).\]

The error \varepsilon_{2,x_0}(x) = T_{2,x_0}(x) - f(x) approaches 0 at a faster rate than \|x-x_0\|^2, i.e.

    \[\lim_{\|h\|\to 0}\frac{\varepsilon_{2,x_0}(x+h)}{\|h\|^2} = 0.\]

 

Again, this theorem tells us how around x_0, we can arrive at a “good” functional approximation of an arbitrary f\in C^2(X), where “good” again means that the error becomes negligible relative to the squared distance \|h\|^2 to the approximation point x_0.

What about approximations of other orders? Recall that every time we take the derivative, we increase the order of the object to be considered, so that the first derivative of a real-valued function is vector-valued, and the second derivative, i.e. the derivative of the vector-valued first derivative, is matrix-valued. Accordingly, the third derivative requires three dimensions, the fourth requires four, and so forth. Although it is mathematically possible to define Taylor Approximations of orders >2 also for multivariate functions, this complication is the reason for why applied mathematicians usually stick to approximations of orders 0,1 or 2.

The approximations of order 0 and 1, respectively, are defined in the way we would expect: if f\in C^1(X), the first-order Taylor expansion is

    \[T_{1,x_0}(x) = f(x_0) + \nabla f(x_0) \cdot (x-x_0),\]

and the order 0 approximation is just T_{0,x_0}(x) = f(x_0).

The multivariate Taylor expansion, adding the error to the Taylor approximation, is also defined analogy to the univariate case. However, for the second order approximation, due to the dimensional complication, an explicit formula (using the third derivative) is harder to come by. For the first order approximation, if f\in C^2(X), the error is equal to

    \[ \varepsilon_{1,x_0}(x_0+h) = \frac{1}{2} h' \cdot H_f(x_0 + \lambda h)\cdot h \]

for a \lambda\in(0,1), and thus a direct generalization of the univariate case. This generalization principle holds also for errors of order 0 approximations if f\in C^1(X), where \varepsilon_{0,x_0}(x_0+h) = \nabla f(x_0 + \lambda h)h for a \lambda\in(0,1).

Recall that in the univariate case, we had the mean value theorem as a corollary of Taylor’s theorem. What about the multivariate case? In analogy to before, for a real-valued function f\in C^1(X), using the order 0 expansion (approximating at x_1 and using h=x_2-x_1), we may arrive at

    \[f(x_2) - f(x_1) = \nabla f(x^*)(x_2 - x_1)\]

where x^* is a convex combination of x_1 and x_2. Now, the issue arises that the RHS is a scalar product, and we can not solve for \nabla f(x^*). Thus and unfortunately, a multivariate generalization of the mean value theorem does not exist, such that we can not as easily derive a sufficient condition for existence of a vector x^* that sets the gradient to zero using the Taylor approach.

The Total Derivative. An object frequently used in economics is the total derivative. It is instructive to discuss it here, since it is closely linked to Taylor’s theorem. Typically, you will read the two-variable version like this:

(6)   \begin{equation*} df = \frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2. \end{equation*}

So, how do we make sense of this? And how can we use it to derive insights on the function f? The purpose of this expression is to capture the instantaneous rate of change of f as the the vector of arguments marginally varies in a specific direction dx = (dx_1,dx_2)'. That is, it characterizes df(x_0) = \lim_{t\to 0} \Delta f(x_0)/t when considering changes of the form \Delta x = t\cdot dx with a fixed vector dx of relative variation of elements in x. Accordingly, df is a function of x_0 and moreover of the direction components dx_1 and dx_2! As such, a more explicit way of writing this concept is

    \[df(x_0,dx) = \frac{\partial f}{\partial x_1}(x_0) dx_1 + \frac{\partial f}{\partial x_2}(x_0) dx_2 = \nabla f(x_0)\cdot\begin{pmatrix} dx_1\\dx_2\end{pmatrix}.\]

Why should we care about fixing specific ratios for the variation in the argument’s components? In economics, we are frequently concerned with trade-offs so that increasing one argument (e.g. consumption of the first good) can not go without decreasing the other (e.g. consumption of the second good, leisure, etc.), and the exchange ratio is usually exogenously given, at least when fixing the starting point of variation x_0. The total derivative tells us more directly how instantaneous variations in light of such trade-offs look like! Indeed, it tells us that computing this variation is as simple as multiplying the respective vector of relative variations to the gradient.

The formal justification behind this relationship and some more elaborations can be found in the companion script. Intuitively, you can see that it closely relates to Taylor’s theorem by noting that the RHS of the above equation looks a lot like the middle term you get in the first order Taylor expansion. To see this rather abstract concept in action, consider the following scenario: suppose you care only about studying and sleeping, so that your utility function may be written as

    \[u(p, s) = 5\sqrt{p} - (8-s)^2\]

where p is the number of pages you read in your favorite economics textbook in a day, and s is the number of hours of sleep you are getting per night. Suppose you are currently getting 6 hours of sleep and reading 36 pages. You are thinking about reading just a bit more at the expense of sleeping. Assuming that you can read 4 pages in 1 hour, the vector characterizing how you marginally exchange pages for sleep is (dp, ds) = (4,-1). So, how does your utility change as you move towards reading more and sleeping less? Let us consult the total derivative:

    \[du(p_0, s_0) = \frac{5}{2\sqrt{p_0}}dp + 2(8-s_0)ds.\]

Plugging in your current schedule (p_0, s_0) = (36, 6) and the variation vector (dp, ds), you get

    \[du(36, 6) = \frac{5}{2\sqrt{36}}\cdot 4 + 2\cdot (8-6)\cdot (-1) = \frac{5}{3} - 4 = -\frac{7}{3}.\]

Thus, your utility is decreasing, and you should in fact be reading less and sleeping more!

If, on the other hand, you currently manage to read the 36 pages and still get 7 hours of sleep, and you are more efficient at reading, managing 6 pages per hour, things look different:

    \[du(36, 7) = \frac{5}{2\sqrt{36}}\cdot 6 + 2\cdot (8-7)\cdot (-1) = \frac{5}{2} - 2 = \frac{1}{2}.\]

The idea we saw above for two arguments can, of course, be generalized to arbitrary functions f:\mathbb R^n\mapsto\mathbb R. Defining the total derivative as a function of the considered location x_0 and the direction of change dx=(dx_1,\ldots, dx_n), we write

    \[df:\mathbb R^n\times \mathbb R^n\mapsto\mathbb R, (x_0,dx)\mapsto df(x_0,dx) = \sum_{i=1}^n\frac{\partial f}{\partial x_i}(x_0) dx_i\]

or more compactly

(7)   \begin{equation*} df = \sum_{i=1}^n\frac{\partial f}{\partial x_i} dx_i = \nabla f\cdot dx. \end{equation*}

In economics, these considerations are valuable in theoretical models when we are doing comparative statics, i.e. we consider how some equilibrium state (corresponding to x_0) and an economic output quantity, e.g. GDP(x_0), marginally responds to exogenous impulses that change economic quantities in fixed ratios. Oftentimes, as the example just given already hints at, we will not choose these the ratio of changes ad hoc, but assume that they are driven by some background variable, such as technology shocks. If z denotes the level of technology in the economy, we consider f=GDP as the outcome, and x_i are all relevant determinants of GDP, to highlight that the ratios are driven by the technology variable and endogenously depend upon it, we would write accordingly:

(8)   \begin{equation*} \frac{df}{dz} = \sum_{i=1}^n\frac{\partial f}{\partial x_i} \frac{dx_i}{dz} = \nabla f\cdot \frac{dx}{dz}. \end{equation*}

 

Before moving on, a note of caution: equations (7) and (8) look quite similar, indeed, the naive mathematician could think that the latter could be obtained from just dividing the former by “dz“. Of course, this is in no way a well-defined operation, as dz is not a well-defined mathematical object, but arises only from our notational convention for the derivative of x with respect to z, \frac{dx}{dz}. Thus, remember that the total derivative works also when the direction of change is determined implicitly through a background variable z, but that this result does not follow by dividing the total derivative by the change dz! Moreover, do not extrapolate the total derivative to non-marginal changes! The concept, by its definition, captures an instantaneous variation relying on the Taylor approximation, and for non-marginal changes, the linear approximation is by no means guaranteed to fare well. To illustrate correct and false interpretation in an example, re-consider our reading-and-sleeping utility: here, we saw that moving marginally in direction (6,-1) (exchange rate 6 units of pages for one unit of sleep) when starting from the status quo of (36,7) could increase utility locally around this point. When considering instead the non-marginal change of reading 6 more pages at the expense of sleeping one less hour, one gets

    \[\begin{split} \Delta u(p,s) &= u(42,6) - u(36,7) = 5(\sqrt{42}-\sqrt{36}) - (8-6)^2 + (8-7)^2 \\&\approx 5(6.48 - 6) - 4 + 1 = -0.6, \end{split}\]

and thus a loss in utility.

You should take away from this:

    1. The standard form of the total derivative, df(x_0) = \sum_{i=1}^n\frac{\partial f}{\partial x_i}(x_0) dx_i, corresponds to the instantaneous variation of f from x_0 in the relative change direction dx = (dx_1,\ldots, dx_n)'.
    2. The total derivative with a “background variable” in the denominator,

          \[\frac{df}{dz} = \sum_{i=1}^n\frac{\partial f}{\partial x_i} \frac{dx_i}{dz},\]

      captures the same concept and follows directly from the chain rule.

    3. If someone ever asks me to explain the total derivative to them, I will not tell them to derive 2. from 1. by dividing the equation by dz.
    4. The arguments dx_i, i\in\{1,\ldots,n\}, express how x_i changes marginally relative to the other changes dx_j, j\neq i, and does not represent an absolute change in the i-th argument!

Differentiability and Continuity and Convexity

Let us re-consider the valuable properties of differentiable functions we initially highlighted for univariate derivatives: while we saw the linear approximation generalization in the previous subsection, the other two points, i.e. continuity and monotonicity, have not yet been addressed.

Indeed, the following generalizes to the multivariate case:

Theorem: Multivariate Differentiability implies Continuity.
Suppose that f:X\mapsto Y, X\subseteq\mathbb R^n, Y\subseteq\mathbb R^m is differentiable at x_0\in X. Then, f is continuous at x_0. Accordingly, if f is differentiable, f is also continuous.

 

However, be careful to note that differentiability and partial differentiability, i.e. existence of all partial derivatives, are not equivalent! Indeed, there may be discontinuous functions where all partial derivatives exist. One such example is

    \[f: \mathbb{R}^2 \mapsto \mathbb{R}, x\mapsto f(x,y)=\begin{cases}0&\text{if } xy=0\\1 &\text{else}\end{cases}.\]

It is partially differentiable with respect to x and y with partial derivative \frac{\partial f}{\partial x}(0,0) = \frac{\partial f}{\partial y}(0,0) = 0, but it is not continuous in (0,0).

That being said, recall also that partial differentiability with continuous partial derivatives was sufficient for differentiability, so that if all partial derivatives exist and are continuous, then the function itself must be continuous as well.

Next, let us turn to the third important feature (and not yet generalized) of univariate derivatives: they told us whether a given function was increasing, decreasing or constant on some interval. For multivariate functions, this characterization is no longer too meaningful: how f evolves along one dimension depends on the positions in the other dimension: e.g. f(x_1,x_2) = x_1x_2 is monotonically increasing in x_1 if and only if x_2\geq 0. Thus, the more convenient concept to characterize multivariate functions is convexity.

Proposition: Convexity of Twice Differentiable Univariate Functions.
Let X\subseteq \mathbb R be a convex, open subset of \mathbb R and suppose that f\in C^2(X), i.e. f is a twice differentiable univariate function such that f"(x) is continuous. Then, f is convex if and only if \forall x\in X: f"(x)\geq 0.

 

Corollary: Strict Convexity of Univariate Functions.
Let X\subseteq \mathbb R be a convex subset of \mathbb R and suppose that f\in C^2(X). Then, if for any x\in int(X): f"(x)>0, f is strictly convex.

 

Note that we again focus only on interior points where the derivative exists. Verbally, the second derivative, i.e. the “slope of the slope”, gives us an equivalent condition for convexity and a sufficient condition for strict convexity. With some formal effort (see the companion script), we can generalize this concept to multivariate functions. Conceptually, the issue is how we should about statements such as “H>0” or “H\geq 0” when H is a matrix, for instance the second derivative Hessian matrix of a multivariate real-valued function. It turns out that definiteness of a matrix can be viewed as a generalization of its “sign”! To see why, consider again a real number x\in\mathbb R. Then, if x\geq 0, for any v\in\mathbb R, it holds that v\cdot x \cdot v = x\cdot v^2 \geq 0. Similarly, if H is positive semi-definite, it holds that for any v\in\mathbb R^n, v'Hv \geq 0. Accordingly, we can derive:

Proposition: Multivariate Convexity.
Let X be a convex subset of \mathbb{R}^n and f\in C^2(X). Then, f is convex if and only if, for all x\in int(X), H_f(x) is positive semi-definite. Further, if for all x\in int(X), H_f(x) is positive definite, then f is strictly convex.

 

Integral Theory

For the last bit on multivariate calculus, let us discuss integration — albeit far less extensively as differentiation. The conceptual perspective here is quite the opposite as with differentiation: while thus far, we were interested in the marginal change of a function f, we now care about its accumulation in the codomain. Actually, it is quite intuitive that we should consider integration and differentiation as “inverse” operations also in a narrow sense. This is because f is the instantaneous change of its accumulation, i.e. the rate at which the area under it accumulates! Accumulation is also an issue of frequent interest to economists, e.g. when we care about aggregation of (the choices of) individual firms/households to a national economy, or when forming expectations about outcomes, where we aggregate all possible events and weight them by their probability.

However, you are likely aware that the derivative operator does not have an inverse, as constants vanish when taking derivatives. To see this, consider F_1(x) = x^2 + 5 and F_2(x) = x^2 - 2, such that the function F'(x) = f(x) = 2x has more than one function characterized by the feature we are looking for. In other words, the derivative is not injective, and thus we can not invert it!

However, similar to the non-invertible (because non-injective) function f(x) = x^2, we can of course define the preimage of f under the differential operator, D^{-1}[\{f\}] =\{F: D_F = f\} of functions that have f as their derivative, just like we can define f^{-1}[\{y\}]: \{x\in \mathbb R: y=x^2\} as the pre-image of any value y of f(x) = x^2. For the case of univariate functions, you likely know the following characterization:

    \[  \int f(x)dx :=  F(x) + C,\hspace{0.5cm}C\in\mathbb R \]

where F(x) is the stem function of f. Recall that the reason for ambiguity in the inverse derivative, or antiderivative, was that constants vanish. Thus, up to said constant, we should be able to uniquely pin down the antiderivative through the function F(x) that does not contain a constant…\ and we indeed can! The object in this equation is called the indefinite integral of f. Note that the expression is a notational simplification for the pre-image of f under the differential operator, and that we describe a set here, rather than an equation.

Before moving on to the well-defined definite integral, check that you are familiar with the following rules for indefinite integrals:

Theorem: Rules for Indefinite Integrals.
Let f, g be two integrable functions and let a, b\in\mathbb R be constants, n \in \mathbb{N}. Then

    • \int (af(x)+g(x))dx = a \int f(x) dx + \int g(x)dx,
    • \int x^n dx= \frac{x^{n+1}}{n+1}+C \text{ if } n\neq -1 \text{ and } \int \frac{1}{x}dx = \ln(x) + C,
    •  \int e^x dx = e^x + C \text{ and } \int e^{f(x)}f'(x) dx = e^{f(x)}+C ,
    • \int (f(x))^n f'(x) dx = \frac{1}{n+1}(f(x))^{n+1}+C \text{ if } n\neq -1 and
    • \int \frac{f(x)}{f'(x)}dx= \ln (f(x)) + C .

 

Although there is a more formal definition of integrability, it suffice here to understand an integrable function as being a functions whose integral you can compute with the usual rules. Another important rule, which can be thought of as the reverse of the product rule, is integration by parts:

Theorem: Integration by parts.
Let u,v be two differentiable functions. Then,

    \[\int u(x)v'(x)dx = u(x)v(x)- \int u'(x)v(x)dx.\]

 

Definite Integrals and the Fundamental Theorem of Calculus

Remaining with univariate functions for now, we know that a unique function F can be attributed to f:X\mapsto \mathbb R such that DF = f if we require that F does not contain a constant. For simplicity, let’s focus on convex sets X, i.e. intervals. So, how do we compute F? The idea is quite easy: note that while the antiderivative is not well-defined in general because of the constants C, for any function \tilde F \in D^{-1}[\{f\}], i.e. any function that satisfies \tilde F(x) = F(x) + C, for specific values x,y\in X, we have that

    \[\tilde F(y) - \tilde F(x) = F(y) + C - (F(x) + C) = F(y) - F(x).\]

Supposing that y\geq x, this can be used to compute the uniquely defined definite intergral that tells us the accumulation of f from x to y, that is, on the interval (x,y),

Definition: Definite Integral.
Let X\subseteq\mathbb R and consider an integrable function f:X\mapsto \mathbb R. Then, the definite integral of f from x\in X to y\in X, is

    \[\int_x^yf(t)dt = \tilde F(y) - \tilde F(x),\wspace{\text{where}} \tilde F(x)\in D^{-1}[\{f\}].\]

 

This gives us the usual rule that you are likely familiar with: to compute \int_x^yf(t)dt, compute the stem function F, and take the difference F(y) - F(x). For instance, the stem function of f(x) = 4x^3 is x^4, such that \int_{-1}^3 4x^3dx = 3^4 - (-1)^4 = 81 - 1 = 80.

Before moving on, keep the following in mind: the inverse of the differential operator D is not generally well-defined. However, any function \tilde F in the preimage of a function f under D is characterized by a uniquely defined accumulation between any two points x and y, called the definitive integral of f. Because we like uniquely defined quantities, we mostly restrict attention to this object — indeed, when we care about accumulation as we mostly do when considering antiderivatives, it’s all we need.

Definition: Infimum and Supremum of a Set.
Let X\subseteq\mathbb R. Then, the infimum \inf(X) of X is the largest value smaller than any element of X, i.e. \inf(X) = \max\{a\in\mathbb R:\forall x\in X: x\geq a\}, and the supremum \sup(X) of X is the smallest value larger than any element of X, i.e. \sup(X) = \min\{b\in\mathbb R:\forall x\in X: x\leq b\}.

 

These concepts are a helpful generalization of maximum and minimum, and exist under much more general conditions. For instance, for an interval (a,b), there is no maximum or minimum, but infimum and supremum exist and are equal to a and b, respectively. I need them for the theorem below to ensure that a always defines an interval X as X = (a,b), X =[a,b), X=(a,b] or X=[a,b]), regardless of whether the lower bound is open or closed. Note that we may have a = -\infty.

Theorem: Fundamental Theorem of Calculus.
Let X be an interval in \mathbb R and f:X\mapsto\mathbb R. Let a = \inf(X), suppose that f is integrable, and define F: = X\mapsto\mathbb R, x\mapsto \int_a^x f(t)dt. Then, F is differentiable, and

    \[\forall x\in X: F'(x) = D_F(x) = f(x).\]

 

A slightly informal but – especially given the theorem’s importance – conveniently short and manageable proof can be found in the companion script.

Multivariate Integrals

As we did with the derivatives, let us extend the notion of the integral to the multivariate case by first looking at a function mapping from \mathbb{R}^2 to \mathbb{R}. If in the univariate case, the definite integral was measuring an area under the graph, it is now only natural to require the definite integral to measure the volume under the graph. In higher dimensions, we have to go on without graphic illustrations, but the concept of “summing up the function values over infinitely small areas of the domain” remains valid. Also, indefinite integrals should still be considered the antiderivative, but now to the multivariate derivative, and again intuition might fail us here. Luckily, for probably all intends and purposes that you will come across integrals in your master courses, the following theorem will be of practical help:

Theorem: Fubini’s Theorem.
Let X and Y be two intervals in \mathbb{R}, let f: X\times Y \rightarrow \mathbb{R} and suppose that f is continuous. Then, for any I =I_x \times I_y\subseteq X\times Y with intervals I_x\subseteq X and I_y\subseteq Y,

    \[ \int_I f(x,y)d(x,y) = \int_{I_x} \biggl( \int_{I_y} f(x,y) dy \biggr) dx,\]

and all the integrals on the right-hand side are well-defined.

 

It tells us that when concerned with a multi-dimensional integral, we can integrate with respect to each dimension (or fundamental direction) “in isolation” or rather, integrate in an arbitrary succession with respect to all the single variables. The theorem is pretty powerful as it only needs continuity of the function as a prerequisite, and then allows you to reduce a multivariate integral to a lower-dimensional one! You can also apply the theorem repeatedly if you are faced with higher dimensional integrals, so that

    \[ \int_I f(x_1,\ldots,x_n)d(x_1,\ldots,x_n) = \int_{I_1}\left (\cdots\left (\int_{I_n}f(x_1,\ldots,x_n) dx_n\right )\cdots \right )dx_1. \]

Thus, a scheme applies that is very similar to what we have seen for differentiation of multivariate functions: if the operation can be performed, that is, here if we can integrate the function f, so long as f satisfies a continuity condition, then the multivariate version of the operation can be computed by repeatedly applying the univariate concept subject to a certain scheme of ordering!

For a final property, note that linearity of the integral implies especially that we can pull constants with respect to the integrating variable x, i.e. any expression c(z) that may depend on arbitrary variables z but not on x, out of the integral, so that \int c(z)f(x) dx = c(z) \int f(x)dx. Thus, we obtain the following corollary of Fubibi’s theorem:

Corollary: Integration of Multiplicatively Separable Functions.
Let X_f \in \mathbb{R}^n, X_b\in\mathbb R^m, f:X_f \rightarrow \mathbb{R}, g: X_b \rightarrow \mathbb{R} continuous functions. Then, for any intervals A\subseteq X_f, B\subseteq X_g,

    \[ \int_{A \times B}f(x)g(y)d(x,y) = \biggl( \int_A f(x) dx \biggr) \biggl( \int_B g(y) dy \biggr).\]

 

This follows directly from applying Fubini’s Theorem. Note that f and g can be multivariate, so that whenever you can separate a function into two factors that depend on a disjoint subset of variables, you can multiplicatively separate integration! An important economic example is for instance the Cobb-Douglas production function: Suppose that firms’ stock of capital k and labor l are independently and uniformly distributed on [0,1] (it is not too important what this means here, it just ensures that the first equality below holds), so that individual level output is y = f(k,l) = Ak^\alpha l^{1-\alpha}. Here, the theorem for multiplicatively separable variables can help us determine the aggregated output of the whole economy Y as

    \[\begin{split} Y &= \int_{[0,1]\times[0,1]}f(k,l)d(k,l) = \int_{[0,1]\times[0,1]}Ak^\alpha l^{1-\alpha}d(k,l) = A \left (\int_{[0,1]} k^\alpha dk\right )\left( \int_{[0,1]}l^{1-\alpha} dl\right ) \\&= A \left (\left [\frac{x^{2-\alpha}}{(2-\alpha)}\right ]_{x=0}^{x=1}\right )\left (\left [\frac{x^{1+\alpha}}{(1+\alpha)}\right ]_{x=0}^{x=1}\right ) = \frac{A}{(2-\alpha)(1+\alpha)}. \end{split}\]

To conclude this section on integrals, take away that (i) the differential operator can generally not be inverted, (ii) the definite integral, referring to the accumulation of a function between to points, can be well-defined nonetheless, and corresponds to the usual integral you are familiar with, and (iii) that like differentiation, we can handle multivariate integration by applying techniques for univariate functions according to a certain scheme of ordering, which applies under rather general conditions.

If you feel like testing your understanding of the concepts discussed in this chapter since the last quiz, you can take a short quiz found here