# An Introduction to Structural Econometrics in Julia

This tutorial is adapted from my Julia introductory lecture taught in the graduate course Practical Computing for Economists, Department of Economics, University of Chicago.

The tutorial is in 5 parts:

1. Installing Julia + Juno IDE, as well as useful packages
2. Defining a structural econometric challenge
3. Data generation, management, and regression visualization
4. Numerical simulation of optimal agent behavior under constraints
5. Parallelized estimation by the Method of Simulated Moments

Here is the GitHub repository to replicate the results in this tutorial.

## 1. Installing Julia + Juno IDE, as well as useful packages

Perhaps the greatest obstacle to using Julia in the past has been the absence of an easy-to-install IDE. There used to be an IDE called Julia Studio which was as easy to use as the popular RStudio for R. Back then, you could install and run Julia + Julia Studio in 5mins, compared to the hours it could take to install Python and its basic packages and IDE. When Julia version 0.3.X was released, Julia Studio no longer worked, and I recommended the IJulia Notebook, which requires the installation of Python and IPython just to use Julia, so any argument that Julia is more convenient to install than Python was lost.

Now, with Julia version 0.4.X, Juno has provided an excellent IDE that comes pre-bundled with Julia for convenience, and you can install Julia + Juno IDE in 5mins. Here are some instructions to help you through the installation process:

2. After the brief download (the entire Julia language + Juno IDE is less than 1GB), open the file and click through the installation instructions.
3. Open Juno, and try to run a very simple line of code. For example, type 2+2, highlight this text, right-click, and choose the option Evaluate. A bubble should display 4 next to the line of code.
• Trouble-shooting: On my Mac running OS X Mavericks, 2+2 failed and an unhelpful error was produced. After some searching, I found that the solution was to install the Jewel package. To install Jewel from within Juno, just type Pkg.add(“Jewel”), highlight this text, and Evaluate. After this, 2+2 was successful.
4. You have successfully installed Julia + Juno IDE. Now, you will want to run the following codes to install some other packages used in the econometric exercises below:
Pkg.update()
Pkg.build("Ipopt")
Pkg.build("JuMP")


## 2. Defining a structural econometric challenge

To motivate our application, we consider a very simple economic model, which I have taught previously in the mathematical economics course for undergraduates at the University of Chicago. Although the model is analytically simple, the econometrics become sufficiently complicated to warrant the Method of Simulated Moments, so this serves us well as a teachable case.

Let $c_i \geq 0$ denote consumption and $0 \leq l_i \leq 1$ denote leisure. Consider an agent who wishes to maximize Cobb-Douglas utility over consumption and leisure, that is,

$U(c_i,l_i) =$ $c^\gamma_i l^{1-\gamma}_i$.

where $0 \leq \gamma \leq 1$ is the relative preference for consumption. The budget constraint is given by,

$c_i \leq (1-\tau)w_i(1-l_i)+\epsilon_i$,

where $w_i$ is the wage observed in the data, $\epsilon_i$ is other income that is not observed in the data, and $\tau$ is the tax rate.

The agent’s problem is to maximize $U(c_i,l_i)$ subject to the budget constraint. We assume that non-labor income is uncorrelated with the wage offer, so that $\mathbb{E}[\epsilon_i | w_i]=0$. Although this assumption is a bit unrealistic, as we expect high-wage agents to also tend to have higher non-labor income, it helps keep the example simple. The model is also a bit contrived in that we treat the tax rate as unobservable, but this only makes our job more difficult.

The goal of the econometrician is to identify the model parameters $\gamma$ and $\tau$ from the data $(c_i,l_i,w_i)^N_{i=1}$ and the assumed structure. In particular, the econometrician is interested in the policy-relevant parameter $\bar\psi \equiv \frac{1}{N}\sum^N_{i=1}\psi(w_i)$, where,

$\psi(w_i) \equiv \mathbb{E}_{\epsilon} \frac{\partial}{\partial \tau} C(w_i,\epsilon; \gamma, \tau)$,

and $C(\cdot)$ denotes the demand for consumption. $\psi(w_i)$ is the marginal propensity for an agent with wage $w_i$ to consume in response to the tax rate. $\bar{\psi}$ is the population average marginal propensity to consume in response to the tax rate. Of course, we can solve the model analytically to find that $\psi(w_i) = -\gamma w_i$ and $\bar\psi = -\gamma \bar{w}$, where $\bar{w}$ is the average wage, but we will show that the numerical methods achieve the correct answer even when we cannot solve the model.

## 3. Data generation, management, and regression visualization

The replication code for this section is available here.

To generate data that follows the above model, we first solve analytically for the demand functions for consumption and leisure. In particular, they are,

$C(w_i,\epsilon_i; \gamma, \tau) = \gamma (1-\tau) w_i + \gamma \epsilon_i$

$L(w_i,\epsilon_i; \gamma, \tau) = (1-\gamma) + \frac{(1-\gamma) \epsilon_i}{ (1-\tau) w_i}$

Thus, we need only draw values of $w_i$ and $\epsilon_i$, as well as choose parameter values for $\gamma$ and $\tau$, in order to generate the values of $c_i$ and $l_i$ that agents in this model would choose. We implement this in Julia as follows:

####### Set Simulation Parameters #########
srand(123)           # set the seed to ensure reproducibility
N = 1000             # set number of agents in economy
gamma = .5           # set Cobb-Douglas relative preference for consumption
tau = .2             # set tax rate

####### Draw Income Data and Optimal Consumption and Leisure #########
epsilon = randn(N)                                               # draw unobserved non-labor income
wage = 10+randn(N)                                               # draw observed wage
consump = gamma*(1-tau)*wage + gamma*epsilon                     # Cobb-Douglas demand for c
leisure = (1.0-gamma) + ((1.0-gamma)*epsilon)./((1.0-tau)*wage)  # Cobb-Douglas demand for l


This code is relatively self-explanatory. Our parameter choices are $N=1000$, $\epsilon_i \sim \mathcal{N}(0,1)$, $\gamma=1/2$, and $\tau=1/5$. We draw the wage to have distribution $w_i \sim \mathcal{N}(10,1)$, but this is arbitrary.

We combine the variables into a DataFrame, and export the data as a CSV file. In order to better understand the data, we also non-parametrically regress $c_i$ on $w_i$, and plot the result with Gadfly. The Julia code is as follows:

####### Organize, Describe, and Export Data #########
using DataFrames
df = DataFrame(consump=consump,leisure=leisure,wage=wage,epsilon=epsilon)  # create data frame
plot_c = plot(df,x=:wage,y=:consump,Geom.smooth(method=:loess))            # plot E[consump|wage] using Gadfly
draw(SVG("plot_c.svg", 4inch, 4inch), plot_c)                              # export plot as SVG
writetable("consump_leisure.csv",df)                                       # export data as CSV


Again, the code is self-explanatory. The regression graph produced by the plot function is:

## 4. Numerical simulation of optimal agent behavior under constraints

The replication code for this section is available here.

We now use constrained numerical optimization to generate optimal consumption and leisure data without analytically solving for the demand function. We begin by importing the data and the necessary packages:

####### Prepare for Numerical Optimization #########

using DataFrames
using JuMP
using Ipopt
N = size(df)[1]


Using the JuMP syntax for non-linear modeling, first we define an empty model associated with the Ipopt solver, and then add $N$ values of $c_i$ and $N$ values of $l_i$ to the model:

m = Model(solver=IpoptSolver())    # define empty model solved by Ipopt algorithm
@defVar(m, c[i=1:N] >= 0)       # define positive consumption for each agent
@defVar(m, 0 <= l[i=1:N] <= 1)  # define leisure in [0,1] for each agent 

This syntax is especially convenient, as it allows us to define vectors of parameters, each satisfying the natural inequality constraints. Next, we define the budget constraint, which also follows this convenient syntax:

 @addConstraint(m, c[i=1:N] .== (1.0-t)*(1.0-l[i]).*w[i] + e[i] )        # each agent must satisfy the budget constraint

Finally, we define a scalar-valued objective function, which is the sum of each individual’s utility:

 @setNLObjective(m, Max, sum{ g*log(c[i]) + (1-g)*log(l[i]) , i=1:N } )  # maximize the sum of utility across all agents

Notice that we can optimize one objective function instead of optimizing $N$ objective functions because the individual constrained maximization problems are independent across individuals, so the maximum of the sum is the sum of the maxima. Finally, we can apply the solver to this model and extract optimal consumption and leisure as follows:

 status = solve(m)                                                       # run numerical optimization c_opt = getValue(c)                                                     # extract demand for c l_opt = getValue(l)                                                     # extract demand for l

To make sure it worked, we compare the consumption extracted from this numerical approach to the consumption we generated previously using the true demand functions:

 cor(c_opt,array(df[:consump])) 0.9999999998435865

Thus, we consumption values produced by the numerically optimizer’s approximation to the demand for consumption are almost identical to those produced by the true demand for consumption. Putting it all together, we create a function that can solve for optimal consumption and leisure given any particular values of $\gamma$, $\tau$, and $\epsilon$:

 function hh_constrained_opt(g,t,w,e)      m = Model(solver=IpoptSolver())                                         # define empty model solved by Ipopt algorithm   @defVar(m, c[i=1:N] >= 0)                                               # define positive consumption for each agent
@defVar(m, 0 <= l[i=1:N] <= 1)                                          # define leisure in [0,1] for each agent
@addConstraint(m, c[i=1:N] .== (1.0-t)*(1.0-l[i]).*w[i] + e[i] )        # each agent must satisfy the budget constraint
@setNLObjective(m, Max, sum{ g*log(c[i]) + (1-g)*log(l[i]) , i=1:N } )  # maximize the sum of utility across all agents
status = solve(m)                                                       # run numerical optimization
c_opt = getValue(c)                                                     # extract demand for c
l_opt = getValue(l)                                                     # extract demand for l
demand = DataFrame(c_opt=c_opt,l_opt=l_opt)                             # return demand as DataFrame
end

hh_constrained_opt(gamma,tau,array(df[:wage]),array(df[:epsilon]))          # verify that it works at the true values of gamma, tau, and epsilon


## 5. Parallelized estimation by the Method of Simulated Moments

The replication codes for this section are available here.

We saw in the previous section that, for a given set of model parameters $\gamma$ and $\tau$ and a given draw of $\epsilon_{i}$ for each $i$, we have enough information to simulation $c_{i}$ and $l_{i}$, for each $i$. Denote these simulated values by $\hat{c}_{i}\left(\epsilon;\gamma,\tau\right)$ and $\hat{l}_{i}\left(\epsilon;\gamma,\tau\right)$. With these, we can define the moments,

$\hat{m}\left(\gamma,\tau\right)=\mathbb{E}_{\epsilon}\left[\begin{array}{c} \frac{1}{N}\sum_{i}\left[\hat{c}_{i}\left(\epsilon\right)-c_{i}\right]\\ \frac{1}{N}\sum_{i}\left[\hat{l}_{i}\left(\epsilon\right)-l_{i}\right] \end{array}\right]$

which is equal to zero under the model assumptions. A method of simulated moments (MSM) approach to estimate $\gamma$ and $\tau$ is then,

$\left(\hat{\gamma},\hat{\tau}\right)=\arg\min_{\gamma\in\left[0,1\right],\tau\in\left[0,1\right]}\hat{m}\left(\gamma,\tau\right)'W\hat{m}\left(\gamma,\tau\right)$

where $W$ is a $2\times2$ weighting matrix, which is only relevant when the number of moments is greater than the number of parameters, which is not true in our case, so $W$ can be ignored and the method of simulated moments simplifies to,

$\left(\hat{\gamma},\hat{\tau}\right)=\arg\min_{\gamma\in\left[0,1\right],\tau\in\left[0,1\right]}\left\{ \mathbb{E}_{\epsilon}\left[\frac{1}{N}\sum_{i}\left[\hat{c}_{i}\left(\epsilon\right)-c_{i}\right]\right]\right\} ^{2}+\left\{ \mathbb{E}_{\epsilon}\left[\frac{1}{N}\sum_{i}\left[\hat{l}_{i}\left(\epsilon\right)-l_{i}\right]\right]\right\} ^{2}$

Assuming we know the distribution of $\epsilon_i$, we can simply draw many values of $\epsilon_i$ for each $i$, and average the moments together across all of the draws of $\epsilon_i$. This is Monte Carlo numerical integration. In Julia, we can create this objective function with a random draw of $\epsilon$ as follows:

function sim_moments(params)
this_epsilon = randn(N)                                                     # draw random epsilon
ggamma,ttau = params                                                        # extract gamma and tau from vector
this_demand = hh_constrained_opt(ggamma,ttau,array(df[:wage]),this_epsilon) # obtain demand for c and l
c_moment = mean( this_demand[:c_opt] ) - mean( df[:consump] )               # compute empirical moment for c
l_moment = mean( this_demand[:l_opt] ) - mean( df[:leisure] )               # compute empirical moment for l
[c_moment,l_moment]                                                         # return vector of moments
end


In order to estimate $\hat{m}\left(\gamma,\tau\right)$, we need to run sim_moments(params) many times and take the unweighted average across them to achieve the expectation across $\epsilon_i$. Because each calculation is computer-intensive, it makes sense to compute the contribution of $\hat{m}\left(\gamma,\tau\right)$ for each draw of $\epsilon_i$ on a different processor and then average across them.

Previously, I presented a convenient approach for parallelization in Julia. The idea is to initialize processors with the addprocs() function in an “outer” script, then import all of the needed data and functions to all of the different processors with the require() function applied to an “inner” script, where the needed data and functions are already managed by the inner script. This is incredibly easy and much simpler than the manual spawn-and-fetch approaches suggested by Julia’s official documentation.

In order to implement the parallelized method of simulated moments, the function hh_constrained_opt() and sim_moments() are stored in a file called est_msm_inner.jl. The following code defines the parallelized MSM and then minimizes the MSM objective using the optimize command set to use the Nelder-Mead algorithm from the Optim package:

####### Prepare for Parallelization #########

print(nprocs())               # Now there are 4 active processors
require("est_msm_inner.jl")   # This distributes functions and data to all active processors

####### Define Sum of Squared Residuals in Parallel #########

function parallel_moments(params)
params = exp(params)./(1.0+exp(params))   # rescale parameters to be in [0,1]
results = @parallel (hcat) for i=1:numReps
sim_moments(params)
end
avg_c_moment = mean(results[1,:])
avg_l_moment = mean(results[2,:])
SSR = avg_c_moment^2 + avg_l_moment^2
end

####### Minimize Sum of Squared Residuals in Parallel #########

using Optim
function MSM()
println(out)                                       # verify convergence
exp(out.minimum)./(1.0+exp(out.minimum))           # return results in rescaled units
end


Parallelization is performed by the @parallel macro, and the results are horizontally concatenated from the various processors by the hcat command. The key tuning parameter here is numReps, which is the number of draws of $\epsilon$ to use in the Monte Carlo numerical integration. Because this example is so simple, a small number of repetitions is sufficient, while a larger number would be needed if $\epsilon$ entered the model in a more complicated manner. The process is run as follows and requires 268 seconds to run on my Macbook Air:

numReps = 12                                         # set number of times to simulate epsilon
gamma_MSM, tau_MSM = MSM()                           # Perform MSM
gamma_MSM
0.49994494921381816
tau_MSM
0.19992279518894465


Finally, given the MSM estimates of $\gamma$ and $\tau$, we define the numerical derivative, $\frac{df(x)}{dx} \approx \frac{f(x+h)-f(x-h)}{2h}$, for some small $h$, as follows:

function Dconsump_Dtau(g,t,h)
opt_plus_h = hh_constrained_opt(g,t+h,array(df[:wage]),array(df[:epsilon]))
opt_minus_h = hh_constrained_opt(g,t-h,array(df[:wage]),array(df[:epsilon]))
(mean(opt_plus_h[:c_opt]) - mean(opt_minus_h[:c_opt]))/(2*h)
end

barpsi_MSM = Dconsump_Dtau(gamma_MSM,tau_MSM,.1)
-5.016610457903023


Thus, we estimate the policy parameter $\bar\psi$ to be approximately $-5.017$ on average, while the true value is $\bar\psi = -\gamma \bar{w} = -(1/2)\times 10=-5$, so the econometrician’s problem is successfully solved.

# Selection Bias Corrections in Julia, Part 1

Selection bias arises when a data sample is not a random draw from the population that it is intended to represent. This is especially problematic when the probability that a particular individual appears in the sample depends on variables that also affect the relationships we wish to study. Selection bias corrections based on models of economic behavior were pioneered by the economist James J. Heckman in his seminal 1979 paper.

For an example of selection bias, suppose we wish to study the effectiveness of a treatment (a new medicine for patients with a particular disease, a preschool curriculum for children facing particular disadvantages, etc.). A random sample is drawn from the population of interest, and the treatment is randomly assigned to a subset of this sample, with the remaining subset serving as the untreated (“control”) group. If the subsets followed instructions, then the resulting data would serve as a random draw from the data generating process that we wish to study.

However, suppose the treatment and control groups do not comply with their assignments. In particular, if only some of the treated benefit from treatment while the others are in fact harmed by treatment, then we might expect the harmed individuals to leave the study. If we accepted the resulting data as a random draw from the data generating process, it would appear that the treatment was much more successful than it actually was; an individual who benefits is more likely to be present in the data than one who does not benefit.

Conversely, if treatment were very beneficial, then some individuals in the control group may find a way to obtain treatment, possibly without our knowledge. The benefits received by the control group would make it appear that the treatment was less beneficial than it actually was; the receipt of treatment is no longer random.

In this tutorial, I present some parameterized examples of selection bias. Then, I present examples of parametric selection bias corrections, evaluating their effectiveness in recovering the data generating processes. Along the way, I demonstrate the use of the GLM package in Julia. A future tutorial demonstrates non-parametric selection bias corrections.

Example 1: Selection on a Normally-Distributed Unobservable

Suppose that we wish to study of the effect of the observable variable $X_{i}$ on $Y_i$. The data generating process is given by,

$Y_i=\beta_0+X_i\beta_1+\epsilon_i$,

where $X_i$ and $\epsilon_i$ are independent in the population. Because of this independence condition, the ordinary least squares estimator would be unbiased if the data $(Y,X)$ were drawn randomly from the population. However, suppose that the probability that individual $i$ were in the data set $\mathcal{S}$ were a function of $X$ and $\epsilon$. For example, suppose that,

$\Pr\left( i \in \mathcal{S}\Big| X_i,\epsilon_i\right)=1$ if $X_i > \epsilon_i$,

and the probability is zero otherwise.  This selection rule ensures that, among the individuals in the data (the $i$ in $\mathcal{S}$), the covariance between $X$ and $\epsilon$ will be positive, even though the covariance is zero in the population. When $X$ covaries positively with $\epsilon$, then the OLS estimate of $\beta_1$ is biased upward, i.e., the OLS estimator converges to a value that is greater than $\beta_1$.

To see the problem, consider the following simulation of the above process in which $X$ and $\epsilon$ are drawn as independent standard normal random variables:

srand(2)
N = 1000
X = randn(N)
epsilon = randn(N)
beta_0 = 0.
beta_1 = 1.
Y = beta_0 + beta_1.*X + epsilon
populationData = DataFrame(Y=Y,X=X,epsilon=epsilon)
selected = X.>epsilon
sampleData = DataFrame(Y=Y[selected],X=X[selected],epsilon=epsilon[selected])


There are 1,000 individuals in the population data, but 492 of them are selected to be included in the sample data. The covariance between X and epsilon in the population data is,

cov(populationData[:X],populationData[:epsilon])
0.00861456704877879


which is approximately zero, but in the sample data, it is,

cov(sampleData[:X],sampleData[:epsilon])
0.32121357192108513


which is approximately 0.32.

Now, we regress $Y$ on $X$ with the two data sets to obtain,

linreg(array(populationData[:X]),array(populationData[:Y]))
2-element Array{Float64,1}:
-0.027204
1.00882

linreg(array(sampleData[:X]),array(sampleData[:Y]))
2-element Array{Float64,1}:
-0.858874
1.4517


where, in Julia 0.3.0, the command array() replaces the old command matrix() in converting a DataFrame into a numerical Array. This simulation demonstrates severe selection bias associated with using the sample data to estimate the data generating process instead of the population data, as the true parameters, $\beta_0=0,\beta_1=1$, are not recovered by the sample estimator.

Correction 1: Heckman (1979)

The key insight of Heckman (1979) is that the correlation between $X$ and $\epsilon$ can be represented as an omitted variable in the OLS moment condition,

$\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\beta_0 + \beta_1X_i+\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right]$,

Furthermore, using the conditional density of the standard Normally distributed $\epsilon$,

$\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\epsilon_i\Big|X_i,X_i>\epsilon_i\right]=\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}$.

which is called the inverse Mills ratio, where $\phi$ and $\Phi$ are the probability and cumulative density functions of the standard normal distribution. As a result, the moment condition that holds in the sample is,

$\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}$

Returning to our simulation, the inverse Mills ratio is added to the sample data as,

sampleData[:invMills] = -pdf(Normal(),sampleData[:X])./cdf(Normal(),sampleData[:X])


Then, we run the regression corresponding to the sample moment condition,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
-0.166541
1.05648
0.827454


We see that the estimate for $\beta_1$ is now 1.056, which is close to the true value of 1, compared to the non-corrected estimate of 1.452 above. Similarly, the estimate for $\beta_0$ has improved from -0.859 to -0.167, when the true value is 0. To see that the Heckman (1979) correction is consistent, we can increase the sample size to $N=100,000$, which yields the estimates,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
-0.00417033
1.00697
0.991503


which are very close to the true parameter values.

Note that this analysis generalizes to the case in which $X$ contains $K$ variables and the selection rule is,

$\delta_0 + \delta_1 X_1 + \delta_2 X_2 + \ldots + \delta_K X_k > \epsilon$,

which is the case considered by Heckman (1979). The only difference is that the coefficients $\delta_k$ must first be estimated by regressing an indicator for $i \in \mathcal{S}$ on $X$, then using the fitted equation within the inverse Mills ratio. This requires that we observe $X$ for $i \notin \mathcal{S}$. Probit regression is covered in a slightly different context below.

As a matter of terminology, the process of estimating $\delta$ is called the “first stage”, and estimating $\beta$ conditional on the estimates of $\delta$ is called the “second stage”. When the coefficient on the inverse Mills ratio is positive, it is said that “positive selection” has occurred, with “negative selection” otherwise. Positive selection means that, without the correction, the estimate of $\beta_1$ would have been upward-biased, while negative selection results in a downward-biased estimate. Finally, because the selection rule is driven by an unobservable variables $\epsilon$, this is a case of “selection on unobservables”. In the next section we consider a case of “selection on observables”.

Example 2: Probit Selection on Observables

Suppose that we wish to know the mean and variance of $Y$ in the population. However, our sample of $Y$ suffers from selection bias. In particular, there is some $X$ such that the probability of observing $Y$ depends on $X$ according to,

$\Pr\left(i\in\mathcal{S}\Big| X_i\right)=F\left(X_i\right)$,

where $F$ is some function with range $[0,1]$. Notice that, if $Y$ and $X$ were independent, then the resulting sample distribution of $Y$ would be a random draw from the population (marginal distribution) of $Y$. Instead, we suppose $\mathrm{Cov}\left(X,Y\right)\neq 0$. For example,

srand(2)
N = 10000
populationData = DataFrame(rand(MvNormal([0,0.],[1 .5;.5 1]),N)')
names!(populationData,[:X,:Y])

mean(array(populationData),1)
1x2 Array{Float64,2}:
-0.0281916  -0.022319

cov(array(populationData))
2x2 Array{Float64,2}:
0.98665   0.500912
0.500912  1.00195


In this simulated population, the estimated mean and variance of $Y$ are -0.022 and 1.002, and the covariance between $X$ and $Y$ is 0.501. Now, suppose the probability that $Y_i$ is observed is a probit regression of $X_i$,

$\Pr\left( i\in\mathcal{S}\Big| X_i \right) = \Phi\left( \beta_0 + \beta_1 X_i \right)$,

where $\Phi$ is the CDF of the standard normal distribution. Letting $D_i=1$ indicate that $i \in \mathcal{S}$, we can generate the sample selection rule $D$ as,

beta_0 = 0
beta_1 = 1
index = (beta_0 + beta_1*data[:X])
probability = cdf(Normal(0,1),index)
D = zeros(N)
for i=1:N
D[i] = rand(Bernoulli(probability[i]))
end
populationData[:D] = D
sampleData = populationData
sampleData[D.==0,:Y] = NA


The sample data has missing values in place of $Y_i$ if $D_i=0$. The estimated mean and variance of $Y$ in the sample data are 0.275 (which is too large) and 0.862 (which is too small).

Correction 2: Inverse Probability Weighting

The reason for the biased estimates of the mean and variance of $Y$ in Example 2 is sample selection on the observable $X$. In particular, certain values of $Y$ are over-represented due to their relationship with $X$. Inverse probability weighting is a way to correct for the over-representation of certain types of individuals, where the “type” is captured by the probability of being included in the sample.

In the above simulation, conditional on appearing in the population, the probability that an individual of type $X_i=1$ is included in the sample is 0.841. By contrast, the probability that an individual of type $X_i=0$ is included in the sample is 0.5, so type $X_i$ is over-represented by a factor of 0.841/0.5 = 1.682. If we could reduce the impact that type $X_i=1$ has in the computation of the mean and variance of $Y$ by a factor of 1.682, we would alter the balance of types in the sample to match the balance of types in the population. Inverse probability weighting generalizes this logic by weighting each individual’s impact by the inverse of the probability that this individual appears in the sample.

Before we can make the correct, we must first estimate the probability of sample inclusion. This can be done by fitting the probit regression above by least-squares. For this, we use the GLM package in Julia, which can be installed the usual way with the command Pkg.add(“GLM”).

using GLM
Probit = glm(D ~ X, sampleData, Binomial(), ProbitLink())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:

Coefficients:
Estimate Std.Error  z value Pr(>|z|)
(Intercept)  0.114665  0.148809 0.770554   0.4410
X             1.14826   0.21813  5.26414    1e-6

estProb = predict(Probit)
weights = 1./estProb[D.==1]/sum(1./estProb[D.==1])


which are the inverse probability weights needed to match the sample distribution to the population distribution.

Now, we use the inverse probability weights to correct the mean and variance estimates of $Y$,

correctedMean = sum(sampleData[D.==1,:Y].*weight)
-0.024566923132025013

correctedVariance = (N/(N-1))*sum((sampleData[D.==1,:Y]-correctedMean).^2.*weight)
1.0094029613131092


which are very close to the population values of -0.022319 and 1.00195. The logic here extends to the case of multivariate $X$, as more coefficients are added to the Probit regression. The logic also extends to other functional forms of $F$, for example, switching from Probit to Logit is achieved by replacing the ProbitLink() with LogitLink() in the glm() estimation above.

Example 3: Generalized Roy Model

For the final example of this tutorial, we consider a model which allows for rich, realistic economic behavior. In words, the Generalized Roy Model is the economic representation of a world in which each individual must choose between two options, where each option has its own benefits, and one of the options costs more than the other. In math notation, the first alternative, denoted $D_i=1$, relates the outcomes $Y_i$ to the individual’s observable characteristics, $X_i$, by,

$Y_{1,i} = \mu_1\left(X_i\right)+U_{1,i}$.

Similarly, the second alternative, denoted $D_i=0$, relates $Y_i$ to $X_i$, by,

$Y_{0,i} = \mu_0\left(X_i\right)+U_{0,i}$.

The value of $Y_i$ that appears in our sample is thus given by,

$Y_i = D_iY_{1,i} + (1-D_i)Y_{0,i}$.

Finally, the value of $D_i$ is chosen by individual $i$ according to,

$D_i=1$ if $Y_{1,i}-Y_{0,i}-C_i>0$,

where $C_i$ is the cost of choosing the alternative $D_i=1$ and is given by,

$C_i = \mu_C\left(X_i,Z_i\right)+U_{C,i}$,

where $Z$ contains additional characteristics of $i$ that are not included in $X$.

We assume that the data only contains $Y_i,D_i,X_i,Z_i$; it does not contain the variables $Y_{i,1},Y_{i,0},U_{i,1},U_{i,0},U_{i,C}$ or the functions $\mu_1,\mu_0,\mu_C$. Assuming that the three $\mu$ functions follow the linear form and that the unobservables $U$ are independent and Normally distributed, we can simulate the data generating process as,

srand(2)
N = 1000
sampleData = DataFrame(rand(MvNormal([0,0.],[1 .5; .5 1]),N)')
names!(sampleData,[:X,:Z])
U1 = rand(Normal(0,.5),N)
U0 = rand(Normal(0,.7),N)
UC = rand(Normal(0,.9),N)
betas1 = [0,1]
betas0 = [.3,.2]
betasC = [.1,.1,.1]
Y1 = betas1[1] + betas1[2].*sampleData[:X] + U1
Y0 = betas0[1] + betas0[2].*sampleData[:X] + U0
C = betasC[1] + betasC[2].*sampleData[:X] + betasC[3].*sampleData[:Z] + UC
D = Y1-Y0-C.>0
Y = D.*Y1 + (1-D).*Y0
sampleData[:D] = D
sampleData[:Y] = Y


In this simulation, about 38% of individuals choose the alternative $D_i=1$. About 10% of individuals choose $D_i=0$ even though they receive greater benefits under $D_i=1$ due to the high cost $C_i$ associated with $D_i=1$.

Solution 3: Heckman Correction for Generalized Roy Model

The identification of this model is attributable to Heckman and Honore (1990). Estimation proceeds in steps. The first step is to notice that the left- and right-hand terms in the following moment equation motivate a Probit regression:

$\mathbb{E}\left[D_i\Big|X_i,Z_i\right]=\Pr\left(\mu_D\left(X_i,Z_i\right)>U_{D,i}\Big|X_i,Z_i\right)=\Phi\left(\mu_D\left(X_i,Z_i\right)\right)$,

where $U_{D,i}\equiv -\left(U_{1,i}-U_{0,i}-U_{C,i}\right)/\sigma_D\sim\mathcal{N}\left(0,1\right)$ is the negative of the total error term arising in the equation that determines $D_i$ above, $\sigma_D \equiv \sqrt{\mathrm{Var}\left(U_{1,i}-U_{0,i}-U_{C,i}\right)}$, and,

$\mu_D\left(X,Z\right) \equiv \left([1,X]\beta_1 -[1,X]\beta_0 -[1,X,Z]\beta_C\right)/\sigma_D \equiv [1,X,Z]\beta_D$,

In the simulation above, $\beta_D = [-.4,.7,-.1]/\sqrt{.5+.7+.9}\approx[-0.276,0.483,-0.069]$. We can estimate $\beta_D$ from the Probit regression of $D$ on $X$ and $Z$.

betasD = coef(glm(D~X+Z,sampleData,Binomial(),ProbitLink()))
3-element Array{Float64,1}:
-0.299096
0.59392
-0.103155


Next, notice that,

$\mathbb{E}\left[Y_i\Big|D_i=1,X_i,Z_i\right] =[1,X_i]\beta_1+\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right]$,

where,

$\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right] =\mathbb{E}\left[U_{1,i}\Big|\mu_D\left(X_i,Z_i\right)>U_{i,D},X_i,Z_i\right]=\rho_1 \frac{-\phi\left(\mu_D\left(X_i,Z_i\right)\right)}{\Phi\left(\mu_D\left(X_i,Z_i\right)\right)}$,

which is the inverse Mills ratio again, where $\rho_1 \equiv \frac{\mathrm{Cov}\left(U_1,U_D\right)}{\mathrm{Var}\left(U_D\right)}$. Substituting in the estimate for $\beta_D$, we consistently estimate $\beta_1$:

fittedVals = hcat(ones(N),array(sampleData[[:X,:Z]]))*betasD

sampleData[:invMills1] = -pdf(Normal(0,1),fittedVals)./cdf(Normal(0,1),fittedVals)

correctedBetas1 = linreg(array(sampleData[D,[:X,:invMills1]]),vec(array(sampleData[D,[:Y]])))
3-element Array{Float64,1}:
0.0653299
0.973568
-0.135445


To see how well the correction has performed, compare these estimates to the uncorrected estimates of $\beta_1$,

biasedBetas1 = linreg(array(sampleData[D,[:X]]),vec(array(sampleData[D,[:Y]])))
2-element Array{Float64,1}:
0.202169
0.927317


Similar logic allows us to estimate $\beta_0$:

sampleData[:invMills0] = pdf(Normal(0,1),fittedVals)./(1-cdf(Normal(0,1),fittedVals))

correctedBetas0 = linreg(array(sampleData[D.==0,[:X,:invMills0]]),vec(array(sampleData[D.==0,[:Y]])))
3-element Array{Float64,1}:
0.340621
0.207068
0.323793

biasedBetas0 = linreg(array(sampleData[D.==0,[:X]]),vec(array(sampleData[D.==0,[:Y]])))
2-element Array{Float64,1}:
0.548451
0.295698


In summary, we can consistently estimate the benefits associated with each of two alternative choices, even though we only observe each individual in one of the alternatives, subject to heavy selection bias, by extending the logic introduced by Heckman (1979).

# Kalman Filter for Panel Data and MLE in Julia, Part 1

* The script to reproduce the results of this tutorial in Julia is located here.

The Kalman Filter is notoriously difficult to estimate. I think this can be attributed to the following issues:

• It necessitates a large number of parameters that are difficult to manage.
• It requires the sequential inversion of matrices, so we must be careful to keep the matrices invertible and monitor numerical error.
• Most papers and Internet resources about the Kalman Filter assume that we have time-series rather than panel data, so they do not show how to adapt the parameters for panel data estimation
• Similarly, this literature makes “macro” assumptions like initializing at the steady state rather than the “micro” approach of estimating the initial state.
• Professional software is not well-suited to parameter estimation (more on this below).

In this tutorial, I will show how to write your own program that estimates the MLE of the most basic type of panel data Kalman Filter: time-invariant parameters and dedicated measures. It will explicitly impose assumptions to ensure identification, so that the estimates returned are uniquely determined from the data, unlike what would happen if we maximized the likelihood with the unrestricted Kalman Filter. In Part 2 of the tutorial, I generalize the estimation to permit a time-varying parameter space as well as time-varying parameter values, show how to identify the initial state, permit non-dedicated measures, and show how to “anchor” the model so that the units of the state space are determined to scale.

Kalman Filter DGP

The data generating process (DGP) corresponding to the panel data Kalman Filter is,

$\theta_{i,t+1}=A\theta_{i,t}+\nu_{i,t+1},\;\nu_{i,t+1}\sim\mathcal{N}\left(0,V\right)$,

$Y_{i,t+1}=C\theta_{i,t+1}+\omega_{i,t+1},\;\omega_{i,t+1}\sim\mathcal{N}\left(0,W\right)$,

$\theta_{i,1} \sim \mathcal{N}\left(\mu_1,\Sigma_1 \right)$,

for time periods $t=2,\ldots,T$. We assume that $Y$, referred to as “the measures” or “the data”, is the only observed term in this entire system. The first equation determines the evolution of the unobserved state, $\theta$, and we will refer to it as the “transition equation”. The second equation determines the relationship between the state and measures or signals of the state, $Y$, and we will refer to it as the “measurement equation”. The shocks, $v_{i,t+1}$ and $w_{i,t+1}$, are assumed to be independent across $i$, across $t$, and even across components so that $V$ and $W$ are diagonal.

For our purposes, the parameters of interest are typically $A,V$, although we generally need to identify $C,W$ as well so that we can separate them from $A,V$. It is no trouble to add other observables to these equations, e.g., in optimal control theory, there is usually an observed “input”, call it $u$, on the right-hand side the transition equation, so that the equation takes the form,

$\theta_{i,t+1}=A\theta_{i,t}+Bu_{i,t}+\nu_{i,t+1},\;\nu_{i,t+1}\sim\mathcal{N}\left(0,V\right)$,

but this is a detail that could be added later with little difficulty, so we omit it until Part 2 of this tutorial. Also, we will assume for now that the initial distribution of the state is characterized by $\mu_1=0$ and $\Sigma_1=\mathbf{1}$, but relax this assumption in Part 2.

Failure of Identification and Necessary Assumptions

Before simulating the DGP, notice that it is indifferent to the scale or even the sign of $\theta$. In particular, consider dividing $\theta$ by a constant $\alpha \neq 0$, and multiplying $C$ by $\alpha$. Then,

$Y_{i,t+1}=\left(\alpha C\right)\frac{\theta_{i,t+1}}{\alpha}+\omega_{i,t+1}=C\theta_{i,t+1}+\omega_{i,t+1}$,

so $\alpha$ cancels out and we are left with the same expression.

Since our only information about $\theta$ comes from $Y$, this means that we will never be able to know $C$ and $\theta$ apart from one another, that is, we will never know the scale of $\theta$. Similarly, we cannot distinguish $A$ and $\theta$. By choosing $\alpha<0$, we see that neither the sign nor the magnitude of $A$ is determined by knowledge of $Y$. Let me emphasize this point: we can change the sign of every entry in $A$ as we wish and get the same data. Since $A$ is one of the parameters of interest — in fact, our desire to learn $A$ is often the motivation for using the Kalman Filter — this is a major issue. Since $\alpha$ can take on a continuum of values, we see that the DGP results in the same $Y$ for a continuum of parameters $\alpha A, \alpha C$.

If we need to know the unique value of each parameter, we must make additional assumptions. The following assumptions, adapted from Cunha and Heckman (2008), are sufficient to ensure identification:

1. Each of the measures (i.e., the components of $Y$) is a measure of only one of the factors (i.e., the components of $\theta$). This is the “dedicated measures” assumption, and it is equivalent to requiring each row of $C$ to have exactly one entry that is not assumed to be zero.
2. There are at least 3 dedicated measures for each factor. This means that each column in $C$ must have at least three entries that are not assumed to be zero.

The first assumption can be relaxed a bit if there are more than 3 measures per factor, but we can never relax the assumption that there is at least one dedicated measure for one of the factors.

In practice, are these assumptions valid? Possibly, it depends on your faith in the measures available in your data. You must be able to confidently claim that certain measures in your data are measuring only one factor. The alternative is to abandon estimating the model. In other words, the Kalman Filter is not all-powerful — certain models simply cannot be estimated with certain data, and trying to force it to estimate a model that is not identified from your data will result in incorrect estimates.

Notice that professional Kalman Filter software does not allow one to impose these assumptions. This is one of the main motivations for this tutorial: at present, you must program the Kalman Filter yourself if you want valid estimates of the model parameters, and estimating the model parameters rather than projecting the state from given model parameters is the main purpose of the Kalman Filter in microeconomic applications.

Simulation of the Kalman Filter DGP

Before simulating the Kalman Filter DGP, we must first deal with the issue I mentioned at the beginning of this tutorial: the Kalman Filter DGP necessitates a large number of parameters that are difficult to manage. We need a function that can convert an array of parameters into the matrices $A,V,C,W$, including exponentiating variance terms so that when we use an optimizer later, the optimizer can try out different parameters without error. The following function, unpackParams, is my solution to this problem:

function unpackParams(params)
place = 0
A = reshape(params[(place+1):(place+stateDim^2)],(stateDim,stateDim))
place += stateDim^2
V = diagm(exp(params[(place+1):(place+stateDim)]))
place += stateDim
C = zeros(stateDim*obsDim,stateDim)
for j in [1:stateDim]
C[(1+obsDim*(j-1)):obsDim*j,j] = [1,params[(place+1+(obsDim-1)*(j-1)):(place+(obsDim-1)*(j))]]
end
place += (obsDim-1)*stateDim
W = diagm(exp(params[(place+1):(place+obsDim*stateDim)]))
return ["A"=>A,"V"=>V,"C"=>C,"W"=>W]
end


where stateDim is the length of $\theta$ and obsDim is the number of measures on each component of $\theta$. This structure assumes that all measures are dedicated to a single factor, and the same number of measures is available for each factor. The function can be modified as needed to account for other structures. Here is the output from using this function on a set of parameters when there are two components of $\theta$ with 3 measures on each:

julia> stateDims = 2
julia> obsDims = 3
julia> params0 = [1.,.3,.3,1.,0.,0.,.5,-.5,.5,-.5,0.,0.,0.,0.,0.,0.]
julia> unpackParams(params0)
"V" 2x2 Array{Float64,2}:
1.0  0.0
0.0  1.0,
"C" 6x2 Array{Float64,2}:
1.0   0.0
0.5   0.0
-0.5   0.0
0.0   1.0
0.0   0.5
0.0  -0.5,
"A" 2x2 Array{Float64,2}:
1.0  0.3
0.3  1.0,
"W" 6x6 Array{Float64,2}:
1.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  0.0  0.0  0.0
0.0  0.0  0.0  1.0  0.0  0.0
0.0  0.0  0.0  0.0  1.0  0.0
0.0  0.0  0.0  0.0  0.0  1.0]


Clearly, the most difficult part to construct is $C$, the measurement matrix.

Now that we can convert an array of parameters into the appropriate parameter matrices, it is straight-forward to define a function that simulates the DGP of the Kalman Filter:

function KalmanDGP(params)
# initialize data
data = zeros(N,stateDim*obsDim*T+1)
# parameters
unpacked = unpackParams(params)
A = unpacked["A"]
V = unpacked["V"]
C = unpacked["C"]
W = unpacked["W"]
# draw from DGP
for i=1:N
# data of individual i
iData = ones(stateDim*obsDim*T+1)
current_state = rand(MvNormal(reshape(init_exp,(stateDim,)),init_var))
iData[1:stateDim*obsDim] = rand(MvNormal(eye(obsDim*stateDim)))
for t=2:T
current_state = A*current_state + rand(MvNormal(V))
iData[((t-1)*stateDim*obsDim+1):(t*stateDim*obsDim)] = C*current_state + rand(MvNormal(W))
end
data[i,:] = iData
end
return data
end


Here is a simulated data set that satisfies the assumptions of time-invariant parameters and at least three dedicated measures (in particular, I use obsDims=4):

srand(2)
N = 1000
T = 4
stateDims = 2
obsDims = 4
init_exp = zeros(stateDim)
init_var = eye(stateDim)
data=KalmanDGP(params0)


Estimation Strategy for the Kalman Filter DGP

Now that we understand the Kalman Filter model, we can consider the Kalman Filter itself. The Kalman Filter is the estimator of what I have been calling the Kalman Filter DGP. The aim of the Kalman Filter for panel data is to estimate the mean and variance of $\theta_{i,t}$ for each $i$ and each $t$.

Estimation can be divided into two steps. Denoting $\mathcal{Y}^t_i$ as the set of all data available at or before time $t$, the projection step is trivial to derive and given by:

$\mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^t_i\right] = A\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]$

$\mathbb{E}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right] = C\mathbb{E}\left[\theta_{i,t+1}\Big| \mathcal{Y}^t_i\right]$

$\mathrm{Var}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right] = A\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]A'+V$

$\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^t_i\right] = C\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]C'+W$

The update step is:

$\mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t+1}_i\right] = \mathbb{E}\left[\theta_{i,t+1}\Big| \mathcal{Y}^t_i\right]+K_{t+1} \tilde{Y}_{i,t+1}$

$\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t+1}_i\right] = \mathrm{Var}\left[\theta_{i,t+1}\Big| \mathcal{Y}^{t}_i\right]-K_{t+1} C\mathrm{Var}\left[\theta_{i,t+1}\Big| \mathcal{Y}^{t}_i\right]$

where,

$\tilde{Y}_{t+1}=Y_{t+1}-\mathbb{E}\left[Y_{i,t+1}\Big|\mathcal{Y}^{t}_i\right]$

is the error from projecting $Y_{t+1}$ at time $t$, and,

$K_{t+1}=\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t}_i\right]C'\left(\mathrm{Var}\left[Y_{i,t+1}\Big|\mathcal{Y}^{t}_i\right]\right)^{-1}$

is called the optimal Kalman gain and is derived as the solution to a least-squares problem.

The individual likelihood of the Kalman Filter is given by,

$\mathcal{L}\left(A,V,C,W\Big|\mathcal{Y}^T_i\right)\propto\Pr\left(\mathcal{Y}^T_i\Big|A,V,C,W\right)=\Pr\left(Y^{1}_i\Big|A,V,C,W\right)\prod_{t=1}^{T-1}\Pr\left(Y^{t+1}_i\Big|\mathcal{Y}^t_i,A,V,C,W\right)$

The initial conditions enter through $\Pr\left(Y^1_i\Big|A,V,C,W\right)$. In this tutorial, we assume $\Pr\left(Y^1_i\Big|A,V,C,W\right)=\Pr\left(Y^1_i\right)=\phi\left(Y_1^i\Big|0,\mathbf{1}\right)$,  where $\phi$ is the multivariate normal PDF. We permit individual-specific initial conditions and unknown parameters in Part 2 of the tutorial. Furthermore, $\Pr\left(Y^{t+1}_i\Big|\mathcal{Y}^t_i,A,V,C,W\right)=\phi\left(Y_i^{t+1}\big|\mathbb{E}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right],\mathrm{Var}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right]\right)$.

Finally, since individuals are assumed to be independent,

$\mathcal{L}\left(A,V,C,W\Big|\mathcal{Y}^T\right)\propto \prod_{i=1}^N\Pr\left(Y_{i,1}\right)\prod_{t=1}^{T-1}\Pr\left(Y^{t+1}_i\Big|\mathcal{Y}^t_i,A,V,C,W\right)$

Now that we have a simple expression for the likelihood, we need only convert it into a Julia code.

Estimation of the Kalman Filter in Julia

To simplify computation, we write a single function that simultaneously predicts and updates (in that order) from any time period $t$ to the next time period $t+1$. It begins with the “posterior” estimates of the expected state and variance of the state at time $t$, that is, $\mathbb{E}\left[\theta_{i,t}\Big|\mathcal{Y}^{t}_i\right],\mathrm{Var}\left[\theta_{i,t}\Big|\mathcal{Y}^{t}_i\right]$, uses these to predict the “prior” estimates at time $t+1$, that is, $\mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t}_i\right],\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t}_i\right]$, uses the priors to obtain the likelihood of the new observation at time $t+1$, and finally uses the priors as well as the new observation at time $t+1$ to update to the posterior estimates at time $t+1$. The function, incrementKF (“a time increment of the Kalman Filter”, is as follows:

function incrementKF(params,post_exp,post_var,new_obs)
# unpack parameters
unpacked = unpackParams(params)
A = unpacked["A"]
V = unpacked["V"]
C = unpacked["C"]
W = unpacked["W"]
# predict
prior_exp = A*post_exp
prior_var = A*post_var*A' + V
obs_prior_exp = C*prior_exp
obs_prior_var = C*prior_var*C' + W
# update
residual = new_obs - obs_prior_exp
obs_prior_cov = prior_var*C'
kalman_gain = obs_prior_cov*inv(obs_prior_var)
post_exp = prior_exp + kalman_gain*residual
post_var = prior_var - kalman_gain*obs_prior_cov'
# step likelihood
dist = MvNormal(reshape(obs_prior_exp,(length(obs_prior_exp),)),obs_prior_var)
log_like = logpdf(dist,new_obs)
return ["post_exp"=>post_exp,"post_var"=>post_var,"log_like"=>log_like]
end


We see that the code is quite simple, as it closely mimics the mathematical formulas above in syntax. The part of the code that does not mimic the usual mathematical syntax — preparing parameters — is handled elsewhere by unpackParams.

Now that we have the code that predicts and updates the Kalman Filter for any individual while outputting the contribution to the log-likelihood, we can iterate over time and collect all of the log-likelihood contributions from an individual’s time path:

function indivKF(params,init_exp,init_var,i)
iData = data[i,:]
# initialization
post_exp = init_exp
post_var = init_var
init_obs = array([iData[obsDict[1]]])'
dist = MvNormal(eye(length(init_obs)))
log_like=logpdf(dist,init_obs)
for t = 1:(T-1)
# predict and update
new_obs = array([iData[obsDict[t+1]]])'
new_post = incrementKF(params,post_exp,post_var,new_obs)
# replace
post_exp = new_post["post_exp"]
post_var = new_post["post_var"]
# contribute
log_like += new_post["log_like"]
end
return log_like
end


This initializes the individual’s contribution to the likelihood using the assumed initial distribution of $\mathcal{N}\left(0,\mathbf{1}\right)$, then proceeds to predict and update the Kalman Filter over time until all log-likelihood contributions have been collected. The only confusing part is obsDict, which is a dictionary containing the column names of the observations, which are assumed to be stored in a DataFrame called data. This strategy isn’t necessary, but it avoids a lot of complications, and also makes our example better resemble a real-world application to a data set.

Finally, we loop through each individual, summing all of their lifetime log-likelihood contributions:

function sampleKF(params,init_exp,init_var)
log_like = 0.0
N = size(data,1)
for i in 1:N
log_like += indivKF(params,init_exp,init_var,i)
end
neg_avg_log_like = -log_like/N
println("current average negative log-likelihood: ",neg_avg_log_like)
return neg_avg_log_like[1]
end


where I like to include a print statement so that I can monitor the convergence of the likelihood. Thus, we have the panel data Kalman Filter likelihood, under the appropriate assumptions to ensure identification. Finally, we use a wrapper to prepare it for optimization:

function wrapLoglike(params)
print("current parameters: ",params)
return sampleKF(params,init_exp,init_var)
end


where I have included a print statement so that I can monitor the parameters that the optimizer is testing and know in advance if the estimates are diverging (if it were diverging, I would know my code has a bug and could stop the optimizer before it wastes too much time).

We conclude by estimating the Kalman Filter MLE on simulated data.

Results

First, we simulate data from KalmanDGP:

srand(2)
N = 1000
T = 4
stateDim = 2
obsDim = 3
init_exp = zeros(stateDim)
init_var = eye(stateDim)
params0 = [1.,0.,0.,1.,0.,0.,.5,-.5,.5,-.5,0.,0.,0.,0.,0.,0.]

data=KalmanDGP(params0)
data = DataFrame(data)
colnames!(data,["one_1","one_2","one_3","one_4","one_5","one_6","two_1","two_2","two_3","two_4","two_5","two_6","three_1","three_2","three_3","three_4","three_5","three_6","four_1","four_2","four_3","four_4","four_5","four_6","outcome"])
obsDict = {["one_1","one_2","one_3","one_4","one_5","one_6"],["two_1","two_2","two_3","two_4","two_5","two_6"],["three_1","three_2","three_3","three_4","three_5","three_6"],["four_1","four_2","four_3","four_4","four_5","four_6"]}


Everything here is the same as in the DGP code above, except that we are converting the data matrix into a DataFrame. The columns are named using the syntax time_number, so that three_4 means the 4th observation in time period 3. It creates obsDict such that, for example, data[obsDict[3]] would return the variables from three_1 to three_6 — all of the observations at time 3 in the correct order.

Now, we are ready to estimate the Kalman Filter as follows:

MLE = optimize(wrapLoglike,params0,method=:cg,ftol=1e-8)


When it is finished (and we have watched it converge using the print statements), we can view the parameters in the appropriate context by:

optimParams = unpackParams(MLE.minimum)
["V" 2x2 Array{Float64,2}:
1.11323  0.0
0.0      0.918925,
"C" 6x2 Array{Float64,2}:
1.0        0.0
0.489902   0.0
-0.490378   0.0
0.0        1.0
0.0        0.528178
0.0       -0.517231,
"A" 2x2 Array{Float64,2}:
0.976649    -0.00976326
0.00762499   0.994169  ,
"C" 6x6 Array{Float64,2}:
0.977998  0.0      0.0     0.0       0.0      0.0
0.0       0.97972  0.0     0.0       0.0      0.0
0.0       0.0      1.0241  0.0       0.0      0.0
0.0       0.0      0.0     0.990665  0.0      0.0
0.0       0.0      0.0     0.0       1.00625  0.0
0.0       0.0      0.0     0.0       0.0      1.01042]


This compares favorably to the true parameters,

julia> unpackParams(params0)
["V" 2x2 Array{Float64,2}:
1.0  0.0
0.0  1.0,
"C" 6x2 Array{Float64,2}:
1.0   0.0
0.5   0.0
-0.5   0.0
0.0   1.0
0.0   0.5
0.0  -0.5,
"A" 2x2 Array{Float64,2}:
1.0  0.0
0.0  1.0,
"W" 6x6 Array{Float64,2}:
1.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  0.0  0.0  0.0
0.0  0.0  0.0  1.0  0.0  0.0
0.0  0.0  0.0  0.0  1.0  0.0
0.0  0.0  0.0  0.0  0.0  1.0]


On a single processor of my personal laptop, optimization requires 446 seconds. In response to a reader comment, I also added this version of the code that achieves optimization in 393 seconds by removing all global variables.

# Parallel Processing in Julia: Bootstrapping the MLE

* Scripts that reproduce the results of this tutorial in Julia are located here and here.

To review our extended example, we first simulated the OLS DGP, solved for the MLE of OLS, bootstrapped the MLE, and computed various non-parametric p-values.

The only part of our previous work that may have slowed down your computer was forming the 10,000 bootstrap samples (samples). In this post, we will see how to speed up the bootstrapping by using more than one processor to collect the bootstrap samples.

The Estimator that We Wish to Bootstrap

We are interested in the following MLE estimator, adapted from our previous work. First, we import the two needed packages and generate the data:

# Prepare
srand(2)
using Distributions
using Optim
using DataFrames
N=1000
K=3

# Generate variables
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]
genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.01,0.05,0.05,0.07]
Y = X*trueParams + epsilon

# Export data
data = DataFrame(hcat(Y,X))
names!(data,[:Y,:one,:X1,:X2,:X3])
writetable("data.csv",data)


Next, we define the likelihood of OLS:

function loglike(rho,y,x)
beta = rho[1:4]
sigma2 = exp(rho[5])+eps(Float64)
residual = y-x*beta
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


Finally, we define the bootstrap function, which includes within it a wrapper for the likelihood:

function bootstrapSamples(B)
println("hi")
samples = zeros(B,5)
for b=1:B
theIndex = sample(1:N,N)
x = X[theIndex,:]
y = Y[theIndex,:]
function wrapLoglike(rho)
return loglike(rho,y,x)
end
samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
end
samples[:,5] = exp(samples[:,5])
println("bye")
return samples
end


The reason for the cute println statements will become apparent later. Save all of this in a file called “bootstrapFunction.jl”. Finally, we import the data created above:

data = readtable("data.csv")
N = size(data,1)
Y = array(data[:Y])
X = array(data[[:one,:X1,:X2,:X3]])


Parallel Bootstrapping

We now have the function, bootstrapSamples, saved in the file “bootstrapFunction.jl”, that we wish to run in parallel. The idea is that, instead of running the MLE $B$ times on one processor, we run it $B/4$ times on each of four processors, then collect the results back to the first processor. Unfortunately, Julia makes it very difficult to parallelize functions within the same file that the functions are defined; the user must explicitly tell Julia every parameter and function that is needed on each processor, which is a frustrating process of copy-pasting the command @everywhere all around your code.

Instead, we separate our work into two files, the first one you have already created. Create a second file in the same location on your computer as “bootstrapFunction.jl”. In the second file, write the following commands:

addprocs(4)
require("tutorial_5_bootstrapFunction.jl")
B=10000
b=2500
samples_pmap = pmap(bootstrapSamples,[b,b,b,b])
samples = vcat(samples_pmap[1],samples_pmap[2],samples_pmap[3],samples_pmap[4])


The command addprocs(4) means to add four processors into your Julia session. The command require is the key to getting this right: it ensures that all of the functions and parameters contained in “bootstrapFunction.jl” are available to each processor. pmap is the function that actually applies bootstrapSamples to each value in the list [b,b,b,b]. I set the list next to it to be [b,b,b,b], where b=2500, so that bootstrapEstimates would collect 2,500 bootstrap samples on each processor. The resulting matrices from all four processors are stored in the length-4 vector called samples_pmap. Finally, we vertically stack (vcat) the four matrices into one matrix, samples, which is exactly the matrix of bootstrap samples we have seen previously.

Tip: For parallel processing in Julia, separate your code into two files, one file containing the functions and parameters that need to be run in parallel, and the other file managing the processing and collecting results. Use the require command in the managing file to import the functions and parameters to all processors.

Results

Once you have run the outer file, you can do speed tests to compare the two ways of getting 10,000 bootstrap samples. We use the @elapsed command to time each code. Here are the results:

@elapsed pmap(bootstrapSamples,[b,b,b,b])
From worker 2:	hi
From worker 3:	hi
From worker 4:	hi
From worker 5:	hi
From worker 2:	bye
From worker 5:	bye
From worker 4:	bye
From worker 3:	bye
66.528345161


Each of the processors tells us “hi” and “bye” when it begins and finishes its job, respectively. The total time was a bit more than a minute (66.5 seconds). By comparison, the following code will compute 10,000 bootstrap samples on only one processor, as in our previous work:

@elapsed bootstrapSamples(10000)
hi
bye
145.265103729


So, in the absence of parallel processing, the same result would require just over 145 seconds, or almost two and a half minutes. Thus, parallel processing on a four-core personal computer allowed us to reduce the time required to bootstrap our estimator by a factor of around 2.2. As a general rule, parallel processing of the bootstrap estimator saves you more time when the function being bootstrapped is more complicated.

# Julia Attracts an International Audience

I created this site 2 days ago to help economists learn Julia. I am pleasantly surprised that so many share this interest. Below, see the number of views of this site over these 2 days by country. Hopefully, all of this momentum in favor of Julia will speed its accumulation of useful packages and documentation.

Please, do not hesitate to comment or email me with ideas to improve the codes I post or ideas of worthwhile codes to work on next. Thank you for visiting this site.

# Stepdown p-values for Multiple Hypothesis Testing in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

To finish yesterday’s tutorial on hypothesis testing with non-parametric p-values in Julia, I show how to perform the bootstrap stepdown correction to p-values for multiple testing, which many economists find intimidating (including me) but is actually not so difficult.

First, I will show the simple case of Holm’s (1979) stepdown procedure. Then, I will build on the bootstrap example to apply one of the most exciting advances in econometric theory from the past decade, the bootstrap step-down procedure, developed by Romano and Wolf (2004) and extended by Chicago professor Azeem Shaikh. These methods allow one to bound the family-wise error rate, which is the probability of making at least one false discovery (i.e., rejecting a null hypothesis when the null hypothesis was true).

Motivation

The standard example to motivate this discussion is to suppose you compute p-values for, say, 10 parameters with independent estimators. If all 10 have p-value equal to 0.05 corresponding to the null hypothesis of each, then we reject the null hypothesis for any one of them with 95% confidence. However, the probability that all of them reject the null hypothesis is $1-\left(1-0.05\right)^{10} \approx 0.4$ (by independence, the joint probability is the product of marginal probabilities), so we only have 60% confidence in rejecting all of them jointly. If we wish to test for all of the hypotheses jointly, we need some way to adjust the p-values to give us greater confidence in jointly rejecting them. Holm’s solution to this problem is to iteratively inflate the individual p-values, so that the smallest p-value is inflated the most, and the largest p-value is not inflated at all (unless one of the smaller p-values is inflated above the largest p-value, then the largest p-value will be inflated to preserve the order of p-values). The specifics of the procedure are below.

The Romano-Wolf-Shaikh approach is to account for the dependence among the estimators, and penalize estimators more as they become more independent. Here’s how I motivate their approach: First, consider the extreme case of two estimators that are identical. Then, if we reject the null hypothesis for the first, there is no reason to penalize the second; it would be very strange to reject the null hypothesis for the first but not the second when they are identical. Second, suppose they are almost perfectly dependent. Then, rejecting the null for one strongly suggests that we should also reject the null for the second, so we do not want to penalize one very strongly for the rejection of the other. But as they become increasingly independent, the rejection of one provides less and less affirmative information about the other, and we approach the case of perfect independence shown above, which receives the greatest penalty. The specifics of the procedure are below.

Holm’s Correction to the p-values

Suppose you have a vector of p-values, $p$, of length $K$, and a desired maximum family-wise error rate, $\alpha$ (the most common choice is $\alpha=0.05$. These are all of the needed ingredients. Holm’s stepdown procedure is as follows:

1. If the smallest element of $p$ is greater than $\frac{\alpha}{(K+1)}$, reject no null hypotheses and exit (do not continue to step 2). Else, reject the null hypothesis corresponding to the smallest element of $p$ (continue to step 2).
2. If the second-smallest element of $p$ is greater than $\frac{\alpha}{(K+1)-1}$, reject no remaining null hypotheses and exit (do not continue to step 3). Else, reject the null hypothesis corresponding to the second-smallest element of $p$ (continue to step 3).
3. If the third-smallest element of $p$ is greater than $\frac{\alpha}{(K+1)-2}$, reject no remaining null hypotheses and exit (do not continue to step 4). Else, reject the null hypothesis corresponding to the third-smallest element of $p$ (continue to step 4).
4. And so on.

We could program this directly as a loop-based function that takes $p,\alpha$ as parameters, but a simpler approach is to compute the p-values equivalent to this procedure for any $\alpha$, which are,

$\tilde{p}_{(k)} =\min\left\{\max_{j:p_{(j)}\leq p_{(k)}} \left\{ (K+1-j)p_{(j)}\right\},1\right\}$,

where $(k)$ means the $k^{\mathit{th}}$ smallest element. This expression is equivalent to the algorithm above in the sense that, for any family-wise error rate $\alpha\in (0,1)$, $\tilde{p}_{(k)}\leq\alpha$ if and only if the corresponding hypothesis is rejected by the algorithm above.

The following code computes these p-values. Julia (apparently) lacks a command that would tell me the index of the rank of the p-values, so my loop below does this, including the handling of ties (when some of the p-values are the same):

function Holm(p)
K = length(p)
sort_index = -ones(K)
sorted_p = sort(p)
for j=1:K
num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
end
holm_p = [sorted_holm_p[sort_index[k]] for k=1:K]
return holm_p
end


This is straight-forward except for sort_index, which I constructed such that, e.g., if the first element of sort_index is 3, then the first element of pvalues is the third smallest. Unfortunately, it arbitrarily breaks ties in favor of the parameter that appears first in the MLE array, so the second entry of sort_index is 1 and the third entry is 2, even though the two corresponding p-values are equal.

Bootstrap Stepdown p-values

Given our work yesterday, it is relatively easy to replace bootstrap marginal p-values with bootstrap stepdown p-values. We begin with samples, as created yesterday. The following code creates tNullDistribution, which is the same as nullDistribution from yesterday except as t-statistics (i.e., divided by standard error).

function stepdown(MLE,bootSamples)
K = length(MLE)
tMLE = MLE
bootstrapSE = std(bootSamples,1)
tNullDistribution = bootSamples
p = -ones(K)
for i=1:K
tMLE[i] = abs(tMLE[i]/bootstrapSE[i])
tNullDistribution[:,i] = abs((tNullDistribution[:,i]-mean(tNullDistribution[:,i]))/bootstrapSE[i])
p[i] = mean(tMLE[i].<tNullDistribution[:,i])
end
sort_index = -ones(K)
stepdown_p = -ones(K)
sorted_p = sort(p)
for j=1:K
num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
end
for k=1:K
current_index = [sort_index.>=k]
stepdown_p[sort_index[k]] = mean(maximum(tNullDistribution[:,sort_index.>=k],2).>tMLE[sort_index[k]])
end
return ["single_pvalues"=>p,"stepdown_pvalues"=>stepdown_p,"Holm_pvalues"=>Holm(p)]
end


The only difference between the single p-values and the stepdown p-values is the use of the maximum t-statistic in the comparison to the null distribution, and the maximum is taken over only the parameter estimates whose p-values have not yet been computed. Notice that I used a dictionary in the return so that I could output single, stepdown, and Holm p-values from the stepdown function.

Results

To test out the above corrected p-values, we return to MLE and samples from yesterday. Suppose we want to perform the two-sided tests simultaneously for the null hypotheses $\beta_0=0,\beta_1=0,\beta_2=0,\beta_3=0$. The results from the above code are,

julia> stepdown(MLE[1:4],samples[:,1:4])
Dict{ASCIIString,Any} with 3 entries:
"Holm_pvalues"     => {0.766,0.766,0.18,0.036}
"stepdown_pvalues" => [0.486,0.649,0.169,0.034]
"single_pvalues"   => [0.486,0.383,0.06,0.009]


We see that the p-value corresponding to the null hypothesis $\beta_2=0$ is inflated by both the bootstrap stepdown and Holm procedures, and is no longer significant at the 0.10 level.

# Bootstrapping and Non-parametric p-values in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

Suppose that we wish to test to see if the the parameter estimates of $\beta$ are statistically different from zero and if the estimate of $\sigma^2$ is different from one for the OLS parameters defined in a previous post. Suppose further that we do not know how to compute analytically the standard errors of the MLE parameter estimates; the MLE estimates were presented in the previous post.

We decide to bootstrap by resampling cases in order to estimate the standard errors. This means that we treat the sample of $N$ individuals as if it were a population from which we randomly draw $B$ samples, each of size $N$. This produces a sample of MLEs of size $B$, that is, it provides an empirical approximation to the distribution of the MLE. From the empirical approximation, we can compare the full-sample point MLE to the MLE distribution under the null hypothesis.

To perform bootstrapping, we rely on Julia’s built-in sample function.

Wrapper Functions and the Likelihood of Interest

Now that we have a bootstrap index, we define the log-likelihood as a function of x and y, which are any subsets of X and Y, respectively.

function loglike(rho,y,x)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = y-*(x,beta)
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


Then, if we wish to evaluate loglike across various subsets of x and y, we use what is called a wrapper, which simply creates a copy of loglike that has already set the values of x and y. For example, the following function will evaluate loglike when x=X and y=Y:

function wrapLoglike(rho)
return loglike(rho,Y,X)
end


We do this because we want the optimizer to find the optimal $\rho$, holding x and y fixed, but we also want to be able to adjust x and y to suit our purposes. The wrapper function allows the user to modify x and y, but tells the optimizer not to bother them.

Tip: Use wrapper functions to manage arguments of your objective function that are not supposed to be accessed by the optimizer. Give the optimizer functions with only one argument — the parameters over which it is supposed to optimize your objective function.

Bootstrapping the OLS MLE

Now, we will use a random index, which is drawn for each b using the sample function, to take a random sample of individuals from the data, feed them into the function using a wrapper, then have the optimizer maximize the wrapper across the parameters. We repeat this process in a loop, so that we obtain the MLE for each subset. The following loop stores the MLE in each row of the matrix samples using 1,000 bootstrap samples of size one-half (M) of the available sample:

B=1000
samples = zeros(B,5)
for b=1:B
theIndex = sample(1:N,N)
x = X[theIndex,:]
y = Y[theIndex,:]
function wrapLoglike(rho)
return loglike(rho,y,x)
end
samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
end
samples[:,5] = exp(samples[:,5])


The resulting matrix contains 1,000 samples of the MLE. As always, we must remember to exponentiate the variance estimates, because they were stored in log-units.

Bootstrapping for Non-parametric p-values

Estimates of the standard errors of the MLE estimates can be obtained by computing the standard deviation of each column,

bootstrapSE = std(samples,1)


where the number 1 indicates that the standard deviation is taken over columns (instead of rows).

Standard errors like these can be used directly for hypothesis testing under parametric assumptions. For example, if we assume an MLE is normally distributed, then we reject the null hypothesis that the parameter is equal to some point if the parameter estimate differs from the point by at least 1.96 standard errors (using that the sample is large). However, we can make fewer assumptions using non-parametric p-values. The following code creates the distribution implied by the null hypothesis that $\beta_0 =0, \beta_1=0, \beta_2=0, \beta_3=0, \sigma^2=1$ by subtracting the mean from each distribution (thus imposing a zero mean) and then adding 1 to the distribution of $\sigma^2$ (thus imposing a mean of one); this is called nullDistribution.

nullDistribution = samples
pvalues = ones(5)
for i=1:5
nullDistribution[:,i] = nullDistribution[:,i]-mean(nullDistribution[:,i])
end
nullDistribution[:,5] = 1 + nullDistribution[:,5]


The non-parametric p-value (for two-sided hypothesis testing) is the fraction of times that the absolute value of the MLE is greater than the absolute value of the null distribution.

pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]


If we are interested in one-sided hypothesis testing, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0>0$:

pvalues = [mean(MLE[i].<nullDistribution[:,i]) for i=1:5]


Conversely, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0<0$:

pvalues = [mean(MLE[i].>nullDistribution[:,i]) for i=1:5]


Thus, two-sided testing uses the absolute value (abs), and one-sided testing only requires that we choose the right comparison operator (.> or .<).

Results
Let the true parameters be,

julia> trueParams = [0.01,0.05,0.05,0.07]


The resulting bootstrap standard errors are,

julia> bootstrapSE = std(samples,1)
1x5 Array{Float64,2}:
0.0308347  0.0311432  0.0313685  0.0305757  0.0208229


and the non-parametric two-sided p-value estimates are,

julia> pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]
5-element Array{Any,1}:
0.486
0.383
0.06
0.009
0.289


Thus, we reject the null hypotheses for the third and fourth parameters only and conclude that $\beta_2 \neq 0, \beta_3 \neq 0$, but find insufficient evidence to reject the null hypotheses that $\beta_0 =0, \beta_1 =0$ and $\sigma^2=1$.

# Maximum Likelihood Estimation (MLE) in Julia: The OLS Example

* The script to reproduce the results of this tutorial in Julia is located here.

We continue working with OLS, using the model and data generating process presented in the previous post. Recall that,

$\epsilon|X \sim \mathcal{N}\left(0,\sigma^2\right) \implies Y|X \sim \mathcal{N}\left( X\beta, \sigma^2 \right)$,

or, equivalently,

$\left(Y-X\beta\right)|X \sim \mathcal{N}\left( 0, \sigma^2 \right)$,

which is a more convenient expression because the distribution does not depend on $X$, conditional on $X$. Denote the parameter vector by $\rho \equiv [\beta, \sigma^2]$. We will now see how to obtain the MLE estimate $\hat\rho$ of $\rho$. By Bayes’ Rule and independence across individuals (i), the likelihood of $\rho$ satisfies,

$\mathcal{L}\left(\rho|Y,X\right) \propto \prod_{i=1}^N \phi\left( Y_i - X_i\beta, \sigma^2 \right|\rho)$;

where $\phi$ is the normal probability distribution function (PDF). $\hat\rho$ is the $\arg\max$ of this expression, and we will show how to find it using a numerical search algorithm in Julia.

Computing the Log-Likelihood of OLS

First, we define the log-likelihood in Julia as follows (we are using the data X and Y generated in the previous post):

using Distributions
function loglike(rho)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = Y-X*beta
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


This code first collects $beta$ (beta) and $\sigma^2$ (sigma2) from $\rho$ (rho), uses $\sigma^2$ to initialize the appropriate normal distribution (dist), then evaluates the normal distribution at each of the residuals, $Y_i-X_i\beta$ (residuals), returning the negative of the sum of the individual contributions to the log-likelihood (contributions).

Tip: Always remember to use the negative of the likelihood (or log-likelihood) that you wish to maximize, because the optimize command is a minimizer by default. Since the $\arg\max$ is the same when maximizing some function and minimizing the negative of the function, we need the negative sign in order to maximize the likelihood with a minimizer.

The only confusing part of this function is that sigma2 is read from rho as an exponential. This is strictly a unit conversion — it means that $\sigma^2$ was stored in $\rho$ in log-units, so we must exponentiate to return it to levels, that is, $\rho$ is defined as $\rho \equiv [\beta, \log\left(\sigma^2\right)]$. This is a common approach in numerical optimization to solve a practical problem: the numerical optimizer tries out many possible values for $\rho$ in its effort to search for the MLE, and it may naively try out a negative value of $\sigma^2$, which would be a logical contradiction that would crash the code. By storing $\sigma^2$ in $\log$ units, we make it perfectly acceptable for the optimizer to try out a negative value of the parameter, because a parameter that is negative in log-units is non-negative in level-units.

Tip: Always restrict variances so that the optimizer cannot try negative values. We have used log-units to achieve this goal.

Maximizing the Likelihood of OLS

Now, we are ready to find the MLE, $\hat\rho$. To do this, we will use the Optim package, which I previously showed how to install.

using Optim
params0 = [.1,.2,.3,.4,.5]
optimum = optimize(loglike,params0,method=:cg)
MLE = optimum.minimum
MLE[5] = exp(MLE[5])
println(MLE)


This says to optimize the function loglike, starting from the point params0, which is chosen somewhat arbitrarily. Numerical search algorithms have to start somewhere, and params0 serves as an initial guess of the optimum. Of course, the best possible guess is trueParams, because the optimizer would have to do much less work, but in practice, we do not know the true parameters so the optimizer will have to do the work. Notice that, at the end, we have to exponentiate the sigma2 parameter because the optimizer will return it in log-units due to our exponentiation above.

Results

Using the same random seed as before, the algorithm returns the estimates,

julia> MLE
5-element Array{Float64,1}:
0.112163
0.476432
-0.290571
0.010831
1.01085


which are very close to trueParams and the true variance of $\epsilon$, 1. The Optim package has various optimizers from which to choose; we were using the Newton Conjugate-Gradient (cg) algorithm above. If we replace this with the Nelder-Mead algorithm (nelder_mead), we obtain the almost-identical estimates,

julia> MLE
5-element Array{Float64,1}:
0.112161
0.476435
-0.290572
0.0108343
1.01085


In future posts, we will see cases where the choice of optimizer makes a major difference in the quality of the numerical MLE as well as the speed of computation, but for this simple example, the choice of optimizer does not much matter.

# Introductory Example: Ordinary Least Squares (OLS) Estimation in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

In this post, I show in Julia how to perform ordinary least squares (OLS) estimation after first simulating the OLS data generating process (DGP). At the end, we see that the parameter estimates converge to the true parameter as sample size grows large. If you have not yet installed Julia, it takes 5 minutes following these instructions.

As a reminder and to clarify notation, the OLS DGP is,

$Y = X\beta + \epsilon$,

where $Y$ is the $Nx1$ dependent variable, $X$ is the $N\mathrm{x}\tilde{K}$ matrix of independent variables, $\beta$ is the $\tilde{K}\mathrm{x}1$ vector of parameters that we wish to estimate, and $\epsilon$ is the $N\mathrm{x}1$ error satisfying $\epsilon \overset{\mathit{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^2\right)$. Because we assume that the first column of $X$ is the constant number 1, we will find it useful below to work with $K \equiv \tilde{K}-1$. The least squares estimator is,

$\hat{\beta} = \left( X^T X \right)^{-1} \left( X^T Y \right)$.

Matrix Algebra and Simulated Random Variables: The OLS DGP

First, we generate the independent variables $X$, then we use $X$ to generate the independent variable $Y$. To begin, create a new file in Julia Studio and save it to your computer. In the file editor (script), insert the following commands:


using Distributions
N=1000
K=3
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]



The using command let Julia know that we will be using the Distributions package. The MvNormal() command initialized a multivariate normal distribution, which is an object including methods such as pdf for the probability distribution function and rand for drawing random variables.  eye(K) means $\mathit{I}_K$, the identity matrix of size $K\mathrm{x}K$. We only told MvNormal the covariance matrix, leaving the mean blank, which Julia assumes means that we would like a zero mean. The distribution of $X$ is arbitrary; we only used multivariate normal for simplicity. At the end, we concatenate the vector of ones to X using brackets, [].

Tip: To ensure that the matrices are of the appropriate dimension, use the size command. Above, we transposed $X$ after finding that rand returned it as a $3\mathrm{x}N$ matrix, when we need it to be $N \mathrm{x} 3$. Misaligned dimensions are one of the most common and frustrating errors to make in writing a program.

Now that we have created $X$ as a matrix containing a column of ones as well as three independent variables, we wish to multiply it by a vector of regression coefficients of length 4 (including the intercept) and add the normally distributed shock, $\epsilon$. For simplicity, we assume $\sigma^2=1$.

genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.1,0.5,-0.3,0.]
Y = X*trueParams + epsilon


Matrix algebra in Julia can be done as in a way comparable to Python, *(A,B), which means AxB, or in the more R-like way we used above, A*B. Then, you can click run (the little green arrow in Julia Studio) and it will perform the operations in the file above. To make sure it worked, you can now go to the Console and type,

julia> mean(Y)


and press Enter. If the code worked correctly, this should return the mean of simulated $Y$, which should be near the true intercept 0.1 (since each of the independent variables has mean zero, the true mean of $Y$ is just the intercept).

Functions in Julia: The OLS Estimator

Functions are defined in Julia using the command function, followed by the desired name of your function, and parentheses containing the arguments of the function. An end statement is required upon completion of the function definition. Indentation is required within the body of the function, and it is a good practice to explicitly include a return statement. The OLS estimator is defined with respect to any particular parameters as follows:

function OLSestimator(y,x)
estimate = inv(x'*x)*(x'*y)
return estimate
end


This function uses the dot product (*) three times, the transpose () twice, and the inverse (inv()) once. This function works for any matrices x and y with the appropriate dimensions. Once you have defined this function by running the file, you can obtain the OLS estimates of the true parameters by typing,

julia> estimates = OLSestimator(Y,X)


Because of the return statement, the parameter estimates will be returned by OLSestimator() and stored in estimates. If you compute estimates in the script, you can print them to the screen using,

println(estimates)


Finally, change the parameter defined at the beginning of your code. When N is small (say, 100), estimates will usually be further from trueParams than when N is large (say 10,000). This should be very easy to change; if you used the print statement for the estimates, just change the value of N, run the code, and see the new estimates printed to the console. In order to make your estimates reproducible (i.e., the exact same draws from the random distributions), set the random seed at the beginning of your code using,

srand(2)


where 2 is a possible seed you could choose.

Results

When I run the code above with random seed 2, I find that,

julia> estimates
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337


so you can use this to check your results.

If you are only trying to estimate OLS, you can use the built-in command linreg(), but do not include the vector of ones, as linreg() will add another vector of ones. This is why I created X_noconstant above, which is just X without a column of ones. The syntax is,

julia> linreg(X_noconstant,Y)
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337


and the estimates are identical to those in estimates.