Kalman Filter for Panel Data and MLE in Julia, Part 1

* The script to reproduce the results of this tutorial in Julia is located here.

The Kalman Filter is notoriously difficult to estimate. I think this can be attributed to the following issues:

  • It necessitates a large number of parameters that are difficult to manage.
  • It requires the sequential inversion of matrices, so we must be careful to keep the matrices invertible and monitor numerical error.
  • Most papers and Internet resources about the Kalman Filter assume that we have time-series rather than panel data, so they do not show how to adapt the parameters for panel data estimation
  • Similarly, this literature makes “macro” assumptions like initializing at the steady state rather than the “micro” approach of estimating the initial state.
  • Professional software is not well-suited to parameter estimation (more on this below).

In this tutorial, I will show how to write your own program that estimates the MLE of the most basic type of panel data Kalman Filter: time-invariant parameters and dedicated measures. It will explicitly impose assumptions to ensure identification, so that the estimates returned are uniquely determined from the data, unlike what would happen if we maximized the likelihood with the unrestricted Kalman Filter. In Part 2 of the tutorial, I generalize the estimation to permit a time-varying parameter space as well as time-varying parameter values, show how to identify the initial state, permit non-dedicated measures, and show how to “anchor” the model so that the units of the state space are determined to scale.

Kalman Filter DGP

The data generating process (DGP) corresponding to the panel data Kalman Filter is,



\theta_{i,1} \sim \mathcal{N}\left(\mu_1,\Sigma_1 \right),

for time periods t=2,\ldots,T. We assume that Y, referred to as “the measures” or “the data”, is the only observed term in this entire system. The first equation determines the evolution of the unobserved state, \theta, and we will refer to it as the “transition equation”. The second equation determines the relationship between the state and measures or signals of the state, Y, and we will refer to it as the “measurement equation”. The shocks, v_{i,t+1} and w_{i,t+1}, are assumed to be independent across i, across t, and even across components so that V and W are diagonal.

For our purposes, the parameters of interest are typically A,V, although we generally need to identify C,W as well so that we can separate them from A,V. It is no trouble to add other observables to these equations, e.g., in optimal control theory, there is usually an observed “input”, call it u, on the right-hand side the transition equation, so that the equation takes the form,


but this is a detail that could be added later with little difficulty, so we omit it until Part 2 of this tutorial. Also, we will assume for now that the initial distribution of the state is characterized by \mu_1=0 and \Sigma_1=\mathbf{1}, but relax this assumption in Part 2.

Failure of Identification and Necessary Assumptions

Before simulating the DGP, notice that it is indifferent to the scale or even the sign of \theta. In particular, consider dividing \theta by a constant \alpha \neq 0, and multiplying C by \alpha. Then,

Y_{i,t+1}=\left(\alpha C\right)\frac{\theta_{i,t+1}}{\alpha}+\omega_{i,t+1}=C\theta_{i,t+1}+\omega_{i,t+1},

so \alpha cancels out and we are left with the same expression.

Since our only information about \theta comes from Y, this means that we will never be able to know C and \theta apart from one another, that is, we will never know the scale of \theta. Similarly, we cannot distinguish A and \theta. By choosing \alpha<0, we see that neither the sign nor the magnitude of A is determined by knowledge of Y. Let me emphasize this point: we can change the sign of every entry in A as we wish and get the same data. Since A is one of the parameters of interest — in fact, our desire to learn A is often the motivation for using the Kalman Filter — this is a major issue. Since \alpha can take on a continuum of values, we see that the DGP results in the same Y for a continuum of parameters \alpha A, \alpha C.

If we need to know the unique value of each parameter, we must make additional assumptions. The following assumptions, adapted from Cunha and Heckman (2008), are sufficient to ensure identification:

  1. Each of the measures (i.e., the components of Y) is a measure of only one of the factors (i.e., the components of \theta). This is the “dedicated measures” assumption, and it is equivalent to requiring each row of C to have exactly one entry that is not assumed to be zero.
  2. There are at least 3 dedicated measures for each factor. This means that each column in C must have at least three entries that are not assumed to be zero.

The first assumption can be relaxed a bit if there are more than 3 measures per factor, but we can never relax the assumption that there is at least one dedicated measure for one of the factors.

In practice, are these assumptions valid? Possibly, it depends on your faith in the measures available in your data. You must be able to confidently claim that certain measures in your data are measuring only one factor. The alternative is to abandon estimating the model. In other words, the Kalman Filter is not all-powerful — certain models simply cannot be estimated with certain data, and trying to force it to estimate a model that is not identified from your data will result in incorrect estimates.

Notice that professional Kalman Filter software does not allow one to impose these assumptions. This is one of the main motivations for this tutorial: at present, you must program the Kalman Filter yourself if you want valid estimates of the model parameters, and estimating the model parameters rather than projecting the state from given model parameters is the main purpose of the Kalman Filter in microeconomic applications.

Simulation of the Kalman Filter DGP

Before simulating the Kalman Filter DGP, we must first deal with the issue I mentioned at the beginning of this tutorial: the Kalman Filter DGP necessitates a large number of parameters that are difficult to manage. We need a function that can convert an array of parameters into the matrices A,V,C,W, including exponentiating variance terms so that when we use an optimizer later, the optimizer can try out different parameters without error. The following function, unpackParams, is my solution to this problem:

function unpackParams(params)
    place = 0
    A = reshape(params[(place+1):(place+stateDim^2)],(stateDim,stateDim))
    place += stateDim^2
    V = diagm(exp(params[(place+1):(place+stateDim)]))
    place += stateDim
    C = zeros(stateDim*obsDim,stateDim)
    for j in [1:stateDim]
            C[(1+obsDim*(j-1)):obsDim*j,j] = [1,params[(place+1+(obsDim-1)*(j-1)):(place+(obsDim-1)*(j))]]
    place += (obsDim-1)*stateDim
    W = diagm(exp(params[(place+1):(place+obsDim*stateDim)]))
    return ["A"=>A,"V"=>V,"C"=>C,"W"=>W]

where stateDim is the length of \theta and obsDim is the number of measures on each component of \theta. This structure assumes that all measures are dedicated to a single factor, and the same number of measures is available for each factor. The function can be modified as needed to account for other structures. Here is the output from using this function on a set of parameters when there are two components of \theta with 3 measures on each:

julia> stateDims = 2
julia> obsDims = 3
julia> params0 = [1.,.3,.3,1.,0.,0.,.5,-.5,.5,-.5,0.,0.,0.,0.,0.,0.]
julia> unpackParams(params0)
"V" 2x2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0,
"C" 6x2 Array{Float64,2}:
  1.0   0.0
  0.5   0.0
 -0.5   0.0
  0.0   1.0
  0.0   0.5
  0.0  -0.5,
"A" 2x2 Array{Float64,2}:
 1.0  0.3
 0.3  1.0,
"W" 6x6 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0]

Clearly, the most difficult part to construct is C, the measurement matrix.

Now that we can convert an array of parameters into the appropriate parameter matrices, it is straight-forward to define a function that simulates the DGP of the Kalman Filter:

function KalmanDGP(params)
    # initialize data
    data = zeros(N,stateDim*obsDim*T+1)
    # parameters
    unpacked = unpackParams(params)
    A = unpacked["A"]
    V = unpacked["V"]
    C = unpacked["C"]
    W = unpacked["W"]
    # draw from DGP
    for i=1:N
        # data of individual i
        iData = ones(stateDim*obsDim*T+1)
        current_state = rand(MvNormal(reshape(init_exp,(stateDim,)),init_var))
        iData[1:stateDim*obsDim] = rand(MvNormal(eye(obsDim*stateDim)))
        for t=2:T
            current_state = A*current_state + rand(MvNormal(V))
            iData[((t-1)*stateDim*obsDim+1):(t*stateDim*obsDim)] = C*current_state + rand(MvNormal(W))
        # add individual to data
        data[i,:] = iData
    return data

Here is a simulated data set that satisfies the assumptions of time-invariant parameters and at least three dedicated measures (in particular, I use obsDims=4):

N = 1000
T = 4
stateDims = 2
obsDims = 4
init_exp = zeros(stateDim)
init_var = eye(stateDim)

Estimation Strategy for the Kalman Filter DGP

Now that we understand the Kalman Filter model, we can consider the Kalman Filter itself. The Kalman Filter is the estimator of what I have been calling the Kalman Filter DGP. The aim of the Kalman Filter for panel data is to estimate the mean and variance of \theta_{i,t} for each i and each t.

Estimation can be divided into two steps. Denoting \mathcal{Y}^t_i as the set of all data available at or before time t, the projection step is trivial to derive and given by:

\mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^t_i\right] = A\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]

\mathbb{E}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right] = C\mathbb{E}\left[\theta_{i,t+1}\Big| \mathcal{Y}^t_i\right]

\mathrm{Var}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right] = A\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]A'+V

\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^t_i\right] = C\mathbb{E}\left[\theta_{i,t}\Big| \mathcal{Y}^t_i\right]C'+W

The update step is:

\mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t+1}_i\right] = \mathbb{E}\left[\theta_{i,t+1}\Big| \mathcal{Y}^t_i\right]+K_{t+1} \tilde{Y}_{i,t+1}

\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t+1}_i\right] = \mathrm{Var}\left[\theta_{i,t+1}\Big| \mathcal{Y}^{t}_i\right]-K_{t+1} C\mathrm{Var}\left[\theta_{i,t+1}\Big| \mathcal{Y}^{t}_i\right]



is the error from projecting Y_{t+1} at time t, and,


is called the optimal Kalman gain and is derived as the solution to a least-squares problem.

The individual likelihood of the Kalman Filter is given by,


The initial conditions enter through \Pr\left(Y^1_i\Big|A,V,C,W\right). In this tutorial, we assume \Pr\left(Y^1_i\Big|A,V,C,W\right)=\Pr\left(Y^1_i\right)=\phi\left(Y_1^i\Big|0,\mathbf{1}\right),  where \phi is the multivariate normal PDF. We permit individual-specific initial conditions and unknown parameters in Part 2 of the tutorial. Furthermore, \Pr\left(Y^{t+1}_i\Big|\mathcal{Y}^t_i,A,V,C,W\right)=\phi\left(Y_i^{t+1}\big|\mathbb{E}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right],\mathrm{Var}\left[Y_{i,t+1}\Big|\mathcal{Y}^t_i\right]\right).

Finally, since individuals are assumed to be independent,

\mathcal{L}\left(A,V,C,W\Big|\mathcal{Y}^T\right)\propto \prod_{i=1}^N\Pr\left(Y_{i,1}\right)\prod_{t=1}^{T-1}\Pr\left(Y^{t+1}_i\Big|\mathcal{Y}^t_i,A,V,C,W\right)

Now that we have a simple expression for the likelihood, we need only convert it into a Julia code.

Estimation of the Kalman Filter in Julia

To simplify computation, we write a single function that simultaneously predicts and updates (in that order) from any time period t to the next time period t+1. It begins with the “posterior” estimates of the expected state and variance of the state at time t, that is, \mathbb{E}\left[\theta_{i,t}\Big|\mathcal{Y}^{t}_i\right],\mathrm{Var}\left[\theta_{i,t}\Big|\mathcal{Y}^{t}_i\right], uses these to predict the “prior” estimates at time t+1, that is, \mathbb{E}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t}_i\right],\mathrm{Var}\left[\theta_{i,t+1}\Big|\mathcal{Y}^{t}_i\right], uses the priors to obtain the likelihood of the new observation at time t+1, and finally uses the priors as well as the new observation at time t+1 to update to the posterior estimates at time t+1. The function, incrementKF (“a time increment of the Kalman Filter”, is as follows:

function incrementKF(params,post_exp,post_var,new_obs)
    # unpack parameters
    unpacked = unpackParams(params)
    A = unpacked["A"]
    V = unpacked["V"]
    C = unpacked["C"]
    W = unpacked["W"]
    # predict
    prior_exp = A*post_exp
    prior_var = A*post_var*A' + V
    obs_prior_exp = C*prior_exp
    obs_prior_var = C*prior_var*C' + W
    # update
    residual = new_obs - obs_prior_exp
    obs_prior_cov = prior_var*C'
    kalman_gain = obs_prior_cov*inv(obs_prior_var)
    post_exp = prior_exp + kalman_gain*residual
    post_var = prior_var - kalman_gain*obs_prior_cov'
    # step likelihood
    dist = MvNormal(reshape(obs_prior_exp,(length(obs_prior_exp),)),obs_prior_var)
    log_like = logpdf(dist,new_obs)
    return ["post_exp"=>post_exp,"post_var"=>post_var,"log_like"=>log_like]

We see that the code is quite simple, as it closely mimics the mathematical formulas above in syntax. The part of the code that does not mimic the usual mathematical syntax — preparing parameters — is handled elsewhere by unpackParams.

Now that we have the code that predicts and updates the Kalman Filter for any individual while outputting the contribution to the log-likelihood, we can iterate over time and collect all of the log-likelihood contributions from an individual’s time path:

function indivKF(params,init_exp,init_var,i)
    iData = data[i,:]
    # initialization
    post_exp = init_exp
    post_var = init_var
    init_obs = array([iData[obsDict[1]]])'
    dist = MvNormal(eye(length(init_obs)))
    for t = 1:(T-1)
        # predict and update
        new_obs = array([iData[obsDict[t+1]]])'
        new_post = incrementKF(params,post_exp,post_var,new_obs)
        # replace
        post_exp = new_post["post_exp"]
        post_var = new_post["post_var"]
        # contribute
        log_like += new_post["log_like"]
    return log_like

This initializes the individual’s contribution to the likelihood using the assumed initial distribution of \mathcal{N}\left(0,\mathbf{1}\right), then proceeds to predict and update the Kalman Filter over time until all log-likelihood contributions have been collected. The only confusing part is obsDict, which is a dictionary containing the column names of the observations, which are assumed to be stored in a DataFrame called data. This strategy isn’t necessary, but it avoids a lot of complications, and also makes our example better resemble a real-world application to a data set.

Finally, we loop through each individual, summing all of their lifetime log-likelihood contributions:

function sampleKF(params,init_exp,init_var)
    log_like = 0.0
    N = size(data,1)
    for i in 1:N
        log_like += indivKF(params,init_exp,init_var,i)
    neg_avg_log_like = -log_like/N
    println("current average negative log-likelihood: ",neg_avg_log_like)
    return neg_avg_log_like[1]

where I like to include a print statement so that I can monitor the convergence of the likelihood. Thus, we have the panel data Kalman Filter likelihood, under the appropriate assumptions to ensure identification. Finally, we use a wrapper to prepare it for optimization:

function wrapLoglike(params)
    print("current parameters: ",params)
    return sampleKF(params,init_exp,init_var)

where I have included a print statement so that I can monitor the parameters that the optimizer is testing and know in advance if the estimates are diverging (if it were diverging, I would know my code has a bug and could stop the optimizer before it wastes too much time).

We conclude by estimating the Kalman Filter MLE on simulated data.


First, we simulate data from KalmanDGP:

N = 1000
T = 4
stateDim = 2
obsDim = 3
init_exp = zeros(stateDim)
init_var = eye(stateDim)
params0 = [1.,0.,0.,1.,0.,0.,.5,-.5,.5,-.5,0.,0.,0.,0.,0.,0.]

data = DataFrame(data)
obsDict = {["one_1","one_2","one_3","one_4","one_5","one_6"],["two_1","two_2","two_3","two_4","two_5","two_6"],["three_1","three_2","three_3","three_4","three_5","three_6"],["four_1","four_2","four_3","four_4","four_5","four_6"]}

Everything here is the same as in the DGP code above, except that we are converting the data matrix into a DataFrame. The columns are named using the syntax time_number, so that three_4 means the 4th observation in time period 3. It creates obsDict such that, for example, data[obsDict[3]] would return the variables from three_1 to three_6 — all of the observations at time 3 in the correct order.

Now, we are ready to estimate the Kalman Filter as follows:

MLE = optimize(wrapLoglike,params0,method=:cg,ftol=1e-8)

When it is finished (and we have watched it converge using the print statements), we can view the parameters in the appropriate context by:

optimParams = unpackParams(MLE.minimum)
["V" 2x2 Array{Float64,2}:
 1.11323  0.0
 0.0      0.918925,
"C" 6x2 Array{Float64,2}:
  1.0        0.0
  0.489902   0.0
 -0.490378   0.0
  0.0        1.0
  0.0        0.528178
  0.0       -0.517231,
"A" 2x2 Array{Float64,2}:
 0.976649    -0.00976326
 0.00762499   0.994169  ,
"C" 6x6 Array{Float64,2}:
 0.977998  0.0      0.0     0.0       0.0      0.0
 0.0       0.97972  0.0     0.0       0.0      0.0
 0.0       0.0      1.0241  0.0       0.0      0.0
 0.0       0.0      0.0     0.990665  0.0      0.0
 0.0       0.0      0.0     0.0       1.00625  0.0
 0.0       0.0      0.0     0.0       0.0      1.01042]

This compares favorably to the true parameters,

julia> unpackParams(params0)
["V" 2x2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0,
"C" 6x2 Array{Float64,2}:
  1.0   0.0
  0.5   0.0
 -0.5   0.0
  0.0   1.0
  0.0   0.5
  0.0  -0.5,
"A" 2x2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0,
"W" 6x6 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0]

On a single processor of my personal laptop, optimization requires 446 seconds. In response to a reader comment, I also added this version of the code that achieves optimization in 393 seconds by removing all global variables.

Bradley J. Setzler

Parallel Processing in Julia: Bootstrapping the MLE

* Scripts that reproduce the results of this tutorial in Julia are located here and here.

To review our extended example, we first simulated the OLS DGP, solved for the MLE of OLS, bootstrapped the MLE, and computed various non-parametric p-values.

The only part of our previous work that may have slowed down your computer was forming the 10,000 bootstrap samples (samples). In this post, we will see how to speed up the bootstrapping by using more than one processor to collect the bootstrap samples.

The Estimator that We Wish to Bootstrap

We are interested in the following MLE estimator, adapted from our previous work. First, we import the two needed packages and generate the data:

# Prepare
using Distributions
using Optim
using DataFrames

# Generate variables
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]
genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.01,0.05,0.05,0.07]
Y = X*trueParams + epsilon

# Export data
data = DataFrame(hcat(Y,X))

Next, we define the likelihood of OLS:

function loglike(rho,y,x)
    beta = rho[1:4]
    sigma2 = exp(rho[5])+eps(Float64)
    residual = y-x*beta
    dist = Normal(0, sqrt(sigma2))
    contributions = logpdf(dist,residual)
    loglikelihood = sum(contributions)
    return -loglikelihood

Finally, we define the bootstrap function, which includes within it a wrapper for the likelihood:

function bootstrapSamples(B)
    samples = zeros(B,5)
    for b=1:B
		theIndex = sample(1:N,N)
		x = X[theIndex,:]
		y = Y[theIndex,:]
		function wrapLoglike(rho)
			return loglike(rho,y,x)
		samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
	samples[:,5] = exp(samples[:,5])
    return samples

The reason for the cute println statements will become apparent later. Save all of this in a file called “bootstrapFunction.jl”. Finally, we import the data created above:

data = readtable("data.csv")
N = size(data,1)
Y = array(data[:Y])
X = array(data[[:one,:X1,:X2,:X3]])

Parallel Bootstrapping

We now have the function, bootstrapSamples, saved in the file “bootstrapFunction.jl”, that we wish to run in parallel. The idea is that, instead of running the MLE B times on one processor, we run it B/4 times on each of four processors, then collect the results back to the first processor. Unfortunately, Julia makes it very difficult to parallelize functions within the same file that the functions are defined; the user must explicitly tell Julia every parameter and function that is needed on each processor, which is a frustrating process of copy-pasting the command @everywhere all around your code.

Instead, we separate our work into two files, the first one you have already created. Create a second file in the same location on your computer as “bootstrapFunction.jl”. In the second file, write the following commands:

samples_pmap = pmap(bootstrapSamples,[b,b,b,b])
samples = vcat(samples_pmap[1],samples_pmap[2],samples_pmap[3],samples_pmap[4])

The command addprocs(4) means to add four processors into your Julia session. The command require is the key to getting this right: it ensures that all of the functions and parameters contained in “bootstrapFunction.jl” are available to each processor. pmap is the function that actually applies bootstrapSamples to each value in the list [b,b,b,b]. I set the list next to it to be [b,b,b,b], where b=2500, so that bootstrapEstimates would collect 2,500 bootstrap samples on each processor. The resulting matrices from all four processors are stored in the length-4 vector called samples_pmap. Finally, we vertically stack (vcat) the four matrices into one matrix, samples, which is exactly the matrix of bootstrap samples we have seen previously.

Tip: For parallel processing in Julia, separate your code into two files, one file containing the functions and parameters that need to be run in parallel, and the other file managing the processing and collecting results. Use the require command in the managing file to import the functions and parameters to all processors.


Once you have run the outer file, you can do speed tests to compare the two ways of getting 10,000 bootstrap samples. We use the @elapsed command to time each code. Here are the results:

@elapsed pmap(bootstrapSamples,[b,b,b,b])
	From worker 2:	hi
	From worker 3:	hi
	From worker 4:	hi
	From worker 5:	hi
	From worker 2:	bye
	From worker 5:	bye
	From worker 4:	bye
	From worker 3:	bye

Each of the processors tells us “hi” and “bye” when it begins and finishes its job, respectively. The total time was a bit more than a minute (66.5 seconds). By comparison, the following code will compute 10,000 bootstrap samples on only one processor, as in our previous work:

@elapsed bootstrapSamples(10000)

So, in the absence of parallel processing, the same result would require just over 145 seconds, or almost two and a half minutes. Thus, parallel processing on a four-core personal computer allowed us to reduce the time required to bootstrap our estimator by a factor of around 2.2. As a general rule, parallel processing of the bootstrap estimator saves you more time when the function being bootstrapped is more complicated.

Bradley J. Setzler

Maximum Likelihood Estimation (MLE) in Julia: The OLS Example

* The script to reproduce the results of this tutorial in Julia is located here.

We continue working with OLS, using the model and data generating process presented in the previous post. Recall that,

\epsilon|X \sim \mathcal{N}\left(0,\sigma^2\right) \implies Y|X \sim \mathcal{N}\left( X\beta, \sigma^2 \right),

or, equivalently,

\left(Y-X\beta\right)|X \sim \mathcal{N}\left( 0, \sigma^2 \right),

which is a more convenient expression because the distribution does not depend on X, conditional on X. Denote the parameter vector by \rho \equiv [\beta, \sigma^2]. We will now see how to obtain the MLE estimate \hat\rho of \rho. By Bayes’ Rule and independence across individuals (i), the likelihood of \rho satisfies,

\mathcal{L}\left(\rho|Y,X\right) \propto \prod_{i=1}^N \phi\left( Y_i - X_i\beta, \sigma^2 \right|\rho);

where \phi is the normal probability distribution function (PDF). \hat\rho is the \arg\max of this expression, and we will show how to find it using a numerical search algorithm in Julia.

Computing the Log-Likelihood of OLS

First, we define the log-likelihood in Julia as follows (we are using the data X and Y generated in the previous post):

using Distributions
function loglike(rho)
   beta = rho[1:4]
   sigma2 = exp(rho[5])
   residual = Y-X*beta
   dist = Normal(0, sqrt(sigma2))
   contributions = logpdf(dist,residual)
   loglikelihood = sum(contributions)
   return -loglikelihood

This code first collects beta (beta) and \sigma^2 (sigma2) from \rho (rho), uses \sigma^2 to initialize the appropriate normal distribution (dist), then evaluates the normal distribution at each of the residuals, Y_i-X_i\beta (residuals), returning the negative of the sum of the individual contributions to the log-likelihood (contributions).

Tip: Always remember to use the negative of the likelihood (or log-likelihood) that you wish to maximize, because the optimize command is a minimizer by default. Since the \arg\max is the same when maximizing some function and minimizing the negative of the function, we need the negative sign in order to maximize the likelihood with a minimizer.

The only confusing part of this function is that sigma2 is read from rho as an exponential. This is strictly a unit conversion — it means that \sigma^2 was stored in \rho in log-units, so we must exponentiate to return it to levels, that is, \rho is defined as \rho \equiv [\beta, \log\left(\sigma^2\right)]. This is a common approach in numerical optimization to solve a practical problem: the numerical optimizer tries out many possible values for \rho in its effort to search for the MLE, and it may naively try out a negative value of \sigma^2, which would be a logical contradiction that would crash the code. By storing \sigma^2 in \log units, we make it perfectly acceptable for the optimizer to try out a negative value of the parameter, because a parameter that is negative in log-units is non-negative in level-units.

Tip: Always restrict variances so that the optimizer cannot try negative values. We have used log-units to achieve this goal.

Maximizing the Likelihood of OLS

Now, we are ready to find the MLE, \hat\rho. To do this, we will use the Optim package, which I previously showed how to install.

using Optim
params0 = [.1,.2,.3,.4,.5]
optimum = optimize(loglike,params0,method=:cg)
MLE = optimum.minimum
MLE[5] = exp(MLE[5])

This says to optimize the function loglike, starting from the point params0, which is chosen somewhat arbitrarily. Numerical search algorithms have to start somewhere, and params0 serves as an initial guess of the optimum. Of course, the best possible guess is trueParams, because the optimizer would have to do much less work, but in practice, we do not know the true parameters so the optimizer will have to do the work. Notice that, at the end, we have to exponentiate the sigma2 parameter because the optimizer will return it in log-units due to our exponentiation above.


Using the same random seed as before, the algorithm returns the estimates,

julia> MLE
5-element Array{Float64,1}:

which are very close to trueParams and the true variance of \epsilon, 1. The Optim package has various optimizers from which to choose; we were using the Newton Conjugate-Gradient (cg) algorithm above. If we replace this with the Nelder-Mead algorithm (nelder_mead), we obtain the almost-identical estimates,

julia> MLE
5-element Array{Float64,1}:

In future posts, we will see cases where the choice of optimizer makes a major difference in the quality of the numerical MLE as well as the speed of computation, but for this simple example, the choice of optimizer does not much matter.

Bradley J. Setzler