Bootstrapping and Non-parametric p-values in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

Suppose that we wish to test to see if the the parameter estimates of $\beta$ are statistically different from zero and if the estimate of $\sigma^2$ is different from one for the OLS parameters defined in a previous post. Suppose further that we do not know how to compute analytically the standard errors of the MLE parameter estimates; the MLE estimates were presented in the previous post.

We decide to bootstrap by resampling cases in order to estimate the standard errors. This means that we treat the sample of $N$ individuals as if it were a population from which we randomly draw $B$ samples, each of size $N$. This produces a sample of MLEs of size $B$, that is, it provides an empirical approximation to the distribution of the MLE. From the empirical approximation, we can compare the full-sample point MLE to the MLE distribution under the null hypothesis.

To perform bootstrapping, we rely on Julia’s built-in sample function.

Wrapper Functions and the Likelihood of Interest

Now that we have a bootstrap index, we define the log-likelihood as a function of x and y, which are any subsets of X and Y, respectively.

function loglike(rho,y,x)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = y-*(x,beta)
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


Then, if we wish to evaluate loglike across various subsets of x and y, we use what is called a wrapper, which simply creates a copy of loglike that has already set the values of x and y. For example, the following function will evaluate loglike when x=X and y=Y:

function wrapLoglike(rho)
return loglike(rho,Y,X)
end


We do this because we want the optimizer to find the optimal $\rho$, holding x and y fixed, but we also want to be able to adjust x and y to suit our purposes. The wrapper function allows the user to modify x and y, but tells the optimizer not to bother them.

Tip: Use wrapper functions to manage arguments of your objective function that are not supposed to be accessed by the optimizer. Give the optimizer functions with only one argument — the parameters over which it is supposed to optimize your objective function.

Bootstrapping the OLS MLE

Now, we will use a random index, which is drawn for each b using the sample function, to take a random sample of individuals from the data, feed them into the function using a wrapper, then have the optimizer maximize the wrapper across the parameters. We repeat this process in a loop, so that we obtain the MLE for each subset. The following loop stores the MLE in each row of the matrix samples using 1,000 bootstrap samples of size one-half (M) of the available sample:

B=1000
samples = zeros(B,5)
for b=1:B
theIndex = sample(1:N,N)
x = X[theIndex,:]
y = Y[theIndex,:]
function wrapLoglike(rho)
return loglike(rho,y,x)
end
samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
end
samples[:,5] = exp(samples[:,5])


The resulting matrix contains 1,000 samples of the MLE. As always, we must remember to exponentiate the variance estimates, because they were stored in log-units.

Bootstrapping for Non-parametric p-values

Estimates of the standard errors of the MLE estimates can be obtained by computing the standard deviation of each column,

bootstrapSE = std(samples,1)


where the number 1 indicates that the standard deviation is taken over columns (instead of rows).

Standard errors like these can be used directly for hypothesis testing under parametric assumptions. For example, if we assume an MLE is normally distributed, then we reject the null hypothesis that the parameter is equal to some point if the parameter estimate differs from the point by at least 1.96 standard errors (using that the sample is large). However, we can make fewer assumptions using non-parametric p-values. The following code creates the distribution implied by the null hypothesis that $\beta_0 =0, \beta_1=0, \beta_2=0, \beta_3=0, \sigma^2=1$ by subtracting the mean from each distribution (thus imposing a zero mean) and then adding 1 to the distribution of $\sigma^2$ (thus imposing a mean of one); this is called nullDistribution.

nullDistribution = samples
pvalues = ones(5)
for i=1:5
nullDistribution[:,i] = nullDistribution[:,i]-mean(nullDistribution[:,i])
end
nullDistribution[:,5] = 1 + nullDistribution[:,5]


The non-parametric p-value (for two-sided hypothesis testing) is the fraction of times that the absolute value of the MLE is greater than the absolute value of the null distribution.

pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]


If we are interested in one-sided hypothesis testing, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0>0$:

pvalues = [mean(MLE[i].<nullDistribution[:,i]) for i=1:5]


Conversely, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0<0$:

pvalues = [mean(MLE[i].>nullDistribution[:,i]) for i=1:5]


Thus, two-sided testing uses the absolute value (abs), and one-sided testing only requires that we choose the right comparison operator (.> or .<).

Results
Let the true parameters be,

julia> trueParams = [0.01,0.05,0.05,0.07]


The resulting bootstrap standard errors are,

julia> bootstrapSE = std(samples,1)
1x5 Array{Float64,2}:
0.0308347  0.0311432  0.0313685  0.0305757  0.0208229


and the non-parametric two-sided p-value estimates are,

julia> pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]
5-element Array{Any,1}:
0.486
0.383
0.06
0.009
0.289


Thus, we reject the null hypotheses for the third and fourth parameters only and conclude that $\beta_2 \neq 0, \beta_3 \neq 0$, but find insufficient evidence to reject the null hypotheses that $\beta_0 =0, \beta_1 =0$ and $\sigma^2=1$.

Maximum Likelihood Estimation (MLE) in Julia: The OLS Example

* The script to reproduce the results of this tutorial in Julia is located here.

We continue working with OLS, using the model and data generating process presented in the previous post. Recall that,

$\epsilon|X \sim \mathcal{N}\left(0,\sigma^2\right) \implies Y|X \sim \mathcal{N}\left( X\beta, \sigma^2 \right)$,

or, equivalently,

$\left(Y-X\beta\right)|X \sim \mathcal{N}\left( 0, \sigma^2 \right)$,

which is a more convenient expression because the distribution does not depend on $X$, conditional on $X$. Denote the parameter vector by $\rho \equiv [\beta, \sigma^2]$. We will now see how to obtain the MLE estimate $\hat\rho$ of $\rho$. By Bayes’ Rule and independence across individuals (i), the likelihood of $\rho$ satisfies,

$\mathcal{L}\left(\rho|Y,X\right) \propto \prod_{i=1}^N \phi\left( Y_i - X_i\beta, \sigma^2 \right|\rho)$;

where $\phi$ is the normal probability distribution function (PDF). $\hat\rho$ is the $\arg\max$ of this expression, and we will show how to find it using a numerical search algorithm in Julia.

Computing the Log-Likelihood of OLS

First, we define the log-likelihood in Julia as follows (we are using the data X and Y generated in the previous post):

using Distributions
function loglike(rho)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = Y-X*beta
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


This code first collects $beta$ (beta) and $\sigma^2$ (sigma2) from $\rho$ (rho), uses $\sigma^2$ to initialize the appropriate normal distribution (dist), then evaluates the normal distribution at each of the residuals, $Y_i-X_i\beta$ (residuals), returning the negative of the sum of the individual contributions to the log-likelihood (contributions).

Tip: Always remember to use the negative of the likelihood (or log-likelihood) that you wish to maximize, because the optimize command is a minimizer by default. Since the $\arg\max$ is the same when maximizing some function and minimizing the negative of the function, we need the negative sign in order to maximize the likelihood with a minimizer.

The only confusing part of this function is that sigma2 is read from rho as an exponential. This is strictly a unit conversion — it means that $\sigma^2$ was stored in $\rho$ in log-units, so we must exponentiate to return it to levels, that is, $\rho$ is defined as $\rho \equiv [\beta, \log\left(\sigma^2\right)]$. This is a common approach in numerical optimization to solve a practical problem: the numerical optimizer tries out many possible values for $\rho$ in its effort to search for the MLE, and it may naively try out a negative value of $\sigma^2$, which would be a logical contradiction that would crash the code. By storing $\sigma^2$ in $\log$ units, we make it perfectly acceptable for the optimizer to try out a negative value of the parameter, because a parameter that is negative in log-units is non-negative in level-units.

Tip: Always restrict variances so that the optimizer cannot try negative values. We have used log-units to achieve this goal.

Maximizing the Likelihood of OLS

Now, we are ready to find the MLE, $\hat\rho$. To do this, we will use the Optim package, which I previously showed how to install.

using Optim
params0 = [.1,.2,.3,.4,.5]
optimum = optimize(loglike,params0,method=:cg)
MLE = optimum.minimum
MLE[5] = exp(MLE[5])
println(MLE)


This says to optimize the function loglike, starting from the point params0, which is chosen somewhat arbitrarily. Numerical search algorithms have to start somewhere, and params0 serves as an initial guess of the optimum. Of course, the best possible guess is trueParams, because the optimizer would have to do much less work, but in practice, we do not know the true parameters so the optimizer will have to do the work. Notice that, at the end, we have to exponentiate the sigma2 parameter because the optimizer will return it in log-units due to our exponentiation above.

Results

Using the same random seed as before, the algorithm returns the estimates,

julia> MLE
5-element Array{Float64,1}:
0.112163
0.476432
-0.290571
0.010831
1.01085


which are very close to trueParams and the true variance of $\epsilon$, 1. The Optim package has various optimizers from which to choose; we were using the Newton Conjugate-Gradient (cg) algorithm above. If we replace this with the Nelder-Mead algorithm (nelder_mead), we obtain the almost-identical estimates,

julia> MLE
5-element Array{Float64,1}:
0.112161
0.476435
-0.290572
0.0108343
1.01085


In future posts, we will see cases where the choice of optimizer makes a major difference in the quality of the numerical MLE as well as the speed of computation, but for this simple example, the choice of optimizer does not much matter.

Introductory Example: Ordinary Least Squares (OLS) Estimation in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

In this post, I show in Julia how to perform ordinary least squares (OLS) estimation after first simulating the OLS data generating process (DGP). At the end, we see that the parameter estimates converge to the true parameter as sample size grows large. If you have not yet installed Julia, it takes 5 minutes following these instructions.

As a reminder and to clarify notation, the OLS DGP is,

$Y = X\beta + \epsilon$,

where $Y$ is the $Nx1$ dependent variable, $X$ is the $N\mathrm{x}\tilde{K}$ matrix of independent variables, $\beta$ is the $\tilde{K}\mathrm{x}1$ vector of parameters that we wish to estimate, and $\epsilon$ is the $N\mathrm{x}1$ error satisfying $\epsilon \overset{\mathit{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^2\right)$. Because we assume that the first column of $X$ is the constant number 1, we will find it useful below to work with $K \equiv \tilde{K}-1$. The least squares estimator is,

$\hat{\beta} = \left( X^T X \right)^{-1} \left( X^T Y \right)$.

Matrix Algebra and Simulated Random Variables: The OLS DGP

First, we generate the independent variables $X$, then we use $X$ to generate the independent variable $Y$. To begin, create a new file in Julia Studio and save it to your computer. In the file editor (script), insert the following commands:


using Distributions
N=1000
K=3
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]



The using command let Julia know that we will be using the Distributions package. The MvNormal() command initialized a multivariate normal distribution, which is an object including methods such as pdf for the probability distribution function and rand for drawing random variables.  eye(K) means $\mathit{I}_K$, the identity matrix of size $K\mathrm{x}K$. We only told MvNormal the covariance matrix, leaving the mean blank, which Julia assumes means that we would like a zero mean. The distribution of $X$ is arbitrary; we only used multivariate normal for simplicity. At the end, we concatenate the vector of ones to X using brackets, [].

Tip: To ensure that the matrices are of the appropriate dimension, use the size command. Above, we transposed $X$ after finding that rand returned it as a $3\mathrm{x}N$ matrix, when we need it to be $N \mathrm{x} 3$. Misaligned dimensions are one of the most common and frustrating errors to make in writing a program.

Now that we have created $X$ as a matrix containing a column of ones as well as three independent variables, we wish to multiply it by a vector of regression coefficients of length 4 (including the intercept) and add the normally distributed shock, $\epsilon$. For simplicity, we assume $\sigma^2=1$.

genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.1,0.5,-0.3,0.]
Y = X*trueParams + epsilon


Matrix algebra in Julia can be done as in a way comparable to Python, *(A,B), which means AxB, or in the more R-like way we used above, A*B. Then, you can click run (the little green arrow in Julia Studio) and it will perform the operations in the file above. To make sure it worked, you can now go to the Console and type,

julia> mean(Y)


and press Enter. If the code worked correctly, this should return the mean of simulated $Y$, which should be near the true intercept 0.1 (since each of the independent variables has mean zero, the true mean of $Y$ is just the intercept).

Functions in Julia: The OLS Estimator

Functions are defined in Julia using the command function, followed by the desired name of your function, and parentheses containing the arguments of the function. An end statement is required upon completion of the function definition. Indentation is required within the body of the function, and it is a good practice to explicitly include a return statement. The OLS estimator is defined with respect to any particular parameters as follows:

function OLSestimator(y,x)
estimate = inv(x'*x)*(x'*y)
return estimate
end


This function uses the dot product (*) three times, the transpose () twice, and the inverse (inv()) once. This function works for any matrices x and y with the appropriate dimensions. Once you have defined this function by running the file, you can obtain the OLS estimates of the true parameters by typing,

julia> estimates = OLSestimator(Y,X)


Because of the return statement, the parameter estimates will be returned by OLSestimator() and stored in estimates. If you compute estimates in the script, you can print them to the screen using,

println(estimates)


Finally, change the parameter defined at the beginning of your code. When N is small (say, 100), estimates will usually be further from trueParams than when N is large (say 10,000). This should be very easy to change; if you used the print statement for the estimates, just change the value of N, run the code, and see the new estimates printed to the console. In order to make your estimates reproducible (i.e., the exact same draws from the random distributions), set the random seed at the beginning of your code using,

srand(2)


where 2 is a possible seed you could choose.

Results

When I run the code above with random seed 2, I find that,

julia> estimates
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337


so you can use this to check your results.

If you are only trying to estimate OLS, you can use the built-in command linreg(), but do not include the vector of ones, as linreg() will add another vector of ones. This is why I created X_noconstant above, which is just X without a column of ones. The syntax is,

julia> linreg(X_noconstant,Y)
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337


and the estimates are identical to those in estimates.

Getting Started: Installing Julia, Julia Studio, and Packages used in Economics

UPDATE: Julia Studio is no longer supported. Please see my more recent installation guide here.

In this post, I explain how to install Julia, Julia Studio, and 3 packages commonly used in economics on your personal computer in about 5 minutes.

Installing Julia

Unlike installing Python, it is very easy to install Julia and its packages. Simply download Julia Studio, which is the most popular IDE for Julia, and click install. This will also install the current version of the Julia language. Now, open Julia Studio. In the console, type:


julia> 2+2


and press Enter. If it returns the number 4, you have successfully installed Julia.

Installing Packages in Julia

Next, you need to install a few packages used frequently in economics. The following command will install the Distributions package, which allows you to simulate random variables and evaluate probability distribution functions. In the console, type:

julia> Pkg.add("Distributions")


Like R but unlike Python, Julia installs packages from within Julia. Also, install the packages called “DataFrames”, which is used for working with data, and “Optim”, which contains numerical optimizers.

That’s it, you should be ready to work in Julia after about 5 minutes of installations!

Why I Switched to Julia

The following story explains why I began programming in Julia. Since then, I have found that Julia improves the performance of my other econometric estimators. However, Julia has a major disadvantage in that it lacks informative documentation and tutorials, much less accumulated discussion on sites like stackoverflow. This blog is meant to record the skills I am learning in Julia over time, to serve as a tutorial for economists and others learning the Julia programming language.

Is Julia the Future of Computational Economics?

I am currently estimating a structural econometric model of game-theoretic parent-child interaction. Using the standard implementation of Python (the code is written entirely in NumPy and SciPy with data prepared by Pandas), the optimizer ran for 24 hours, then terminated due to the 5,000 iteration limit. It was converging smoothly, but never quite arrived. While waiting for the estimates last night (and growing increasingly impatient), I installed Julia and its packages, learned how to program in Julia, rewrote the estimation in Julia, and this morning successfully optimized the likelihood in Julia.

The contrast is staggering: the optimization that didn’t converge after 24 hours in Python converged after only 15 minutes in Julia while Python was still running on the same processor. Julia was already achieving a greater likelihood than Python after only 5 minutes even though Python had a 20-hour head start. They are both using the same optimization algorithm (including numerical tolerance), and the structure of the code is identical. Julia evaluates the likelihood in 0.5 seconds, while Python requires 21 seconds per evaluation, so Julia is about 40 times faster in the function evaluation, and about 100 times faster in the optimizer (I’m giving Python the benefit of the doubt even though it never converged).

The final iteration of Python was approaching the Julia optimal likelihood and getting closer; the only difference was that Julia arrived much, much more quickly. Since my next step is to bootstrap the estimator, speed is extremely important. Some practical arithmetic: on my four-core laptop, it would take two-thirds of a year to bootstrap this estimator 1,000 times, whereas Julia could do it in fewer than three days (though I’m planning to run the bootstrap in batch on the server).

I am agnostic on programming languages; I use whatever gets the answer fastest and can be reproduced most clearly, and I often use multiple languages on the same project to get the best features of each. My only claim is that Julia has taken the Python code, with minimal syntax changes, and executed the code 100 times faster for someone who had no prior experience with Julia. This was not a contrived, time-testing code; this is the estimator motivated by economic theory. The 100-fold speed increase of Julia relative to Python has been found elsewhere in computational economics.

So, is Julia the programming language of the future in structural econometrics? I’m not sure, but it seems to dominate Python and R at the moment.