# Selection Bias Corrections in Julia, Part 1

Selection bias arises when a data sample is not a random draw from the population that it is intended to represent. This is especially problematic when the probability that a particular individual appears in the sample depends on variables that also affect the relationships we wish to study. Selection bias corrections based on models of economic behavior were pioneered by the economist James J. Heckman in his seminal 1979 paper.

For an example of selection bias, suppose we wish to study the effectiveness of a treatment (a new medicine for patients with a particular disease, a preschool curriculum for children facing particular disadvantages, etc.). A random sample is drawn from the population of interest, and the treatment is randomly assigned to a subset of this sample, with the remaining subset serving as the untreated (“control”) group. If the subsets followed instructions, then the resulting data would serve as a random draw from the data generating process that we wish to study.

However, suppose the treatment and control groups do not comply with their assignments. In particular, if only some of the treated benefit from treatment while the others are in fact harmed by treatment, then we might expect the harmed individuals to leave the study. If we accepted the resulting data as a random draw from the data generating process, it would appear that the treatment was much more successful than it actually was; an individual who benefits is more likely to be present in the data than one who does not benefit.

Conversely, if treatment were very beneficial, then some individuals in the control group may find a way to obtain treatment, possibly without our knowledge. The benefits received by the control group would make it appear that the treatment was less beneficial than it actually was; the receipt of treatment is no longer random.

In this tutorial, I present some parameterized examples of selection bias. Then, I present examples of parametric selection bias corrections, evaluating their effectiveness in recovering the data generating processes. Along the way, I demonstrate the use of the GLM package in Julia. A future tutorial demonstrates non-parametric selection bias corrections.

Example 1: Selection on a Normally-Distributed Unobservable

Suppose that we wish to study of the effect of the observable variable $X_{i}$ on $Y_i$. The data generating process is given by,

$Y_i=\beta_0+X_i\beta_1+\epsilon_i$,

where $X_i$ and $\epsilon_i$ are independent in the population. Because of this independence condition, the ordinary least squares estimator would be unbiased if the data $(Y,X)$ were drawn randomly from the population. However, suppose that the probability that individual $i$ were in the data set $\mathcal{S}$ were a function of $X$ and $\epsilon$. For example, suppose that,

$\Pr\left( i \in \mathcal{S}\Big| X_i,\epsilon_i\right)=1$ if $X_i > \epsilon_i$,

and the probability is zero otherwise.  This selection rule ensures that, among the individuals in the data (the $i$ in $\mathcal{S}$), the covariance between $X$ and $\epsilon$ will be positive, even though the covariance is zero in the population. When $X$ covaries positively with $\epsilon$, then the OLS estimate of $\beta_1$ is biased upward, i.e., the OLS estimator converges to a value that is greater than $\beta_1$.

To see the problem, consider the following simulation of the above process in which $X$ and $\epsilon$ are drawn as independent standard normal random variables:

srand(2)
N = 1000
X = randn(N)
epsilon = randn(N)
beta_0 = 0.
beta_1 = 1.
Y = beta_0 + beta_1.*X + epsilon
populationData = DataFrame(Y=Y,X=X,epsilon=epsilon)
selected = X.>epsilon
sampleData = DataFrame(Y=Y[selected],X=X[selected],epsilon=epsilon[selected])

There are 1,000 individuals in the population data, but 492 of them are selected to be included in the sample data. The covariance between X and epsilon in the population data is,

cov(populationData[:X],populationData[:epsilon])
0.00861456704877879

which is approximately zero, but in the sample data, it is,

cov(sampleData[:X],sampleData[:epsilon])
0.32121357192108513

which is approximately 0.32.

Now, we regress $Y$ on $X$ with the two data sets to obtain,

linreg(array(populationData[:X]),array(populationData[:Y]))
2-element Array{Float64,1}:
-0.027204
1.00882

linreg(array(sampleData[:X]),array(sampleData[:Y]))
2-element Array{Float64,1}:
-0.858874
1.4517

where, in Julia 0.3.0, the command array() replaces the old command matrix() in converting a DataFrame into a numerical Array. This simulation demonstrates severe selection bias associated with using the sample data to estimate the data generating process instead of the population data, as the true parameters, $\beta_0=0,\beta_1=1$, are not recovered by the sample estimator.

Correction 1: Heckman (1979)

The key insight of Heckman (1979) is that the correlation between $X$ and $\epsilon$ can be represented as an omitted variable in the OLS moment condition,

$\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\beta_0 + \beta_1X_i+\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right]$,

Furthermore, using the conditional density of the standard Normally distributed $\epsilon$,

$\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\epsilon_i\Big|X_i,X_i>\epsilon_i\right]=\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}$.

which is called the inverse Mills ratio, where $\phi$ and $\Phi$ are the probability and cumulative density functions of the standard normal distribution. As a result, the moment condition that holds in the sample is,

$\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}$

Returning to our simulation, the inverse Mills ratio is added to the sample data as,

sampleData[:invMills] = -pdf(Normal(),sampleData[:X])./cdf(Normal(),sampleData[:X])

Then, we run the regression corresponding to the sample moment condition,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
-0.166541
1.05648
0.827454

We see that the estimate for $\beta_1$ is now 1.056, which is close to the true value of 1, compared to the non-corrected estimate of 1.452 above. Similarly, the estimate for $\beta_0$ has improved from -0.859 to -0.167, when the true value is 0. To see that the Heckman (1979) correction is consistent, we can increase the sample size to $N=100,000$, which yields the estimates,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
-0.00417033
1.00697
0.991503

which are very close to the true parameter values.

Note that this analysis generalizes to the case in which $X$ contains $K$ variables and the selection rule is,

$\delta_0 + \delta_1 X_1 + \delta_2 X_2 + \ldots + \delta_K X_k > \epsilon$,

which is the case considered by Heckman (1979). The only difference is that the coefficients $\delta_k$ must first be estimated by regressing an indicator for $i \in \mathcal{S}$ on $X$, then using the fitted equation within the inverse Mills ratio. This requires that we observe $X$ for $i \notin \mathcal{S}$. Probit regression is covered in a slightly different context below.

As a matter of terminology, the process of estimating $\delta$ is called the “first stage”, and estimating $\beta$ conditional on the estimates of $\delta$ is called the “second stage”. When the coefficient on the inverse Mills ratio is positive, it is said that “positive selection” has occurred, with “negative selection” otherwise. Positive selection means that, without the correction, the estimate of $\beta_1$ would have been upward-biased, while negative selection results in a downward-biased estimate. Finally, because the selection rule is driven by an unobservable variables $\epsilon$, this is a case of “selection on unobservables”. In the next section we consider a case of “selection on observables”.

Example 2: Probit Selection on Observables

Suppose that we wish to know the mean and variance of $Y$ in the population. However, our sample of $Y$ suffers from selection bias. In particular, there is some $X$ such that the probability of observing $Y$ depends on $X$ according to,

$\Pr\left(i\in\mathcal{S}\Big| X_i\right)=F\left(X_i\right)$,

where $F$ is some function with range $[0,1]$. Notice that, if $Y$ and $X$ were independent, then the resulting sample distribution of $Y$ would be a random draw from the population (marginal distribution) of $Y$. Instead, we suppose $\mathrm{Cov}\left(X,Y\right)\neq 0$. For example,

srand(2)
N = 10000
populationData = DataFrame(rand(MvNormal([0,0.],[1 .5;.5 1]),N)')
names!(populationData,[:X,:Y])

mean(array(populationData),1)
1x2 Array{Float64,2}:
-0.0281916  -0.022319

cov(array(populationData))
2x2 Array{Float64,2}:
0.98665   0.500912
0.500912  1.00195

In this simulated population, the estimated mean and variance of $Y$ are -0.022 and 1.002, and the covariance between $X$ and $Y$ is 0.501. Now, suppose the probability that $Y_i$ is observed is a probit regression of $X_i$,

$\Pr\left( i\in\mathcal{S}\Big| X_i \right) = \Phi\left( \beta_0 + \beta_1 X_i \right)$,

where $\Phi$ is the CDF of the standard normal distribution. Letting $D_i=1$ indicate that $i \in \mathcal{S}$, we can generate the sample selection rule $D$ as,

beta_0 = 0
beta_1 = 1
index = (beta_0 + beta_1*data[:X])
probability = cdf(Normal(0,1),index)
D = zeros(N)
for i=1:N
D[i] = rand(Bernoulli(probability[i]))
end
populationData[:D] = D
sampleData = populationData
sampleData[D.==0,:Y] = NA

The sample data has missing values in place of $Y_i$ if $D_i=0$. The estimated mean and variance of $Y$ in the sample data are 0.275 (which is too large) and 0.862 (which is too small).

Correction 2: Inverse Probability Weighting

The reason for the biased estimates of the mean and variance of $Y$ in Example 2 is sample selection on the observable $X$. In particular, certain values of $Y$ are over-represented due to their relationship with $X$. Inverse probability weighting is a way to correct for the over-representation of certain types of individuals, where the “type” is captured by the probability of being included in the sample.

In the above simulation, conditional on appearing in the population, the probability that an individual of type $X_i=1$ is included in the sample is 0.841. By contrast, the probability that an individual of type $X_i=0$ is included in the sample is 0.5, so type $X_i$ is over-represented by a factor of 0.841/0.5 = 1.682. If we could reduce the impact that type $X_i=1$ has in the computation of the mean and variance of $Y$ by a factor of 1.682, we would alter the balance of types in the sample to match the balance of types in the population. Inverse probability weighting generalizes this logic by weighting each individual’s impact by the inverse of the probability that this individual appears in the sample.

Before we can make the correct, we must first estimate the probability of sample inclusion. This can be done by fitting the probit regression above by least-squares. For this, we use the GLM package in Julia, which can be installed the usual way with the command Pkg.add(“GLM”).

using GLM
Probit = glm(D ~ X, sampleData, Binomial(), ProbitLink())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:

Coefficients:
Estimate Std.Error  z value Pr(>|z|)
(Intercept)  0.114665  0.148809 0.770554   0.4410
X             1.14826   0.21813  5.26414    1e-6

estProb = predict(Probit)
weights = 1./estProb[D.==1]/sum(1./estProb[D.==1])

which are the inverse probability weights needed to match the sample distribution to the population distribution.

Now, we use the inverse probability weights to correct the mean and variance estimates of $Y$,

correctedMean = sum(sampleData[D.==1,:Y].*weight)
-0.024566923132025013

correctedVariance = (N/(N-1))*sum((sampleData[D.==1,:Y]-correctedMean).^2.*weight)
1.0094029613131092

which are very close to the population values of -0.022319 and 1.00195. The logic here extends to the case of multivariate $X$, as more coefficients are added to the Probit regression. The logic also extends to other functional forms of $F$, for example, switching from Probit to Logit is achieved by replacing the ProbitLink() with LogitLink() in the glm() estimation above.

Example 3: Generalized Roy Model

For the final example of this tutorial, we consider a model which allows for rich, realistic economic behavior. In words, the Generalized Roy Model is the economic representation of a world in which each individual must choose between two options, where each option has its own benefits, and one of the options costs more than the other. In math notation, the first alternative, denoted $D_i=1$, relates the outcomes $Y_i$ to the individual’s observable characteristics, $X_i$, by,

$Y_{1,i} = \mu_1\left(X_i\right)+U_{1,i}$.

Similarly, the second alternative, denoted $D_i=0$, relates $Y_i$ to $X_i$, by,

$Y_{0,i} = \mu_0\left(X_i\right)+U_{0,i}$.

The value of $Y_i$ that appears in our sample is thus given by,

$Y_i = D_iY_{1,i} + (1-D_i)Y_{0,i}$.

Finally, the value of $D_i$ is chosen by individual $i$ according to,

$D_i=1$ if $Y_{1,i}-Y_{0,i}-C_i>0$,

where $C_i$ is the cost of choosing the alternative $D_i=1$ and is given by,

$C_i = \mu_C\left(X_i,Z_i\right)+U_{C,i}$,

where $Z$ contains additional characteristics of $i$ that are not included in $X$.

We assume that the data only contains $Y_i,D_i,X_i,Z_i$; it does not contain the variables $Y_{i,1},Y_{i,0},U_{i,1},U_{i,0},U_{i,C}$ or the functions $\mu_1,\mu_0,\mu_C$. Assuming that the three $\mu$ functions follow the linear form and that the unobservables $U$ are independent and Normally distributed, we can simulate the data generating process as,

srand(2)
N = 1000
sampleData = DataFrame(rand(MvNormal([0,0.],[1 .5; .5 1]),N)')
names!(sampleData,[:X,:Z])
U1 = rand(Normal(0,.5),N)
U0 = rand(Normal(0,.7),N)
UC = rand(Normal(0,.9),N)
betas1 = [0,1]
betas0 = [.3,.2]
betasC = [.1,.1,.1]
Y1 = betas1[1] + betas1[2].*sampleData[:X] + U1
Y0 = betas0[1] + betas0[2].*sampleData[:X] + U0
C = betasC[1] + betasC[2].*sampleData[:X] + betasC[3].*sampleData[:Z] + UC
D = Y1-Y0-C.>0
Y = D.*Y1 + (1-D).*Y0
sampleData[:D] = D
sampleData[:Y] = Y

In this simulation, about 38% of individuals choose the alternative $D_i=1$. About 10% of individuals choose $D_i=0$ even though they receive greater benefits under $D_i=1$ due to the high cost $C_i$ associated with $D_i=1$.

Solution 3: Heckman Correction for Generalized Roy Model

The identification of this model is attributable to Heckman and Honore (1990). Estimation proceeds in steps. The first step is to notice that the left- and right-hand terms in the following moment equation motivate a Probit regression:

$\mathbb{E}\left[D_i\Big|X_i,Z_i\right]=\Pr\left(\mu_D\left(X_i,Z_i\right)>U_{D,i}\Big|X_i,Z_i\right)=\Phi\left(\mu_D\left(X_i,Z_i\right)\right)$,

where $U_{D,i}\equiv -\left(U_{1,i}-U_{0,i}-U_{C,i}\right)/\sigma_D\sim\mathcal{N}\left(0,1\right)$ is the negative of the total error term arising in the equation that determines $D_i$ above, $\sigma_D \equiv \sqrt{\mathrm{Var}\left(U_{1,i}-U_{0,i}-U_{C,i}\right)}$, and,

$\mu_D\left(X,Z\right) \equiv \left([1,X]\beta_1 -[1,X]\beta_0 -[1,X,Z]\beta_C\right)/\sigma_D \equiv [1,X,Z]\beta_D$,

In the simulation above, $\beta_D = [-.4,.7,-.1]/\sqrt{.5+.7+.9}\approx[-0.276,0.483,-0.069]$. We can estimate $\beta_D$ from the Probit regression of $D$ on $X$ and $Z$.

3-element Array{Float64,1}:
-0.299096
0.59392
-0.103155

Next, notice that,

$\mathbb{E}\left[Y_i\Big|D_i=1,X_i,Z_i\right] =[1,X_i]\beta_1+\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right]$,

where,

$\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right] =\mathbb{E}\left[U_{1,i}\Big|\mu_D\left(X_i,Z_i\right)>U_{i,D},X_i,Z_i\right]=\rho_1 \frac{-\phi\left(\mu_D\left(X_i,Z_i\right)\right)}{\Phi\left(\mu_D\left(X_i,Z_i\right)\right)}$,

which is the inverse Mills ratio again, where $\rho_1 \equiv \frac{\mathrm{Cov}\left(U_1,U_D\right)}{\mathrm{Var}\left(U_D\right)}$. Substituting in the estimate for $\beta_D$, we consistently estimate $\beta_1$:

fittedVals = hcat(ones(N),array(sampleData[[:X,:Z]]))*betasD

sampleData[:invMills1] = -pdf(Normal(0,1),fittedVals)./cdf(Normal(0,1),fittedVals)

correctedBetas1 = linreg(array(sampleData[D,[:X,:invMills1]]),vec(array(sampleData[D,[:Y]])))
3-element Array{Float64,1}:
0.0653299
0.973568
-0.135445

To see how well the correction has performed, compare these estimates to the uncorrected estimates of $\beta_1$,

biasedBetas1 = linreg(array(sampleData[D,[:X]]),vec(array(sampleData[D,[:Y]])))
2-element Array{Float64,1}:
0.202169
0.927317

Similar logic allows us to estimate $\beta_0$:

sampleData[:invMills0] = pdf(Normal(0,1),fittedVals)./(1-cdf(Normal(0,1),fittedVals))

correctedBetas0 = linreg(array(sampleData[D.==0,[:X,:invMills0]]),vec(array(sampleData[D.==0,[:Y]])))
3-element Array{Float64,1}:
0.340621
0.207068
0.323793

biasedBetas0 = linreg(array(sampleData[D.==0,[:X]]),vec(array(sampleData[D.==0,[:Y]])))
2-element Array{Float64,1}:
0.548451
0.295698

In summary, we can consistently estimate the benefits associated with each of two alternative choices, even though we only observe each individual in one of the alternatives, subject to heavy selection bias, by extending the logic introduced by Heckman (1979).

# Introductory Example: Ordinary Least Squares (OLS) Estimation in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

In this post, I show in Julia how to perform ordinary least squares (OLS) estimation after first simulating the OLS data generating process (DGP). At the end, we see that the parameter estimates converge to the true parameter as sample size grows large. If you have not yet installed Julia, it takes 5 minutes following these instructions.

As a reminder and to clarify notation, the OLS DGP is,

$Y = X\beta + \epsilon$,

where $Y$ is the $Nx1$ dependent variable, $X$ is the $N\mathrm{x}\tilde{K}$ matrix of independent variables, $\beta$ is the $\tilde{K}\mathrm{x}1$ vector of parameters that we wish to estimate, and $\epsilon$ is the $N\mathrm{x}1$ error satisfying $\epsilon \overset{\mathit{i.i.d.}}{\sim} \mathcal{N}\left(0,\sigma^2\right)$. Because we assume that the first column of $X$ is the constant number 1, we will find it useful below to work with $K \equiv \tilde{K}-1$. The least squares estimator is,

$\hat{\beta} = \left( X^T X \right)^{-1} \left( X^T Y \right)$.

Matrix Algebra and Simulated Random Variables: The OLS DGP

First, we generate the independent variables $X$, then we use $X$ to generate the independent variable $Y$. To begin, create a new file in Julia Studio and save it to your computer. In the file editor (script), insert the following commands:

using Distributions
N=1000
K=3
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]

The using command let Julia know that we will be using the Distributions package. The MvNormal() command initialized a multivariate normal distribution, which is an object including methods such as pdf for the probability distribution function and rand for drawing random variables.  eye(K) means $\mathit{I}_K$, the identity matrix of size $K\mathrm{x}K$. We only told MvNormal the covariance matrix, leaving the mean blank, which Julia assumes means that we would like a zero mean. The distribution of $X$ is arbitrary; we only used multivariate normal for simplicity. At the end, we concatenate the vector of ones to X using brackets, [].

Tip: To ensure that the matrices are of the appropriate dimension, use the size command. Above, we transposed $X$ after finding that rand returned it as a $3\mathrm{x}N$ matrix, when we need it to be $N \mathrm{x} 3$. Misaligned dimensions are one of the most common and frustrating errors to make in writing a program.

Now that we have created $X$ as a matrix containing a column of ones as well as three independent variables, we wish to multiply it by a vector of regression coefficients of length 4 (including the intercept) and add the normally distributed shock, $\epsilon$. For simplicity, we assume $\sigma^2=1$.

genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.1,0.5,-0.3,0.]
Y = X*trueParams + epsilon

Matrix algebra in Julia can be done as in a way comparable to Python, *(A,B), which means AxB, or in the more R-like way we used above, A*B. Then, you can click run (the little green arrow in Julia Studio) and it will perform the operations in the file above. To make sure it worked, you can now go to the Console and type,

julia> mean(Y)

and press Enter. If the code worked correctly, this should return the mean of simulated $Y$, which should be near the true intercept 0.1 (since each of the independent variables has mean zero, the true mean of $Y$ is just the intercept).

Functions in Julia: The OLS Estimator

Functions are defined in Julia using the command function, followed by the desired name of your function, and parentheses containing the arguments of the function. An end statement is required upon completion of the function definition. Indentation is required within the body of the function, and it is a good practice to explicitly include a return statement. The OLS estimator is defined with respect to any particular parameters as follows:

function OLSestimator(y,x)
estimate = inv(x'*x)*(x'*y)
return estimate
end

This function uses the dot product (*) three times, the transpose () twice, and the inverse (inv()) once. This function works for any matrices x and y with the appropriate dimensions. Once you have defined this function by running the file, you can obtain the OLS estimates of the true parameters by typing,

julia> estimates = OLSestimator(Y,X)

Because of the return statement, the parameter estimates will be returned by OLSestimator() and stored in estimates. If you compute estimates in the script, you can print them to the screen using,

println(estimates)

Finally, change the parameter defined at the beginning of your code. When N is small (say, 100), estimates will usually be further from trueParams than when N is large (say 10,000). This should be very easy to change; if you used the print statement for the estimates, just change the value of N, run the code, and see the new estimates printed to the console. In order to make your estimates reproducible (i.e., the exact same draws from the random distributions), set the random seed at the beginning of your code using,

srand(2)

where 2 is a possible seed you could choose.

Results

When I run the code above with random seed 2, I find that,

julia> estimates
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337

so you can use this to check your results.

If you are only trying to estimate OLS, you can use the built-in command linreg(), but do not include the vector of ones, as linreg() will add another vector of ones. This is why I created X_noconstant above, which is just X without a column of ones. The syntax is,

julia> linreg(X_noconstant,Y)
4-element Array{Float64,1}:
0.11216
0.476437
-0.290574
0.0108337

and the estimates are identical to those in estimates.