Stepdown p-values for Multiple Hypothesis Testing in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

To finish yesterday’s tutorial on hypothesis testing with non-parametric p-values in Julia, I show how to perform the bootstrap stepdown correction to p-values for multiple testing, which many economists find intimidating (including me) but is actually not so difficult.

First, I will show the simple case of Holm’s (1979) stepdown procedure. Then, I will build on the bootstrap example to apply one of the most exciting advances in econometric theory from the past decade, the bootstrap step-down procedure, developed by Romano and Wolf (2004) and extended by Chicago professor Azeem Shaikh. These methods allow one to bound the family-wise error rate, which is the probability of making at least one false discovery (i.e., rejecting a null hypothesis when the null hypothesis was true).

Motivation

The standard example to motivate this discussion is to suppose you compute p-values for, say, 10 parameters with independent estimators. If all 10 have p-value equal to 0.05 corresponding to the null hypothesis of each, then we reject the null hypothesis for any one of them with 95% confidence. However, the probability that all of them reject the null hypothesis is $1-\left(1-0.05\right)^{10} \approx 0.4$ (by independence, the joint probability is the product of marginal probabilities), so we only have 60% confidence in rejecting all of them jointly. If we wish to test for all of the hypotheses jointly, we need some way to adjust the p-values to give us greater confidence in jointly rejecting them. Holm’s solution to this problem is to iteratively inflate the individual p-values, so that the smallest p-value is inflated the most, and the largest p-value is not inflated at all (unless one of the smaller p-values is inflated above the largest p-value, then the largest p-value will be inflated to preserve the order of p-values). The specifics of the procedure are below.

The Romano-Wolf-Shaikh approach is to account for the dependence among the estimators, and penalize estimators more as they become more independent. Here’s how I motivate their approach: First, consider the extreme case of two estimators that are identical. Then, if we reject the null hypothesis for the first, there is no reason to penalize the second; it would be very strange to reject the null hypothesis for the first but not the second when they are identical. Second, suppose they are almost perfectly dependent. Then, rejecting the null for one strongly suggests that we should also reject the null for the second, so we do not want to penalize one very strongly for the rejection of the other. But as they become increasingly independent, the rejection of one provides less and less affirmative information about the other, and we approach the case of perfect independence shown above, which receives the greatest penalty. The specifics of the procedure are below.

Holm’s Correction to the p-values

Suppose you have a vector of p-values, $p$, of length $K$, and a desired maximum family-wise error rate, $\alpha$ (the most common choice is $\alpha=0.05$. These are all of the needed ingredients. Holm’s stepdown procedure is as follows:

1. If the smallest element of $p$ is greater than $\frac{\alpha}{(K+1)}$, reject no null hypotheses and exit (do not continue to step 2). Else, reject the null hypothesis corresponding to the smallest element of $p$ (continue to step 2).
2. If the second-smallest element of $p$ is greater than $\frac{\alpha}{(K+1)-1}$, reject no remaining null hypotheses and exit (do not continue to step 3). Else, reject the null hypothesis corresponding to the second-smallest element of $p$ (continue to step 3).
3. If the third-smallest element of $p$ is greater than $\frac{\alpha}{(K+1)-2}$, reject no remaining null hypotheses and exit (do not continue to step 4). Else, reject the null hypothesis corresponding to the third-smallest element of $p$ (continue to step 4).
4. And so on.

We could program this directly as a loop-based function that takes $p,\alpha$ as parameters, but a simpler approach is to compute the p-values equivalent to this procedure for any $\alpha$, which are,

$\tilde{p}_{(k)} =\min\left\{\max_{j:p_{(j)}\leq p_{(k)}} \left\{ (K+1-j)p_{(j)}\right\},1\right\}$,

where $(k)$ means the $k^{\mathit{th}}$ smallest element. This expression is equivalent to the algorithm above in the sense that, for any family-wise error rate $\alpha\in (0,1)$, $\tilde{p}_{(k)}\leq\alpha$ if and only if the corresponding hypothesis is rejected by the algorithm above.

The following code computes these p-values. Julia (apparently) lacks a command that would tell me the index of the rank of the p-values, so my loop below does this, including the handling of ties (when some of the p-values are the same):

function Holm(p)
K = length(p)
sort_index = -ones(K)
sorted_p = sort(p)
for j=1:K
num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
end
holm_p = [sorted_holm_p[sort_index[k]] for k=1:K]
return holm_p
end


This is straight-forward except for sort_index, which I constructed such that, e.g., if the first element of sort_index is 3, then the first element of pvalues is the third smallest. Unfortunately, it arbitrarily breaks ties in favor of the parameter that appears first in the MLE array, so the second entry of sort_index is 1 and the third entry is 2, even though the two corresponding p-values are equal.

Bootstrap Stepdown p-values

Given our work yesterday, it is relatively easy to replace bootstrap marginal p-values with bootstrap stepdown p-values. We begin with samples, as created yesterday. The following code creates tNullDistribution, which is the same as nullDistribution from yesterday except as t-statistics (i.e., divided by standard error).

function stepdown(MLE,bootSamples)
K = length(MLE)
tMLE = MLE
bootstrapSE = std(bootSamples,1)
tNullDistribution = bootSamples
p = -ones(K)
for i=1:K
tMLE[i] = abs(tMLE[i]/bootstrapSE[i])
tNullDistribution[:,i] = abs((tNullDistribution[:,i]-mean(tNullDistribution[:,i]))/bootstrapSE[i])
p[i] = mean(tMLE[i].<tNullDistribution[:,i])
end
sort_index = -ones(K)
stepdown_p = -ones(K)
sorted_p = sort(p)
for j=1:K
num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
end
for k=1:K
current_index = [sort_index.>=k]
stepdown_p[sort_index[k]] = mean(maximum(tNullDistribution[:,sort_index.>=k],2).>tMLE[sort_index[k]])
end
return ["single_pvalues"=>p,"stepdown_pvalues"=>stepdown_p,"Holm_pvalues"=>Holm(p)]
end


The only difference between the single p-values and the stepdown p-values is the use of the maximum t-statistic in the comparison to the null distribution, and the maximum is taken over only the parameter estimates whose p-values have not yet been computed. Notice that I used a dictionary in the return so that I could output single, stepdown, and Holm p-values from the stepdown function.

Results

To test out the above corrected p-values, we return to MLE and samples from yesterday. Suppose we want to perform the two-sided tests simultaneously for the null hypotheses $\beta_0=0,\beta_1=0,\beta_2=0,\beta_3=0$. The results from the above code are,

julia> stepdown(MLE[1:4],samples[:,1:4])
Dict{ASCIIString,Any} with 3 entries:
"Holm_pvalues"     => {0.766,0.766,0.18,0.036}
"stepdown_pvalues" => [0.486,0.649,0.169,0.034]
"single_pvalues"   => [0.486,0.383,0.06,0.009]


We see that the p-value corresponding to the null hypothesis $\beta_2=0$ is inflated by both the bootstrap stepdown and Holm procedures, and is no longer significant at the 0.10 level.

Bootstrapping and Non-parametric p-values in Julia

* The script to reproduce the results of this tutorial in Julia is located here.

Suppose that we wish to test to see if the the parameter estimates of $\beta$ are statistically different from zero and if the estimate of $\sigma^2$ is different from one for the OLS parameters defined in a previous post. Suppose further that we do not know how to compute analytically the standard errors of the MLE parameter estimates; the MLE estimates were presented in the previous post.

We decide to bootstrap by resampling cases in order to estimate the standard errors. This means that we treat the sample of $N$ individuals as if it were a population from which we randomly draw $B$ samples, each of size $N$. This produces a sample of MLEs of size $B$, that is, it provides an empirical approximation to the distribution of the MLE. From the empirical approximation, we can compare the full-sample point MLE to the MLE distribution under the null hypothesis.

To perform bootstrapping, we rely on Julia’s built-in sample function.

Wrapper Functions and the Likelihood of Interest

Now that we have a bootstrap index, we define the log-likelihood as a function of x and y, which are any subsets of X and Y, respectively.

function loglike(rho,y,x)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = y-*(x,beta)
dist = Normal(0, sqrt(sigma2))
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


Then, if we wish to evaluate loglike across various subsets of x and y, we use what is called a wrapper, which simply creates a copy of loglike that has already set the values of x and y. For example, the following function will evaluate loglike when x=X and y=Y:

function wrapLoglike(rho)
return loglike(rho,Y,X)
end


We do this because we want the optimizer to find the optimal $\rho$, holding x and y fixed, but we also want to be able to adjust x and y to suit our purposes. The wrapper function allows the user to modify x and y, but tells the optimizer not to bother them.

Tip: Use wrapper functions to manage arguments of your objective function that are not supposed to be accessed by the optimizer. Give the optimizer functions with only one argument — the parameters over which it is supposed to optimize your objective function.

Bootstrapping the OLS MLE

Now, we will use a random index, which is drawn for each b using the sample function, to take a random sample of individuals from the data, feed them into the function using a wrapper, then have the optimizer maximize the wrapper across the parameters. We repeat this process in a loop, so that we obtain the MLE for each subset. The following loop stores the MLE in each row of the matrix samples using 1,000 bootstrap samples of size one-half (M) of the available sample:

B=1000
samples = zeros(B,5)
for b=1:B
theIndex = sample(1:N,N)
x = X[theIndex,:]
y = Y[theIndex,:]
function wrapLoglike(rho)
return loglike(rho,y,x)
end
samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
end
samples[:,5] = exp(samples[:,5])


The resulting matrix contains 1,000 samples of the MLE. As always, we must remember to exponentiate the variance estimates, because they were stored in log-units.

Bootstrapping for Non-parametric p-values

Estimates of the standard errors of the MLE estimates can be obtained by computing the standard deviation of each column,

bootstrapSE = std(samples,1)


where the number 1 indicates that the standard deviation is taken over columns (instead of rows).

Standard errors like these can be used directly for hypothesis testing under parametric assumptions. For example, if we assume an MLE is normally distributed, then we reject the null hypothesis that the parameter is equal to some point if the parameter estimate differs from the point by at least 1.96 standard errors (using that the sample is large). However, we can make fewer assumptions using non-parametric p-values. The following code creates the distribution implied by the null hypothesis that $\beta_0 =0, \beta_1=0, \beta_2=0, \beta_3=0, \sigma^2=1$ by subtracting the mean from each distribution (thus imposing a zero mean) and then adding 1 to the distribution of $\sigma^2$ (thus imposing a mean of one); this is called nullDistribution.

nullDistribution = samples
pvalues = ones(5)
for i=1:5
nullDistribution[:,i] = nullDistribution[:,i]-mean(nullDistribution[:,i])
end
nullDistribution[:,5] = 1 + nullDistribution[:,5]


The non-parametric p-value (for two-sided hypothesis testing) is the fraction of times that the absolute value of the MLE is greater than the absolute value of the null distribution.

pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]


If we are interested in one-sided hypothesis testing, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0>0$:

pvalues = [mean(MLE[i].<nullDistribution[:,i]) for i=1:5]


Conversely, the following code would test the null hypothesis $\beta_0 =0$ against the alternative that $\beta_0<0$:

pvalues = [mean(MLE[i].>nullDistribution[:,i]) for i=1:5]


Thus, two-sided testing uses the absolute value (abs), and one-sided testing only requires that we choose the right comparison operator (.> or .<).

Results
Let the true parameters be,

julia> trueParams = [0.01,0.05,0.05,0.07]


The resulting bootstrap standard errors are,

julia> bootstrapSE = std(samples,1)
1x5 Array{Float64,2}:
0.0308347  0.0311432  0.0313685  0.0305757  0.0208229


and the non-parametric two-sided p-value estimates are,

julia> pvalues = [mean(abs(MLE[i]).<abs(nullDistribution[:,i])) for i=1:5]
5-element Array{Any,1}:
0.486
0.383
0.06
0.009
0.289


Thus, we reject the null hypotheses for the third and fourth parameters only and conclude that $\beta_2 \neq 0, \beta_3 \neq 0$, but find insufficient evidence to reject the null hypotheses that $\beta_0 =0, \beta_1 =0$ and $\sigma^2=1$.