Alex D'Amour
November 25, 2014
Why?
This talk: Find a model that provides the best predictive performance for our given sample size. Note that predictive performance includes estimation uncertainty, bias, and residual variation.
Holy grail: Eliminate high-dimensional nuisance without high-dimensional priors.
Is the truth…
This talk: The latter. There may exist a sparse set of predictors, but no reason to believe that the predictors as collected define the proper basis for such sparsity.
More reading: Liu and Yang, 2009. “Parametric or nonparametric? A parametricness index for model selection.”
Example Let \( X_i \) be a \( p \)-dimensional multivariate normal with covariance matrix \( \Sigma \) defined so that \( \Sigma_{k,l} = \rho^{|k-l|} \), \( 0 < \rho < 1 \).
Consider: \[ Y_i \sim X_{i,2} - \rho X_{i,1} + 0.2 X_{i,p} + \mathcal N(0,3). \]
“True”“ model includes covariates \( (1,2,p) \). But for any subset \( A \subset \{1,\cdots ,p\} \), \[ Y_i | X_{i,A} \sim \beta_A^{\top} X_{i,A}\mathcal + N(0, \sigma_A). \] because \( (Y_i,X_i) \) are jointly multivariate normal.
"Truth” only has special status because it has minimal residual variance.
Simulation: \( N = 100 \), \( p=25 \), \( \rho = 0.75 \).
For simplicity, consider only growing models \( A_k = \{1, \cdots, k\} \).
Is it a…
This talk: Design problem!
Estimands mean something.
Separation of selection inference.
Separation of selection and inference.
[M]odels become stochastic in an opaque way when their selection is affected by human intervention based on post-hoc considerations such as “in retrospect only one of these two variables should be in the model” or “it turns out the predictive benefit of this variable is too weak to warrant the cost of collecting it.” (Berk et al 2013).
Wasserman's HARNESS
Issues with Data Splitting
Some statisticians are uncomfortable with data-splitting. There are two common objections. The first is that the inferences are random: if we repeat the procedure we will get different answers. The second is that it is wasteful.(Wasserman in response to Lockhart et al.)
Principled Data Splitting
Key idea: Inference is already conditional on \( X \). “Splitting on observables” can be used to improve power, restrict randomization without biasing inference.
Assume \( (Y_i, X_i) \) multivariate normal, as before.
Procedure:
Lemma: Under the multivariate normal model, for fixed split sizes in the model selection set \( n_1 \) and the inference set \( n_2 \), the optimal (oracle) splitting policy maximizes the leverage of the points in the inference set with respect to the selected model.
Proof: Linear regression information criteria have the form \[ myIC = n_1\log \hat \sigma_A^2 + 2g(p_A, n_1) + C, \] where \( g \) is a function of model size and sample size, and \( C \) is a constant shared by all models.
Because of multivariate normality, residuals for any set \( A \) are mean-zero normal, so \[ \hat \sigma_A^2 \sim \sigma^2_A \chi^2_{n-p}, \] so all expectations of \( myIC \) do not depend on \( X \).
Meanwhile, the predictive variance has the form: \[ Var(Y) = X_A^{rep}(X_A^{\top}X_A)^{-1}X_A^{rep\top} \sigma^2_A \] with trace decreasing in the leverage of inference set.
Achieving the (leverage) oracle:
Relaxed assumptions:
Evaluation:
Cross-pollenation:
Goal:
Don't care if:
Achievable by: