Belloni et al. (2012, Econometrica) consider the model
where is the dependent variable, is an endogenous regressors and is a -dimensional vector of instruments. is allowed to be large and may even exceed the sample size. We refer to as high-dimensional. The interest lies in estimating the causal effect of endogenous variable on the outcome variable .
The choice and specification of instruments is crucial for the estimation of . However, often it is a priori not clear how to select or specify instruments. The situation of many instruments can arise because there are simply many instruments available and/or because we need to consider a large number of transformations of elementary variables to approximate the complex relationship between endogenous regressor and instruments .
Belloni et al. suggest to apply the lasso with theory-driven penalization to the equation . Under the assumption of (approximate) sparsity, the rigorous lasso (or square-root lasso) can be applied to select appropriate instruments and to predict . is then used as a as estimate of the optimal instrument, where is either the lasso, square-root lasso, post-lasso or post square-root lasso estimator. Instrument selection using lasso and square-root lasso is implemented in
Next, we consider the case where is exogenous, but there are many control variables.
In this setting, we allow the -dimensional vector of controls, to be high-dimensional. The problem the researcher faces is that the “right” set of controls is not known. In traditional practice, this presents her with a difficult choice: use too few controls, or the wrong ones, and omitted variable bias will be present; use too many, and the model will suffer from overfitting.
The post-double-selection (PDS) methodology introduced in Belloni, Chernozhukov and Hansen (2014) uses the lasso estimator to select the controls. Specifically, the lasso is used twice:
estimate a lasso regression with as the dependent variable and the control variables as regressors;
estimate a lasso regression with as the dependent variable and again the control variables as regressors. The lasso estimator achieves a sparse solution, i.e., most coefficients are set to zero. The final choice of control variables to include in the OLS regression of on is the union of the controls selected selected in steps 1. and 2., hence the name post-double selection for the methodolgy.
The post-regularization or CHS methodology is closely related. Instead of using the lasso-selected controls in a post-regularization OLS estimation, the selected variables are used to construct orthogonalized versions of the dependent variable and the exogenous causal variables of interest. The orthogonalized versions are based either on the lasso or post-lasso estimated coefficients; the post-lasso is OLS applied to lasso-selected variables. See Chernozhukov, Hansen & Spindler (2015) for details.
The post-double-selection and post-regularization approach
for many controls are implemented in
Many controls and many instruments
Chernozhukov, Hansen & Spindler (2015) also consider the case where we have both many instruments and many controls:
where and/or are allowed. The above model can be estimated using
ivlasso, which allows for low and/or high-dimensional sets of instruments.
pdslasso implement methods for:
- endogenous and/or exogenous regressors,
- low and high-dimensional instruments,
- low and high-dimensional control variables.