Partially linear model with Stacking #
Stacking regression is a simple and powerful method for combining predictions from multiple learners. It is available in Stata
via the pystacked
package (see here). Below is an example with the partially linear model, but it can be used with any model supported by
ddml
.
Step 1: Initialization #
Preparation: use the data and globals as above. Use the name m1 for this new estimation, to distinguish it from the previous example that uses the default name m0. This enables having multiple estimations available for comparison. Also specify 5 cross-fitting repetitions.
. set seed 42
. ddml init partial, kfolds(2) reps(5) mname(m1)
Cross-fitting repetitions
The results of DDML depends on the exact cross-fit fold split. We recommend re-running the (final) model multiple times on different random folds; see optionsreps(integer)
.
Step 2: Add learners #
Add supervised machine learners for estimating conditional expectations. The first learner in the stacked ensemble is OLS. We also use cross-validated lasso, ridge and two random forests with different settings, which we save in the following macros:
. global rflow max_features(5) min_samples_leaf(1) max_samples(.7)
. global rfhigh max_features(5) min_samples_leaf(10) max_samples(.7)
In each step, we add the mname(m1)
option to ensure that the learners are not added to the m0 model which is still in memory. We
also specify the names of the variables containing the estimated conditional expectations using the learner(varname)
option.
This avoids overwriting the variables created for the m0 model using default naming.
. ddml E[Y|X], mname(m1) learner(Y_m1): pystacked $Y $X || ///
> method(ols) || ///
> method(lassocv) || ///
> method(ridgecv) || ///
> method(rf) opt($rflow) || ///
> method(rf) opt($rfhigh), type(reg)
Learner Y_m1 added successfully.
. ddml E[D|X], mname(m1) learner(D_m1): pystacked $D $X || ///
> method(ols) || ///
> method(lassocv) || ///
> method(ridgecv) || ///
> method(rf) opt($rflow) || ///
> method(rf) opt($rfhigh), type(reg)
Learner D_m1 added successfully.
Options
Note: Options before “:” and after the first comma refer toddml
. Options that come after the final comma refer to the estimation command. Make sure to not confuse the two types of options.
Check if learners were correctly added (output omitted):
. ddml desc, mname(m1) learners
Step 3/4: Cross-fitting and estimation #
. qui ddml crossfit, mname(m1)
. ddml estimate, mname(m1) robust
DDML estimation results:
spec r Y learner D learner b SE
opt 1 Y_m1 D_m1 7362.283 (937.426)
opt 2 Y_m1 D_m1 6958.283 (899.946)
opt 3 Y_m1 D_m1 6531.201 (872.895)
opt 4 Y_m1 D_m1 6532.662 (952.414)
opt 5 Y_m1 D_m1 6672.368 (981.239)
opt = minimum MSE specification for that resample.
Mean/med. Y learner D learner b SE
mse mn [min-mse] [mse] 6811.360 (973.863)
mse md [min-mse] [mse] 6672.368 (962.606)
Median over min-mse specifications
y-E[y|X] = Y_m1 Number of obs = 9915
D-E[D|X,Z]= D_m1
------------------------------------------------------------------------------
| Robust
net_tfa | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
e401 | 6672.368 962.6062 6.93 0.000 4785.695 8559.042
------------------------------------------------------------------------------
Summary over 5 resamples:
D eqn mean min p25 p50 p75 max
e401 6811.3596 6531.2007 6532.6626 6672.3682 6958.2832 7362.2832
Examine the learner weights used by pystacked
(not shown):
. ddml extract, mname(m1) show(pystacked)