-----------------------------------------------------------------------------------------------------------------------------------------------------------------
help ddml v1.2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Title
ddml -- Stata package for Double Debiased Machine Learning
ddml implements algorithms for causal inference aided by supervised machine learning as proposed in Double/debiased machine learning for treatment and
structural parameters (Econometrics Journal, 2018). Five different models are supported, allowing for binary or continous treatment variables and
endogeneity, high-dimensional controls and/or instrumental variables. ddml supports a variety of different ML programs, including but not limited to
lassopack and pystacked.
The package includes the wrapper program qddml, which uses a simplified one-line syntax, but offers less flexibility.
qddml relies on crossfit, which can be used as a standalone program.
Please check the examples provided at the end of the help file.
Syntax
Estimation with ddml proceeds in four steps.
Step 1. Initialize ddml and select model:
ddml init model [if] [in] [ , mname(name) kfolds(integer) fcluster(varname) foldvar(varlist) reps(integer) norandom tabfold vars(varlist) ]
where model is either partial, iv, interactive, fiv, interactiveiv; see model descriptions.
Step 2. Add supervised ML programs for estimating conditional expectations:
ddml eq [ , mname(name) vname(varname) learner(varname) vtype(string) predopt(string) ] : command depvar vars [ , cmdopt ]
where, depending on model chosen in Step 1, eq is either E[Y|X] E[Y|D,X] E[Y|X,Z] E[D|X] E[D|X,Z] E[Z|X]. command is a supported supervised ML program
(e.g. pystacked or cvlasso). See supported programs.
Note: Options before ":" and after the first comma refer to ddml. Options that come after the final comma refer to the estimation command.
Step 3. Cross-fitting:
ddml crossfit [ , mname(name) shortstack ]
This step implements the cross-fitting algorithm. Each learner is fitted iteratively on training folds and out-of-sample predicted values are obtained.
Step 4. Estimate causal effects:
ddml estimate [ , mname(name) robust cluster(varname) vce(type) atet ateu trim(real) ]
The ddml estimate command returns treatment effect estimates for all combination of learners added in Step 2.
Optional. Report/post selected results:
ddml estimate [ , mname(name) spec(integer or string) rep(integer or string) allcombos notable replay ]
Auxiliary sub-programs:
Download latest ddml from Github:
ddml update
Report information about ddml model:
ddml desc [ , mname(name) learners crossfit estimates sample all ]
Export results in csv format:
ddml export [ using filename , mname(name) ]
Retrieve information from ddml:
ddml extract [ object_name , mname(name) show(display_item) ename(name) vname(varname) stata keys key1(string) key2(string) key3(string) subkey1(string)
subkey2(string) ]
display_item can be mse, n or pystacked. ddml stores many internal results on associative arrays. These can be retrieved using the different key options.
See ddml extract for details.
Drop the ddml estimation mname and all associated variables:
ddml drop mname
Report overlap plots (interactive and interactiveiv models only):
ddml overlap [ mname(name) replist(numlist) pslist(namelist) n(integer) kernel(name) name(name [, replace]) title(string) subtitle(string) lopt0(string)
lopt1(string) ]
One overlap (line) plot of propensity scores is reported for each treatment variable learner; by default, propensity scores for all crossfit samples are
plotted. Overlap plots for the treatment variables are combined using graph combine.
Options
init options Description
-----------------------------------------------------------------------------------------------------------------------------------------------------------
mname(name) name of the DDML model. Allows to run multiple DDML models simultaneously. Defaults to m0.
kfolds(integer) number of cross-fitting folds. The default is 5.
fcluster(varname) cluster identifiers for cluster randomization of random folds.
foldvar(varlist) integer variable with user-specified cross-fitting folds (one per cross-fitting repetition).
norandom use observations in existing order instead of randomizing before splitting into folds; if multiple resamples, applies to first
resample only; ignored if user-defined fold variables are provided in foldvar(varlist).
reps(integer) cross-fitting repetitions, i.e., how often the cross-fitting procedure is repeated on randomly generated folds.
tabfold prints a table with frequency of observations by fold.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Equation options Description
-----------------------------------------------------------------------------------------------------------------------------------------------------------
mname(name) name of the DDML model. Defaults to m0.
vname(varname) name of the dependent variable in the reduced form estimation. This is usually inferred from the command line but is mandatory for
the fiv model.
learner(varname) optional name of the variable to be created.
vtype(string) optional variable type of the variable to be created. Defaults to double. none can be used to leave the type field blank (required
when using ddml with rforest.)
predopt(string) predict option to be used to get predicted values. Typical values could be xb or pr. Default is blank.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Cross-fitting Description
-----------------------------------------------------------------------------------------------------------------------------------------------------------
mname(name) name of the DDML model. Defaults to m0.
shortstack asks for short-stacking to be used. Short-stacking runs constrained non-negative least squares on the cross-fitted predicted values
to obtain a weighted average of several base learners.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Estimation Description
-----------------------------------------------------------------------------------------------------------------------------------------------------------
mname(name) name of the DDML model. Defaults to m0.
spec(integer/string) select specification. This can either be the specification number, mse for minimum-MSE specification (the default) or ss for
short-stacking.
rep(integer/string) select resampling iteration. This can either be the cross-fit repetition number, mn for mean aggregation or md for median
aggregation (the default).
robust report SEs that are robust to the presence of arbitrary heteroskedasticity.
cluster(varname) select cluster-robust variance-covariance estimator, e.g. vce(hc3) or vce(cluster id).
vce(type) select variance-covariance estimator; see here.
noconstant suppress constant term (partial, iv, fiv models only). Since the residualized outcome and treatment may not be exactly mean-zero in
finite samples, ddml includes the constant by default in the estimation stage of partially linear models.
showconstant display constant term in summary estimation output table (partial, iv, fiv models only).
atet report average treatment effect of the treated (default is ATE).
ateu report average treatment effect of the untreated (default is ATE).
trim(real) trimming of propensity scores for the Interactive and Interactive IV models. The default is 0.01 (that is, values below 0.01 and
above 0.99 are set to 0.01 and 0.99, respectively).
allcombos estimates all possible specifications. By default, only the min-MSE (or short-stacking) specification is estimated and displayed.
replay used in combination with spec() and rep() to display and return estimation results.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Auxiliary Description
-----------------------------------------------------------------------------------------------------------------------------------------------------------
mname(name) name of the DDML model. Defaults to m0.
replist(numlist) (overlap plots) list of crossfitting resamples to plot. Defaults to all.
pslist(namelist) (overlap plots) varnames of propensity scores to plot (excluding the resample number). Defaults to all.
n(integer) (overlap plots) see teffects overlap.
kernel(name) (overlap plots) see teffects overlap.
name(name) (overlap plots) see graph combine.
title(string) (overlap plots) see graph combine.
subtitle(string) (overlap plots) see graph combine.
lopt0(string) (overlap plots) options for line plot of untreated; default is solid/navy; see line.
lopt0(string) (overlap plots) options for line plot of treated; default is short dash/dark orange; see line.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Models
This section provides an overview of supported models.
Throughout we use Y to denote the outcome variable, X to denote confounders, Z to denote instrumental variable(s), and D to denote the treatment
variable(s) of interest.
Partially linear model [partial]
Y = a.D + g(X) + U
D = m(X) + V
where the aim is to estimate a while controlling for X. To this end, we estimate the conditional expectations E[Y|X] and E[D|X] using a supervised machine
learner.
Interactive model [interactive]
Y = g(X,D) + U
D = m(X) + V
which relaxes the assumption that X and D are separable. D is a binary treatment variable. We estimate the conditional expectations E[D|X], as well as
E[Y|X,D=0] and E[Y|X,D=1] (jointly added using ddml E[Y|X,D]).
Partially linear IV model [iv]
Y = a.D + g(X) + U
Z = m(X) + V
where the aim is to estimate a. We estimate the conditional expectations E[Y|X], E[D|X] and E[Z|X] using a supervised machine learner.
Interactive IV model [interactiveiv]
Y = g(Z,X) + U
D = h(Z,X) + V
Z = m(X) + E
where the aim is to estimate the local average treatment effect. We estimate, using a supervised machine learner, the following conditional expectations:
E[Y|X,Z=0] and E[Y|X,Z=1] (jointly added using ddml E[Y|X,Z]); E[D|X,Z=0] and E[D|X,Z=1] (jointly added using ddml E[D|X,Z]); E[Z|X].
Flexible Partially Liner IV model [fiv]
Y = a.D + g(X) + U
D = m(Z) + g(X) + V
where the estimand of interest is a. We estimate the conditional expectations E[Y|X], E[D^|X] and D^:=E[D|Z,X] using a supervised machine learner. The
instrument is then formed as D^-E^[D^|X] where E^[D^|X] denotes the estimate of E[D^|X].
Note: "{D}" is a placeholder that is used because last step (estimation of E[D|X]) uses the fitted values from estimating E[D|X,Z]. Please see example
section below.
Compatible programs
ddml is compatible with a large set of user-written Stata commands. It has been tested with
- lassopack for regularized regression (see lasso2, cvlasso, rlasso).
- the pystacked package (see pystacked. Note that pystacked requires Stata 16.
- rforest by Zou & Schonlau. Note that rforest requires the option vtype(none).
- svmachines by Guenther & Schonlau.
Beyond these, it is compatible with any Stata program that
- uses the standard "reg y x" syntax,
- supports if-conditions,
- and comes with predict post-estimation programs.
Examples
Below we demonstrate the use of ddml for each of the 5 models supported. Note that estimation models are chosen for demonstration purposes only and kept
simple to allow you to run the code quickly.
Partially linear model I.
Preparation: we load the data, define global macros and set the seed.
. use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
. global Y net_tfa
. global D e401
. global X tw age inc fsize educ db marr twoearn pira hown
. set seed 42
We next initialize the ddml estimation and select the model. partial refers to the partially linear model. The model will be stored on a Mata object with
the default name "m0" unless otherwise specified using the mname(name) option.
Note that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds
and even more if your sample size is small.
Note also that we recommend re-running the model multiple times on different random folds; see options reps(integer).
. ddml init partial, kfolds(2)
We add a supervised machine learners for estimating the conditional expectation E[Y|X]. We first add simple linear regression.
. ddml E[Y|X]: reg $Y $X
We can add more than one learner per reduced form equation. Here, we add a random forest learner. We do this using pystacked. In the next example we show
how to use pystacked to stack multiple learners, but here we use it to implement a single learner.
. ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)
We do the same for the conditional expectation E[D|X].
. ddml E[D|X]: reg $D $X
. ddml E[D|X]: pystacked $D $X, type(reg) method(rf)
Optionally, you can check if the learners have been added correctly.
. ddml desc
Cross-fitting. The learners are iteratively fitted on the training data. This step may take a while.
. ddml crossfit
Finally, we estimate the coefficients of interest. Since we added two learners for each of our two reduced form equations, there are four possible
specifications. By default, the result shown corresponds to the specification with the lowest out-of-sample MSPE:
. ddml estimate, robust
To estimate all four specifications, we use the allcombos option:
. ddml estimate, robust allcombos
After having estimated all specifications, we can retrieve specific results. Here we use the specification relying on OLS for both estimating both E[Y|X]
and E[D|X]:
. ddml estimate, robust spec(1) replay
You could manually retrieve the same point estimate by typing:
. reg Y1_reg D1_reg, robust
or graphically:
. twoway (scatter Y1_reg D1_reg) (lfit Y1_reg D1_reg)
where Y1_reg and D1_reg are the orthogonalized versions of net_tfa and e401.
To describe the ddml model setup or results in detail, you can use ddml describe with the relevant option (sample, learners, crossfit, estimates), or just
describe them all with the all option:
. ddml describe, all
Partially linear model II. Stacking regression using pystacked.
Stacking regression is a simple and powerful method for combining predictions from multiple learners. It is available in Stata via the pystacked package.
Below is an example with the partially linear model, but it can be used with any model supported by ddml.
Preparation: use the data and globals as above. Use the name m1 for this new estimation, to distinguish it from the previous example that uses the default
name m0. This enables having multiple estimations available for comparison. Also specify 5 resamplings.
. set seed 42
. ddml init partial, kfolds(2) reps(5) mname(m1)
Add supervised machine learners for estimating conditional expectations. The first learner in the stacked ensemble is OLS. We also use cross-validated
lasso, ridge and two random forests with different settings, which we save in the following macros:
. global rflow max_features(5) min_samples_leaf(1) max_samples(.7)
. global rfhigh max_features(5) min_samples_leaf(10) max_samples(.7)
In each step, we add the mname(m1) option to ensure that the learners are not added to the m0 model which is still in memory. We also specify the names of
the variables containing the estimated conditional expectations using the learner(varname) option. This avoids overwriting the variables created for the
m0 model using default naming.
. ddml E[Y|X], mname(m1) learner(Y_m1): pystacked $Y $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf)
opt($rfhigh), type(reg)
. ddml E[D|X], mname(m1) learner(D_m1): pystacked $D $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf)
opt($rfhigh), type(reg)
Note: Options before ":" and after the first comma refer to ddml. Options that come after the final comma refer to the estimation command. Make sure to
not confuse the two types of options.
Check if learners were correctly added:
. ddml desc, mname(m1) learners
Cross-fitting and estimation.
. ddml crossfit, mname(m1)
. ddml estimate, mname(m1) robust
Examine the stacking weights and MSEs reported by pystacked.
. ddml extract, mname(m1) show(pystacked)
. ddml extract, mname(m1) show(mse)
We can compare the effects with the first ddml model (if you have run the first example above).
. ddml estimate, mname(m0) replay
Partially linear model III. Multiple treatments.
We can also run the partially linear model with multiple treatments. In this simple example, we estimate the effect of both 401k elligibility e401 and
education educ. Note that we remove educ from the set of controls.
. use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
. global Y net_tfa
. global D1 e401
. global D2 educ
. global X tw age inc fsize db marr twoearn pira hown
. set seed 42
Initialize the model.
. ddml init partial, kfolds(2)
Add learners. Note that we add leaners with both $D1 and $D2 as the dependent variable.
. ddml E[Y|X]: reg $Y $X
. ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)
. ddml E[D|X]: reg $D1 $X
. ddml E[D|X]: pystacked $D1 $X, type(reg) method(rf)
. ddml E[D|X]: reg $D2 $X
. ddml E[D|X]: pystacked $D2 $X, type(reg) method(rf)
Cross-fitting.
. ddml crossfit
Estimation.
. ddml estimate, robust
Partially linear IV model.
Preparation: we load the data, define global macros and set the seed.
. use https://statalasso.github.io/dta/AJR.dta, clear
. global Y logpgp95
. global D avexpr
. global Z logem4
. global X lat_abst edes1975 avelf temp* humid* steplow-oilres
. set seed 42
Preparation: we load the data, define global macros and set the seed. Since the data set is very small, we consider 30 cross-fitting folds.
. ddml init iv, kfolds(30)
The partially linear IV model has three conditional expectations: E[Y|X], E[D|X] and E[Z|X]. For each reduced form equation, we add two learners: regress
and rforest.
We need to add the option vtype(none) for rforest to work with ddml since rforest's predict command doesn't support variable types.
. ddml E[Y|X]: reg $Y $X
. ddml E[Y|X], vtype(none): rforest $Y $X, type(reg)
. ddml E[D|X]: reg $D $X
. ddml E[D|X], vtype(none): rforest $D $X, type(reg)
. ddml E[Z|X]: reg $Z $X
. ddml E[Z|X], vtype(none): rforest $Z $X, type(reg)
Cross-fitting and estimation. We use the shortstack option to combine the base learners. Short-stacking is a computationally cheaper alternative to
stacking. Whereas stacking relies on cross-validated predicted values to obtain the relative weights for the base learners, short-stacking uses the
cross-fitted predicted values.
. ddml crossfit, shortstack
. ddml estimate, robust
If you are curious about what ddml does in the background:
. ddml estimate, allcombos spec(8) rep(1) robust
. ivreg Y2_rf (D2_rf = Z2_rf), robust
Interactive model--ATE and ATET estimation.
Preparation: we load the data, define global macros and set the seed.
. webuse cattaneo2, clear
. global Y bweight
. global D mbsmoke
. global X prenatal1 mmarried fbaby mage medu
. set seed 42
We use 5 folds and 5 resamplings; that is, we estimate the model 5 times using randomly chosen folds.
. ddml init interactive, kfolds(5) reps(5)
We need to estimate the conditional expectations of E[Y|X,D=0], E[Y|X,D=1] and E[D|X]. The first two conditional expectations are added jointly.
We consider two supervised learners: linear regression and gradient boosted trees, stacked using pystacked. Note that we use gradient boosted regression
trees for E[Y|X,D], but gradient boosted classification trees for E[D|X].
. ddml E[Y|X,D]: pystacked $Y $X, type(reg) methods(ols gradboost)
. ddml E[D|X]: pystacked $D $X, type(class) methods(logit gradboost)
Cross-fitting:
. ddml crossfit
In the final estimation step, we can estimate the average treatment effect (the default), the average treatment effect of the treated (atet), or the
average treatment effect of the untreated (ateu).
. ddml estimate
. ddml estimate, atet
Recall that we have specified 5 resampling iterations (reps(5)) By default, the median over the minimum-MSE specification per resampling iteration is
shown. At the bottom, a table of summary statistics over resampling iterations is shown.
To estimate using the same two base learners but with short-stacking instead of stacking, we would enter the learners separately and use the shortstack
option:
. set seed 42
. ddml init interactive, kfolds(5) reps(5)
. ddml E[Y|X,D]: reg $Y $X
. ddml E[Y|X,D]: pystacked $Y $X, type(reg) method(gradboost)
. ddml E[D|X]: logit $D $X
. ddml E[D|X]: pystacked $D $X, type(class) method(gradboost)
. ddml crossfit, shortstack
. ddml estimate
Interactive IV model--LATE estimation.
Preparation: we load the data, define global macros and set the seed.
. use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta, clear
. global Y earnings
. global D training
. global Z assignmt
. global X sex age married black hispanic
. set seed 42
We initialize the model.
. ddml init interactiveiv, kfolds(5)
We use stacking (implemented in pystacked) with two base learners for each reduced form equation.
. ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(ols lassocv)
. ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(logit lassocv)
. ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(logit lassocv)
Cross-fitting and estimation.
. ddml crossfit
. ddml estimate, robust
To short-stack instead of stack:
. set seed 42
. ddml init interactiveiv, kfolds(5)
. ddml E[Y|X,Z]: reg $Y $X
. ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(lassocv)
. ddml E[D|X,Z]: logit $D $X
. ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(lassocv)
. ddml E[Z|X]: logit $Z $X
. ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(lassocv)
Cross-fitting and estimation.
. ddml crossfit, shortstack
. ddml estimate, robust
Flexible Partially Linear IV model.
Preparation: we load the data, define global macros and set the seed.
. use https://github.com/aahrens1/ddml/raw/master/data/BLP.dta, clear
. global Y share
. global D price
. global X hpwt air mpd space
. global Z sum*
. set seed 42
We initialize the model.
. ddml init fiv
We add learners for E[Y|X] in the usual way.
. ddml E[Y|X]: reg $Y $X
. ddml E[Y|X]: pystacked $Y $X, type(reg)
There are some pecularities that we need to bear in mind when adding learners for E[D|Z,X] and E[D|X]. The reason for this is that the estimation of
E[D|X] depends on the estimation of E[D|X,Z]. More precisely, we first obtain the fitted values D^=E[D|X,Z] and fit these against X to estimate E[D^|X].
When adding learners for E[D|Z,X], we need to provide a name for each learners using learner(name).
. ddml E[D|Z,X], learner(Dhat_reg): reg $D $X $Z
. ddml E[D|Z,X], learner(Dhat_pystacked): pystacked $D $X $Z, type(reg)
When adding learners for E[D|X], we explicitly refer to the learner from the previous step (e.g., learner(Dhat_reg)) and also provide the name of the
treatment variable (vname($D)). Finally, we use the placeholder {D} in place of the dependent variable.
. ddml E[D|X], learner(Dhat_reg) vname($D): reg {D} $X
. ddml E[D|X], learner(Dhat_pystacked) vname($D): pystacked {D} $X, type(reg)
That's it. Now we can move to cross-fitting and estimation.
. ddml crossfit
. ddml estimate, robust
If you are curious about what ddml does in the background:
. ddml estimate, allcombos spec(8) rep(1) robust
. gen Dtilde = $D - Dhat_pystacked_h_1
. gen Zopt = Dhat_pystacked_1 - Dhat_pystacked_h_1
. ivreg Y2_pystacked_1 (Dtilde=Zopt), robust
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and
structural parameters. The Econometrics Journal, 21: C1-C68. https://doi.org/10.1111/ectj.12097
Installation
To get the latest stable version of ddml from our website, check the installation instructions at https://statalasso.github.io/installation/. We update
the stable website version more frequently than the SSC version.
To verify that ddml is correctly installed, click on or type whichpkg ddml (which requires whichpkg to be installed; ssc install whichpkg).
Authors
Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
achim.ahrens@gess.ethz.ch
Christian B. Hansen, University of Chicago, USA
Christian.Hansen@chicagobooth.edu
Mark E Schaffer, Heriot-Watt University, UK
m.e.schaffer@hw.ac.uk
Thomas Wiemann, University of Chicago, USA
wiemann@uchicago.edu
Also see (if installed)
Help: lasso2, cvlasso, rlasso, ivlasso, pdslasso, pystacked.
ddml help file