ddml help file

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
help ddml                                                                                                                                                    v1.2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Title

    ddml --       Stata package for Double Debiased Machine Learning

    ddml implements algorithms for causal inference aided by supervised machine learning as proposed in Double/debiased machine learning for treatment and
    structural parameters (Econometrics Journal, 2018). Five different models are supported, allowing for binary or continous treatment variables and
    endogeneity, high-dimensional controls and/or instrumental variables.  ddml supports a variety of different ML programs, including but not limited to 
    lassopack and pystacked.

    The package includes the wrapper program qddml, which uses a simplified one-line syntax, but offers less flexibility.

    qddml relies on crossfit, which can be used as a standalone program.

    Please check the examples provided at the end of the help file.

Syntax

    Estimation with ddml proceeds in four steps.

    Step 1. Initialize ddml and select model:

        ddml init model [if] [in] [ , mname(name) kfolds(integer) fcluster(varname) foldvar(varlist) reps(integer) norandom tabfold vars(varlist) ]

    where model is either partial, iv, interactive, fiv, interactiveiv; see model descriptions.

    Step 2. Add supervised ML programs for estimating conditional expectations:

        ddml eq [ , mname(name) vname(varname) learner(varname) vtype(string) predopt(string) ] :  command depvar vars [ , cmdopt ]

    where, depending on model chosen in Step 1, eq is either E[Y|X] E[Y|D,X] E[Y|X,Z] E[D|X] E[D|X,Z] E[Z|X].  command is a supported supervised ML program
    (e.g. pystacked or cvlasso).  See supported programs.

    Note: Options before ":" and after the first comma refer to ddml.  Options that come after the final comma refer to the estimation command.

    Step 3. Cross-fitting:

        ddml crossfit [ , mname(name) shortstack ]

    This step implements the cross-fitting algorithm. Each learner is fitted iteratively on training folds and out-of-sample predicted values are obtained.

    Step 4. Estimate causal effects:

        ddml estimate [ , mname(name) robust cluster(varname) vce(type) atet ateu trim(real) ]

    The ddml estimate command returns treatment effect estimates for all combination of learners added in Step 2.

    Optional. Report/post selected results:

        ddml estimate [ , mname(name) spec(integer or string) rep(integer or string) allcombos notable replay  ]

    Auxiliary sub-programs:

    Download latest ddml from Github:

        ddml update

    Report information about ddml model:

        ddml desc [ , mname(name) learners crossfit estimates sample all ]

    Export results in csv format:

        ddml export [ using filename , mname(name) ]

    Retrieve information from ddml:

        ddml extract [ object_name , mname(name) show(display_item) ename(name) vname(varname) stata keys key1(string) key2(string) key3(string) subkey1(string)
              subkey2(string) ]

    display_item can be mse, n or pystacked.  ddml stores many internal results on associative arrays.  These can be retrieved using the different key options.
    See ddml extract for details.

    Drop the ddml estimation mname and all associated variables:

        ddml drop mname

    Report overlap plots (interactive and interactiveiv models only):

        ddml overlap [ mname(name) replist(numlist) pslist(namelist) n(integer) kernel(name) name(name [, replace]) title(string) subtitle(string) lopt0(string)
              lopt1(string) ]

    One overlap (line) plot of propensity scores is reported for each treatment variable learner; by default, propensity scores for all crossfit samples are
    plotted.  Overlap plots for the treatment variables are combined using graph combine.

Options

    init options          Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    mname(name)            name of the DDML model. Allows to run multiple DDML models simultaneously. Defaults to m0.
    kfolds(integer)        number of cross-fitting folds. The default is 5.
    fcluster(varname)      cluster identifiers for cluster randomization of random folds.
    foldvar(varlist)       integer variable with user-specified cross-fitting folds (one per cross-fitting repetition).
    norandom               use observations in existing order instead of randomizing before splitting into folds; if multiple resamples, applies to first
                            resample only; ignored if user-defined fold variables are provided in foldvar(varlist).
    reps(integer)          cross-fitting repetitions, i.e., how often the cross-fitting procedure is repeated on randomly generated folds.
    tabfold                prints a table with frequency of observations by fold.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------

    Equation options      Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    mname(name)            name of the DDML model. Defaults to m0.
    vname(varname)         name of the dependent variable in the reduced form estimation.  This is usually inferred from the command line but is mandatory for
                            the fiv model.
    learner(varname)       optional name of the variable to be created.
    vtype(string)          optional variable type of the variable to be created. Defaults to double.  none can be used to leave the type field blank (required
                            when using ddml with rforest.)
    predopt(string)        predict option to be used to get predicted values.  Typical values could be xb or pr. Default is blank.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------

    Cross-fitting         Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    mname(name)            name of the DDML model. Defaults to m0.
    shortstack             asks for short-stacking to be used.  Short-stacking runs constrained non-negative least squares on the cross-fitted predicted values
                            to obtain a weighted average of several base learners.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------

    Estimation            Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    mname(name)            name of the DDML model. Defaults to m0.
    spec(integer/string)   select specification. This can either be the specification number, mse for minimum-MSE specification (the default) or ss for
                            short-stacking.
    rep(integer/string)    select resampling iteration. This can either be the cross-fit repetition number, mn for mean aggregation or md for median
                            aggregation (the default).
    robust                 report SEs that are robust to the presence of arbitrary heteroskedasticity.
    cluster(varname)       select cluster-robust variance-covariance estimator, e.g. vce(hc3) or vce(cluster id).
    vce(type)              select variance-covariance estimator; see here.
    noconstant             suppress constant term (partial, iv, fiv models only). Since the residualized outcome and treatment may not be exactly mean-zero in
                            finite samples, ddml includes the constant by default in the estimation stage of partially linear models.
    showconstant           display constant term in summary estimation output table (partial, iv, fiv models only).
    atet                   report average treatment effect of the treated (default is ATE).
    ateu                   report average treatment effect of the untreated (default is ATE).
    trim(real)             trimming of propensity scores for the Interactive and Interactive IV models. The default is 0.01 (that is, values below 0.01 and
                            above 0.99 are set to 0.01 and 0.99, respectively).
    allcombos              estimates all possible specifications. By default, only the min-MSE (or short-stacking) specification is estimated and displayed.
    replay                 used in combination with spec() and rep() to display and return estimation results.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------

    Auxiliary             Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    mname(name)            name of the DDML model. Defaults to m0.
    replist(numlist)       (overlap plots) list of crossfitting resamples to plot. Defaults to all.
    pslist(namelist)       (overlap plots) varnames of propensity scores to plot (excluding the resample number). Defaults to all.
    n(integer)             (overlap plots) see teffects overlap.
    kernel(name)           (overlap plots) see teffects overlap.
    name(name)             (overlap plots) see graph combine.
    title(string)          (overlap plots) see graph combine.
    subtitle(string)       (overlap plots) see graph combine.
    lopt0(string)          (overlap plots) options for line plot of untreated; default is solid/navy; see line.
    lopt0(string)          (overlap plots) options for line plot of treated; default is short dash/dark orange; see line.
    -----------------------------------------------------------------------------------------------------------------------------------------------------------


Models

    This section provides an overview of supported models.

    Throughout we use Y to denote the outcome variable, X to denote confounders, Z to denote instrumental variable(s), and D to denote the treatment
    variable(s) of interest.

    Partially linear model [partial]

        Y = a.D + g(X) + U
        D = m(X) + V

    where the aim is to estimate a while controlling for X. To this end, we estimate the conditional expectations E[Y|X] and E[D|X] using a supervised machine
    learner.

    Interactive model [interactive]

        Y = g(X,D) + U
        D = m(X) + V

    which relaxes the assumption that X and D are separable.  D is a binary treatment variable.  We estimate the conditional expectations E[D|X], as well as
    E[Y|X,D=0] and E[Y|X,D=1] (jointly added using ddml E[Y|X,D]).

    Partially linear IV model [iv]

        Y = a.D + g(X) + U
        Z = m(X) + V

    where the aim is to estimate a.  We estimate the conditional expectations E[Y|X], E[D|X] and E[Z|X] using a supervised machine learner.

    Interactive IV model [interactiveiv]

        Y = g(Z,X) + U
        D = h(Z,X) + V
        Z = m(X) + E

    where the aim is to estimate the local average treatment effect.  We estimate, using a supervised machine learner, the following conditional expectations:
    E[Y|X,Z=0] and E[Y|X,Z=1] (jointly added using ddml E[Y|X,Z]); E[D|X,Z=0] and E[D|X,Z=1] (jointly added using ddml E[D|X,Z]); E[Z|X].

    Flexible Partially Liner IV model [fiv]

        Y = a.D + g(X) + U
        D = m(Z) + g(X) + V 

    where the estimand of interest is a.  We estimate the conditional expectations E[Y|X], E[D^|X] and D^:=E[D|Z,X] using a supervised machine learner. The
    instrument is then formed as D^-E^[D^|X] where E^[D^|X] denotes the estimate of E[D^|X].

    Note: "{D}" is a placeholder that is used because last step (estimation of E[D|X]) uses the fitted values from estimating E[D|X,Z].  Please see example
    section below.

Compatible programs

    ddml is compatible with a large set of user-written Stata commands.  It has been tested with

       - lassopack for regularized regression (see lasso2, cvlasso, rlasso).

       - the pystacked package (see pystacked.  Note that pystacked requires Stata 16.

       - rforest by Zou & Schonlau. Note that rforest requires the option vtype(none).

       - svmachines by Guenther & Schonlau.

    Beyond these, it is compatible with any Stata program that

       - uses the standard "reg y x" syntax,

       - supports if-conditions,

       - and comes with predict post-estimation programs.

Examples

    Below we demonstrate the use of ddml for each of the 5 models supported.  Note that estimation models are chosen for demonstration purposes only and kept
    simple to allow you to run the code quickly.

    Partially linear model I.

    Preparation: we load the data, define global macros and set the seed.
        . use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
        . global Y net_tfa
        . global D e401
        . global X tw age inc fsize educ db marr twoearn pira hown
        . set seed 42

    We next initialize the ddml estimation and select the model.  partial refers to the partially linear model.  The model will be stored on a Mata object with
    the default name "m0" unless otherwise specified using the mname(name) option.
 
    Note that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds
    and even more if your sample size is small.
 
    Note also that we recommend re-running the model multiple times on different random folds; see options reps(integer).
 
        . ddml init partial, kfolds(2)

    We add a supervised machine learners for estimating the conditional expectation E[Y|X]. We first add simple linear regression.
        . ddml E[Y|X]: reg $Y $X

    We can add more than one learner per reduced form equation. Here, we add a random forest learner. We do this using pystacked.  In the next example we show
    how to use pystacked to stack multiple learners, but here we use it to implement a single learner.
        . ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)

    We do the same for the conditional expectation E[D|X].
        . ddml E[D|X]: reg $D $X
        . ddml E[D|X]: pystacked $D $X, type(reg) method(rf)

    Optionally, you can check if the learners have been added correctly.
        . ddml desc

    Cross-fitting. The learners are iteratively fitted on the training data.  This step may take a while.
        . ddml crossfit

    Finally, we estimate the coefficients of interest.  Since we added two learners for each of our two reduced form equations, there are four possible
    specifications.  By default, the result shown corresponds to the specification with the lowest out-of-sample MSPE:
        . ddml estimate, robust

    To estimate all four specifications, we use the allcombos option:
        . ddml estimate, robust allcombos

    After having estimated all specifications, we can retrieve specific results. Here we use the specification relying on OLS for both estimating both E[Y|X]
    and E[D|X]:
        . ddml estimate, robust spec(1) replay

    You could manually retrieve the same point estimate by typing:
        . reg Y1_reg D1_reg, robust
    or graphically:
        . twoway (scatter Y1_reg D1_reg) (lfit Y1_reg D1_reg)

    where Y1_reg and D1_reg are the orthogonalized versions of net_tfa and e401.

    To describe the ddml model setup or results in detail, you can use ddml describe with the relevant option (sample, learners, crossfit, estimates), or just
    describe them all with the all option:
        . ddml describe, all

    Partially linear model II. Stacking regression using pystacked.

    Stacking regression is a simple and powerful method for combining predictions from multiple learners.  It is available in Stata via the pystacked package.
    Below is an example with the partially linear model, but it can be used with any model supported by ddml.

    Preparation: use the data and globals as above.  Use the name m1 for this new estimation, to distinguish it from the previous example that uses the default
    name m0.  This enables having multiple estimations available for comparison.  Also specify 5 resamplings.
        . set seed 42
        . ddml init partial, kfolds(2) reps(5) mname(m1)

    Add supervised machine learners for estimating conditional expectations.  The first learner in the stacked ensemble is OLS.  We also use cross-validated
    lasso, ridge and two random forests with different settings, which we save in the following macros:
        . global rflow max_features(5) min_samples_leaf(1) max_samples(.7)
        . global rfhigh max_features(5) min_samples_leaf(10) max_samples(.7)

    In each step, we add the mname(m1) option to ensure that the learners are not added to the m0 model which is still in memory.  We also specify the names of
    the variables containing the estimated conditional expectations using the learner(varname) option.  This avoids overwriting the variables created for the
    m0 model using default naming.

        . ddml E[Y|X], mname(m1) learner(Y_m1): pystacked $Y $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf)
            opt($rfhigh), type(reg)
        . ddml E[D|X], mname(m1) learner(D_m1): pystacked $D $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf)
            opt($rfhigh), type(reg)

    Note: Options before ":" and after the first comma refer to ddml.  Options that come after the final comma refer to the estimation command.  Make sure to
    not confuse the two types of options.

    Check if learners were correctly added:
        . ddml desc, mname(m1) learners

    Cross-fitting and estimation.
        . ddml crossfit, mname(m1)
        . ddml estimate, mname(m1) robust

    Examine the stacking weights and MSEs reported by pystacked.
        . ddml extract, mname(m1) show(pystacked)
        . ddml extract, mname(m1) show(mse)

    We can compare the effects with the first ddml model (if you have run the first example above).
        . ddml estimate, mname(m0) replay

    Partially linear model III. Multiple treatments.

    We can also run the partially linear model with multiple treatments.  In this simple example, we estimate the effect of both 401k elligibility e401 and
    education educ.  Note that we remove educ from the set of controls.
        . use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
        . global Y net_tfa
        . global D1 e401
        . global D2 educ
        . global X tw age inc fsize db marr twoearn pira hown
        . set seed 42

    Initialize the model.
        . ddml init partial, kfolds(2)

    Add learners. Note that we add leaners with both $D1 and $D2 as the dependent variable.
        . ddml E[Y|X]: reg $Y $X
        . ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)
        . ddml E[D|X]: reg $D1 $X
        . ddml E[D|X]: pystacked $D1 $X, type(reg) method(rf)
        . ddml E[D|X]: reg $D2 $X
        . ddml E[D|X]: pystacked $D2 $X, type(reg) method(rf)

    Cross-fitting.
        . ddml crossfit

    Estimation.
        . ddml estimate, robust

    Partially linear IV model.

    Preparation: we load the data, define global macros and set the seed.
        . use https://statalasso.github.io/dta/AJR.dta, clear
        . global Y logpgp95
        . global D avexpr
        . global Z logem4
        . global X lat_abst edes1975 avelf temp* humid* steplow-oilres
        . set seed 42

    Preparation: we load the data, define global macros and set the seed. Since the data set is very small, we consider 30 cross-fitting folds.
        . ddml init iv, kfolds(30)

    The partially linear IV model has three conditional expectations:  E[Y|X], E[D|X] and E[Z|X]. For each reduced form equation, we add two learners: regress
    and rforest.

    We need to add the option vtype(none) for rforest to work with ddml since rforest's predict command doesn't support variable types.
        . ddml E[Y|X]: reg $Y $X
        . ddml E[Y|X], vtype(none): rforest $Y $X, type(reg)
        . ddml E[D|X]: reg $D $X
        . ddml E[D|X], vtype(none): rforest $D $X, type(reg)
        . ddml E[Z|X]: reg $Z $X
        . ddml E[Z|X], vtype(none): rforest $Z $X, type(reg)

    Cross-fitting and estimation. We use the shortstack option to combine the base learners. Short-stacking is a computationally cheaper alternative to
    stacking. Whereas stacking relies on cross-validated predicted values to obtain the relative weights for the base learners, short-stacking uses the
    cross-fitted predicted values.
        . ddml crossfit, shortstack
        . ddml estimate, robust

    If you are curious about what ddml does in the background:
        . ddml estimate, allcombos spec(8) rep(1) robust
        . ivreg Y2_rf (D2_rf = Z2_rf), robust

    Interactive model--ATE and ATET estimation.

    Preparation: we load the data, define global macros and set the seed.
        . webuse cattaneo2, clear
        . global Y bweight
        . global D mbsmoke
        . global X prenatal1 mmarried fbaby mage medu
        . set seed 42

    We use 5 folds and 5 resamplings; that is, we estimate the model 5 times using randomly chosen folds.
        . ddml init interactive, kfolds(5) reps(5)

    We need to estimate the conditional expectations of E[Y|X,D=0], E[Y|X,D=1] and E[D|X]. The first two conditional expectations are added jointly.
 
    We consider two supervised learners: linear regression and gradient boosted trees, stacked using pystacked.  Note that we use gradient boosted regression
    trees for E[Y|X,D], but gradient boosted classification trees for E[D|X].
 
        . ddml E[Y|X,D]: pystacked $Y $X, type(reg) methods(ols gradboost)
        . ddml E[D|X]: pystacked $D $X, type(class) methods(logit gradboost)

    Cross-fitting:
        . ddml crossfit

    In the final estimation step, we can estimate the average treatment effect (the default), the average treatment effect of the treated (atet), or the
    average treatment effect of the untreated (ateu).
        . ddml estimate
        . ddml estimate, atet

    Recall that we have specified 5 resampling iterations (reps(5)) By default, the median over the minimum-MSE specification per resampling iteration is
    shown.  At the bottom, a table of summary statistics over resampling iterations is shown.

    To estimate using the same two base learners but with short-stacking instead of stacking, we would enter the learners separately and use the shortstack
    option:

        . set seed 42
        . ddml init interactive, kfolds(5) reps(5)
        . ddml E[Y|X,D]: reg $Y $X
        . ddml E[Y|X,D]: pystacked $Y $X, type(reg) method(gradboost)
        . ddml E[D|X]: logit $D $X
        . ddml E[D|X]: pystacked $D $X, type(class) method(gradboost)
        . ddml crossfit, shortstack
        . ddml estimate

    Interactive IV model--LATE estimation.

    Preparation: we load the data, define global macros and set the seed.
        . use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta, clear
        . global Y earnings
        . global D training
        . global Z assignmt
        . global X sex age married black hispanic
        . set seed 42

    We initialize the model.
        . ddml init interactiveiv, kfolds(5)

    We use stacking (implemented in pystacked) with two base learners for each reduced form equation.
        . ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(ols lassocv)
        . ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(logit lassocv)
        . ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(logit lassocv)

    Cross-fitting and estimation.
        . ddml crossfit
        . ddml estimate, robust

    To short-stack instead of stack:
        . set seed 42
        . ddml init interactiveiv, kfolds(5)
        . ddml E[Y|X,Z]: reg $Y $X
        . ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(lassocv)
        . ddml E[D|X,Z]: logit $D $X
        . ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(lassocv)
        . ddml E[Z|X]: logit $Z $X
        . ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(lassocv)

    Cross-fitting and estimation.
        . ddml crossfit, shortstack
        . ddml estimate, robust

    Flexible Partially Linear IV model.

    Preparation: we load the data, define global macros and set the seed.
        . use https://github.com/aahrens1/ddml/raw/master/data/BLP.dta, clear
        . global Y share
        . global D price
        . global X hpwt air mpd space
        . global Z sum*
        . set seed 42

    We initialize the model.
        . ddml init fiv

    We add learners for E[Y|X] in the usual way.
        . ddml E[Y|X]: reg $Y $X
        . ddml E[Y|X]: pystacked $Y $X, type(reg)

    There are some pecularities that we need to bear in mind when adding learners for E[D|Z,X] and E[D|X].  The reason for this is that the estimation of
    E[D|X] depends on the estimation of E[D|X,Z].  More precisely, we first obtain the fitted values D^=E[D|X,Z] and fit these against X to estimate E[D^|X].

    When adding learners for E[D|Z,X], we need to provide a name for each learners using learner(name).
        . ddml E[D|Z,X], learner(Dhat_reg): reg $D $X $Z
        . ddml E[D|Z,X], learner(Dhat_pystacked): pystacked $D $X $Z, type(reg)

    When adding learners for E[D|X], we explicitly refer to the learner from the previous step (e.g., learner(Dhat_reg)) and also provide the name of the
    treatment variable (vname($D)).  Finally, we use the placeholder {D} in place of the dependent variable.
        . ddml E[D|X], learner(Dhat_reg) vname($D): reg {D} $X
        . ddml E[D|X], learner(Dhat_pystacked) vname($D): pystacked {D} $X, type(reg)
 
    That's it. Now we can move to cross-fitting and estimation.
        . ddml crossfit
        . ddml estimate, robust

    If you are curious about what ddml does in the background:
        . ddml estimate, allcombos spec(8) rep(1) robust
        . gen Dtilde = $D - Dhat_pystacked_h_1
        . gen Zopt = Dhat_pystacked_1 - Dhat_pystacked_h_1
        . ivreg Y2_pystacked_1 (Dtilde=Zopt), robust

References

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and
    structural parameters.  The Econometrics Journal, 21: C1-C68. https://doi.org/10.1111/ectj.12097

Installation

    To get the latest stable version of ddml from our website, check the installation instructions at https://statalasso.github.io/installation/.  We update
    the stable website version more frequently than the SSC version.

    To verify that ddml is correctly installed, click on or type whichpkg ddml (which requires whichpkg to be installed; ssc install whichpkg).

Authors

    Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
    achim.ahrens@gess.ethz.ch

    Christian B. Hansen, University of Chicago, USA
    Christian.Hansen@chicagobooth.edu

    Mark E Schaffer, Heriot-Watt University, UK
    m.e.schaffer@hw.ac.uk

    Thomas Wiemann, University of Chicago, USA
    wiemann@uchicago.edu

Also see (if installed)

    Help: lasso2, cvlasso, rlasso, ivlasso, pdslasso, pystacked.