qddml help file

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
help ddml                                                                                                                                                    v1.2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Title

    qddml --      Stata program for Double Debiased Machine Learning

    ddml implements algorithms for causal inference aided by supervised machine learning as proposed in Double/debiased machine learning for treatment and
    structural parameters (Econometrics Journal, 2018). Five different models are supported, allowing for binary or continous treatment variables and
    endogeneity, high-dimensional controls and/or instrumental variables.  ddml supports a variety of different ML programs, including but not limited to 
    lassopack and pystacked.

    qddml is a wrapper program of ddml. It provides a convenient one-line syntax with almost the full flexibility of ddml.  The main restriction of qddml is
    that it only allows to be used with one machine learning program at the time, while ddml allow for multiple learners per reduced form equation.

    qddml uses stacking regression (pystacked) as the default machine learning program.

    qddml relies on crossfit, which can be used as a standalone program.

        qddml depvar regressors [(hd_controls)] (endog=instruments) [if exp] [in range] model(name) [ , cmd(string) cmdopt(string) mname(string) noreg ... ]}

    Since qddml uses pystacked per default, it requires Stata 16 or higher, Python 3.x and at least scikit-learn 0.24. See this help file, this Stata blog
    entry and this Youtube video for how to set up Python on your system.  In short, install Python 3.x (we recommend Anaconda) and set the appropriate Python
    path using python set exec.  If you don't have Stata 16+, you can still use pystacked with programs that don't rely on Python, e.g., using the option
    cmd(rlasso).

    Please check the examples provided at the end of the help file.

Options

    General               Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    model(name)            the model to be estimated; allows for partial, interactive, iv, fiv, late. See here for an overview.
    mname(string)          name of the DDML model. Allows to run multiple DDML models simultaneously. Defaults to m0.
    kfolds(integer)        number of cross-fitting folds. The default is 5.
    fcluster(varname)      cluster identifiers for cluster randomization of random folds.
    foldvar(varname)       integer variable with user-specified cross-fitting folds.
    reps(integer)          number of re-sampling iterations, i.e., how often the cross-fitting procedure is repeated on randomly generated folds.
    shortstack             asks for short-stacking to be used.  Short-stacking runs contrained non-negative least squares on the cross-fitted predicted values
                            to obtain a weighted average of several base learners.
    robust                 report SEs that are robust to the presence of arbitrary heteroskedasticity.
    vce(type)              select variance-covariance estimator, see here
    cluster(varname)       select cluster-robust variance-covariance estimator.
    noreg                  do not add regress as an additional learner.

    Learners              Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    cmd(string)            ML program used for estimating conditional expectations.  Defaults to pystacked.  See here for other supported programs.
    ycmd(string)           ML program used for estimating the conditional expectations of the outcome Y.  Defaults to cmd(string).
    dcmd(string)           ML program used for estimating the conditional expectations of the treatment variable(s) D.  Defaults to cmd(string).
    zcmd(string)           ML program used for estimating conditional expectations of instrumental variable(s) Z.  Defaults to cmd(string).
    *cmdopt(string)        options that are passed on to ML program.  The asterisk * can be replaced with either nothing (setting the default for all reduced
                            form equations), y (setting the default for the conditional expectation of Y), d (setting the default for D) or z (setting the
                            default for Z).
    *vtype(string)         variable type of the variable to be created. Defaults to double.  none can be used to leave the type field blank (this is required
                            when using ddml with rforest.) The asterisk * can be replaced with either nothing (setting the default for all reduced form
                            equations), y (setting the default for the conditional expectation of Y), d (setting the default for D) or z (setting the default
                            for Z).
    *predopt(string)       predict option to be used to get predicted values.  Typical values could be xb or pr. Default is blank. The asterisk * can be
                            replaced with either nothing (setting the default for all reduced form equations), y (setting the default for the conditional
                            expectation of Y), d (setting the default for D) or z (setting the default for Z).

    Output                Description
    -----------------------------------------------------------------------------------------------------------------------------------------------------------
    verbose                show detailed output
    vverbose               show even more output

Models

    See here.

Compatible programs

    See here.

Examples

    Below we demonstrate the use of qddml for each of the 5 models supported.  Note that estimation models are chosen for demonstration purposes only and kept
    simple to allow you to run the code quickly.  Please also see the examples in the ddml help file

    Partially linear model.

    Preparations: we load the data, define global macros and set the seed.
        . use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
        . global Y net_tfa
        . global D e401
        . global X tw age inc fsize educ db marr twoearn pira hown
        . set seed 42

    The options model(partial) selects the partially linear model and kfolds(2) selects two cross-fitting folds.  We use the options cmd() and cmdopt() to
    select random forest for estimating the conditional expectations.

    Note that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds
    and even more if your sample size is small.

    Note also that we recommend to re-run the model multiple time on different random folds, see options reps(integer).

        . qddml $Y $D ($X), kfolds(2) model(partial) cmd(pystacked) cmdopt(type(reg) method(rf))

    Partially linear IV model.

    Preparations: we load the data, define global macros and set the seed.
        . use https://statalasso.github.io/dta/AJR.dta, clear
        . global Y logpgp95
        . global D avexpr
        . global Z logem4
        . global X lat_abst edes1975 avelf temp* humid* steplow-oilres
        . set seed 42

    Since the data set is very small, we consider 30 cross-fitting folds.
 
    We need to add the option vtype(none) for rforest to work with ddml since rforests's predict command doesn't support variable types.

        . qddml $Y ($X) ($D=$Z), kfolds(30) model(iv) cmd(rforest) cmdopt(type(reg)) vtype(none) robust

    Interactive model--ATE and ATET estimation.

    Preparations: we load the data, define global macros and set the seed.
        . webuse cattaneo2, clear
        . global Y bweight
        . global D mbsmoke
        . global X mage prenatal1 mmarried fbaby mage medu
        . set seed 42

    Note that we use gradient boosted regression trees for E[Y|X,D] (see ycmdopt()), but gradient boosted classification trees for E[D|X] (see dcmdopt()).
 
        . qddml $Y $D ($X), kfolds(5) reps(5) model(interactive) cmd(pystacked) ycmdopt(type(reg) method(gradboost)) dcmdopt(type(class) method(gradboost))

    qddml reports the ATE effect by default. The option atet returns the ATET estimate.

    If we want retrieve the ATET estimate after estimation, we can simply use {ddml estimate}.
        . ddml estimate, atet

    Interactive IV model--LATE estimation.

    Preparations: we load the data, define global macros and set the seed.
        . use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta,clear
        . global Y earnings
        . global D training
        . global Z assignmt
        . global X sex age married black hispanic
        . set seed 42

        . qddml $Y (c.($X)# #c($X)) ($D=$Z), kfolds(5) model(interactiveiv) cmd(pystacked) ycmdopt(type(reg) m(lassocv)) dcmdopt(type(class) m(lassocv))
            zcmdopt(type(class) m(lassocv))

    Flexible Partially Linear IV model.

    Preparations: we load the data, define global macros and set the seed.
        . use https://github.com/aahrens1/ddml/raw/master/data/BLP.dta, clear
        . global Y share
        . global D price
        . global X hpwt air mpd space
        . global Z sum*
        . set seed 42

    The syntax is the same as in the Partially Linear IV model, but we now estimate the optimal instrument flexibly.
        . qddml $Y ($X) ($D=$Z), model(fiv)

References

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and
    structural parameters.  The Econometrics Journal, 21: C1-C68. https://doi.org/10.1111/ectj.12097

Installation

    To get the latest stable version of ddml from our website, check the installation instructions at https://statalasso.github.io/installation/.  We update
    the stable website version more frequently than the SSC version.

    To verify that ddml is correctly installed, click on or type whichpkg ddml (which requires whichpkg to be installed; ssc install whichpkg).

Authors

    Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
    achim.ahrens@gess.ethz.ch

    Christian B. Hansen, University of Chicago, USA
    Christian.Hansen@chicagobooth.edu

    Mark E Schaffer, Heriot-Watt University, UK
    m.e.schaffer@hw.ac.uk

    Thomas Wiemann, University of Chicago, USA
    wiemann@uchicago.edu

Also see (if installed)

    Help: lasso2, cvlasso, rlasso, ivlasso, pdslasso, pystacked.