Partial Linear Model (PLM)

Partially Linear Model #

Preparations #

We load the data, define global macros and set the seed.

. use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
. global Y net_tfa
. global D e401
. global X tw age inc fsize educ db marr twoearn pira hown
. set seed 42

Step 1: Initialize DDML model #

We next initialize the ddml estimation and select the model. partial refers to the partially linear model. The model will be stored on a Mata object with the default name “m0” unless otherwise specified using the mname(name) option.

Number of folds
Note that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds and even more if your sample size is small.

. ddml init partial, kfolds(2)

Step 2: Add machine learners #

We add a supervised machine learners for estimating the conditional expectation \(E[Y|X]\) . We first add simple linear regression.

. ddml E[Y|X]: reg $Y $X
Learner Y1_reg added successfully.

We can add more than one learner per reduced form equation. Here, we also add a random forest learner implemented in pystacked. (In the next example we show how to use pystacked to stack multiple learners, but here we use it to implement a single learner.)

. ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)
Learner Y2_pystacked added successfully.

We do the same for the conditional expectation E[D|X].

. ddml E[D|X]: reg $D $X
Learner D1_reg added successfully.

. ddml E[D|X]: pystacked $D $X, type(reg) method(rf)
Learner D2_pystacked added successfully.

Optionally, you can check if the learners have been added correctly.

. ddml desc

Model:                  partial, crossfit folds k=2, resamples r=1
Dependent variable (Y): net_tfa
 net_tfa learners:      Y1_reg Y2_pystacked
D equations (1):        e401
 e401 learners:         D1_reg D2_pystacked

Step 3: Cross-fitting #

The learners are iteratively fitted on the training data. This step may take a while.

. ddml crossfit
Cross-fitting E[Y|X] equation: net_tfa
Cross-fitting fold 1 2 ...completed cross-fitting
Cross-fitting E[D|X] equation: e401
Cross-fitting fold 1 2 ...completed cross-fitting

Step 4: Estimation #

Finally, we obtain estimates of the coefficients of interest. Since we added two learners for each of our two reduced form equations, there are four possible specifications. By default, the result shown corresponds to the specification with the lowest out-of-sample MSPE:

. ddml estimate, robust

DDML estimation results:
spec  r     Y learner     D learner         b        SE
   1  1        Y1_reg        D1_reg  5397.308(1130.901)
   2  1        Y1_reg  D2_pystacked  6707.514 (880.374)
*  3  1  Y2_pystacked        D1_reg  7044.822(1127.173)
   4  1  Y2_pystacked  D2_pystacked  6991.835 (755.805)
* = minimum MSE specification for that resample.

Min MSE DDML model, specification 3
y-E[y|X]  = Y2_pystacked_1                         Number of obs   =      9915
D-E[D|X,Z]= D1_reg_1
------------------------------------------------------------------------------
             |               Robust
     net_tfa | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        e401 |   7044.822   1127.173     6.25   0.000     4835.603    9254.042
------------------------------------------------------------------------------

To estimate all four specifications, we use the allcombos option:

. ddml estimate, robust allcombos

DDML estimation results:
spec  r     Y learner     D learner         b        SE
   1  1        Y1_reg        D1_reg  5397.208(1130.776)
   2  1        Y1_reg  D2_pystacked  6705.740 (878.656)
*  3  1  Y2_pystacked        D1_reg  7044.518(1126.896)
   4  1  Y2_pystacked  D2_pystacked  6979.699 (753.471)
* = minimum MSE specification for that resample.

Min MSE DDML model
y-E[y|X]  = Y2_pystacked_1                         Number of obs   =      9915
D-E[D|X,Z]= D1_reg_1
------------------------------------------------------------------------------
             |               Robust
     net_tfa | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        e401 |   7044.518   1126.896     6.25   0.000     4835.843    9253.193
       _cons |  -317.8379   352.8666    -0.90   0.368    -1009.444     373.768
------------------------------------------------------------------------------

After having estimated all specifications, we can retrieve specific results. Here we use the specification relying on OLS for both estimating both E[Y|X] and E[D|X]:

. ddml estimate, robust spec(1) replay

DDML estimation results:
spec  r     Y learner     D learner         b        SE
 opt  1  Y2_pystacked        D1_reg  7044.518(1126.896)
opt = minimum MSE specification for that resample.

DDML model, specification 1
y-E[y|X]  = Y1_reg_1                               Number of obs   =      9915
D-E[D|X,Z]= D1_reg_1
------------------------------------------------------------------------------
             |               Robust
     net_tfa | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        e401 |   5397.208   1130.776     4.77   0.000     3180.928    7613.488
       _cons |   -104.854   397.9023    -0.26   0.792     -884.728    675.0201
------------------------------------------------------------------------------

Inclusion of the constant
Since the residualized outcome and treatment may not be exactly mean-zero in finite samples, ddml includes the constant by default in the estimation stage of partially linear models. Asymptotically, the intercept is not required. Earlier versions of ddml (before 1.2) did not include the constant.

You could manually retrieve the same point estimate by typing:

. reg Y1_reg D1_reg, robust

Linear regression                               Number of obs     =      9,915
                                                F(1, 9914)        =      22.78
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0037
                                                Root MSE          =      39626

------------------------------------------------------------------------------
             |               Robust
    Y1_reg_1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    D1_reg_1 |   5397.308   1130.901     4.77   0.000     3180.512    7614.105
------------------------------------------------------------------------------

or graphically:

. twoway (scatter Y1_reg D1_reg) (lfit Y1_reg D1_reg)

where Y1_reg and D1_reg are the orthogonalized versions of net_tfa and e401.

To describe the ddml model setup or results in detail, you can use ddml describe with the relevant option (sample, learners, crossfit, estimates), or just describe them all with the all option:

. ddml describe, all