Getting started

Getting started #

Before we get into stacking, let’s first use pystacked as a “regular” program for machine learning.

Gradient boosting #

We load the example data set and randomly split the data in training/test sample.

. clear all
. insheet using ///
                 https://statalasso.github.io/dta/housing.csv,  ///
                 clear comma
. set seed 789
. gen train=runiform()
. replace train=train>.75

As an example, we run pystacked with gradient boosting:

. pystacked medv crim-lstat if train, ///
          type(regress) method(gradboost)
Single base learner: no stacking done.

Stacking weights:
---------------------------------------
  Method         |      Weight
-----------------+---------------------
  gradboost      |      1.0000000
Python seed
Since pystacked uses Python, we also need to take the Python seed into account to ensure replicability. pystacked does this automatically for you. The default (pyseed(-1)) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated automatically.

We use the type option to indicate that we consider a regression task rather than a classification task. method(gradboost) selects gradient boosting with regression trees. We will later see that we can specify more than one learner in method().

The output isn’t particularly informative since–for now–we only consider one method. Yet, pystacked has fitted 100 boosted trees in the background. We can use predict in the Stata-typical way to obtain predicted values:

. predict double ygb if !train

Options #

We can pass options to scitkit-learn and fine-tune our gradient booster. For this, we use the cmdopt1() option. We need to use cmdopt1() because gradboost is the first (and only) method we are using with pystacked.

The options of each base learner are listed when you type:

. _pyparse, type(reg) method(gradboost) print
Default options: 
loss(squared_error) learning_rate(.1) n_estimators(100) subsample(1) 
criterion(friedman_mse) min_samples_split(2) min_samples_leaf(1) 
min_weight_fraction_leaf(0) max_depth(3) min_impurity_decrease(0) 
init(None) random_state(rng) max_features(None) alpha(.9) 
max_leaf_nodes(None) warm_start(False) validation_fraction(.1) 
n_iter_no_change(None) tol(.0001) ccp_alpha(0) 

Check the scikit-learn documentation here to get more information about each option.

scikit-learn documentation
Please study the scikit-learn documentation carefully when using pystacked.

For demonstation, we try two things: (1) we reduce the learning rate from 0.1 to 0.01, and (2) we reduce the learning rate and, in addition, increase the number of trees to 1000.

. pystacked medv crim-lstat if train, ///
       type(regress) method(gradboost) ///
       cmdopt1(learning_rate(.01))
. predict double ygb2 if !train

. pystacked medv crim-lstat if train, ///
       type(regress) method(gradboost) ///
       cmdopt1(learning_rate(.01) n_estimators(1000))
. predict double ygb3 if !train

We can then compare the performance across the three models:

. gen double rgb_sq=(medv-ygb)^2
. gen double rgb2_sq=(medv-ygb2)^2
. gen double rgb3_sq=(medv-ygb3)^2

. sum rgb* if!train
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      rgb_sq |        361    14.19902    38.15371   .0000138   469.2815
     rgb2_sq |        361    28.66256    50.82527   .0019822   437.9858
     rgb3_sq |        361    13.38564    38.00065   8.70e-07   546.1962

Reducing the learning rate without increasing the number of trees has deteriorated prediction performance; yet, if we also increase the number of trees, we might be able to improve prediction performance.

Pipelines #

We can make use of pipelines to pre-process our predictors. This will become especially useful in the context of stacking when we want to pass 2nd-order polynomials to one method, but not the other. Here, we use lasso with and without the poly2 pipeline:

. pystacked medv crim-lstat if train, ///
       type(regress) method(lassocv)  
. predict double ylasso1 if !train

. pystacked medv crim-lstat if train, ///
       type(regress) method(lassocv) ///
       pipe1(poly2)
. predict double ylasso2 if !train

. gen double rlasso1_sq=(medv-ylasso1)^2
. gen double rlasso2_sq=(medv-ylasso2)^2

. sum *sq if !train

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      rgb_sq |        361    14.19902    38.15371   .0000138   469.2815
     rgb2_sq |        361    28.66256    50.82527   .0019822   437.9858
     rgb3_sq |        361    13.38564    38.00065   8.70e-07   546.1962
  rlasso1_sq |        361    28.03915    76.37714   .0006696   897.5598
  rlasso2_sq |        361    22.85806    76.73137   4.81e-06   952.1228

The interactions and squared terms improve the performance of the lasso. However, gradient boosting performance much better in this application.