Getting started #
Before we get into stacking, let’s first use pystacked
as a “regular”
program for machine learning.
Gradient boosting #
We load the example data set and randomly split the data in training/test sample.
. clear all
. insheet using ///
https://statalasso.github.io/dta/housing.csv, ///
clear comma
. set seed 789
. gen train=runiform()
. replace train=train>.75
As an example, we run pystacked
with gradient boosting:
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost)
Single base learner: no stacking done.
Stacking weights:
---------------------------------------
Method | Weight
-----------------+---------------------
gradboost | 1.0000000
Python seed
Sincepystacked
uses Python, we also need to take the Python seed into account to ensure replicability.pystacked
does this automatically for you. The default (pyseed(-1)
) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example,set seed 42
is sufficient, as the Python seed is generated automatically.
We use the type
option to indicate that we consider a regression task rather than a classification task. method(gradboost)
selects gradient boosting with regression trees. We will later see that we can specify more than one learner in method()
.
The output isn’t particularly informative since–for now–we only consider one method. Yet, pystacked
has fitted 100 boosted trees in the background. We can use predict
in the Stata-typical way to obtain predicted values:
. predict double ygb if !train
Options #
We can pass options to scitkit-learn and fine-tune our gradient booster. For this, we use the
cmdopt1()
option. We need to use cmdopt1()
because gradboost is the first (and only) method we are using with
pystacked
.
The options of each base learner are listed when you type:
. _pyparse, type(reg) method(gradboost) print
Default options:
loss(squared_error) learning_rate(.1) n_estimators(100) subsample(1)
criterion(friedman_mse) min_samples_split(2) min_samples_leaf(1)
min_weight_fraction_leaf(0) max_depth(3) min_impurity_decrease(0)
init(None) random_state(rng) max_features(None) alpha(.9)
max_leaf_nodes(None) warm_start(False) validation_fraction(.1)
n_iter_no_change(None) tol(.0001) ccp_alpha(0)
Check the scikit-learn documentation here to get more information about each option.
scikit-learn documentation
Please study the scikit-learn documentation carefully when usingpystacked
.
For demonstation, we try two things: (1) we reduce the learning rate from 0.1 to 0.01, and (2) we reduce the learning rate and, in addition, increase the number of trees to 1000.
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost) ///
cmdopt1(learning_rate(.01))
. predict double ygb2 if !train
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost) ///
cmdopt1(learning_rate(.01) n_estimators(1000))
. predict double ygb3 if !train
We can then compare the performance across the three models:
. gen double rgb_sq=(medv-ygb)^2
. gen double rgb2_sq=(medv-ygb2)^2
. gen double rgb3_sq=(medv-ygb3)^2
. sum rgb* if!train
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
rgb_sq | 361 14.19902 38.15371 .0000138 469.2815
rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858
rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962
Reducing the learning rate without increasing the number of trees has deteriorated prediction performance; yet, if we also increase the number of trees, we might be able to improve prediction performance.
Pipelines #
We can make use of pipelines to pre-process our predictors. This will become especially useful in the context of stacking when we want to pass 2nd-order polynomials to one method, but not the other. Here, we use lasso with and without the poly2
pipeline:
. pystacked medv crim-lstat if train, ///
type(regress) method(lassocv)
. predict double ylasso1 if !train
. pystacked medv crim-lstat if train, ///
type(regress) method(lassocv) ///
pipe1(poly2)
. predict double ylasso2 if !train
. gen double rlasso1_sq=(medv-ylasso1)^2
. gen double rlasso2_sq=(medv-ylasso2)^2
. sum *sq if !train
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
rgb_sq | 361 14.19902 38.15371 .0000138 469.2815
rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858
rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962
rlasso1_sq | 361 28.03915 76.37714 .0006696 897.5598
rlasso2_sq | 361 22.85806 76.73137 4.81e-06 952.1228
The interactions and squared terms improve the performance of the lasso. However, gradient boosting performance much better in this application.