------------------------------------------------------------------------------------------------------------------------
help pystacked v0.4.8
------------------------------------------------------------------------------------------------------------------------
Title
pystacked -- Stata program for Stacking Regression
Overview
pystacked implements stacking regression (Wolpert, 1992) via scikit-learn's sklearn.ensemble.StackingRegressor and
sklearn.ensemble.StackingClassifier. Stacking is a way of combining multiple supervised machine learners (the
"base" or "level-0" learners) into a meta learner. The currently supported base learners are linear regression,
logit, lasso, ridge, elastic net, (linear) support vector machines, gradient boosting, and neural nets (MLP).
pystacked can also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn's
machine learning algorithms.
pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher).
pystacked has been tested with scikit-learn 0.24, 1.0, 1.1.0, 1.1.1 and 1.1.2. See here and here for how to set
up Python for Stata on your system.
Contents
Syntax overview
Syntax 1
Syntax 2
Other options
Postestimation and prediction options
Stacking
Supported base learners
Base learners: Options
Pipelines
Learner-specific predictors
Example Stacking Regression
Example Stacking Classification
Installation
Misc (references, contact, etc.)
Syntax overview
There are two alternative syntaxes. The first syntax is:
pystacked depvar predictors [if exp] [in range] [, methods(string) cmdopt1(string) cmdopt2(string) ...
pipe1(string) pipe2(string) ... xvars1(varlist) xvars2(varlist) ... otheropts ]
The second syntax is:
pystacked depvar predictors || method(string) opt(string) pipeline(string) xvars(varlist) || method(string)
opt(string) pipeline(string) xvars(varlist) || ... || [if exp] [in range] [, otheropts ]
The first syntax uses methods(string) to select base learners, where string is a list of base learners. Options
are passed on to base learners via cmdopt1(string), cmdopt2(string) to cmdopt10(string). That is, up to 10 base
learners can be specified and options are passed on in the order in which they appear in methods(string) (see
Command options). Likewise, the pipe*(string) option can be used for pre-processing predictors within Python on
the fly (see Pipelines). Furthermore, xvars*(varlist) allows to specify a learner-specific varlist of predictors.
The second syntax imposes no limit on the number of base learners (aside from the increasing computational
complexity). Base learners are added before the comma using method(string) together with opt(string) and separated
by "||".
Syntax 1
Option Description
------------------------------------------------------------------------------------------------------------------
methods(string) a list of base learners, defaults to "ols lassocv gradboost" for regression and "logit
lassocv gradboost" for classification; see Base learners.
cmdopt*(string) options passed to the base learners, see Command options.
pipe*(string) pipelines passed to the base learners, see Pipelines. Regularized linear learners use the
stdscaler pipeline by default, which standardizes the predictors. To suppress this, use
nostdscaler. For other learners, there is no default pipeline.
xvars*(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists
of predictors. See here.
------------------------------------------------------------------------------------------------------------------
Note: * is replaced with 1 to 10. The number refers to the order given in methods(string).
Syntax 2
Option Description
------------------------------------------------------------------------------------------------------------------
method(string) a base learner, see Base learners.
opt(string) options, see Command options.
pipeline(string) pipelines applied to the predictors, see Pipelines. pipelines passed to the base learners,
see Pipelines. Regularized linear learners use the stdscaler pipeline by default, which
standardizes the predictors. To suppress this, use nostdscaler. For other learners, there
is no default pipeline.
xvars(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists
of predictors. See here.
------------------------------------------------------------------------------------------------------------------
Other options
Option Description
------------------------------------------------------------------------------------------------------------------
type(string) reg(ress) for regression problems or class(ify) for classification problems. The default
is regression.
finalest(string) final estimator used to combine base learners. The default is non-negative least squares
without an intercept and the additional constraint that weights sum to 1 (nnls1).
Alternatives are nnls0 (non-negative least squares without intercept without the
sum-to-one constraint), singlebest (use base learner with minimum MSE), ols (ordinary
least squares) or ridge for (logistic) ridge, which is the sklearn default. For more
information, see here.
nosavepred do not save predicted values (do not use if predict is used after estimation)
nosavebasexb do not save predicted values of each base learner (do not use if predict with basexb is
used after estimation)
njobs(int) number of jobs for parallel computing. The default is 0 (no parallelization), -1 uses all
available CPUs, -2 uses all CPUs minus 1.
backend(string) joblib backend used for parallelization; the default is 'loky' under Linux/MacOS and
'threading' under Windows. See here for more information.
folds(int) number of folds used for cross-validation (not relevant for voting); default is 5. Ignored
if foldvar(varname) if specified.
foldvar(varname) integer fold variable for cross-validation.
bfolds(int) number of folds used for base learners that use cross-validation (e.g. lassocv); default is
5.
norandom folds are created using the ordering of the data.
noshuffle cross-validation folds for base learners that use cross-validation (e.g. lassocv) are based
on ordering of the data.
sparse converts predictor matrix to a sparse matrix. This will only lead to speed improvements if
the predictor matrix is sufficiently sparse. Not all learners support sparse matrices and
not all learners will benefit from sparse matrices in the same way. You can also use the
sparse pipeline to use sparse matrices for some learners, but not for others.
pyseed(int) set the Python seed. Note that since pystacked uses Python, we also need to set the Python
seed to ensure replicability. Three options: 1) pyseed(-1) draws a number between 0 and
10^8 in Stata which is then used as a Python seed. This way, you only need to deal with
the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated
automatically. 2) Setting pyseed(x) with any positive integer x allows to control the
Python seed directly. 3) pyseed(0) sets the seed to None in Python. The default is
pyseed(-1).
------------------------------------------------------------------------------------------------------------------
Voting Description
------------------------------------------------------------------------------------------------------------------
voting use voting regressor (ensemble.VotingRegressor) or voting classifier
(ensemble.VotingClassifier); see here for a brief explanation.
votetype(string) type of voting classifier: hard (default) or soft
voteweights(numlist) positive weights used for voting regression/classification. The length of numlist should
be the number of base learners - 1. The last weight is calculated to ensure that
sum(weights)=1.
------------------------------------------------------------------------------------------------------------------
Postestimation and prediction options
Postestimation tables
After estimation, pystacked can report a table of in-sample (both cross-validated and full-sample-refitted) and,
optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners.
For regression problems, the table reports the root MSPE (mean squared prediction error); for classification
problems, a confusion matrix is reported. The default holdout sample used for out-of-sample performance with the
holdout option is all observations not included in the estimation. Alternatively, the user can specify the
holdout sample explicitly using the syntax holdout(varname). The table can be requested postestimation as below,
or as part of the pystacked estimation command.
Table syntax:
pystacked [, table holdout[(varname)] ]
Postestimation graphs
pystacked can also report graphs of in-sample and, optionally, out-of-sample (holdout sample) performance for both
the stacking regression and the base learners. For regression problems, the graphs compare predicted vs actual
values of depvar. For classification problems, the default is to report ROC curves; optionally, histograms of
predicted probabilities are reported. As with the table option, the default holdout sample used for out-of-sample
performance is all observations not included in the estimation, but the user can instead specify the holdout
sample explicitly. The table can be requested postestimation as below, or as part of the pystacked estimation
command.
The graph option on its own reports the graphs using pystacked's default settings. Because graphs are produced
using Stata's twoway, roctab and histogram commands, the user can control either the combined graph
(graph(options)) or the individual learner graphs (lgraph(options)) appear by passing options to these commands.
Graph syntax:
pystacked [, graph[(options)] lgraph[(options)] histogram holdout[(varname)] ]
Prediction
To get stacking predicted values:
predict type newname [if exp] [in range] [, pr class xb resid ]
To get fitted values for each base learner:
predict type stub [if exp] [in range] [, basexb cvalid ]
Option Description
------------------------------------------------------------------------------------------------------------------
xb predicted value; the default for regression
pr predicted probability; the default for classification
class predicted class
resid residuals
basexb predicted values for each base learner (default = use base learners re-fitted on full
estimation sample)
cvalid cross-validated predicted values. Currently only supported if combined with basexb.
------------------------------------------------------------------------------------------------------------------
Note: Predicted values (in and out-sample) are calculated and stored in Python memory when pystacked is run.
predict pulls the predicted values from Python memory and saves them in Stata memory. This means that no changes
on the data in Stata memory should be made between pystacked call and predict call. If changes to the data set are
made, predict will return an error.
Stacking
Stacking is a way of combining cross-validated predictions from multiple base ("level-0") learners into a final
prediction. A final estimator ("level-1") is used to combine the base predictions.
The default final predictor for stacking regession is non-negative least squares (NNLS) without an intercept and
with the constraint that weights sum to one. Note that in this respect we deviate from the scikit-learn default
and follow the recommendation in Breiman (1996) and Hastie et al. (2009, p. 290). The scikit-learn defaults for
the final estimator are ridge regression for stacking regression and logistic ridge for classification tasks. To
use the scikit-learn default, use finalest(ridge). pystacked also supports ordinary (unconstrained) least squares
as the final estimator (finalest(ols)). Finally, singlebest uses the single base learner that exhibits the
smallest cross-validated mean squared error.
An alternative to stacking is voting. Voting regression uses the weighted average of base learners to form
predictions. By default, the unweighted average is used, but the user can specify weights using
voteweights(numlist). Voting classifier uses a majority rule by default (hard voting). An alternative is soft
voting where the (weighted) probabilities are used to form the final prediction.
Supported base learners
The following base learners are supported:
Base learners
ols Linear regression (regression only)
logit Logistic regression (classification only)
lassoic Lasso with penalty chosen by AIC/BIC (regression only)
lassocv Lasso with cross-validated penalty
ridgecv Ridge with cross-validated penalty
elasticcv Elastic net with cross-validated penalty
svm Support vector machines
gradboost Gradient boosting
rf Random forest
linsvm Linear SVM
nnet Neural net
The base learners can be chosen using the methods(lassocv gradboost nnet) (Syntax 1) or method(string) options
(Syntax 2).
Please see links in the next section for more information on each method.
Base learners: Options
Options can be passed to the base learners via cmdopt*(string) (Syntax 1) or opt(string) (Syntax 2). The defaults
are adopted from scikit-learn.
To see the default options of each base learners, simply click on the "Show options" links below. To see which
alternative settings are allowed, please see the scikit-learn documentations linked below. We strongly recommend
that you read the scikit-learn documentation carefully.
The option showoptions shows the options passed on to Python. We recommend to verify that options have been
passed to Python as intended.
Linear regression
Methods ols
Type: reg
Documentation: linear_model.LinearRegression
Show options
Logistic regression
Methods: logit
Type: class
Documentation: linear_model.LogisticRegression
Show options
Penalized regression with information criteria
Methods lassoic
Type: reg
Documentation: linear_model.LassoLarsIC
Show options
Penalized regression with cross-validation
Methods: lassocv, ridgecv and elasticv
Type: regress
Documentation: linear_model.ElasticNetCV
Show lasso options
Show ridge options
Show elastic net options
Penalized logistic regression with cross-validation
Methods: lassocv, ridgecv and elasticv
Type: class
Documentation: linear_model.LogisticRegressionCV
Show lasso options
Show ridge options
Show elastic options
Random forest classifier
Method: rf
Type: class
Documentation: ensemble.RandomForestClassifier
Show options
Random forest regressor
Method: rf
Type: reg
Documentation: ensemble.RandomForestRegressor
Show options
Gradient boosted regression trees
Method: gradboost
Type: reg
Documentation: ensemble.GradientBoostingRegressor
Show options
Linear SVM (SVR)
Method: linsvm
Type: reg
Documentation: svm.LinearSVR
Show options
SVM (SVR)
Method: svm
Type: class
Documentation: svm.SVR
Show options
SVM (SVC)
Method: svm
Type: reg
Documentation: svm.SVC
Show options
Neural net classifier (Multi-layer Perceptron)
Method: nnet
Type: class
Documentation: sklearn.neural_network.MLPClassifier
Show options
Neural net regressor (Multi-layer Perceptron)
Method: nnet
Type: reg
Documentation: sklearn.neural_network.MLPRegressor
Show options
Learner-specific predictors
By default, pystacked uses the same set of predictors for all base learners. This is often not desirable. For
example, when using linear machine learners such as the lasso, it is recommended to create interactions. There are
two methods to allow for learner-specific sets of predictors:
1) Pipelines, discussed in the next section, can be used to create polynomials on the fly.
2) The xvars*(varlist) option allows to specify predictors for a specific learner. If xvars*(varlist) is missing
for a specific learner, the default predictor list is used.
Pipelines
Scikit-learn uses pipelines to pre-preprocess input data on the fly. Pipelines can be used to impute missing
observations or create transformation of predictors such as interactions and polynomials.
The following pipelines are currently supported:
Pipelines
stdscaler StandardScaler()
stdscaler0 StandardScaler(with_mean=False)
sparse SparseTransformer()
onehot OneHotEncoder()()
minmaxscaler MinMaxScaler()
medianimputer SimpleImputer(strategy='median')
knnimputer KNNImputer()
poly2 PolynomialFeatures(degree=2)
poly3 PolynomialFeatures(degree=3)
Pipelines can be passed to the base learners via pipe*(string) (Syntax 1) or pipeline(string) (Syntax 2).
stdscaler0 is intended for sparse matrices, since stdscaler will make a sparse matrix dense.
Example using Boston Housing data (Harrison et al., 1978)
Data set
The data set is available from the UCI Machine Learning Repository. The following variables are included in the
data set of 506 observations:
Predictors
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion Black by town
LSTAT % lower status of the population
Outcome
MEDV Median value of owner-occupied homes in $1000's
Getting started
Load housing data.
. insheet using https://statalasso.github.io/dta/housing.csv, clear
Stacking regression with lasso, random forest and gradient boosting.
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(lassocv rf gradboost)
The weights determine how much each base learner contributes to the final stacking prediction.
Request the root MSPE table (in-sample only):
. pystacked, table
Re-estimate using the first 400 observations, and request the root MSPE table. RMSPEs for both in-sample (both
refitted and cross-validated) and the default holdout sample (all unused observations) are reported.:
. pystacked medv crim-lstat if _n<=400, type(regress) pyseed(123) methods(lassocv rf gradboost)
. pystacked, table holdout
Graph predicted vs actual for the holdout sample:
. pystacked, graph holdout
Storing the predicted values:
. predict double yhat, xb
Storing the cross-validated predicted values:
. predict double yhat_cv, xb cvalid
We can also save the predicted values of each base learner:
. predict double yhat, basexb
Learner-specific predictors (Syntax 1)
pystacked allows the use of different sets of predictors for each base learners. For example, linear estimators
might perform better if interactions are provided as inputs. Here, we use interactions and 2nd-order polynomials
for the lasso, but not for the other base learners.
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) xvars2(c.(crim-lstat)#
#c.(crim-lstat))
The same can be achieved using pipelines which create polynomials on-the-fly in Python.
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe2(poly2)
Learner-specific predictors (Syntax 2)
We demonstrate the same using the alternative syntax, which is often more handy:
. pystacked medv crim-lstat || m(ols) || m(lassocv) xvars(c.(crim-lstat)# #c.(crim-lstat)) || m(rf) || ,
type(regress) pyseed(123)
. pystacked medv crim-lstat || m(ols) || m(lassocv) pipe(poly2) || m(rf) || , type(regress) pyseed(123)
Options of base learners (Syntax 1)
We can pass options to the base learners using cmdopt*(string). In this example, we change the maximum tree depth
for the random forest. Since random forest is the third base learner, we use cmdopt3(max_depth(3)).
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2)
cmdopt3(max_depth(3))
You can verify that the option has been passed to Python correctly:
. di e(pyopt3)
Options of base learners (Syntax 2)
The same results as above can be achieved using the alternative syntax, which imposes no limit on the number of
base learners.
. pystacked medv crim-lstat || m(ols) pipe(poly2) || m(lassocv) pipe(poly2) || m(rf) opt(max_depth(3)) ,
type(regress) pyseed(123)
Single base learners
You can use pystacked with a single base learner. In this example, we are using a conventional random forest:
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(rf)
Voting
You can also use pre-defined weights. Here, we assign weights of 0.5 to OLS, .1 to the lasso and, implicitly, .4
to the random foreset.
. pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2)
voting voteweights(.5 .1)
Classification Example using Spam data
Data set
For demonstration we consider the Spambase Data Set from the UCI Machine Learning Repository. The data includes
4,601 observations and 57 variables. The aim is to predict whether an email is spam (i.e., unsolicited commercial
e-mail) or not. Each observation corresponds to one email.
Predictors
v1-v48 percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word
appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor
corresponds to, see link below.
v49-v54 percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times
the character appears in the e-mail) divided by total number of characters in e-mail. To see which
character each predictor corresponds to, see link below.
v55 average length of uninterrupted sequences of capital letters
v56 length of longest uninterrupted sequence of capital letters
v57 total number of capital letters in the e-mail
Outcome
v58 denotes whether the e-mail was considered spam (1) or not (0).
For more information about the data see https://archive.ics.uci.edu/ml/datasets/spambase.
Load spam data.
. insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma
Throughout this example, we add the option njobs(4), which enables parallelization with 4 cores.
We consider three base learners: logit, random forest and gradient boosting:
. pystacked v58 v1-v57, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2)
Out-of-sample classification.
As the data is ordered by outcome, we first shuffle the data randomly.
. set seed 42
. gen u = runiform()
. sort u
Estimation on the first 2000 observations.
. pystacked v58 v1-v57 if _n<=2000, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2)
We can get both the predicted probabilities or the predicted class:
. predict spam, class
. predict spam_p, pr
Confusion matrix, just in-sample and both in- and out-of-sample.
. pystacked, table
. pystacked, table holdout
Confusion matrix for a specified holdout sample.
. gen h = _n>3000
. pystacked, table holdout(h)
ROC curves for the default holdout sample. Specify a subtitle for the combined graph.
. pystacked, graph(subtitle(Spam data)) holdout
Predicted probabilites (hist option) for the default holdout sample. Specify number of bins for the individual
learner graphs.
. pystacked, graph hist lgraph(bin(20)) holdout
Installation
pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). See
this help file, this Stata blog entry and this Youtube video for how to set up Python on your system. Installing
Anaconda is in most cases the easiest way of installing Python including all required packages.
You can check your scikit-learn version using:
. python: import sklearn
. python: sklearn.__version__
Updating scikit-learn: If you use Anaconda, update scikit-learn through your Anaconda Python distribution. Make
sure that you have linked Stata with the correct Python installation using python query.
If you use pip, you can update scikit-learn by typing "<Python path> -m pip install -U scikit-learn" into the
terminal, or directly in Stata:
. shell <Python path> -m pip install -U scikit-learn
Note that you might need to restart Stata for changes to your Python installation to take effect.
For further information, see https://scikit-learn.org/stable/install.html.
To install/update pystacked, type
. net install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replace
References
Harrison, D. and Rubinfeld, D.L (1978). Hedonic prices and the demand for clean air. J. Environ. Economics &
Management, vol.5, 81-102, 1978.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference,
and prediction. Springer Science & Business Media.
Wolpert, David H. Stacked generalization. Neural networks 5.2 (1992): 241-259.
https://doi.org/10.1016/S0893-6080(05)80023-1
Contact
If you encounter an error, contact us via email. If you have a question, you can also post on Statalist (please
tag @Achim Ahrens).
Acknowledgements
pystacked took some inspiration from Michael Droste's pylearn, which implements other scikit-learn programs for
Stata. Thanks to Jan Ditzen for testing an early version of the program. We also thank Brigham Frandsen and Marco
Alfano for feedback. All remaining errors are our own.
Citation
Please also cite scikit-learn; see https://scikit-learn.org/stable/about.html.
Authors
Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
achim.ahrens@gess.ethz.ch
Christian B. Hansen, University of Chicago, USA
Christian.Hansen@chicagobooth.edu
Mark E Schaffer, Heriot-Watt University, UK
Help file