------------------------------------------------------------------------------------------------------------------------ help pystacked v0.4.8 ------------------------------------------------------------------------------------------------------------------------ Title pystacked -- Stata program for Stacking Regression Overview pystacked implements stacking regression (Wolpert, 1992) via scikit-learn's sklearn.ensemble.StackingRegressor and sklearn.ensemble.StackingClassifier. Stacking is a way of combining multiple supervised machine learners (the "base" or "level-0" learners) into a meta learner. The currently supported base learners are linear regression, logit, lasso, ridge, elastic net, (linear) support vector machines, gradient boosting, and neural nets (MLP). pystacked can also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn's machine learning algorithms. pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). pystacked has been tested with scikit-learn 0.24, 1.0, 1.1.0, 1.1.1 and 1.1.2. See here and here for how to set up Python for Stata on your system. Contents Syntax overview Syntax 1 Syntax 2 Other options Postestimation and prediction options Stacking Supported base learners Base learners: Options Pipelines Learner-specific predictors Example Stacking Regression Example Stacking Classification Installation Misc (references, contact, etc.) Syntax overview There are two alternative syntaxes. The first syntax is: pystacked depvar predictors [if exp] [in range] [, methods(string) cmdopt1(string) cmdopt2(string) ... pipe1(string) pipe2(string) ... xvars1(varlist) xvars2(varlist) ... otheropts ] The second syntax is: pystacked depvar predictors || method(string) opt(string) pipeline(string) xvars(varlist) || method(string) opt(string) pipeline(string) xvars(varlist) || ... || [if exp] [in range] [, otheropts ] The first syntax uses methods(string) to select base learners, where string is a list of base learners. Options are passed on to base learners via cmdopt1(string), cmdopt2(string) to cmdopt10(string). That is, up to 10 base learners can be specified and options are passed on in the order in which they appear in methods(string) (see Command options). Likewise, the pipe*(string) option can be used for pre-processing predictors within Python on the fly (see Pipelines). Furthermore, xvars*(varlist) allows to specify a learner-specific varlist of predictors. The second syntax imposes no limit on the number of base learners (aside from the increasing computational complexity). Base learners are added before the comma using method(string) together with opt(string) and separated by "||". Syntax 1 Option Description ------------------------------------------------------------------------------------------------------------------ methods(string) a list of base learners, defaults to "ols lassocv gradboost" for regression and "logit lassocv gradboost" for classification; see Base learners. cmdopt*(string) options passed to the base learners, see Command options. pipe*(string) pipelines passed to the base learners, see Pipelines. Regularized linear learners use the stdscaler pipeline by default, which standardizes the predictors. To suppress this, use nostdscaler. For other learners, there is no default pipeline. xvars*(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. See here. ------------------------------------------------------------------------------------------------------------------ Note: * is replaced with 1 to 10. The number refers to the order given in methods(string). Syntax 2 Option Description ------------------------------------------------------------------------------------------------------------------ method(string) a base learner, see Base learners. opt(string) options, see Command options. pipeline(string) pipelines applied to the predictors, see Pipelines. pipelines passed to the base learners, see Pipelines. Regularized linear learners use the stdscaler pipeline by default, which standardizes the predictors. To suppress this, use nostdscaler. For other learners, there is no default pipeline. xvars(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. See here. ------------------------------------------------------------------------------------------------------------------ Other options Option Description ------------------------------------------------------------------------------------------------------------------ type(string) reg(ress) for regression problems or class(ify) for classification problems. The default is regression. finalest(string) final estimator used to combine base learners. The default is non-negative least squares without an intercept and the additional constraint that weights sum to 1 (nnls1). Alternatives are nnls0 (non-negative least squares without intercept without the sum-to-one constraint), singlebest (use base learner with minimum MSE), ols (ordinary least squares) or ridge for (logistic) ridge, which is the sklearn default. For more information, see here. nosavepred do not save predicted values (do not use if predict is used after estimation) nosavebasexb do not save predicted values of each base learner (do not use if predict with basexb is used after estimation) njobs(int) number of jobs for parallel computing. The default is 0 (no parallelization), -1 uses all available CPUs, -2 uses all CPUs minus 1. backend(string) joblib backend used for parallelization; the default is 'loky' under Linux/MacOS and 'threading' under Windows. See here for more information. folds(int) number of folds used for cross-validation (not relevant for voting); default is 5. Ignored if foldvar(varname) if specified. foldvar(varname) integer fold variable for cross-validation. bfolds(int) number of folds used for base learners that use cross-validation (e.g. lassocv); default is 5. norandom folds are created using the ordering of the data. noshuffle cross-validation folds for base learners that use cross-validation (e.g. lassocv) are based on ordering of the data. sparse converts predictor matrix to a sparse matrix. This will only lead to speed improvements if the predictor matrix is sufficiently sparse. Not all learners support sparse matrices and not all learners will benefit from sparse matrices in the same way. You can also use the sparse pipeline to use sparse matrices for some learners, but not for others. pyseed(int) set the Python seed. Note that since pystacked uses Python, we also need to set the Python seed to ensure replicability. Three options: 1) pyseed(-1) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated automatically. 2) Setting pyseed(x) with any positive integer x allows to control the Python seed directly. 3) pyseed(0) sets the seed to None in Python. The default is pyseed(-1). ------------------------------------------------------------------------------------------------------------------ Voting Description ------------------------------------------------------------------------------------------------------------------ voting use voting regressor (ensemble.VotingRegressor) or voting classifier (ensemble.VotingClassifier); see here for a brief explanation. votetype(string) type of voting classifier: hard (default) or soft voteweights(numlist) positive weights used for voting regression/classification. The length of numlist should be the number of base learners - 1. The last weight is calculated to ensure that sum(weights)=1. ------------------------------------------------------------------------------------------------------------------ Postestimation and prediction options Postestimation tables After estimation, pystacked can report a table of in-sample (both cross-validated and full-sample-refitted) and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the table reports the root MSPE (mean squared prediction error); for classification problems, a confusion matrix is reported. The default holdout sample used for out-of-sample performance with the holdout option is all observations not included in the estimation. Alternatively, the user can specify the holdout sample explicitly using the syntax holdout(varname). The table can be requested postestimation as below, or as part of the pystacked estimation command. Table syntax: pystacked [, table holdout[(varname)] ] Postestimation graphs pystacked can also report graphs of in-sample and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the graphs compare predicted vs actual values of depvar. For classification problems, the default is to report ROC curves; optionally, histograms of predicted probabilities are reported. As with the table option, the default holdout sample used for out-of-sample performance is all observations not included in the estimation, but the user can instead specify the holdout sample explicitly. The table can be requested postestimation as below, or as part of the pystacked estimation command. The graph option on its own reports the graphs using pystacked's default settings. Because graphs are produced using Stata's twoway, roctab and histogram commands, the user can control either the combined graph (graph(options)) or the individual learner graphs (lgraph(options)) appear by passing options to these commands. Graph syntax: pystacked [, graph[(options)] lgraph[(options)] histogram holdout[(varname)] ] Prediction To get stacking predicted values: predict type newname [if exp] [in range] [, pr class xb resid ] To get fitted values for each base learner: predict type stub [if exp] [in range] [, basexb cvalid ] Option Description ------------------------------------------------------------------------------------------------------------------ xb predicted value; the default for regression pr predicted probability; the default for classification class predicted class resid residuals basexb predicted values for each base learner (default = use base learners re-fitted on full estimation sample) cvalid cross-validated predicted values. Currently only supported if combined with basexb. ------------------------------------------------------------------------------------------------------------------ Note: Predicted values (in and out-sample) are calculated and stored in Python memory when pystacked is run. predict pulls the predicted values from Python memory and saves them in Stata memory. This means that no changes on the data in Stata memory should be made between pystacked call and predict call. If changes to the data set are made, predict will return an error. Stacking Stacking is a way of combining cross-validated predictions from multiple base ("level-0") learners into a final prediction. A final estimator ("level-1") is used to combine the base predictions. The default final predictor for stacking regession is non-negative least squares (NNLS) without an intercept and with the constraint that weights sum to one. Note that in this respect we deviate from the scikit-learn default and follow the recommendation in Breiman (1996) and Hastie et al. (2009, p. 290). The scikit-learn defaults for the final estimator are ridge regression for stacking regression and logistic ridge for classification tasks. To use the scikit-learn default, use finalest(ridge). pystacked also supports ordinary (unconstrained) least squares as the final estimator (finalest(ols)). Finally, singlebest uses the single base learner that exhibits the smallest cross-validated mean squared error. An alternative to stacking is voting. Voting regression uses the weighted average of base learners to form predictions. By default, the unweighted average is used, but the user can specify weights using voteweights(numlist). Voting classifier uses a majority rule by default (hard voting). An alternative is soft voting where the (weighted) probabilities are used to form the final prediction. Supported base learners The following base learners are supported: Base learners ols Linear regression (regression only) logit Logistic regression (classification only) lassoic Lasso with penalty chosen by AIC/BIC (regression only) lassocv Lasso with cross-validated penalty ridgecv Ridge with cross-validated penalty elasticcv Elastic net with cross-validated penalty svm Support vector machines gradboost Gradient boosting rf Random forest linsvm Linear SVM nnet Neural net The base learners can be chosen using the methods(lassocv gradboost nnet) (Syntax 1) or method(string) options (Syntax 2). Please see links in the next section for more information on each method. Base learners: Options Options can be passed to the base learners via cmdopt*(string) (Syntax 1) or opt(string) (Syntax 2). The defaults are adopted from scikit-learn. To see the default options of each base learners, simply click on the "Show options" links below. To see which alternative settings are allowed, please see the scikit-learn documentations linked below. We strongly recommend that you read the scikit-learn documentation carefully. The option showoptions shows the options passed on to Python. We recommend to verify that options have been passed to Python as intended. Linear regression Methods ols Type: reg Documentation: linear_model.LinearRegression Show options Logistic regression Methods: logit Type: class Documentation: linear_model.LogisticRegression Show options Penalized regression with information criteria Methods lassoic Type: reg Documentation: linear_model.LassoLarsIC Show options Penalized regression with cross-validation Methods: lassocv, ridgecv and elasticv Type: regress Documentation: linear_model.ElasticNetCV Show lasso options Show ridge options Show elastic net options Penalized logistic regression with cross-validation Methods: lassocv, ridgecv and elasticv Type: class Documentation: linear_model.LogisticRegressionCV Show lasso options Show ridge options Show elastic options Random forest classifier Method: rf Type: class Documentation: ensemble.RandomForestClassifier Show options Random forest regressor Method: rf Type: reg Documentation: ensemble.RandomForestRegressor Show options Gradient boosted regression trees Method: gradboost Type: reg Documentation: ensemble.GradientBoostingRegressor Show options Linear SVM (SVR) Method: linsvm Type: reg Documentation: svm.LinearSVR Show options SVM (SVR) Method: svm Type: class Documentation: svm.SVR Show options SVM (SVC) Method: svm Type: reg Documentation: svm.SVC Show options Neural net classifier (Multi-layer Perceptron) Method: nnet Type: class Documentation: sklearn.neural_network.MLPClassifier Show options Neural net regressor (Multi-layer Perceptron) Method: nnet Type: reg Documentation: sklearn.neural_network.MLPRegressor Show options Learner-specific predictors By default, pystacked uses the same set of predictors for all base learners. This is often not desirable. For example, when using linear machine learners such as the lasso, it is recommended to create interactions. There are two methods to allow for learner-specific sets of predictors: 1) Pipelines, discussed in the next section, can be used to create polynomials on the fly. 2) The xvars*(varlist) option allows to specify predictors for a specific learner. If xvars*(varlist) is missing for a specific learner, the default predictor list is used. Pipelines Scikit-learn uses pipelines to pre-preprocess input data on the fly. Pipelines can be used to impute missing observations or create transformation of predictors such as interactions and polynomials. The following pipelines are currently supported: Pipelines stdscaler StandardScaler() stdscaler0 StandardScaler(with_mean=False) sparse SparseTransformer() onehot OneHotEncoder()() minmaxscaler MinMaxScaler() medianimputer SimpleImputer(strategy='median') knnimputer KNNImputer() poly2 PolynomialFeatures(degree=2) poly3 PolynomialFeatures(degree=3) Pipelines can be passed to the base learners via pipe*(string) (Syntax 1) or pipeline(string) (Syntax 2). stdscaler0 is intended for sparse matrices, since stdscaler will make a sparse matrix dense. Example using Boston Housing data (Harrison et al., 1978) Data set The data set is available from the UCI Machine Learning Repository. The following variables are included in the data set of 506 observations: Predictors CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion Black by town LSTAT % lower status of the population Outcome MEDV Median value of owner-occupied homes in $1000's Getting started Load housing data. . insheet using https://statalasso.github.io/dta/housing.csv, clear Stacking regression with lasso, random forest and gradient boosting. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(lassocv rf gradboost) The weights determine how much each base learner contributes to the final stacking prediction. Request the root MSPE table (in-sample only): . pystacked, table Re-estimate using the first 400 observations, and request the root MSPE table. RMSPEs for both in-sample (both refitted and cross-validated) and the default holdout sample (all unused observations) are reported.: . pystacked medv crim-lstat if _n<=400, type(regress) pyseed(123) methods(lassocv rf gradboost) . pystacked, table holdout Graph predicted vs actual for the holdout sample: . pystacked, graph holdout Storing the predicted values: . predict double yhat, xb Storing the cross-validated predicted values: . predict double yhat_cv, xb cvalid We can also save the predicted values of each base learner: . predict double yhat, basexb Learner-specific predictors (Syntax 1) pystacked allows the use of different sets of predictors for each base learners. For example, linear estimators might perform better if interactions are provided as inputs. Here, we use interactions and 2nd-order polynomials for the lasso, but not for the other base learners. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) xvars2(c.(crim-lstat)# #c.(crim-lstat)) The same can be achieved using pipelines which create polynomials on-the-fly in Python. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe2(poly2) Learner-specific predictors (Syntax 2) We demonstrate the same using the alternative syntax, which is often more handy: . pystacked medv crim-lstat || m(ols) || m(lassocv) xvars(c.(crim-lstat)# #c.(crim-lstat)) || m(rf) || , type(regress) pyseed(123) . pystacked medv crim-lstat || m(ols) || m(lassocv) pipe(poly2) || m(rf) || , type(regress) pyseed(123) Options of base learners (Syntax 1) We can pass options to the base learners using cmdopt*(string). In this example, we change the maximum tree depth for the random forest. Since random forest is the third base learner, we use cmdopt3(max_depth(3)). . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) cmdopt3(max_depth(3)) You can verify that the option has been passed to Python correctly: . di e(pyopt3) Options of base learners (Syntax 2) The same results as above can be achieved using the alternative syntax, which imposes no limit on the number of base learners. . pystacked medv crim-lstat || m(ols) pipe(poly2) || m(lassocv) pipe(poly2) || m(rf) opt(max_depth(3)) , type(regress) pyseed(123) Single base learners You can use pystacked with a single base learner. In this example, we are using a conventional random forest: . pystacked medv crim-lstat, type(regress) pyseed(123) methods(rf) Voting You can also use pre-defined weights. Here, we assign weights of 0.5 to OLS, .1 to the lasso and, implicitly, .4 to the random foreset. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) voting voteweights(.5 .1) Classification Example using Spam data Data set For demonstration we consider the Spambase Data Set from the UCI Machine Learning Repository. The data includes 4,601 observations and 57 variables. The aim is to predict whether an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email. Predictors v1-v48 percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor corresponds to, see link below. v49-v54 percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times the character appears in the e-mail) divided by total number of characters in e-mail. To see which character each predictor corresponds to, see link below. v55 average length of uninterrupted sequences of capital letters v56 length of longest uninterrupted sequence of capital letters v57 total number of capital letters in the e-mail Outcome v58 denotes whether the e-mail was considered spam (1) or not (0). For more information about the data see https://archive.ics.uci.edu/ml/datasets/spambase. Load spam data. . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma Throughout this example, we add the option njobs(4), which enables parallelization with 4 cores. We consider three base learners: logit, random forest and gradient boosting: . pystacked v58 v1-v57, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2) Out-of-sample classification. As the data is ordered by outcome, we first shuffle the data randomly. . set seed 42 . gen u = runiform() . sort u Estimation on the first 2000 observations. . pystacked v58 v1-v57 if _n<=2000, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2) We can get both the predicted probabilities or the predicted class: . predict spam, class . predict spam_p, pr Confusion matrix, just in-sample and both in- and out-of-sample. . pystacked, table . pystacked, table holdout Confusion matrix for a specified holdout sample. . gen h = _n>3000 . pystacked, table holdout(h) ROC curves for the default holdout sample. Specify a subtitle for the combined graph. . pystacked, graph(subtitle(Spam data)) holdout Predicted probabilites (hist option) for the default holdout sample. Specify number of bins for the individual learner graphs. . pystacked, graph hist lgraph(bin(20)) holdout Installation pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). See this help file, this Stata blog entry and this Youtube video for how to set up Python on your system. Installing Anaconda is in most cases the easiest way of installing Python including all required packages. You can check your scikit-learn version using: . python: import sklearn . python: sklearn.__version__ Updating scikit-learn: If you use Anaconda, update scikit-learn through your Anaconda Python distribution. Make sure that you have linked Stata with the correct Python installation using python query. If you use pip, you can update scikit-learn by typing "<Python path> -m pip install -U scikit-learn" into the terminal, or directly in Stata: . shell <Python path> -m pip install -U scikit-learn Note that you might need to restart Stata for changes to your Python installation to take effect. For further information, see https://scikit-learn.org/stable/install.html. To install/update pystacked, type . net install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replace References Harrison, D. and Rubinfeld, D.L (1978). Hedonic prices and the demand for clean air. J. Environ. Economics & Management, vol.5, 81-102, 1978. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Wolpert, David H. Stacked generalization. Neural networks 5.2 (1992): 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1 Contact If you encounter an error, contact us via email. If you have a question, you can also post on Statalist (please tag @Achim Ahrens). Acknowledgements pystacked took some inspiration from Michael Droste's pylearn, which implements other scikit-learn programs for Stata. Thanks to Jan Ditzen for testing an early version of the program. We also thank Brigham Frandsen and Marco Alfano for feedback. All remaining errors are our own. Citation Please also cite scikit-learn; see https://scikit-learn.org/stable/about.html. Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E Schaffer, Heriot-Watt University, UK
Help file