------------------------------------------------------------------------------------------------------------------------help pystackedv0.4.8 ------------------------------------------------------------------------------------------------------------------------Titlepystacked-- Stata program for Stacking RegressionOverviewpystackedimplements stacking regression (Wolpert, 1992) via scikit-learn's sklearn.ensemble.StackingRegressor and sklearn.ensemble.StackingClassifier. Stacking is a way of combining multiple supervised machine learners (the "base" or "level-0" learners) into a meta learner. The currently supported base learners are linear regression, logit, lasso, ridge, elastic net, (linear) support vector machines, gradient boosting, and neural nets (MLP).pystackedcan also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn's machine learning algorithms.pystackedrequires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher).pystackedhas been tested with scikit-learn 0.24, 1.0, 1.1.0, 1.1.1 and 1.1.2. Seehereand here for how to set up Python for Stata on your system.ContentsSyntax overviewSyntax 1Syntax 2Other optionsPostestimation and prediction optionsStackingSupported base learnersBase learners: OptionsPipelinesLearner-specific predictorsExample Stacking RegressionExample Stacking ClassificationInstallationMisc (references, contact, etc.)There are two alternative syntaxes. TheSyntax overviewfirst syntaxis:pystackeddepvarpredictors[ifexp] [inrange] [,methods(string)cmdopt1(string)cmdopt2(string)...pipe1(string)pipe2(string)...xvars1(varlist)xvars2(varlist)...otheropts] Thesecond syntaxis:pystackeddepvarpredictors||method(string)opt(string)pipeline(string)xvars(varlist)||method(string)opt(string)pipeline(string)xvars(varlist)||...|| [ifexp] [inrange] [,otheropts] The first syntax usesmethods(string)to select base learners, wherestringis a list of base learners. Options are passed on to base learners viacmdopt1(string),cmdopt2(string)tocmdopt10(string). That is, up to 10 base learners can be specified and options are passed on in the order in which they appear inmethods(string)(seeCommand options). Likewise, thepipe*(string)option can be used for pre-processing predictors within Python on the fly (seePipelines). Furthermore,xvars*(varlist)allows to specify a learner-specific varlist of predictors. The second syntax imposes no limit on the number of base learners (aside from the increasing computational complexity). Base learners are added before the comma usingmethod(string)together withopt(string)and separated by "||".Syntax 1OptionDescription ------------------------------------------------------------------------------------------------------------------methods(string)a list of base learners, defaults to "ols lassocv gradboost" for regression and "logitlassocv gradboost" for classification; seeBase learners.cmdopt*(string)options passed to the base learners, seeCommand options.pipe*(string)pipelines passed to the base learners, seePipelines. Regularized linear learners use thestdscalerpipeline by default, which standardizes the predictors. To suppress this, usenostdscaler. For other learners, there is no default pipeline.xvars*(varlist)overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. Seehere. ------------------------------------------------------------------------------------------------------------------Note:*is replaced with 1 to 10. The number refers to the order given inmethods(string).Syntax 2OptionDescription ------------------------------------------------------------------------------------------------------------------method(string)a base learner, seeBase learners.opt(string)options, seeCommand options.pipeline(string)pipelines applied to the predictors, seePipelines. pipelines passed to the base learners, seePipelines. Regularized linear learners use thestdscalerpipeline by default, which standardizes the predictors. To suppress this, usenostdscaler. For other learners, there is no default pipeline.xvars(varlist)overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. Seehere. ------------------------------------------------------------------------------------------------------------------Other optionsOptionDescription ------------------------------------------------------------------------------------------------------------------type(string)reg(ress)for regression problems orclass(ify)for classification problems. The default is regression.finalest(string)final estimator used to combine base learners. The default is non-negative least squares without an intercept and the additional constraint that weights sum to 1 (nnls1). Alternatives arennls0(non-negative least squares without intercept without the sum-to-one constraint),singlebest(use base learner with minimum MSE),ols(ordinary least squares) orridgefor (logistic) ridge, which is the sklearn default. For more information, seehere.nosavepreddo not save predicted values (do not use ifpredictis used after estimation)nosavebasexbdo not save predicted values of each base learner (do not use ifpredictwithbasexbis used after estimation)njobs(int)number of jobs for parallel computing. The default is 0 (no parallelization), -1 uses all available CPUs, -2 uses all CPUs minus 1.backend(string)joblib backend used for parallelization; the default is 'loky' under Linux/MacOS and 'threading' under Windows. See here for more information.folds(int)number of folds used for cross-validation (not relevant for voting); default is 5. Ignored iffoldvar(varname)if specified.foldvar(varname)integer fold variable for cross-validation.bfolds(int)number of folds used forbase learnersthat use cross-validation (e.g.lassocv); default is 5.norandomfolds are created using the ordering of the data.noshufflecross-validation folds forbase learnersthat use cross-validation (e.g.lassocv) are based on ordering of the data.sparseconverts predictor matrix to a sparse matrix. This will only lead to speed improvements if the predictor matrix is sufficiently sparse. Not all learners support sparse matrices and not all learners will benefit from sparse matrices in the same way. You can also use the sparse pipeline to use sparse matrices for some learners, but not for others.pyseed(int)set the Python seed. Note that sincepystackeduses Python, we also need to set the Python seed to ensure replicability. Three options: 1)pyseed(-1)draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example,set seed 42is sufficient, as the Python seed is generated automatically. 2) Settingpyseed(x)with any positive integerxallows to control the Python seed directly. 3)pyseed(0)sets the seed to None in Python. The default ispyseed(-1). ------------------------------------------------------------------------------------------------------------------VotingDescription ------------------------------------------------------------------------------------------------------------------votinguse voting regressor (ensemble.VotingRegressor) or voting classifier (ensemble.VotingClassifier); seeherefor a brief explanation.votetype(string)type of voting classifier:hard(default) orsoftvoteweights(numlist)positive weights used for voting regression/classification. The length ofnumlistshould be the number of base learners - 1. The last weight is calculated to ensure that sum(weights)=1. ------------------------------------------------------------------------------------------------------------------Postestimation and prediction optionsPostestimation tablesAfter estimation,pystackedcan report a table of in-sample (both cross-validated and full-sample-refitted) and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the table reports the root MSPE (mean squared prediction error); for classification problems, a confusion matrix is reported. The default holdout sample used for out-of-sample performance with theholdoutoption is all observations not included in the estimation. Alternatively, the user can specify the holdout sample explicitly using the syntaxholdout(varname). The table can be requested postestimation as below, or as part of thepystackedestimation command. Table syntax:pystacked[,tableholdout[(varname)] ]Postestimation graphspystackedcan also report graphs of in-sample and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the graphs compare predicted vs actual values ofdepvar. For classification problems, the default is to report ROC curves; optionally, histograms of predicted probabilities are reported. As with thetableoption, the default holdout sample used for out-of-sample performance is all observations not included in the estimation, but the user can instead specify the holdout sample explicitly. The table can be requested postestimation as below, or as part of thepystackedestimation command. Thegraphoption on its own reports the graphs usingpystacked's default settings. Because graphs are produced using Stata'stwoway,roctabandhistogramcommands, the user can control either the combined graph (graph(options)) or the individual learner graphs (lgraph(options)) appear by passing options to these commands. Graph syntax:pystacked[,graph[(options)]lgraph[(options)]histogramholdout[(varname)] ]PredictionTo get stacking predicted values:predicttypenewname[ifexp] [inrange] [,prclassxbresid] To get fitted values for each base learner:predicttypestub[ifexp] [inrange] [,basexbcvalid]OptionDescription ------------------------------------------------------------------------------------------------------------------xbpredicted value; the default for regressionprpredicted probability; the default for classificationclasspredicted classresidresidualsbasexbpredicted values for each base learner (default = use base learners re-fitted on full estimation sample)cvalidcross-validated predicted values. Currently only supported if combined withbasexb. ------------------------------------------------------------------------------------------------------------------Note:Predicted values (in and out-sample) are calculated and stored in Python memory whenpystackedis run.predictpulls the predicted values from Python memory and saves them in Stata memory. This means that no changes on the data in Stata memory should be madebetweenpystackedcall andpredictcall. If changes to the data set are made,predictwill return an error.Stacking is a way of combining cross-validated predictions from multiple base ("level-0") learners into a final prediction. A final estimator ("level-1") is used to combine the base predictions. The default final predictor for stacking regession is non-negative least squares (NNLS) without an intercept and with the constraint that weights sum to one. Note that in this respect we deviate from the scikit-learn default and follow the recommendation in Breiman (1996) and Hastie et al. (Stacking2009, p. 290). The scikit-learn defaults for the final estimator are ridge regression for stacking regression and logistic ridge for classification tasks. To use the scikit-learn default, usefinalest(ridge).pystackedalso supports ordinary (unconstrained) least squares as the final estimator (finalest(ols)). Finally,singlebestuses the single base learner that exhibits the smallest cross-validated mean squared error. An alternative to stacking is voting. Voting regression uses the weighted average of base learners to form predictions. By default, the unweighted average is used, but the user can specify weights usingvoteweights(numlist). Voting classifier uses a majority rule by default (hard voting). An alternative is soft voting where the (weighted) probabilities are used to form the final prediction.The following base learners are supported: Base learnersSupported base learnersolsLinear regression(regression only)logitLogistic regression(classification only)lassoicLasso with penalty chosen by AIC/BIC(regression only)lassocvLasso with cross-validated penaltyridgecvRidge with cross-validated penaltyelasticcvElastic net with cross-validated penaltysvmSupport vector machinesgradboostGradient boostingrfRandom forestlinsvmLinear SVMnnetNeural net The base learners can be chosen using themethods(lassocv gradboost nnet)(Syntax 1) ormethod(string)options (Syntax 2). Please see links in the next section for more information on each method.Options can be passed to the base learners viaBase learners: Optionscmdopt*(string)(Syntax 1) oropt(string)(Syntax 2). The defaults are adopted from scikit-learn. To see the default options of each base learners, simply click on the "Show options" links below. To see which alternative settings are allowed, please see the scikit-learn documentations linked below. Westrongly recommendthat you read the scikit-learn documentation carefully. The optionshowoptionsshows the options passed on to Python. We recommend to verify that options have been passed to Python as intended.Linear regressionMethodsolsType:regDocumentation:linear_model.LinearRegression Show optionsLogistic regressionMethods:logitType:classDocumentation: linear_model.LogisticRegression Show optionsPenalized regression with information criteriaMethodslassoicType:regDocumentation: linear_model.LassoLarsIC Show optionsPenalized regression with cross-validationMethods:lassocv,ridgecvandelasticvType:regressDocumentation: linear_model.ElasticNetCV Show lasso options Show ridge options Show elastic net optionsPenalized logistic regression with cross-validationMethods:lassocv,ridgecvandelasticvType:classDocumentation: linear_model.LogisticRegressionCV Show lasso options Show ridge options Show elastic optionsRandom forest classifierMethod:rfType:classDocumentation: ensemble.RandomForestClassifier Show optionsRandom forest regressorMethod:rfType:regDocumentation: ensemble.RandomForestRegressor Show optionsGradient boosted regression treesMethod:gradboostType:regDocumentation: ensemble.GradientBoostingRegressor Show optionsLinear SVM (SVR)Method:linsvmType:regDocumentation: svm.LinearSVR Show optionsSVM (SVR)Method:svmType:classDocumentation: svm.SVR Show optionsSVM (SVC)Method:svmType:regDocumentation: svm.SVC Show optionsNeural net classifier (Multi-layer Perceptron)Method:nnetType:classDocumentation: sklearn.neural_network.MLPClassifier Show optionsNeural net regressor (Multi-layer Perceptron)Method:nnetType:regDocumentation: sklearn.neural_network.MLPRegressor Show optionsBy default,Learner-specific predictorspystackeduses the same set of predictors for all base learners. This is often not desirable. For example, when using linear machine learners such as the lasso, it is recommended to create interactions. There are two methods to allow for learner-specific sets of predictors: 1) Pipelines, discussed in the next section, can be used to create polynomials on the fly. 2) Thexvars*(varlist)option allows to specify predictors for a specific learner. Ifxvars*(varlist)is missing for a specific learner, the default predictor list is used.Scikit-learn uses pipelines to pre-preprocess input data on the fly. Pipelines can be used to impute missing observations or create transformation of predictors such as interactions and polynomials. The following pipelines are currently supported: PipelinesPipelinesstdscalerStandardScaler()stdscaler0StandardScaler(with_mean=False)sparseSparseTransformer()onehotOneHotEncoder()()minmaxscalerMinMaxScaler()medianimputerSimpleImputer(strategy='median')knnimputerKNNImputer()poly2PolynomialFeatures(degree=2)poly3PolynomialFeatures(degree=3) Pipelines can be passed to the base learners viapipe*(string)(Syntax 1) orpipeline(string)(Syntax 2).stdscaler0is intended for sparse matrices, sincestdscalerwill make a sparse matrix dense.Example using Boston Housing data (Harrison et al.,1978)Data setThe data set is available from the UCI Machine Learning Repository. The following variables are included in the data set of 506 observations: Predictors CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion Black by town LSTAT % lower status of the population Outcome MEDV Median value of owner-occupied homes in $1000'sGetting startedLoad housing data. . insheet using https://statalasso.github.io/dta/housing.csv, clear Stacking regression with lasso, random forest and gradient boosting. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(lassocv rf gradboost) The weights determine how much each base learner contributes to the final stacking prediction. Request the root MSPE table (in-sample only): . pystacked, table Re-estimate using the first 400 observations, and request the root MSPE table. RMSPEs for both in-sample (both refitted and cross-validated) and the default holdout sample (all unused observations) are reported.: . pystacked medv crim-lstat if _n<=400, type(regress) pyseed(123) methods(lassocv rf gradboost) . pystacked, table holdout Graph predicted vs actual for the holdout sample: . pystacked, graph holdout Storing the predicted values: . predict double yhat, xb Storing the cross-validated predicted values: . predict double yhat_cv, xb cvalid We can also save the predicted values of each base learner: . predict double yhat, basexbLearner-specific predictors (Syntax 1)pystackedallows the use of different sets of predictors for each base learners. For example, linear estimators might perform better if interactions are provided as inputs. Here, we use interactions and 2nd-order polynomials for the lasso, but not for the other base learners. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) xvars2(c.(crim-lstat)# #c.(crim-lstat)) The same can be achieved using pipelines which create polynomials on-the-fly in Python. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe2(poly2)Learner-specific predictors (Syntax 2)We demonstrate the same using the alternative syntax, which is often more handy: . pystacked medv crim-lstat || m(ols) || m(lassocv) xvars(c.(crim-lstat)# #c.(crim-lstat)) || m(rf) || , type(regress) pyseed(123) . pystacked medv crim-lstat || m(ols) || m(lassocv) pipe(poly2) || m(rf) || , type(regress) pyseed(123)Options of base learners (Syntax 1)We can pass options to the base learners usingcmdopt*(string). In this example, we change the maximum tree depth for the random forest. Since random forest is the third base learner, we usecmdopt3(max_depth(3)). . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) cmdopt3(max_depth(3)) You can verify that the option has been passed to Python correctly: . di e(pyopt3)Options of base learners (Syntax 2)The same results as above can be achieved using the alternative syntax, which imposes no limit on the number of base learners. . pystacked medv crim-lstat || m(ols) pipe(poly2) || m(lassocv) pipe(poly2) || m(rf) opt(max_depth(3)) , type(regress) pyseed(123)Single base learnersYou can usepystackedwith a single base learner. In this example, we are using a conventional random forest: . pystacked medv crim-lstat, type(regress) pyseed(123) methods(rf)VotingYou can also use pre-defined weights. Here, we assign weights of 0.5 to OLS, .1 to the lasso and, implicitly, .4 to the random foreset. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) voting voteweights(.5 .1)Classification Example using Spam dataData setFor demonstration we consider the Spambase Data Set from the UCI Machine Learning Repository. The data includes 4,601 observations and 57 variables. The aim is to predict whether an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email. Predictors v1-v48 percentage of words in the e-mail that match a specificword, i.e. 100 * (number of times the word appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor corresponds to, see link below. v49-v54 percentage of characters in the e-mail that match a specificcharacter, i.e. 100 * (number of times the character appears in the e-mail) divided by total number of characters in e-mail. To see which character each predictor corresponds to, see link below. v55 average length of uninterrupted sequences of capital letters v56 length of longest uninterrupted sequence of capital letters v57 total number of capital letters in the e-mail Outcome v58 denotes whether the e-mail was considered spam (1) or not (0). For more information about the data see https://archive.ics.uci.edu/ml/datasets/spambase. Load spam data. . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma Throughout this example, we add the optionnjobs(4), which enables parallelization with 4 cores. We consider three base learners: logit, random forest and gradient boosting: . pystacked v58 v1-v57, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2)Out-of-sample classification.As the data is ordered by outcome, we first shuffle the data randomly. . set seed 42 . gen u = runiform() . sort u Estimation on the first 2000 observations. . pystacked v58 v1-v57 if _n<=2000, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2) We can get both the predicted probabilities or the predicted class: . predict spam, class . predict spam_p, pr Confusion matrix, just in-sample and both in- and out-of-sample. . pystacked, table . pystacked, table holdout Confusion matrix for a specified holdout sample. . gen h = _n>3000 . pystacked, table holdout(h) ROC curves for the default holdout sample. Specify a subtitle for the combined graph. . pystacked, graph(subtitle(Spam data)) holdout Predicted probabilites (histoption) for the default holdout sample. Specify number of bins for the individual learner graphs. . pystacked, graph hist lgraph(bin(20)) holdoutInstallationpystackedrequires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). Seethis help file, this Stata blog entry and this Youtube video for how to set up Python on your system. Installing Anaconda is in most cases the easiest way of installing Python including all required packages. You can check your scikit-learn version using: . python: import sklearn . python: sklearn.__version__Updating scikit-learn:If you use Anaconda, update scikit-learn through your Anaconda Python distribution. Make sure that you have linked Stata with the correct Python installation using python query. If you use pip, you can update scikit-learn by typing "<Python path> -m pip install -U scikit-learn" into theterminal, or directly in Stata: . shell <Python path> -m pip install -U scikit-learn Note that you might need to restart Stata for changes to your Python installation to take effect. For further information, see https://scikit-learn.org/stable/install.html. To install/updatepystacked, type . net install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replaceHarrison, D. and Rubinfeld, D.L (1978). Hedonic prices and the demand for clean air.ReferencesJ. Environ. Economics &Management, vol.5, 81-102, 1978. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Wolpert, David H. Stacked generalization.Neural networks5.2 (1992): 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1If you encounter an error, contact us via email. If you have a question, you can also post on Statalist (please tag @Achim Ahrens).ContactAcknowledgementspystackedtook some inspiration from Michael Droste's pylearn, which implements other scikit-learn programs for Stata. Thanks to Jan Ditzen for testing an early version of the program. We also thank Brigham Frandsen and Marco Alfano for feedback. All remaining errors are our own.Please also cite scikit-learn; see https://scikit-learn.org/stable/about.html.CitationAchim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E Schaffer, Heriot-Watt University, UKAuthors

**Help file**