Help file
------------------------------------------------------------------------------------------------------------------------
help pystacked                                                                                                    v0.4.8
------------------------------------------------------------------------------------------------------------------------

Title

    pystacked --  Stata program for Stacking Regression

Overview

    pystacked implements stacking regression (Wolpert, 1992) via scikit-learn's sklearn.ensemble.StackingRegressor and
    sklearn.ensemble.StackingClassifier.  Stacking is a way of combining multiple supervised machine learners (the
    "base" or "level-0" learners) into a meta learner.  The currently supported base learners are linear regression,
    logit, lasso, ridge, elastic net, (linear) support vector machines, gradient boosting, and neural nets (MLP).

    pystacked can also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn's
    machine learning algorithms.

    pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher).
    pystacked has been tested with scikit-learn 0.24, 1.0, 1.1.0, 1.1.1 and 1.1.2.  See here and here for how to set
    up Python for Stata on your system.

Contents

        Syntax overview
        Syntax 1
        Syntax 2
        Other options
        Postestimation and prediction options
        Stacking
        Supported base learners
        Base learners: Options
        Pipelines
        Learner-specific predictors
        Example Stacking Regression
        Example Stacking Classification
        Installation
        Misc (references, contact, etc.)

Syntax overview

    There are two alternative syntaxes. The first syntax is:

        pystacked depvar predictors [if exp] [in range] [, methods(string) cmdopt1(string) cmdopt2(string) ...
              pipe1(string) pipe2(string) ...  xvars1(varlist) xvars2(varlist) ...  otheropts ]

    The second syntax is:

        pystacked depvar predictors || method(string) opt(string) pipeline(string) xvars(varlist) || method(string)
              opt(string) pipeline(string) xvars(varlist) || ...  || [if exp] [in range] [, otheropts ]

    The first syntax uses methods(string) to select base learners, where string is a list of base learners.  Options
    are passed on to base learners via cmdopt1(string), cmdopt2(string) to cmdopt10(string).  That is, up to 10 base
    learners can be specified and options are passed on in the order in which they appear in methods(string) (see
    Command options).  Likewise, the pipe*(string) option can be used for pre-processing predictors within Python on
    the fly (see Pipelines).  Furthermore, xvars*(varlist) allows to specify a learner-specific varlist of predictors.

    The second syntax imposes no limit on the number of base learners (aside from the increasing computational
    complexity). Base learners are added before the comma using method(string) together with opt(string) and separated
    by "||".

Syntax 1

    Option                Description
    ------------------------------------------------------------------------------------------------------------------
    methods(string)        a list of base learners, defaults to "ols lassocv gradboost" for regression and "logit
                            lassocv gradboost" for classification; see Base learners.
    cmdopt*(string)        options passed to the base learners, see Command options.
    pipe*(string)          pipelines passed to the base learners, see Pipelines.  Regularized linear learners use the
                            stdscaler pipeline by default, which standardizes the predictors. To suppress this, use
                            nostdscaler.  For other learners, there is no default pipeline.
    xvars*(varlist)        overwrites the default list of predictors.  That is, you can specify learner-specific lists
                            of predictors.  See here.
    ------------------------------------------------------------------------------------------------------------------
    Note: * is replaced with 1 to 10. The number refers to the order given in methods(string).

Syntax 2

    Option                Description
    ------------------------------------------------------------------------------------------------------------------
    method(string)         a base learner, see Base learners.
    opt(string)            options, see Command options.
    pipeline(string)       pipelines applied to the predictors, see Pipelines.  pipelines passed to the base learners,
                            see Pipelines.  Regularized linear learners use the stdscaler pipeline by default, which
                            standardizes the predictors. To suppress this, use nostdscaler.  For other learners, there
                            is no default pipeline.
    xvars(varlist)         overwrites the default list of predictors.  That is, you can specify learner-specific lists
                            of predictors.  See here.
    ------------------------------------------------------------------------------------------------------------------

Other options

    Option                Description
    ------------------------------------------------------------------------------------------------------------------
    type(string)           reg(ress) for regression problems or class(ify) for classification problems.  The default
                            is regression.
    finalest(string)       final estimator used to combine base learners.  The default is non-negative least squares
                            without an intercept and the additional constraint that weights sum to 1 (nnls1).
                            Alternatives are nnls0 (non-negative least squares without intercept without the
                            sum-to-one constraint), singlebest (use base learner with minimum MSE), ols (ordinary
                            least squares) or ridge for (logistic) ridge, which is the sklearn default. For more
                            information, see here.
    nosavepred             do not save predicted values (do not use if predict is used after estimation)
    nosavebasexb           do not save predicted values of each base learner (do not use if predict with basexb is
                            used after estimation)
    njobs(int)             number of jobs for parallel computing. The default is 0 (no parallelization), -1 uses all
                            available CPUs, -2 uses all CPUs minus 1.
    backend(string)        joblib backend used for parallelization; the default is 'loky' under Linux/MacOS and
                            'threading' under Windows.  See here for more information.
    folds(int)             number of folds used for cross-validation (not relevant for voting); default is 5. Ignored
                            if foldvar(varname) if specified.
    foldvar(varname)       integer fold variable for cross-validation.
    bfolds(int)            number of folds used for base learners that use cross-validation (e.g. lassocv); default is
                            5.
    norandom               folds are created using the ordering of the data.
    noshuffle              cross-validation folds for base learners that use cross-validation (e.g. lassocv) are based
                            on ordering of the data.
    sparse                 converts predictor matrix to a sparse matrix. This will only lead to speed improvements if
                            the predictor matrix is sufficiently sparse. Not all learners support sparse matrices and
                            not all learners will benefit from sparse matrices in the same way. You can also use the
                            sparse pipeline to use sparse matrices for some learners, but not for others.
    pyseed(int)            set the Python seed. Note that since pystacked uses Python, we also need to set the Python
                            seed to ensure replicability.  Three options: 1) pyseed(-1) draws a number between 0 and
                            10^8 in Stata which is then used as a Python seed.  This way, you only need to deal with
                            the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated
                            automatically. 2) Setting pyseed(x) with any positive integer x allows to control the
                            Python seed directly. 3) pyseed(0) sets the seed to None in Python.  The default is
                            pyseed(-1).
    ------------------------------------------------------------------------------------------------------------------

    Voting                Description
    ------------------------------------------------------------------------------------------------------------------
    voting                 use voting regressor (ensemble.VotingRegressor) or voting classifier
                            (ensemble.VotingClassifier); see here for a brief explanation.
    votetype(string)       type of voting classifier:  hard (default) or soft
    voteweights(numlist)   positive weights used for voting regression/classification.  The length of numlist should
                            be the number of base learners - 1. The last weight is calculated to ensure that
                            sum(weights)=1.
    ------------------------------------------------------------------------------------------------------------------

Postestimation and prediction options

    Postestimation tables

    After estimation, pystacked can report a table of in-sample (both cross-validated and full-sample-refitted) and,
    optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners.
    For regression problems, the table reports the root MSPE (mean squared prediction error); for classification
    problems, a confusion matrix is reported.  The default holdout sample used for out-of-sample performance with the
    holdout option is all observations not included in the estimation.  Alternatively, the user can specify the
    holdout sample explicitly using the syntax holdout(varname).  The table can be requested postestimation as below,
    or as part of the pystacked estimation command.

    Table syntax:

        pystacked [, table holdout[(varname)] ]

    Postestimation graphs

    pystacked can also report graphs of in-sample and, optionally, out-of-sample (holdout sample) performance for both
    the stacking regression and the base learners.  For regression problems, the graphs compare predicted vs actual
    values of depvar.  For classification problems, the default is to report ROC curves; optionally, histograms of
    predicted probabilities are reported.  As with the table option, the default holdout sample used for out-of-sample
    performance is all observations not included in the estimation, but the user can instead specify the holdout
    sample explicitly.  The table can be requested postestimation as below, or as part of the pystacked estimation
    command.

    The graph option on its own reports the graphs using pystacked's default settings.  Because graphs are produced
    using Stata's twoway, roctab and histogram commands, the user can control either the combined graph
    (graph(options)) or the individual learner graphs (lgraph(options)) appear by passing options to these commands.

    Graph syntax:

        pystacked [, graph[(options)] lgraph[(options)] histogram holdout[(varname)] ]

    Prediction

    To get stacking predicted values:

        predict type newname [if exp] [in range] [, pr class xb resid ]

    To get fitted values for each base learner:

        predict type stub [if exp] [in range] [, basexb cvalid ]

    Option                Description
    ------------------------------------------------------------------------------------------------------------------
    xb                     predicted value; the default for regression
    pr                     predicted probability; the default for classification
    class                  predicted class
    resid                  residuals
    basexb                 predicted values for each base learner (default = use base learners re-fitted on full
                            estimation sample)
    cvalid                 cross-validated predicted values. Currently only supported if combined with basexb.
    ------------------------------------------------------------------------------------------------------------------

    Note: Predicted values (in and out-sample) are calculated and stored in Python memory when pystacked is run.
    predict pulls the predicted values from Python memory and saves them in Stata memory. This means that no changes
    on the data in Stata memory should be made between pystacked call and predict call. If changes to the data set are
    made, predict will return an error.

Stacking

    Stacking is a way of combining cross-validated predictions from multiple base ("level-0") learners into a final
    prediction. A final estimator ("level-1") is used to combine the base predictions.

    The default final predictor for stacking regession is non-negative least squares (NNLS) without an intercept and
    with the constraint that weights sum to one.  Note that in this respect we deviate from the scikit-learn default
    and follow the recommendation in Breiman (1996) and Hastie et al. (2009, p. 290).  The scikit-learn defaults for
    the final estimator are ridge regression for stacking regression and logistic ridge for classification tasks.  To
    use the scikit-learn default, use finalest(ridge).  pystacked also supports ordinary (unconstrained) least squares
    as the final estimator (finalest(ols)).  Finally, singlebest uses the single base learner that exhibits the
    smallest cross-validated mean squared error.

    An alternative to stacking is voting. Voting regression uses the weighted average of base learners to form
    predictions. By default, the unweighted average is used, but the user can specify weights using
    voteweights(numlist). Voting classifier uses a majority rule by default (hard voting). An alternative is soft
    voting where the (weighted) probabilities are used to form the final prediction.

Supported base learners

    The following base learners are supported:

    Base learners           
      ols                   Linear regression (regression only)
      logit                 Logistic regression (classification only)
      lassoic               Lasso with penalty chosen by AIC/BIC (regression only)
      lassocv               Lasso with cross-validated penalty
      ridgecv               Ridge with cross-validated penalty
      elasticcv             Elastic net with cross-validated penalty
      svm                   Support vector machines
      gradboost             Gradient boosting
      rf                    Random forest
      linsvm                Linear SVM
      nnet                  Neural net

    The base learners can be chosen using the methods(lassocv gradboost nnet) (Syntax 1) or method(string) options
    (Syntax 2).

    Please see links in the next section for more information on each method.

Base learners: Options

    Options can be passed to the base learners via cmdopt*(string) (Syntax 1) or opt(string) (Syntax 2).  The defaults
    are adopted from scikit-learn.

    To see the default options of each base learners, simply click on the "Show options" links below. To see which
    alternative settings are allowed, please see the scikit-learn documentations linked below.  We strongly recommend
    that you read the scikit-learn documentation carefully.

    The option showoptions shows the options passed on to Python.  We recommend to verify that options have been
    passed to Python as intended.

    Linear regression
    Methods ols
    Type: reg
    Documentation: linear_model.LinearRegression

    Show options

    Logistic regression
    Methods: logit
    Type: class
    Documentation: linear_model.LogisticRegression

    Show options

    Penalized regression with information criteria
    Methods lassoic
    Type: reg
    Documentation: linear_model.LassoLarsIC

    Show options

    Penalized regression with cross-validation

      Methods: lassocv, ridgecv and elasticv
      Type: regress
      Documentation: linear_model.ElasticNetCV

    Show lasso options
    Show ridge options
    Show elastic net options

    Penalized logistic regression with cross-validation
    Methods: lassocv, ridgecv and elasticv
    Type: class
    Documentation: linear_model.LogisticRegressionCV

    Show lasso options
    Show ridge options
    Show elastic options

    Random forest classifier
    Method: rf
    Type: class
    Documentation: ensemble.RandomForestClassifier

    Show options

    Random forest regressor
    Method: rf
    Type: reg
    Documentation: ensemble.RandomForestRegressor

    Show options

    Gradient boosted regression trees
    Method: gradboost
    Type: reg
    Documentation: ensemble.GradientBoostingRegressor

    Show options

    Linear SVM (SVR)
    Method: linsvm
    Type: reg
    Documentation: svm.LinearSVR

    Show options

    SVM (SVR)
    Method: svm
    Type: class
    Documentation: svm.SVR

    Show options

    SVM (SVC)
    Method: svm
    Type: reg
    Documentation: svm.SVC

    Show options

    Neural net classifier (Multi-layer Perceptron)
    Method: nnet
    Type: class
    Documentation: sklearn.neural_network.MLPClassifier

    Show options

    Neural net regressor (Multi-layer Perceptron)
    Method: nnet
    Type: reg
    Documentation: sklearn.neural_network.MLPRegressor

    Show options

Learner-specific predictors

    By default, pystacked uses the same set of predictors for all base learners. This is often not desirable. For
    example, when using linear machine learners such as the lasso, it is recommended to create interactions. There are
    two methods to allow for learner-specific sets of predictors:

    1) Pipelines, discussed in the next section, can be used to create polynomials on the fly.

    2) The xvars*(varlist) option allows to specify predictors for a specific learner.  If xvars*(varlist) is missing
    for a specific learner, the default predictor list is used.

Pipelines

    Scikit-learn uses pipelines to pre-preprocess input data on the fly.  Pipelines can be used to impute missing
    observations or create transformation of predictors such as interactions and polynomials.

    The following pipelines are currently supported:

    Pipelines               
      stdscaler             StandardScaler()
      stdscaler0            StandardScaler(with_mean=False)
      sparse                SparseTransformer()
      onehot                OneHotEncoder()()
      minmaxscaler          MinMaxScaler()
      medianimputer         SimpleImputer(strategy='median')
      knnimputer            KNNImputer()
      poly2                 PolynomialFeatures(degree=2)
      poly3                 PolynomialFeatures(degree=3)

    Pipelines can be passed to the base learners via pipe*(string) (Syntax 1) or pipeline(string) (Syntax 2).

    stdscaler0 is intended for sparse matrices, since stdscaler will make a sparse matrix dense.

Example using Boston Housing data (Harrison et al., 1978)

    Data set

    The data set is available from the UCI Machine Learning Repository.  The following variables are included in the
    data set of 506 observations:

    Predictors    
      CRIM      per capita crime rate by town
      ZN        proportion of residential land zoned for lots over 25,000 sq.ft.
      INDUS     proportion of non-retail business acres per town
      CHAS      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
      NOX       nitric oxides concentration (parts per 10 million)
      RM        average number of rooms per dwelling
      AGE       proportion of owner-occupied units built prior to 1940
      DIS       weighted distances to five Boston employment centres
      RAD       index of accessibility to radial highways
      TAX       full-value property-tax rate per $10,000
      PTRATIO   pupil-teacher ratio by town
      B         1000(Bk - 0.63)^2 where Bk is the proportion Black by town
      LSTAT     % lower status of the population

    Outcome       
      MEDV      Median value of owner-occupied homes in $1000's

    Getting started

    Load housing data.
        . insheet using https://statalasso.github.io/dta/housing.csv, clear

    Stacking regression with lasso, random forest and gradient boosting.
        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(lassocv rf gradboost)

    The weights determine how much each base learner contributes to the final stacking prediction.

    Request the root MSPE table (in-sample only):
        . pystacked, table

    Re-estimate using the first 400 observations, and request the root MSPE table.  RMSPEs for both in-sample (both
    refitted and cross-validated) and the default holdout sample (all unused observations) are reported.:
        . pystacked medv crim-lstat if _n<=400, type(regress) pyseed(123) methods(lassocv rf gradboost)
        . pystacked, table holdout

    Graph predicted vs actual for the holdout sample:
        . pystacked, graph holdout

    Storing the predicted values:
        . predict double yhat, xb

    Storing the cross-validated predicted values:
        . predict double yhat_cv, xb cvalid

    We can also save the predicted values of each base learner:
        . predict double yhat, basexb

    Learner-specific predictors (Syntax 1)

    pystacked allows the use of different sets of predictors for each base learners. For example, linear estimators
    might perform better if interactions are provided as inputs. Here, we use interactions and 2nd-order polynomials
    for the lasso, but not for the other base learners.
        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) xvars2(c.(crim-lstat)#
            #c.(crim-lstat))

    The same can be achieved using pipelines which create polynomials on-the-fly in Python.
        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe2(poly2)

    Learner-specific predictors (Syntax 2)

    We demonstrate the same using the alternative syntax, which is often more handy:

        . pystacked medv crim-lstat || m(ols) || m(lassocv) xvars(c.(crim-lstat)# #c.(crim-lstat)) || m(rf) || ,
            type(regress) pyseed(123)
        . pystacked medv crim-lstat || m(ols) || m(lassocv) pipe(poly2) || m(rf) || , type(regress) pyseed(123)

    Options of base learners (Syntax 1)

    We can pass options to the base learners using cmdopt*(string). In this example, we change the maximum tree depth
    for the random forest. Since random forest is the third base learner, we use cmdopt3(max_depth(3)).

        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2)
            cmdopt3(max_depth(3))

    You can verify that the option has been passed to Python correctly:
        . di e(pyopt3)

    Options of base learners (Syntax 2)

    The same results as above can be achieved using the alternative syntax, which imposes no limit on the number of
    base learners.

        . pystacked medv crim-lstat || m(ols) pipe(poly2) || m(lassocv) pipe(poly2) || m(rf) opt(max_depth(3)) ,
            type(regress) pyseed(123)

    Single base learners

    You can use pystacked with a single base learner.  In this example, we are using a conventional random forest:

        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(rf)

    Voting

    You can also use pre-defined weights. Here, we assign weights of 0.5 to OLS, .1 to the lasso and, implicitly, .4
    to the random foreset.
        . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2)
            voting voteweights(.5 .1)

Classification Example using Spam data

    Data set

    For demonstration we consider the Spambase Data Set from the UCI Machine Learning Repository.  The data includes
    4,601 observations and 57 variables.  The aim is to predict whether an email is spam (i.e., unsolicited commercial
    e-mail) or not.  Each observation corresponds to one email.

    Predictors    
      v1-v48    percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word
                  appears in the e-mail) divided by total number of words in e-mail.  To see which word each predictor
                  corresponds to, see link below.
      v49-v54   percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times
                  the character appears in the e-mail) divided by total number of characters in e-mail.  To see which
                  character each predictor corresponds to, see link below.
      v55       average length of uninterrupted sequences of capital letters
      v56       length of longest uninterrupted sequence of capital letters
      v57       total number of capital letters in the e-mail

    Outcome       
      v58       denotes whether the e-mail was considered spam (1) or not (0).

    For more information about the data see https://archive.ics.uci.edu/ml/datasets/spambase.

    Load spam data.
        . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma

    Throughout this example, we add the option njobs(4), which enables parallelization with 4 cores.

    We consider three base learners: logit, random forest and gradient boosting:
        . pystacked v58 v1-v57, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2)

    Out-of-sample classification.

    As the data is ordered by outcome, we first shuffle the data randomly.
        . set seed 42
        . gen u = runiform()
        . sort u

    Estimation on the first 2000 observations.
        . pystacked v58 v1-v57 if _n<=2000, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2)

    We can get both the predicted probabilities or the predicted class:
        . predict spam, class
        . predict spam_p, pr

    Confusion matrix, just in-sample and both in- and out-of-sample.
        . pystacked, table
        . pystacked, table holdout

    Confusion matrix for a specified holdout sample.
        . gen h = _n>3000
        . pystacked, table holdout(h)

    ROC curves for the default holdout sample. Specify a subtitle for the combined graph.
        . pystacked, graph(subtitle(Spam data)) holdout

    Predicted probabilites (hist option) for the default holdout sample.  Specify number of bins for the individual
    learner graphs.
        . pystacked, graph hist lgraph(bin(20)) holdout


Installation

    pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher).  See
    this help file, this Stata blog entry and this Youtube video for how to set up Python on your system.  Installing
    Anaconda is in most cases the easiest way of installing Python including all required packages.

    You can check your scikit-learn version using:
        . python: import sklearn
        . python: sklearn.__version__

    Updating scikit-learn:  If you use Anaconda, update scikit-learn through your Anaconda Python distribution. Make
    sure that you have linked Stata with the correct Python installation using python query.

    If you use pip, you can update scikit-learn by typing "<Python path> -m pip install -U scikit-learn" into the
    terminal, or directly in Stata:
        . shell <Python path> -m pip install -U scikit-learn

    Note that you might need to restart Stata for changes to your Python installation to take effect.

    For further information, see https://scikit-learn.org/stable/install.html.

    To install/update pystacked, type
        . net install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replace

References

    Harrison, D. and Rubinfeld, D.L (1978). Hedonic prices and the demand for clean air. J. Environ. Economics &
    Management, vol.5, 81-102, 1978.

    Hastie, T., Tibshirani, R., & Friedman, J. (2009).  The elements of statistical learning: data mining, inference,
    and prediction. Springer Science & Business Media.

    Wolpert, David H. Stacked generalization. Neural networks 5.2 (1992): 241-259.  
    https://doi.org/10.1016/S0893-6080(05)80023-1

Contact

    If you encounter an error, contact us via email. If you have a question, you can also post on Statalist (please
    tag @Achim Ahrens).

Acknowledgements

    pystacked took some inspiration from Michael Droste's pylearn, which implements other scikit-learn programs for
    Stata.  Thanks to Jan Ditzen for testing an early version of the program. We also thank Brigham Frandsen and Marco
    Alfano for feedback.  All remaining errors are our own.

Citation

    Please also cite scikit-learn; see https://scikit-learn.org/stable/about.html.

Authors

    Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
    achim.ahrens@gess.ethz.ch

    Christian B. Hansen, University of Chicago, USA
    Christian.Hansen@chicagobooth.edu

    Mark E Schaffer, Heriot-Watt University, UK