## Getting started #

Before we get into stacking, let’s first use `pystacked`

as a “regular”
program for machine learning.

### Gradient boosting #

We load the example data set and randomly split the data in training/test sample.

```
. clear all
. insheet using ///
https://statalasso.github.io/dta/housing.csv, ///
clear comma
. set seed 789
. gen train=runiform()
. replace train=train>.75
```

As an example, we run `pystacked`

with gradient boosting:

```
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost)
Single base learner: no stacking done.
Stacking weights:
---------------------------------------
Method | Weight
-----------------+---------------------
gradboost | 1.0000000
```

Python seed

Since`pystacked`

uses Python, we also need to take the Python seed into account to ensure replicability.`pystacked`

does this automatically for you. The default (`pyseed(-1)`

) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example,`set seed 42`

is sufficient, as the Python seed is generated automatically.

We use the `type`

option to indicate that we consider a regression task rather than a classification task. `method(gradboost)`

selects gradient boosting with regression trees. We will later see that we can specify more than one learner in `method()`

.

The output isn’t particularly informative since–for now–we only consider one method. Yet, `pystacked`

has fitted 100 boosted trees in the background. We can use `predict`

in the Stata-typical way to obtain predicted values:

```
. predict double ygb if !train
```

### Options #

We can pass options to scitkit-learn and fine-tune our gradient booster. For this, we use the
`cmdopt1()`

option. We need to use `cmdopt1()`

because *gradboost* is the first (and only) method we are using with
`pystacked`

.

The options of each base learner are listed when you type:

```
. _pyparse, type(reg) method(gradboost) print
Default options:
loss(squared_error) learning_rate(.1) n_estimators(100) subsample(1)
criterion(friedman_mse) min_samples_split(2) min_samples_leaf(1)
min_weight_fraction_leaf(0) max_depth(3) min_impurity_decrease(0)
init(None) random_state(rng) max_features(None) alpha(.9)
max_leaf_nodes(None) warm_start(False) validation_fraction(.1)
n_iter_no_change(None) tol(.0001) ccp_alpha(0)
```

Check the scikit-learn documentation here to get more information about each option.

scikit-learn documentation

Please study the scikit-learn documentation carefully when using`pystacked`

.

For demonstation, we try two things: (1) we reduce the learning rate from 0.1 to 0.01, and (2) we reduce the learning rate and, in addition, increase the number of trees to 1000.

```
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost) ///
cmdopt1(learning_rate(.01))
. predict double ygb2 if !train
. pystacked medv crim-lstat if train, ///
type(regress) method(gradboost) ///
cmdopt1(learning_rate(.01) n_estimators(1000))
. predict double ygb3 if !train
```

We can then compare the performance across the three models:

```
. gen double rgb_sq=(medv-ygb)^2
. gen double rgb2_sq=(medv-ygb2)^2
. gen double rgb3_sq=(medv-ygb3)^2
. sum rgb* if!train
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
rgb_sq | 361 14.19902 38.15371 .0000138 469.2815
rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858
rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962
```

Reducing the learning rate without increasing the number of trees has deteriorated prediction performance; yet, if we also increase the number of trees, we might be able to improve prediction performance.

### Pipelines #

We can make use of pipelines to pre-process our predictors. This will become especially useful in the context of stacking when we want to pass 2nd-order polynomials to one method, but not the other. Here, we use lasso with and without the `poly2`

pipeline:

```
. pystacked medv crim-lstat if train, ///
type(regress) method(lassocv)
. predict double ylasso1 if !train
. pystacked medv crim-lstat if train, ///
type(regress) method(lassocv) ///
pipe1(poly2)
. predict double ylasso2 if !train
. gen double rlasso1_sq=(medv-ylasso1)^2
. gen double rlasso2_sq=(medv-ylasso2)^2
. sum *sq if !train
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
rgb_sq | 361 14.19902 38.15371 .0000138 469.2815
rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858
rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962
rlasso1_sq | 361 28.03915 76.37714 .0006696 897.5598
rlasso2_sq | 361 22.85806 76.73137 4.81e-06 952.1228
```

The interactions and squared terms improve the performance of the lasso. However, gradient boosting performance much better in this application.