Cross-validation #
In the course of cross-validation, the data is repeatedly partitioned into training and validation data. The model is fit to the training data and the validation data is used to calculate the prediction error. This in turn enables us to identify the values of \(\lambda\) and \(\alpha\) that optimize predictive performance (i.e., minimize the estimated mean-squared prediction error).
cvlasso
supports
\(K\)
-fold cross-validation and
\(h\)
-step ahead rolling cross-validation.
The latter is intended for time-series or panel data with a large time dimension.
\(h\)
-step ahead rolling cross-validation was suggested by
Rob H Hyndman in a blog post.
K-fold cross-validation #
We begin with 10-fold cross-validation (the default).
If no fold variable is specified (which can be done using the foldvar()
option),
the data is randomly partitioned into “folds”.
We use seed(123)
throughout this demonstration to allow
reproducing the outputs below.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
K-fold cross-validation with 10 folds. Elastic net with alpha=1.
Fold 1 2 3 4 5 6 7 8 9 10
| Lambda MSPE st. dev.
----------+---------------------------------------------
1| 163.62492 1.3162136 .13064798
2| 149.08894 1.2141972 .12282686
3| 135.84429 1.114079 .11387635
...
17| 36.930468 .5827423 .06260056 ^
...
27| 14.566138 .53408884 .05830419 *
...
100| .01636249 .54838029 .07390164
* lopt = the lambda that minimizes MSPE.
Run model: cvlasso, lopt
^ lse = largest lambda for which MSPE is within one standard error of the minimal MSPE.
Run model: cvlasso, lse
Note that parts of the output have been omitted for the sake of brevity. The columns 2 to 4 show the value of \(\lambda\) , the estimate of the mean-squared prediction error and the associated standard error.
The
\(\lambda\)
value that minimizes the mean-squared prediction error is indicated by an asterisk (*
).
A hat (^
) marks the largest
\(\lambda\)
at which the MSPE is within one standard error of the minimal MSPE.
We denote these by
\(\lambda_{lopt}\)
and
\(\lambda_{lse}\)
, respectively. The former is returned in e(lopt)
, the latter in e(lse)
.
. di e(lopt)
14.566138
. di e(lse)
36.930468
Estimate the selected model #
To estimate the full model with either
\(\lambda_{lopt}\)
or
\(\lambda_{lse}\)
, we can
use lopt
or lse
.
Internally, cvlasso
calls lasso2
with either lambda(14.566138)
or lambda(36.930468)
.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lopt seed(123)
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lse seed(123)
The same as above can be achieved using the replay syntax in two steps.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cvlasso, lopt
. cvlasso, lse
If postest
is specified, cvlasso
posts the lasso2
estimation results.
. cvlasso, lopt postest
. ereturn list
K-fold cross-validation over lambda and alpha #
alpha()
can be a scalar or list of elastic net parameters. Each
\(\alpha\)
value must lie
in the interval [0,1]. If alpha()
is a list longer than one, cvlasso
cross-validates
over
\(\lambda\)
and
\(\alpha\)
.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, ///
alpha(0 0.1 0.5 1) seed(123)
Cross-validation over alpha (0 .1 .5 1).
alpha | lopt* Minimum MSPE
------------+----------------------------
0.000 | 12.093063 .54348993
0.100 | 25.454739 .5418149
0.500 | 15.986318 .53499607
1.000 | 14.566138 .53408884 #
* lambda value that minimizes MSPE for a given alpha
# alpha value that minimizes MSPE
The second column in the table indicates the value of
\(\lambda\)
that minimizes
the MSPE for a given value of
\(\alpha\)
. A hash key (#
) indicates that value of
\(\alpha\)
that minimizes the overall MSPE.
Plotting #
We can plot the estimated mean-squared prediction error over \(\lambda\) . Note that the plotting feature is not supported if we also cross-validate over \(\alpha\) .
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) plotcv
This produces the following graph:
The two vertical lines indicate \(\lambda_{lopt}\) and \(\lambda_{lse}\) (dashed line).
Similar to lasso2
, cvlasso
allows to pass plotting options on to Stata’s line
using plotopt()
.
Prediction #
The predict postestimation command allows to obtain predicted values and residuals
for either e(lopt)
or e(lse)
.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cap drop xbhat1
. predict double xbhat1, lopt
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cap drop xbhat2
. predict double xbhat2, lse
Store intermediate steps #
cvlasso
calls lasso2
internally. The saveest(string)
allows to
access intermediate estimation results.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, ///
seed(123) nfolds(3) saveest(step)
. estimates dir
. estimates restore step1
. estimates replay step1
Note: EBIC and \(R^2\) are not calculated to speed up the computation.
Time-series example using rolling h-step ahead cross-validation #
Load airline passenger data:
. webuse air2, clear
There are 144 observations in the sample. origin()
controls the sample range used
for training and validation. In this example, origin(130)
implies that data up to
and including
\(t=130\)
are used for training in the first iteration. Data points
\(t=131,...,144\)
are successively used for validation.
. cvlasso air L(1/12).air, rolling origin(130)
Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1.
Training from-to (validation point): 13-130 (131), 13-131 (132), 13-132 (133), 13-133 (134),
> 13-134 (135), 13-135 (136), 13-136 (137), 13-137 (138), 13-138 (139), 13-139 (140),
> 13-140 (141), 13-141 (142), 13-142 (143), 13-143 (144).
The notation a-b (v)
indicates that
data a
to b
are used for estimation (training), and data point v
is used for
forecasting (validation). Note that the training dataset starts with
\(t=13\)
since 12
lags are used as predictors.
The “optimal” model includes lags 1, 11 and 12.
. cvlasso, lopt
Estimate lasso with lambda=315.16 (lopt).
---------------------------------------------------
Selected | Lasso Post-est OLS
------------------+--------------------------------
air |
L1. | 0.1534004 0.1610229
L11. | 0.0638066 0.0724006
L12. | 0.8422566 0.8374074
------------------+--------------------------------
Partialled-out*|
------------------+--------------------------------
|
_cons | 11.5075093 8.2797832
---------------------------------------------------
The option h()
controls the forecasting horizon (default is h(1)
).
. cvlasso air L(1/12).air, rolling origin(130) h(2)
Rolling forecasting cross-validation with 2-step ahead forecasts. Elastic net with alpha=1.
Training from-to (validation point): 13-130 (132), 13-131 (133), 13-132 (134), 13-133 (135),
> 13-134 (136), 13-135 (137), 13-136 (138), 13-137 (139), 13-138 (140), 13-139 (141),
> 13-140 (142), 13-141 (143), 13-142 (144).
In the above examples, the size of the training dataset increases by one data point
each step. To keep the size of the training dataset fixed, specify fixedwindow
.
. cvlasso air L(1/12).air, rolling origin(130) fixedwindow
Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1.
Training from-to (validation point): 13-130 (131), 14-131 (132), 15-132 (133), 16-133 (134),
> 17-134 (135), 18-135 (136), 19-136 (137), 20-137 (138), 21-138 (139), 22-139 (140),
> 23-140 (141), 24-141 (142), 25-142 (143), 26-143 (144).
Panel data example using rolling h-step ahead cross-validation #
Rolling cross-validation can also be applied to panel data. For demonstration, load Grunfeld data.
. webuse grunfeld, clear
Apply 1-step ahead cross-validation.
. cvlasso mvalue L(1/10).mvalue, rolling origin(1950)
Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1.
Training from-to (validation point): 1945-1950 (1951), 1945-1951 (1952), 1945-1952 (1953),
> 1945-1953 (1954).
The model selected by cross-validation:
. cvlasso, lopt
Estimate lasso with lambda=4828.76 (lopt).
---------------------------------------------------
Selected | Lasso Post-est OLS
------------------+--------------------------------
mvalue |
L1. | 0.7289970 0.7343915
L5. | 0.1181815 0.1239170
L7. | 0.0027785 0.0062233
L8. | 0.0613727 0.0647928
L9. | 0.1014168 0.1031103
------------------+--------------------------------
Partialled-out*|
------------------+--------------------------------
|
_cons | 42.6792365 21.8393696
---------------------------------------------------
More #
Please check the help file for more information and examples.
. help cvlasso