# Help file: cvlasso

```---------------------------------------------------------------------------------------------------------
help cvlasso                                                                               lassopack v1.2
---------------------------------------------------------------------------------------------------------

Title

cvlasso -- Program for cross-validation using lasso, square-root lasso, elastic net, adaptive lasso
and post-OLS estimators

Syntax

Full syntax

cvlasso depvar regressors [if exp] [in range] [, alpha(numlist) alphacount(int) sqrt adaptive
prestd fe noftools noconstant tolopt(real) tolzero(real) maxiter(int) nfolds(int)
foldvar(varname) savefoldvar(varname) rolling h(int) origin(int) fixedwindow seed(real)
plotcv plotopt(string) saveest(string)]

Note: the fe option will take advantage of the ftools package (if installed) for the
fixed-effects transform; the speed gains using this package can be large.  See help
ftools or click on ssc install ftools to install.

Estimators            Description
---------------------------------------------------------------------------------------------------
alpha(numlist)         a scalar elastic net parameter or an ascending list of elastic net
parameters.  If the number of alpha values is larger than 1,
cross-validation is conducted over alpha (and lambda).  The default is
alpha=1, which corresponds to the lasso estimator.  The elastic net
parameter controls the degree of L1-norm (lasso-type) to L2-norm
(ridge-type) penalization.  Each alpha value must be in the interval [0,1].
alphacount(real)       number of alpha values used for cross-validation across alpha.  By default,
cross-validation is only conducted across lambda, but not over alpha.
Ignored if alpha() is specified.
sqrt                   square-root lasso estimator.
1/abs(beta0(j))^theta where beta0(j) is the OLS estimate or univariate OLS
estimate if p>n.  Theta is the adaptive exponent, and can be controlled
loadings.  For example, this could be the vector e(b) from an initial
lasso2 estimation.  The elements of the vector are raised to the power
-theta (note the minus).  See adaptive option.
Default=1.
ols                    post-estimation OLS.  Note that cross-validation using OLS will in most
cases lead to no unique optimal lambda (since MSPE is a step function over
lambda).
---------------------------------------------------------------------------------------------------
See overview of estimation methods.

Lambda(s)             Description
---------------------------------------------------------------------------------------------------
lambda(numlist)        a scalar lambda value or list of descending lambda values. Each lambda value
must be greater than 0.  If not specified, the default list is used which
is given by exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see
mf_range).
lcount(integer)†       number of lambda values for which the solution is obtained. Default is 100.
lminratio(real)†       ratio of minimum to maximum lambda. lminratio must be between 0 and 1.
Default is 1/1000.
lmax(real)†            maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the
square-root lasso (where X is the pre-standardized regressor matrix and y
is the vector of the response variable).
lopt                   after cross-validation, estimate model with lambda that minimized the
mean-squared prediction error
lse                    after cross-validation, estimate model with largest lambda that is within
one standard deviation from lopt
---------------------------------------------------------------------------------------------------
† Not applicable if lambda() is specified.

---------------------------------------------------------------------------------------------------
predictors are always included in the model.
partial(varlist)       variables in varlist are partialled out prior to estimation.
loadings (in the case of the lasso, =sqrt(avg(x^2))).  The size of the
vector should equal the number of predictors (excluding partialled out
variables and excluding the constant).
prestd                 dependent variable and predictors are standardized prior to estimation
rather than standardized "on the fly" using penalty loadings.  See here for
more details.  By default the coefficient estimates are un-standardized
(i.e., returned in original units).
---------------------------------------------------------------------------------------------------
See discussion of standardization in the lasso2 help file.  Also see Section Data transformations
in cross-validation below.

FE & constant         Description
---------------------------------------------------------------------------------------------------
fe                     within-transformation is applied prior to estimation. Requires data to be
xtset.
noftools               do not use FTOOLS package for fixed-effects transform (slower; rarely used)
noconstant             suppress constant from estimation.  Default behaviour is to partial the
constant out (i.e., to center the regressors).
---------------------------------------------------------------------------------------------------

Optimization          Description
---------------------------------------------------------------------------------------------------
tolopt(real)           tolerance for lasso shooting algorithm (default=1e-10)
tolzero(real)          minimum below which coeffs are rounded down to zero (default=1e-4)
maxiter(int)           maximum number of iterations for the lasso shooting algorithm
(default=10,000)
---------------------------------------------------------------------------------------------------

Fold variable options Description
---------------------------------------------------------------------------------------------------
nfolds(integer)        the number of folds used for K-fold cross-validation. Default is 10.
foldvar(varname)       user-specified variable with fold IDs, ranging from 1 to #folds.  If not
specified, fold IDs are randomly generated such that each fold is of
approximately equal size.
savefoldvar(varname)   saves the fold ID variable.  Not supported in combination with rolling.
rolling                uses rolling h-step ahead cross-validation. Requires the data to be tsset.
h(integer)‡            changes the forecasting horizon. Default is 1.
origin(integer)‡       controls the number of observations in the first training dataset.
fixedwindow‡           ensures that the size of the training dataset is always the same.
seed(real)             set seed for the generation of a random fold variable. Only relevant if fold
variable is randomly generated.
---------------------------------------------------------------------------------------------------
‡ Only applicable with rolling option.

Plotting options      Description
---------------------------------------------------------------------------------------------------
plotcv                 plots the estimated mean-squared prediction error as a function of
ln(lambda)
plotopt(varlist)       overwrites the default plotting options. All options are passed on to line.
---------------------------------------------------------------------------------------------------

Display options       Description
---------------------------------------------------------------------------------------------------
omitgrid               suppresses the display of mean-squared prediction errors
---------------------------------------------------------------------------------------------------

Store lasso2 results  Description
---------------------------------------------------------------------------------------------------
saveest(string)        saves lasso2 results from each step of the cross-validation in string1, ...,
stringK where K is the number of folds.  Intermediate results can be
restored using estimates restore.
---------------------------------------------------------------------------------------------------

cvlasso may be used with time-series or panel data, in which case the data must be tsset or xtset
first; see help tsset or xtset.

All varlists may contain time-series operators or factor variables; see help varlist.

Replay syntax

cvlasso [, lopt lse postresults plotcv(method) plotopt(string)]

Replay options        Description
---------------------------------------------------------------------------------------------------
lopt                   show estimation results using the model corresponding to lambda=e(lopt)
lse                    show estimation results using the model corresponding to lambda=e(lse)
postresults            post lasso2 estimation results (to be used in combination with lse or lopt)
plotcv(method)         see plotting options above
plotopt(string)        see plotting options above
---------------------------------------------------------------------------------------------------

Postestimation:

predict [type] newvar [if] [in] [, xb residuals lopt lse noisily]

Predict options       Description
---------------------------------------------------------------------------------------------------
xb                     compute predicted values (the default)
residuals              compute residuals
lopt                   use lambda that minimized the mean-squared prediction error
lse                    use the largest lambda that is within one standard deviation from lopt
noisily                show estimation output if re-estimation required.
---------------------------------------------------------------------------------------------------

Contents

Description
Partitioning of folds
Data transformations in cross-validation
Examples of usage
--General demonstration
--Rolling cross-validation with time-series data
--Rolling cross-validation with panel data
Saved results
References
Website
Installation
Acknowledgements
Citation of lassopack

Description

cvlasso implements K-fold cross-validation and h-step ahead rolling cross-validation for the
following estimators: lasso, square-root lasso, adaptive lasso, ridge regression, elastic net.  See

The purpose of cross-validation is to assess the out-of-sample prediction performance of the
estimator.

The steps for K-fold cross-validation over lambda can be summarized as follows:

1. Split the data into K groups, referred to as folds, of approximately equal size. Let n(k) denote
the number of observations in the kth data partition with k=1,...,K.

2. The first fold is treated as the validation dataset and the remaining K-1 parts constitute the
training dataset.  The model is fit to the training data for a given value of lambda.  The
resulting estimate is denoted as betahat(1,lambda).  The mean-squared prediction error for group 1
is computed as

MSPE(1,lambda)=1/n(1)*sum([y(i) - x(i)'betahat(1,lambda)]^2)

for all i in the first data partition.

The procedure is repeated for k=2,...,K.  Thus, MSPE(2,lambda), ..., MSPE(K,lambda) are calculated.

3. The K-fold cross-validation estimate of the MSPE, which serves as a measure of prediction
performance, is

CV(lambda)=1/K*sum(MSPE(k,lambda)).

4. Step 2 and 3 are repeated for a range of lambda values.

h-step ahead rolling cross-validation proceeds in a similar way, except that the partitioning of
training and validation takes account of the time-series structure.  Specifically, the training
window is iteratively extended (or moved forward) by one step.  See below for more details.

Partitioning of folds

cvlasso supports K-fold cross-validation and cross-validation using rolling h-step ahead forecasts.
K-fold cross-validation is the standard approach and relies on a fold ID variable.  Rolling h-step
ahead cross-validation is applicable with time-series data, or panels with large time dimension.

K-fold cross-validation

The fold ID variable marks the observations which are used as validation data.  For example, a fold
ID variable (with three folds) could have the following structure:

+------------------+
| fold   y      x  |
|------------------|
|  3     y1     x1 |
|  2     y2     x2 |
|  1     y3     x3 |
|  3     y4     x4 |
|  1     y5     x5 |
|  2     y6     x6 |
+------------------+

It is instructive to illustrate the cross-validation process implied by the above fold ID variable.
Let T denote a training observation and V denote a validation point.  The division of folds can be
summarized as follows:

Step

1  2  3
+-       -+
1 | T  T  V |
2 | T  V  T |
3 | V  T  T |
i   4 | T  T  V |
5 | V  T  T |
6 | T  V  T |
+-       -+

In the first step, the 3rd and 5th observation are in the validation dataset and remaining data
constitute the training dataset.  In the second step, the validation dataset includes the 2nd and
6th observation, etc.

By default, the fold ID variable is randomly generated such that each fold is of approximately
equal size.  The default number of folds is equal to 10, but can be changed using the nfolds()
option.

To allow for time-series data, cvlasso supports cross-validation using rolling h-step forecasts
(option rolling); see Hyndman, 2016.  To use rolling cross-validation, the data must be tsset or
xtset.  The options h() and origin() control the forecasting horizon and the starting point of the
rolling forecast, respectively.

The following matrix illustrates the division between training and validation data over the course
of the cross-validation for the case of 1-step ahead forecasting (the default when rolling is
specified).

Step

1  2  3  4  5
+-             -+
1 | T  T  T  T  T |
2 | T  T  T  T  T |
3 | T  T  T  T  T |
t   4 | V  T  T  T  T |
5 | .  V  T  T  T |
6 | .  .  V  T  T |
7 | .  .  .  V  T |
8 | .  .  .  .  V |
+-             -+

In the first iteration (illustrated in the first column), the first three observations are in the
training dataset, which corresponds to origin(3).  The option h() controls the forecasting horizon
used for cross-validation (the default is 1).  If h(2) is specified, which corresponds to 2-step
ahead forecasting, the structure changes to:

Step

1  2  3  4  5
+-             -+
1 | T  T  T  T  T |
2 | T  T  T  T  T |
3 | T  T  T  T  T |
4 | .  T  T  T  T |
t   5 | V  .  T  T  T |
6 | .  V  .  T  T |
7 | .  .  V  .  T |
8 | .  .  .  V  . |
9 | .  .  .  .  V |
+-             -+

The fixedwindow option ensures that the size of the training dataset is always the same. In this
example (using h(1)), each step uses three data points for training:

Step

1  2  3  4  5
+-             -+
1 | T  .  .  .  . |
2 | T  T  .  .  . |
3 | T  T  T  .  . |
t   4 | V  T  T  T  . |
5 | .  V  T  T  T |
6 | .  .  V  T  T |
7 | .  .  .  V  T |
8 | .  .  .  .  V |
+-             -+

Data transformations in cross-validation

An important principle in cross-validation is that the training dataset should not contain
information from the validation dataset.  This mimics the real-world situation where out-of-sample
predictions are made not knowing what the true response is.  The principle applies not only to
individual observations (the training and validation data do not overlap) but also to data
transformations.  Specifically, data transformations applied to the training data should not use
information from the validation data or full dataset.  In particular, standardization using the
full sample violates this principle.

cvlasso implements this principle for all data transformations supported by lasso2:  data
standardization, fixed effects and partialling-out.  In most applications using the estimators
supported by cvlasso, predictors are standardized to have mean zero and unit variance.  The above
principle means that the standardization applied to the training data is based only on observations
in the training data; further, the standardization transformation applied to the validation data
will also be based only on the means and variances of the observations in the training data.  The
same applies to the fixed effects transformation:  the group means used to implement the within
transformation to both the training data and the validation data are calculated using only the
training data.  Similarly, the projection coefficients used to "partial out" variables are
estimated using only the training data and are applied to both the training dataset and the
validation dataset.

General introduction using K-fold cross-validation

Dataset

The dataset is available through Hastie et al. (2015) on the authors' website.  The following
variables are included in the dataset of 97 men:

Predictors
lcavol    log(cancer volume)
lweight   log(prostate weight)
age       patient age
lbph      log(benign prostatic hyperplasia amount)
svi       seminal vesicle invasion
lcp       log(capsular penetration)
gleason   Gleason score
pgg45     percentage Gleason scores 4 or 5

Outcome
lpsa      log(prostate specific antigen)

. insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear
tab

General demonstration

10-fold cross-validation across lambda.  The lambda value that minimizes the mean-squared
prediction error is indicated by an asterisk (*).  A hat (^) marks the largest lambda at which the
MSPE is within one standard error of the minimal MSPE.  The former is returned in e(lopt), the
latter in e(lse).  We use seed(123) throughout this demonstration for replicability of folds.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. di e(lopt)
. di e(lse)

Estimate the full model

Estimate the the full model with either e(lopt) or e(lse).  cvlasso internally calls lasso2 with
lambda=lopt or lse, respectively.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lopt seed(123)
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lse seed(123)

The same as above can be achieved using the replay syntax.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cvlasso, lopt
. cvlasso, lse

If postresults is specified, cvlasso posts the lasso2 estimation results.
. cvlasso, lopt postres
. ereturn list

Cross-validation over lambda and alpha

alpha() can be a scalar or list of elastic net parameters.  Each alpha value must lie in the
interval [0,1].  If alpha() is a list longer than 1, cvlasso cross-validates over lambda and alpha.
The table at the end of the output indicates the alpha value that minimizes the empirical MSPE.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0 0.1 0.5 1) lc(10)
seed(123)

Alternatively, the alphacount() option can be used to control the number of alpha values used for
cross-validation.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, alphac(3) lc(10) seed(123)

Plotting

We can plot the estimated mean-squared prediction error over lambda.  Note that the plotting
feature is not supported if we cross-validate over alpha.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) plotcv

Prediction

The predict postestimation command allows to obtain predicted values and residuals for
lambda=e(lopt) or lambda=e(lse).
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cap drop xbhat1
. predict double xbhat1, lopt
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123)
. cap drop xbhat2
. predict double xbhat2, lse

Store intermediate steps

cvlasso calls internally lasso2.  To see intermediate estimation results, we can use the
saveest(string) option.
. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) nfolds(3) saveest(step)
. estimates dir
. estimates restore step1
. estimates replay step1

Time-series example using rolling h-step ahead cross-validation

. webuse air2, clear

There are 144 observations in the sample.  origin() controls the sample range used for training and
validation.  In this example, origin(130) implies that data up to and including t=130 are used for
training in the first iteration.  Data points t=131 to 144 are successively used for validation.
The notation `a-b (v)' indicates that data a to b are used for estimation (training), and data
point v is used for forecasting (validation).  Note that the training dataset starts with t=13
since 12 lags are used as predictors.
. cvlasso air L(1/12).air, rolling origin(130)

The optimal model includes lags 1, 11 and 12.
. cvlasso, lopt

The option h() controls the forecasting horizon (default=1).
. cvlasso air L(1/12).air, rolling origin(130) h(2)

In the above examples, the size of the training dataset increases by one data point each step.  To
keep the size of the training dataset fixed, specify fixedwindow.
. cvlasso air L(1/12).air, rolling origin(130) fixedwindow

Cross-validation over alpha with alpha={0, 0.1, 0.5, 1}.
. cvlasso air L(1/12).air, rolling origin(130) alpha(0 0.1 0.5 1)

Plot mean-squared prediction errors against ln(lambda).
. cvlasso air L(1/12).air, rolling origin(130)
. cvlasso, plotcv

Panel data example using rolling h-step ahead cross-validation

Rolling cross-validation can also be applied to panel data.  For demonstration, load Grunfeld data.
. webuse grunfeld, clear

. cvlasso mvalue L(1/10).mvalue, rolling origin(1950)

The model selected by cross-validation:
. cvlasso, lopt

Same as above with fixed size of training data.
. cvlasso mvalue L(1/10).mvalue, rolling origin(1950) fixedwindow

Saved results

cvlasso saves the following in e():

scalars
e(N)               sample size
e(nfolds)          number of folds
e(lmax)            largest lambda
e(lmin)            smallest lambda
e(lcount)          number of lambdas
e(sqrt)            =1 if sqrt-lasso, 0 otherwise
e(ols)             =1 if post-estimation OLS, 0 otherwise
e(partial_ct)      number of partialled out predictors
e(notpen_ct)       number of not penalized predictors
e(prestd)          =1 if pre-standardized, 0 otherwise
e(nalpha)          number of alphas
e(h)               forecasting horizon for rolling forecasts (only returned if rolling is
specified)
e(origin)          number of observations in first training dataset (only returned if rolling is
specified)
e(lopt)            optimal lambda (may be missing if no unique minimum MSPE)
e(lse)             lambda se (may be missing if no unique minimum MSPE)
e(mspemin)         minimum MSPE

macros
e(cmd)             cvlasso
e(method)          indicates which estimator is used (e.g. lasso, elastic net)
e(cvmethod)        indicates whether K-fold or rolling cross-validation is used
e(varXmodel)       predictors (excluding partialled-out variables)
e(varX)            predictors
e(partial)         partialled out predictors
e(notpen)          not penalized predictors
e(depvar)          dependent variable

matrices

e(lambdamat)       column vector of lambda values

functions
e(sample)          estimation sample

In addition, if cvlasso cross-validates over alpha and lambda:

scalars
e(alphamin)        optimal alpha, i.e., the alpha that minimizes the empirical MSPE

macros
e(alphalist)       list of alpha values

matrices
e(mspeminmat)      minimum MSPE for each alpha

In addition, if cvlasso cross-validates over lambda only:

scalars
e(alpha)           elastic net parameter

matrices
e(mspe)            matrix of MSPEs for each fold and lambda where each column corresponds to one
lambda value and each row corresponds to one fold.
e(mmspe)           column vector of MSPEs for each lambda
e(cvsd)            column vector standard deviation of MSPE (for each lambda)
e(cvupper)         column vector equal to MSPE + 1 standard deviation
e(cvlower)         column vector equal to MSPE - 1 standard deviation

References

Correia, S. 2016.  FTOOLS: Stata module to provide alternatives to common Stata commands optimized
for large datasets.  https://ideas.repec.org/c/boc/bocode/s458213.html

Hyndman, Rob J. (2016). Cross-validation for time series. Hyndsight blog, 5 December 2016.
https://robjhyndman.com/hyndsight/tscv/

See lasso2 for further references.

Website

Installation

To get the latest stable version of lassopack from our website, check the installation instructions
at https://statalasso.github.io/installation/.  We update the stable website version more
frequently than the SSC version.

To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which
requires whichpkg to be installed; ssc install whichpkg).

Acknowledgements

Thanks to Sergio Correia for advice on the use of the FTOOLS package.

Citation of cvlasso

cvlasso is not an official Stata command. It is a free contribution to the research community, like
a paper. Please cite it as such:

Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018.  cvlasso:  Program for cross-validation using lasso,
square-root lasso, elastic net, adaptive lasso and post-OLS estimators.
http://ideas.repec.org/c/boc/bocode/s458458.html

Authors

Achim Ahrens, Economic and Social Research Institute, Ireland
achim.ahrens@esri.ie

Christian B. Hansen, University of Chicago, USA
Christian.Hansen@chicagobooth.edu

Mark E Schaffer, Heriot-Watt University, UK
m.e.schaffer@hw.ac.uk

Also see

Help: lasso2, rlasso (if installed)
```