Help file: lasso2

---------------------------------------------------------------------------------------------------------
help lasso2                                                                                lassopack v1.2
---------------------------------------------------------------------------------------------------------

Title

    lasso2 --  Program for lasso, square-root lasso, elastic net, ridge, adaptive lasso and
                 post-estimation OLS

Syntax

    Full syntax

        lasso2 depvar regressors [if exp] [in range] [, alpha(real) sqrt adaptive adaloadings(string)
              adatheta(real) ols lambda(numlist) lcount(integer) lminratio(real) lmax(real)
              notpen(varlist) partial(varlist) norecover ploadings(string) unitloadings prestd stdcoef
              fe noftools noconstant tolopt(real) tolzero(real) maxiter(int) plotpath(method)
              plotvar(varlist) plotopt(string) plotlabel ic(string) lic(string) ebicgamma(real) noic
              long displayall postall postresults verbose vverbose wnorm]

        Note: the fe option will take advantage of the ftools package (if installed) for the
              fixed-effects transformation; the speed gains using this package can be large.  See help
              ftools or click on ssc install ftools to install.

    Estimators            Description
    ---------------------------------------------------------------------------------------------------
    alpha(real)            elastic net parameter, which controls the degree of L1-norm (lasso-type) to
                            L2-norm (ridge-type) penalization.  alpha=1 corresponds to the lasso (the
                            default estimator), and alpha=0 to ridge regression.  alpha must be in the
                            interval [0,1].
    sqrt                   square-root lasso estimator.
    adaptive               adaptive lasso estimator.  The penalty loading for predictor j is set to
                            1/abs(beta0(j))^theta where beta0(j) is the OLS estimate or univariate OLS
                            estimate if p>n.  Theta is the adaptive exponent, and can be controlled
                            using the adatheta(real) option.
    adaloadings(string)    alternative initial estimates, beta0, used for calculating adaptive
                            loadings.  For example, this could be the vector e(b) from an initial
                            lasso2 estimation.  The elements of the vector are raised to the power
                            -theta (note the minus).  See adaptive option.
    adatheta(real)         exponent for calculating adaptive penalty loadings. See adaptive option.
                            Default=1.
    ols                    post-estimation OLS.  If lambda is a list, post-estimation OLS results are
                            displayed and returned in e(betas).  If lambda is a scalar, post-estimation
                            OLS is always displayed, and this option controls whether standard or
                            post-estimation OLS results are stored in e(b).
    ---------------------------------------------------------------------------------------------------
    See overview of estimation methods.

    Lambda(s)             Description
    ---------------------------------------------------------------------------------------------------
    lambda(numlist)        a scalar lambda value or list of descending lambda values. Each lambda value
                            must be greater than 0.  If not specified, the default list is used which
                            is given by exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see 
                            mf_range).
    lcount(integer)†       number of lambda values for which the solution is obtained. Default is 100.
    lminratio(real)†       ratio of minimum to maximum lambda. lminratio must be between 0 and 1.
                            Default is 1/1000.
    lmax(real)†            maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the
                            square-root lasso (where X is the pre-standardized regressor matrix and y
                            is the vector of the response variable).
    lic(string)            after first lasso2 estimation using list of lambdas, estimate model
                            corresponding to minimum information criterion.  'aic', 'bic', 'aicc', and
                            'ebic' (the default) are allowed.  Note the lower case spelling.  See
                            Information criteria for the definition of each information criterion.
    ebicgamma(real)        controls the xi parameter of the EBIC.  xi needs to lie in the [0,1]
                            interval.  xi=0 is equivalent to the BIC.  The default choice is
                            xi=1-log(n)/(2*log(p)).
    postresults            Used in combination with lic().  Stores estimation results of the model
                            selected by information criterion in e().
    ---------------------------------------------------------------------------------------------------
    † Not applicable if lambda() is specified.

    Loadings & standardization Description
    ---------------------------------------------------------------------------------------------------
    notpen(varlist)        sets penalty loadings to zero for predictors in varlist.  Unpenalized
                            predictors are always included in the model.
    partial(varlist)       variables in varlist are partialled out prior to estimation.
    norecover              suppresses recovery of partialled out variables after estimation.
    ploadings(matrix)      a row-vector of penalty loadings; overrides the default standardization
                            loadings (in the case of the lasso, =sqrt(avg(x^2))).  The size of the
                            vector should equal the number of predictors (excluding partialled out
                            variables and excluding the constant).
    unitloadings           penalty loadings set to a vector of ones; overrides the default
                            standardization loadings (in the case of the lasso, =sqrt(avg(x^2)).
    prestd                 dependent variable and predictors are standardized prior to estimation
                            rather than standardized "on the fly" using penalty loadings.  See here for
                            more details.  By default the coefficient estimates are un-standardized
                            (i.e., returned in original units).
    stdcoef                return coefficients in standard deviation units, i.e., don't un-standardize.
                            Only supported with prestd option.
    ---------------------------------------------------------------------------------------------------
    See discussion of standardization.

    FE & constant         Description
    ---------------------------------------------------------------------------------------------------
    fe                     within-transformation is applied prior to estimation. Requires data to be
                            xtset.
    noftools               do not use ftools package for fixed-effects transform (slower; rarely used)
    noconstant             suppress constant from estimation.  Default behaviour is to partial the
                            constant out (i.e., to center the regressors).
    ---------------------------------------------------------------------------------------------------

    Optimization          Description
    ---------------------------------------------------------------------------------------------------
    tolopt(real)           tolerance for lasso shooting algorithm (default=1e-10)
    tolzero(real)          minimum below which coeffs are rounded down to zero (default=1e-4)
    maxiter(int)           maximum number of iterations for the lasso shooting algorithm
                            (default=10,000)
    ---------------------------------------------------------------------------------------------------

    Plotting options*     Description
    ---------------------------------------------------------------------------------------------------
    plotpath(method)       plots the coefficients path as a function of the L1-norm (norm), lambda
                            (lambda) or the log of lambda (lnlambda)
    plotvar(varlist)       list of variables to be included in the plot
    plotopt(string)        additional plotting options passed on to line. For example, use
                            plotopt(legend(off)) to turn off the legend.
    plotlabel              displays variable labels in graph.
    ---------------------------------------------------------------------------------------------------
    * Plotting is not available if lambda is a scalar value.

    Display options       Description
    ---------------------------------------------------------------------------------------------------
    displayall*            display full coefficient vectors including unselected variables (default:
                            display only selected, unpenalized and partialled-out)
    postall*               post full coefficient vector including unselected variables in e(b)
                            (default: e(b) has only selected, unpenalized and partialled-out)
    long†                  show long output; instead of showing only the points at which predictors
                            enter or leave the model, all models are shown.
    verbose                show additional output
    vverbose               show even more output
    ic(string)†            controls which information criterion is shown in the output.  'aic', 'bic',
                            'aicc', and 'ebic' (the default' are allowed).  Note the lower case
                            spelling.  See Information criteria for the definition of each information
                            criterion.
    noic†                  suppresses the calculation of information criteria.  This will lead to speed
                            gains if alpha<1, since calculation of effective degrees of freedom
                            requires one inversion per lambda.
    wnorm†                 displays L1 norm of beta estimates weighted by penalty loadings, i.e.,
                            ||Psi*beta||(1) instead of ||beta||(1), which is the default.  Note that
                            this also affects plotting if plotpath(norm)} is specified.
    ---------------------------------------------------------------------------------------------------
    * Only applicable if lambda is a scalar value.  † Only applicable if lambda is a list (the
    default).

    Replay syntax

        lasso2 [, plotpath(method) plotvar(varlist) plotopt(string) plotlabel long postresults
              lic(string) ic(string) wnorm]

    Replay options        Description
    ---------------------------------------------------------------------------------------------------
    long                   show long output; instead of showing only the points at which predictors
                            enter or leave the model, all models are shown.
    ic(string)             controls which information criterion is shown in the output.  'aic', 'bic',
                            'aicc', and 'ebic' (the default) are allowed.  Note the lower case
                            spelling.  See Information criteria for the definition of each information
                            criterion.
    lic(string)            estimate model corresponding to minimum information criterion.  'aic',
                            'bic', 'aicc', and 'ebic' (the default) are allowed.  Note the lower case
                            spelling.  See Information criteria for the definition of each information
                            criterion.
    postresults            store estimation results in e() if lic(string) is used
    plotpath(method)       see Plotting options above
    plotvar(varlist)       see Plotting options above
    plotopt(string)        see Plotting options above
    plotlabel              see Plotting options above
    ---------------------------------------------------------------------------------------------------
    Only applicable if lambda was a list in the previous lasso2 estimation.

    Postestimation:

        predict [type] newvar [if] [in] [, xb residuals ols lambda(real) lid(int) approx noisily
              postresults]

    Predict options       Description
    ---------------------------------------------------------------------------------------------------
    xb                     compute predicted values (the default)
    residuals              compute residuals
    ols                    use post-estimation OLS for prediction
    lambda(real)‡          lambda value for prediction. Ignored if lasso2 was called with scalar lambda
                            value.
    lid(int)‡              index of lambda value for prediction.
    lic(string)            selects which information criterion to use for prediction.
    approx‡                linear approximation is used instead of re-estimation.  Faster, but only
                            exact if coefficient path is piecewise linear.  Only supported in
                            combination with lambda().
    noisily‡               show estimation output if re-estimation required
    postresults‡           store estimation results in e() if re-estimation is used
    ---------------------------------------------------------------------------------------------------
    ‡ Only applicable if lambda was a list in the previous lasso2 estimation.

    lasso2 may be used with time-series or panel data, in which case the data must be tsset or xtset
    first; see help tsset or xtset.

    All varlists may contain time-series operators or factor variables; see help varlist.

Contents

    Description
    Coordinate descent algorithm
    Penalization level
    Standardization of variables
    Information criteria
    Estimators
    Examples and demonstration
    --Data set
    --General demonstration
    --Information criteria
    --Plotting
    --Predicted values
    --Standardization
    --Penalty loadings and notpen()
    --Partialling vs penalization
    --Adaptive lasso
    Saved results
    References
    Website
    Installation
    Acknowledgements
    Citation of lassopack


Description

    lasso2 solves the following problem

        1/N RSS + lambda/N*alpha*||Psi*beta||[1] + lambda/(2*N)*(1-alpha)*||Psi*beta||[2], 
        
    where

    RSS        = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares,
    beta       is a p-dimensional parameter vector,
    lambda     is the overall penalty level,
    ||.||[j]   denotes the L(j) vector norm for j=1,2;
    alpha      the elastic net parameter, which determines the relative contribution of L1 (lasso-type)
                to L2 (ridge-type) penalization.
    Psi        is a p by p diagonal matrix of predictor-specific penalty loadings. Note that lasso2
                treats Psi as a row vector.
    N          number of observations

    Note: the above lambda differs from the definition used in parts of the lasso and elastic net
    literature; see for example the R package glmnet by Friedman et al. (2010).  We have here adopted
    an objective function following Belloni et al. (2012).  Specifically, lambda=2*N*lambda(GN) where
    lambda(GN) is the penalty level used by glmnet.

    In addition, if the option sqrt is specified, lasso2 estimates the square-root lasso (sqrt-lasso)
    estimator, which is defined as the solution to the following objective function:

        sqrt(1/N*RSS) + lambda/N*||Psi*beta||[1]. 
        
Coordinate descent algorithm

    lasso2 implements the elastic net and sqrt-lasso using coordinate descent algorithms.  The
    algorithm (then referred to as "shooting") was first proposed by Fu (1998) for the lasso, and by
    Van der Kooij (2007) for the elastic net.  Belloni et al. (2011) implement the coordinate descent
    for the sqrt-lasso, and have kindly provided Matlab code.

    Coordinate descent algorithms repeatedly cycle over predictors j=1,...,p and update single
    coefficient estimates until convergence.  Suppose the predictors are centered and standardized to
    have unit variance.  In that case, the update for coefficient j is obtained using univariate
    regression of the current partial residuals (i.e., excluding the contribution of predictor j)
    against predictor j.

    The algorithm requires an initial beta estimate for which the Ridge estimate is used.  If the
    coefficient path is obtained for a list of lambda values, lasso2 starts from the largest lambda
    value and uses previous estimates as warm starts.

    See Friedman et al. (2007, 2010), and references therein, for further information.

Penalization level: choice of lambda (and alpha)

    Penalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters
    that control the degree and type of penalization.  The estimation methods implemented in lasso2 use
    two tuning parameters:  lambda, which controls the general degree of penalization, and alpha, which
    determines the relative contribution of L1-type to L2-type penalization.  lasso2 obtains elastic
    net and sqrt-lasso solutions for a given lambda value or a list of lambda values, and for a given
    alpha value (default=1).

    lassopack offers three approaches for selecting the "optimal" lambda (and alpha) value:

    (1) The penalty level may be chosen by cross-validation in order to optimize out-of-sample
    prediction performance.  K-fold cross-validation and rolling cross-validation (for panel and
    time-series data) are implemented in cvlasso.  cvlasso also supports cross-validation across alpha.

    (2) Theoretically justified and feasible penalty levels and loadings are available for the lasso
    and sqrt-lasso via the separate command rlasso.  The penalization is chosen to dominate the noise
    of the data-generating process (represented by the score vector), which allows derivation of
    theoretical results with regard to consistent prediction and parameter estimation.  Since the error
    variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible) lasso
    that relies on an iterative algorithm for estimating the optimal penalization and is valid in the
    presence of non-Gaussian and heteroskedastic errors.  Belloni et al. (2016) extend the framework to
    the panel data setting.  In the case of the sqrt-lasso under homoskedasticity, the optimal penalty
    level is independent of the unknown error variance, leading to a practical advantage and better
    performance in finite samples (see Belloni et al., 2011, 2014).  See help rlasso for more details.

    (3) Lambda can also be selected using information criteria.  lasso2 calculates four information
    criteria:  Akaike Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC;
    Schwarz, 1978), Extended Bayesian information criterion (EBIC; Chen & Chen, 2008) and the corrected
    AIC (AICc; Sugiura, 1978, and Hurvich, 1989).  By default, lasso2 displays EBIC in the output, but
    all four information criteria are stored in e(aic), e(bic), e(ebic) and e(aicc).  See section
    Information criteria for more information.

Standardization of variables

    Standard practice is for predictors to be "standardized", i.e., normalized to have mean zero and
    unit variance.  By default lasso2 achieves this by incorporating the standardization into the
    penalty loadings.  We refer to this method as standardization "on the fly", as standardization
    occurs during rather than before estimation.  Alternatively, the option prestd causes the
    predictors to be standardized prior to the estimation.

    Standardizing "on the fly" via the penalty loadings and pre-standardizing the data prior to
    estimation are theoretically equivalent.  The default standardizing "on the fly" is often faster.
    The prestd option can lead to improved numerical precision or more stable results in the case of
    difficult problems; the cost is the computation time required to pre-standardize the data.

Estimators

    Ridge regression (Hoerl & Kennard, 1970)

    The ridge estimator can be written as

         betahat(ridge) = (X'X+lambda*I(p))^(-1)X'y.

    Thus, even if X'X is not full rank (e.g. because p>n), the problem becomes nonsingular by adding a
    constant to the diagonal of X'X.  Another advantage of the ridge estimator over least squares stems
    from the variance-bias trade-off.  Ridge regression may improve over ordinary least squares by
    inducing a mild bias while decreasing the variance.  For this reason, ridge regression is a popular
    method in the context of multicollinearity.  In contrast to estimators relying on L1-penalization,
    the ridge does not yield sparse solutions and keeps all predictors in the model.

    Lasso estimator (Tibshirani, 1996)

    The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of
    coefficient estimates.  Tibshirani (1996) motivates the lasso with two major advantages over least
    squares.  First, due to the nature of the L1-penalty, the lasso tends to produce sparse solutions
    and thus facilitates model interpretation.  Secondly, similar to ridge regression, lasso can
    outperform least squares in terms of prediction due to lower variance.  Another advantage is that
    the lasso is computationally attractive due to its convex form.  This is in contrast to model
    selection based on AIC or BIC (which employ L0 penalization) where each possible sub-model has to
    be estimated.

    Elastic net (Zou & Hastie, 2005)

    The elastic net applies a mix of L1 (lasso-type) and L2 (ridge-type) penalization.  It combines
    some of the strengths of lasso and ridge regression.  In the presence of groups of correlated
    regressors, the lasso selects typically only one variable from each group, whereas the ridge tends
    to produce similar coefficient estimates for groups of correlated variables.  On the other hand,
    the ridge does not yield sparse solutions impeding model interpretation.  The elastic net is able
    to produce sparse solutions (for some alpha greater than zero) and retains (or drops) correlated
    variables jointly.

    Adaptive lasso (Zou, 2006)

    The lasso is only variable selection consistent under the rather strong "irrepresentable
    condition", which imposes constraints on the degree of correlation between predictors in the true
    model and predictors outside of the model (see Zhao & Yu, 2006; Meinshausen & Bühlmann, 2006).  Zou
    (2006) proposes the adaptive lasso which uses penalty loadings of 1/abs(beta0(j))^theta where beta0
    is an initial estimator.  The adaptive lasso is variable-selection consistent for fixed p under
    weaker assumptions than the standard lasso.  If p<n, OLS can be used as the initial estimator.
    Huang et al. (2008) suggest to use univariate OLS if p>n.  Other initial estimators are possible.

    Square-root lasso (Belloni et al., 2011, 2014)

    The sqrt-lasso is a modification of the lasso that minimizes (RSS)^(1/2) instead of RSS, while also
    imposing an L1-penalty.  The main advantage of the sqrt-lasso over the standard lasso is that the
    theoretically grounded, data-driven optimal lambda is independent of the unknown error variance
    under homoskedasticity.  See rlasso.

    Post-estimation OLS

    Penalized regression methods induce a bias that can be alleviated by post-estimation OLS, which
    applies OLS to the predictors selected by the first-stage variable selection method.  For the case
    of the lasso, Belloni and Chernozhukov (2013) have shown that the post-lasso OLS performs at least
    as well as the lasso under mild additional assumptions.

    For further information on the lasso and related methods, see for example the textbooks by Hastie
    et al. (2009, 2015; both available for free) and Bühlmann & Van de Geer (2011).

Information criteria
 
    The information criteria supported by lasso2 are the Akaike information criterion (AIC, Akaike,
    1974), the Bayesian information criterion (BIC, Schwarz, 1978), the corrected AIC (Sugiura, 1978;
    Hurvich, 1989), and the Extended BIC (Chen & Chen, 2008).  These are given by (omitting dependence
    on lambda and alpha):

        AIC     = N*log(RSS/N) + 2*df
        BIC     = N*log(RSS/N) + df*log(N) 
        AICc    = N*log(RSS/N) + 2*df*N/(N-df)
        EBIC    = BIC + 2*xi*df*log(p)

    where RSS(lambda,alpha) is the residual sum of squares and df(lambda,alpha) is the effective
    degrees of freedom, which is a measure of model complexity.  In the linear regression model, the
    degrees of freedom is simply the number of regressors.  Zou et al. (2007) show that the number of
    non-zero coefficients is an unbiased and consistent estimator of df(lambda,alpha) for the lasso.
    More generally, the degrees of freedom of the elastic net can be calculated as the trace of the
    projection matrix.  With an unbiased estimator for df available, the above information criteria can
    be employed to select tuning parameters.

    The BIC is known to be model selection consistent if the true model is among the candidate models,
    whereas the AIC tends to yield an overfitted model.  On the other hand, the AIC is loss efficient
    in the sense that it selects the model that minimizes the squared average prediction error, while
    the BIC does not possess this property.  Zhang et al. (2010) show that these principles also apply
    when AIC and BIC are employed to select the tuning parameter for penalized regression.

    Both AIC and BIC tend to overselect regressors in the small-N-large-p case.  The AICc corrects the
    small sample bias of the AIC which can be especially severe in the high-dimensional context.
    Similarily, the EBIC addresses the shortcomings of the BIC when p is large by imposing a larger
    penalty on the number of coefficients.  Chen & Chen (2008) show that the EBIC performs better in
    terms of false discovery rate at the cost of a negligible reduction in the positive selection rate.

    The EBIC depends on an additional parameter, xi (denoted as gamma in the original article), which
    can be controlled using ebicgamma(real).  gamma=0 is equivalent to the BIC.  We follow Chen & Chen
    (2008, p. 768) and use xi=1-log(n)/(2*log(p)) as the default choice.  An upper and lower threshold
    is applied to ensure that xi lies in the [0,1] interval.

    The EBIC is displayed in the output of lasso2 by default (if lambda is a list), but all four
    information criteria are returned in e().  The lambda values that minimize the information criteria
    for a given alpha are returned in e(laic), e(lbic), e(laicc) and e(lebic), respectively.  To change
    the default display, use the ic(string) option.  noic suppresses the calculation of information
    criteria, which leads to a speed gain if alpha<1.

Example using prostate cancer data (Stamey et al., 1989)

    Data set

    The data set is available through Hastie et al. (2009) on the authors' website.  The following
    variables are included in the data set of 97 men:

    Predictors    
      lcavol    log(cancer volume)
      lweight   log(prostate weight)
      age       patient age
      lbph      log(benign prostatic hyperplasia amount)
      svi       seminal vesicle invasion
      lcp       log(capsular penetration)
      gleason   Gleason score
      pgg45     percentage Gleason scores 4 or 5

    Outcome       
      lpsa      log(prostate specific antigen)

    Load prostate cancer data.
        . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear
            tab

    General demonstration

    Estimate coefficient lasso path over (default) list of lambda values.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45

    The replay syntax can be used to re-display estimation results.
        . lasso2

    User-specified lambda list.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)

    The list of returned e() objects depends on whether lambda() is a list (the default) or a scalar
    value.  For example, if lambda is a scalar, one vector of coefficient estimates is returned.  If
    lambda is a list, the whole coefficient path for a range of lambda values is obtained.  The last
    row of e(betas) is equal to the row vector e(b).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)
        . ereturn list
        . mat list e(betas)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
        . ereturn list
        . mat list e(b)

    Sqrt-lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt

    Ridge regression.  All predictors are included in the model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0)

    Elastic net with alpha=0.1.  Even though alpha is close to zero (Ridge regression), the elastic net
    can produce sparse solutions.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1)

    The option ols triggers the use of post-estimation OLS.  OLS alleviates the shrinkage bias induced
    by L1 and L2 norm penalization.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ols
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt ols
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) ols

    Information criteria

    lasso2 calculates four information criteria: AIC, BIC, EBIC and AICc.  The EBIC is shown by default
    in the output along with the R-squared.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45

    To see another information criterion in the outout, use the ic(string) option where string can be
    replaced by aic, bic, ebic or aicc (note the lower case spelling).  For example, to display AIC:
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ic(aic)

    In fact, there is no need to re-run the full model.  We can make use of the replay syntax:
        . lasso2, ic(aic)

    The long option triggers extended output; instead of showing only the points at which predictors
    enter or leave the model, all models are shown.  An asterisk marks the model (i.e., the value of
    lambda) that minimizes the information criterion (here, AIC).
        . lasso2, ic(aic) long

    To estimate the model corresponding to the minimum information criterion, click on the link at the
    bottom of the output or type one of the following:
        . lasso2, lic(aic)
        . lasso2, lic(ebic)
        . lasso2, lic(bic)
        . lasso2, lic(aicc)

    To store the estimation results of the selected model, add the postresults option.
        . lasso2, lic(ebic)
        . ereturn list
        . lasso2, lic(ebic) postres
        . ereturn list

    The same can also be achieved in one line without using the replay syntax.  lasso2 first obtains
    the full coefficient path for a list lambda values, and then runs the model selected by AIC.
    Again, postresults can be used to store results of the selected model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lic(aic) postres

    Plotting

    Plot coefficients against lambda:  As lambda increases, the coefficient estimates are shrunk
    towards zero.  Lambda=0 corresponds to OLS and if lambda is sufficiently large the model is empty.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(lambda)

    Plot coefficients against L1 norm.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(norm)

    The replay syntax can also be used for plotting.
        . lasso2, plotpath(norm)

    Only selected variables are plotted.
        . lasso2, plotpath(norm) plotvar(lcavol svi)

    The variable names can be displayed directly next to each series using plotlabel.
    plotopt(legend(off)) suppresses the legend.
        . lasso2, plotpath(lambda) plotlabel plotopt(legend(off))
        . lasso2, plotpath(norm) plotlabel plotopt(legend(off))

    Predicted values

    xbhat1 is generated by re-estimating the model for lambda=10.  The noisily option triggers the
    display of the estimation results.  xbhat2 is generated by linear approximation using the two beta
    estimates closest to lambda=10.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
        . cap drop xbhat1
        . predict double xbhat1, xb l(10) noisily
        . cap drop xbhat2
        . predict double xbhat2, xb l(10) approx

    The model is estimated explicitly using lambda=10.  If lasso2 is called with a scalar lambda value,
    the subsequent predict command requires no lambda() option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
        . cap drop xbhat3
        . predict double xbhat3, xb

    All three methods yield the same results.  However note that the linear approximation is only exact
    for the lasso which is piecewise linear.
        . sum xbhat1 xbhat2 xbhat3

    It is also possible to obtain predicted values by referencing a specific lambda ID using the lid()
    option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
        . cap drop xbhat4
        . predict double xbhat4, xb lid(21)
        . cap drop xbhat5
        . predict double xbhat5, xb l(25.45473900468241)
        . sum xbhat4 xbhat5

    Standardization

    By default lasso2 standardizes the predictors to have unit variance.  Standardization is done by
    default "on the fly" via penalty loadings.  The coefficient estimates are returned in original
    units.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)

    Instead of standardizing "on the fly" by setting penalization loadings equal to standardization
    loadings, we can standardize the regressors prior to estimation with the prestd option.  Both
    methods are equivalent in theory.  Standardizing "on the fly" tends to be faster, but
    pre-standardization may lead to more stable results in the case of difficult problems.  See here
    for more information.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
        . mat list e(Psi)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd
        . mat list e(Psi)

    The used penalty loadings are stored in e(Psi).  In the first case above, the standardization
    loadings are returned.  In the second case the penalty loadings are equal to one for all
    regressors.

    To get the coefficients in standard deviation units, stdcoef can be specified along with the prestd
    option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd stdcoef

    We can override any form of standardization with the unitloadings options, which sets the penalty
    loadings to a vector of 1s.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) unitloadings

    The same logic applies to the sqrt-lasso (and elastic net).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt prestd

    Penalty loadings and notpen()

    By default the penalty loading vector is a vector of standard deviations.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
        . mat list e(Psi)

    We can set the penalty loading for specific predictors to zero, implying no penalization.
    Unpenalized predictor are always included in the model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol)
        . mat list e(Psi)

    We can specify custom penalty loadings.  The option ploadings expects a row vector of size p where
    p is the number of regressors (excluding the constant, which is partialled out).  Because we
    pre-standardize the data (and we are using the lasso) the results are equivalent to the results
    above (standardizing on the fly and specifying lcavol as unpenalized).
        . mat myloadings = (0,1,1,1,1,1,1,1)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) ploadings(myloadings) prestd
        . mat list e(Psi)

    Partialling vs penalization

    If lambda and the penalty loadings are kept constant, partialling out and not penalizing of
    variables yields the same results for the included/penalized regressors.  Yamada (2017) shows that
    the equivalence of partialling out and not penalizing holds for lasso and ridge regression.  The
    examples below suggest that the same result also holds for the elastic net in general and the
    sqrt-lasso.  Note that the equivalence only holds if the regressor matrix and other penalty
    loadings are the same.  Below we use the unitloadings option to achieve this; alternatively we
    could use the ploadings(.) option.

    Lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) partial(lcavol) unitload

    Sqrt-lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt partial(lcavol)
            unitload

    Ridge regression.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) notpen(lcavol)
            unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) partial(lcavol)
            unitload

    Elastic net.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) notpen(lcavol)
            unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) partial(lcavol)
            unitload

    Adaptive lasso

    The adaptive lasso relies on an initial estimator to calculate the penalty loadings.  The penalty
    loadings are given by 1/abs(beta0(j))^theta, where beta0(j) denotes the initial estimate for
    predictor j.  By default, lasso2 uses OLS as the initial estimator as originally suggested by Zou
    (2006).  If the number of parameters exceeds the numbers of observations, univariate OLS is used;
    see Huang et al. (2008).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive
        . mat list e(Psi)

    See the OLS estimates for comparison.
        . reg lpsa lcavol lweight age lbph svi lcp gleason pgg45

    Theta (the exponent for calculating the adaptive loadings) can be changed using the adatheta()
    option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adat(2)
        . mat list e(Psi)

    Other initial estimators such as ridge regression are possible.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0)
        . mat bhat_ridge = e(b)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge)
        . mat list e(Psi)

Saved results

    The set of returned e-class objects depends on whether lambda is a scalar or a list (the default).

    scalars       
      e(N)               sample size
      e(cons)            =1 if constant is present, 0 otherwise
      e(fe)              =1 if fixed effects model is used, 0 otherwise
      e(alpha)           elastic net parameter
      e(sqrt)            =1 if the sqrt-lasso is used, 0 otherwise
      e(ols)             =1 if post-estimation OLS results are returned, 0 otherwise
      e(adaptive)        =1 if adaptive loadings are used, 0 otherwise
      e(p)               number of penalized regressors in model
      e(notpen_ct)       number of unpenalized variables
      e(partial_ct)      number of partialled out regressors (incl constant)
      e(prestd)          =1 if pre-standardized
      e(lcount)          number of lambda values

    scalars (only if lambda is a list)
      e(lmax)            largest lambda value
      e(lmin)            smallest lambda value

    scalars (only if lambda is a scalar)
      e(lambda)          penalty level
      e(rmse)            root mean squared error
      e(rmseOLS)         root mean squared error of post-estimation OLS
      e(pmse)            minimized objective function (penalized mse, lasso/elastic net/ridge only)
      e(prmse)           minimized objective function (penalized rmse, sqrt-lasso only)
      e(k)               number of selected and unpenalized/partialled-out regressors including
                           constant (if present)
      e(s)               number of selected regressors
      e(s0)              number of selected and unpenalized regressors including constant (if present)
      e(niter)           number of iterations
      e(maxiter)         maximum number of iterations
      e(tss)             total sum of squares
      e(aicmin)          minimum AIC
      e(bicmin)          minimum BIC
      e(aiccmin)         minimum AICc
      e(ebicmin)         minimum EBIC
      e(laic)            lambda corresponding to minimum AIC
      e(lbic)            lambda corresponding to minimum BIC
      e(laicc)           lambda corresponding to minimum AICc
      e(lebic)           lambda corresponding to minimum EBIC

    macros        
      e(cmd)             command name
      e(depvar)          name of dependent variable
      e(varX)            all predictors
      e(varXmodel)       penalized predictors
      e(partial)         partialled out predictors
      e(notpen)          unpenalized predictors
      e(method)          estimation method

    macros (only if lambda is a scalar)
      e(selected)        selected predictors
      e(selected0)       selected predictors excluding constant

    matrices      
      e(Psi)             row vector used penalty loadings
      e(stdvec)          row vector of standardization loadings

    matrices (only if lambda is a list)
      e(lambdamat)       column vector of lambdas
      e(l1norm)          column vector of L1 norms for each lambda value (excludes the intercept)
      e(wl1norm)         column vector of weighted L1 norms for each lambda value (excludes the
                           intercept), see wnorm
      e(dof)             column vector of L0 norm for each lambda value (excludes the intercept)
      e(betas)           matrix of estimates, where each row corresponds to one lambda value. The
                           intercept is stored in the last column.
      e(ess)             column vector of explained sum of squares for each lambda value
      e(rss)             column vector of residual sum of squares for each lambda value
      e(ebic)            column vector of EBIC for each lambda value
      e(bic)             column vector of BIC for each lambda value
      e(aic)             column vector of AIC for each lambda value
      e(aicc)            column vector of AICc for each lambda value
      e(rsq)             column vector of R-squared for each lambda value

    matrices (only if lambda is a scalar)
      e(b)               posted coefficient vector (see postall and displayall). Used for prediction.
      e(beta)            coefficient vector
      e(betaOLS)         coefficient vector of post-estimation OLS
      e(betaAll)         full coefficient vector including omitted, factor base variables, etc.
      e(betaAllOLS)      full post-estimation OLS coefficient vector including omitted, factor base
                           variables, etc.

    functions     
      e(sample)          estimation sample

References

    Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
        Automatic Control, 19(6), 716–723.  https://doi.org/10.1109/TAC.1974.1100705

    Belloni, A., Chernozhukov, V., & Wang, L. (2011).  Square-root lasso: pivotal recovery of sparse
        signals via conic programming.  Biometrika 98(4), 791–806.  
        https://doi.org/10.1093/biomet/asr043

    Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse Models and Methods for Optimal
        Instruments With an Application to Eminent Domain. Econometrica 80(6), 2369–2429.  
        https://doi.org/10.3982/ECTA9626

    Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional
        sparse models. Bernoulli, 19(2), 521–547.  https://doi.org/10.3150/11-BEJ410

    Belloni, A., Chernozhukov, V., & Wang, L. (2014). Pivotal estimation via square-root Lasso in
        nonparametric regression. The Annals of Statistics 42(2), 757–788.  
        https://doi.org/10.1214/14-AOS1204

    Belloni, A., Chernozhukov, V., Hansen, C., & Kozbur, D. (2016). Inference in High Dimensional Panel
        Models with an Application to Gun Control. Journal of Business & Economic Statistics 34(4),
        590–605. Methodology.  https://doi.org/10.1080/07350015.2015.1102733

    Bühlmann, P., & Meinshausen, N. (2006). High-dimensional graphs and variable selection with the
        Lasso. {it:The Annals of Statistics], 34(3), 1436–1462.  
        http://doi.org/10.1214/009053606000000281

    Bühlmann, P., & Van de Geer, S. (2011). Statistics for High-Dimensional Data. Berlin, Heidelberg:
        Springer-Verlag.

    Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large
        model spaces. Biometrika, 95(3), 759–771.  https://doi.org/10.1093/biomet/asn034

    Correia, S. 2016.  FTOOLS: Stata module to provide alternatives to common Stata commands optimized
        for large datasets.  https://ideas.repec.org/c/boc/bocode/s458213.html

    Fu, W. J. (1998). Penalized Regressions: The Bridge Versus the Lasso. Journal of Computational and
        Graphical Statistics 7(3), 397–416.  https://doi.org/10.2307/1390712

    Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization.
        The Annals of Applied Statistics 1(2), 302–332.  https://doi.org/10.1214/07-AOAS131

    Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear
        Models via Coordinate Descent. Journal of Statistical Software 33(1), 1–22.  
        https://doi.org/10.18637/jss.v033.i01

    Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
        New York: Springer-Verlag.  https://web.stanford.edu/~hastie/ElemStatLearn/

    Hastie, T., Tibshirani, R., & Wainwright, M. J. (2015). Statistical Learning with Sparsity: The
        Lasso and Generalizations. Boca Raton: CRC Press, Taylor & Francis.  
        https://www.stanford.edu/~hastie/StatLearnSparsity/

    Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems.
        Technometrics 12(1), 69–82.  https://doi.org/10.1080/00401706.1970.10488635

    Huang, J., Ma, S., & Zhang, C.-H. (2008). Adaptive Lasso for Sparse High-Dimensional Regression
        Models Supplement. Statistica Sinica 18, 1603–1618.  https://doi.org/10.2307/24308572

    Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples.
        Biometrika, 76(2), 297–307.  http://doi.org/10.1093/biomet/76.2.297

    Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464.  
        https://doi.org/10.1214/aos/1176344136

    Stamey, T. A., Kabalin, J. N., Mcneal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang,
        N. (1989). Prostate Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the
        Prostate. II. Radical Prostatectomy Treated Patients. The Journal of Urology 141(5), 1076–1083.
        https://doi.org/10.1016/S0022-5347(17)41175-X

    Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite
        corrections. Communications in Statistics - Theory and Methods, 7(1), 13–26.  
        http://doi.org/10.1080/03610927808827599

    Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal
        Statistical Society. Series B (Methodological) 58(1), 267–288.  https://doi.org/10.2307/2346178

    Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling
        Transformations. Ph.D. thesis, Department of Data Theory, University of Leiden.  
        http://hdl.handle.net/1887/12096

    Yamada, H. (2017). The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression.
        Communications in Statistics - Theory and Methods 46(21), 10897–10902.  
        https://doi.org/10.1080/03610926.2016.1252403

    Zhang, Y., Li, R., & Tsai, C.-L. (2010). Regularization Parameter Selections via Generalized
        Information Criterion. Journal of the American Statistical Association, 105(489), 312–323.  
        http://doi.org/10.1198/jasa.2009.tm08013

    Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine Learning
        Research, 7, 2541–2563.  http://dl.acm.org/citation.cfm?id=1248547.1248637

    Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of
        the Royal Statistical Society. Series B: Statistical Methodology 67(2), 301–320.  
        https://doi.org/10.1111/j.1467-9868.2005.00503.x

    Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical
        Association 101(476), 1418–1429.  https://doi.org/10.1198/016214506000000735

    Zou, H., Hastie, T., & Tibshirani, R. (2007). On the "degrees of freedom" of the lasso. Ann.
        Statist., 35(5), 2173–2192.  https://doi.org/10.1214/009053607000000127

Website

    Please check our website https://statalasso.github.io/ for more information.

Installation

    To get the latest stable version of lassopack from our website, check the installation instructions
    at https://statalasso.github.io/installation/.  We update the stable website version more
    frequently than the SSC version.

    To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which
    requires whichpkg to be installed; ssc install whichpkg).

Acknowledgements

    Thanks to Alexandre Belloni, who provided Matlab code for the square-root lasso estimator, Sergio
    Correia for advice on the use of the FTOOLS package, and Jan Ditzen.


Citation of lasso2

    lasso2 is not an official Stata command. It is a free contribution to the research community, like
    a paper. Please cite it as such:

    Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018.  lasso2: Program for lasso, square-root lasso,
        elastic net, ridge, adaptive lasso and post-estimation OLS.  
        http://ideas.repec.org/c/boc/bocode/s458458.html


Authors

        Achim Ahrens, Economic and Social Research Institute, Ireland
        achim.ahrens@esri.ie
        
        Christian B. Hansen, University of Chicago, USA
        Christian.Hansen@chicagobooth.edu

        Mark E Schaffer, Heriot-Watt University, UK
        m.e.schaffer@hw.ac.uk


Also see

       Help: cvlasso, rlasso, ivlasso, pdslasso (if installed).