help lasso2
----------------------------------------------------------------------------------------------------------------------------------
help lasso2                                                                                                       lassopack v1.4.2
----------------------------------------------------------------------------------------------------------------------------------

Title

    lasso2 --  Program for lasso, square-root lasso, elastic net, ridge, adaptive lasso and post-estimation OLS

Syntax

    Full syntax

        lasso2 depvar regressors [if exp] [in range] [, alpha(real) sqrt adaptive adaloadings(string) adatheta(real) ols
              lambda(numlist) lcount(integer) lminratio(real) lmax(real) lglmnet notpen(varlist) partial(varlist)
              psolver(string) norecover ploadings(string) unitloadings prestd stdcoef fe noftools noconstant tolopt(real)
              tolzero(real) maxiter(int) plotpath(method) plotvar(varlist) plotopt(string) plotlabel ic(string) lic(string)
              ebicgamma(real) noic long displayall postall postresults verbose vverbose wnorm]

        Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transformation; the
              speed gains using this package can be large.  See help ftools or click on ssc install ftools to install.

    Estimators            Description
    ----------------------------------------------------------------------------------------------------------------------------
    alpha(real)            elastic net parameter, which controls the degree of L1-norm (lasso-type) to L2-norm (ridge-type)
                            penalization.  alpha=1 corresponds to the lasso (the default estimator), and alpha=0 to ridge
                            regression.  alpha must be in the interval [0,1].
    sqrt                   square-root lasso estimator.
    adaptive               adaptive lasso estimator.  The penalty loading for predictor j is set to 1/abs(beta0(j))^theta where
                            beta0(j) is the OLS estimate or univariate OLS estimate if p>n.  Theta is the adaptive exponent, and
                            can be controlled using the adatheta(real) option.
    adaloadings(string)    alternative initial estimates, beta0, used for calculating adaptive loadings.  For example, this
                            could be the vector e(b) from an initial lasso2 estimation.  The elements of the vector are raised
                            to the power -theta (note the minus).  See adaptive option.
    adatheta(real)         exponent for calculating adaptive penalty loadings. See adaptive option. Default=1.
    ols                    post-estimation OLS.  If lambda is a list, post-estimation OLS results are displayed and returned in
                            e(betas).  If lambda is a scalar, post-estimation OLS is always displayed, and this option controls
                            whether standard or post-estimation OLS results are stored in e(b).
    ----------------------------------------------------------------------------------------------------------------------------
    See overview of estimation methods.

    Lambda(s)             Description
    ----------------------------------------------------------------------------------------------------------------------------
    lambda(numlist)        a scalar lambda value or list of descending lambda values. Each lambda value must be greater than 0.
                            If not specified, the default list is used which is given by
                            exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see mf_range).
    lcount(integer)†       number of lambda values for which the solution is obtained. Default is 100.
    lminratio(real)†       ratio of minimum to maximum lambda. lminratio must be between 0 and 1. Default is 1/1000.
    lmax(real)†            maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the square-root lasso (where
                            X is the pre-standardized regressor matrix and y is the vector of the response variable).
    fdev                   minimum fractional change in deviance (R-sq) to stop looping through lambdas (path)
    devmax                 maximum fraction of explained deviance (R-sq) to stop looping through lambdas (path)
    nodevcrit              override criteria to exit path; loop through all lambdas in list
    lic(string)            after first lasso2 estimation using list of lambdas, estimate model corresponding to minimum
                            information criterion.  'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed.  Note the lower
                            case spelling.  See Information criteria for the definition of each information criterion.
    ebicgamma(real)        controls the xi parameter of the EBIC.  xi needs to lie in the [0,1] interval.  xi=0 is equivalent to
                            the BIC.  The default choice is xi=1-log(n)/(2*log(p)).
    postresults            Used in combination with lic().  Stores estimation results of the model selected by information
                            criterion in e().
    lglmnet                Use the parameterizations for lambda, alpha, standardization, etc. employed by glmnet by Friedman et
                            al. (2010).
    ----------------------------------------------------------------------------------------------------------------------------
    † Not applicable if lambda() is specified.

    Loadings & standardization Description
    ----------------------------------------------------------------------------------------------------------------------------
    notpen(varlist)        sets penalty loadings to zero for predictors in varlist.  Unpenalized predictors are always included
                            in the model.
    partial(varlist)       variables in varlist are partialled out prior to estimation.
    psolver(string)        override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol;
                            default=qrxx)
    norecover              suppresses recovery of partialled out variables after estimation.
    ploadings(matrix)      a row-vector of penalty loadings; overrides the default standardization loadings (in the case of the
                            lasso, =sqrt(avg(x^2))).  The size of the vector should equal the number of predictors (excluding
                            partialled out variables and excluding the constant).
    unitloadings           penalty loadings set to a vector of ones; overrides the default standardization loadings (in the case
                            of the lasso, =sqrt(avg(x^2)).  nostd is a synonym for unitloadings.
    prestd                 dependent variable and predictors are standardized prior to estimation rather than standardized "on
                            the fly" using penalty loadings.  See here for more details.  By default the coefficient estimates
                            are un-standardized (i.e., returned in original units).
    stdcoef                return coefficients in standard deviation units, i.e., don't un-standardize.
    stdall                 return all results (coefficients, information criteria, norms, etc.) in standardized units.
    ----------------------------------------------------------------------------------------------------------------------------
    See discussion of standardization.

    FE & constant         Description
    ----------------------------------------------------------------------------------------------------------------------------
    fe                     within-transformation is applied prior to estimation. Requires data to be xtset.
    noftools               do not use ftools package for fixed-effects transform (slower; rarely used)
    noconstant             suppress constant from estimation.  Default behaviour is to partial the constant out (i.e., to center
                            the regressors).
    ----------------------------------------------------------------------------------------------------------------------------

    Optimization          Description
    ----------------------------------------------------------------------------------------------------------------------------
    tolopt(real)           tolerance for lasso shooting algorithm (default=1e-10)
    tolzero(real)          minimum below which coeffs are rounded down to zero (default=1e-4)
    maxiter(int)           maximum number of iterations for the lasso shooting algorithm (default=10,000)
    ----------------------------------------------------------------------------------------------------------------------------

    Plotting options*     Description
    ----------------------------------------------------------------------------------------------------------------------------
    plotpath(method)       plots the coefficients path as a function of the L1-norm (norm), lambda (lambda) or the log of lambda
                            (lnlambda)
    plotvar(varlist)       list of variables to be included in the plot
    plotopt(string)        additional plotting options passed on to line. For example, use plotopt(legend(off)) to turn off the
                            legend.
    plotlabel              displays variable labels in graph.
    ----------------------------------------------------------------------------------------------------------------------------
    * Plotting is not available if lambda is a scalar value.

    Display options       Description
    ----------------------------------------------------------------------------------------------------------------------------
    displayall*            display full coefficient vectors including unselected variables (default: display only selected,
                            unpenalized and partialled-out)
    postall*               post full coefficient vector including unselected variables in e(b) (default: e(b) has only selected,
                            unpenalized and partialled-out)
    long†                  show long output; instead of showing only the points at which predictors enter or leave the model,
                            all models are shown.
    verbose                show additional output
    vverbose               show even more output
    ic(string)†            controls which information criterion is shown in the output.  'aic', 'bic', 'aicc', and 'ebic' (the
                            default' are allowed).  Note the lower case spelling.  See Information criteria for the definition
                            of each information criterion.
    noic†                  suppresses the calculation of information criteria.  This will lead to speed gains if alpha<1, since
                            calculation of effective degrees of freedom requires one inversion per lambda.
    wnorm†                 displays L1 norm of beta estimates weighted by penalty loadings, i.e., ||Psi*beta||(1) instead of
                            ||beta||(1), which is the default.  Note that this also affects plotting if plotpath(norm)} is
                            specified.
    ----------------------------------------------------------------------------------------------------------------------------
    * Only applicable if lambda is a scalar value.  † Only applicable if lambda is a list (the default).

    Replay syntax

        lasso2 [, plotpath(method) plotvar(varlist) plotopt(string) plotlabel long postresults lic(string) ic(string) wnorm]

    Replay options        Description
    ----------------------------------------------------------------------------------------------------------------------------
    long                   show long output; instead of showing only the points at which predictors enter or leave the model,
                            all models are shown.
    ic(string)             controls which information criterion is shown in the output.  'aic', 'bic', 'aicc', and 'ebic' (the
                            default) are allowed.  Note the lower case spelling.  See Information criteria for the definition of
                            each information criterion.
    lic(string)            estimate model corresponding to minimum information criterion.  'aic', 'bic', 'aicc', and 'ebic' (the
                            default) are allowed.  Note the lower case spelling.  See Information criteria for the definition of
                            each information criterion.
    postresults            store estimation results in e() if lic(string) is used
    plotpath(method)       see Plotting options above
    plotvar(varlist)       see Plotting options above
    plotopt(string)        see Plotting options above
    plotlabel              see Plotting options above
    ----------------------------------------------------------------------------------------------------------------------------
    Only applicable if lambda was a list in the previous lasso2 estimation.

    Postestimation:

        predict [type] newvar [if] [in] [, xb residuals u e ue xbu ols lambda(real) lid(int) approx noisily postresults]

    Predict options       Description
    ----------------------------------------------------------------------------------------------------------------------------
    xb                     compute predicted values (the default)
    residuals              compute residuals
    e                      generate overall error component e(it).  Only after fe.
    ue                     generate combined residuals, i.e., u(i) + e(it). Only after fe.
    xbu                    prediction including fixed effect, i.e., a + xb + u(i). Only after fe.
    u                      fixed effect, i.e., u(i). Only after fe.
    ols                    use post-estimation OLS for prediction
    lambda(real)‡          lambda value for prediction. Ignored if lasso2 was called with scalar lambda value.
    lid(int)‡              index of lambda value for prediction.
    lic(string)            selects which information criterion to use for prediction.
    approx‡                linear approximation is used instead of re-estimation.  Faster, but only exact if coefficient path is
                            piecewise linear.  Only supported in combination with lambda().
    noisily                displays beta used for prediction.
    postresults‡           store estimation results in e() if re-estimation is used
    ----------------------------------------------------------------------------------------------------------------------------
    ‡ Only applicable if lambda was a list in the previous lasso2 estimation.

    lasso2 may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or 
    xtset.

    All varlists may contain time-series operators or factor variables; see help varlist.

Contents

    Description
    Coordinate descent algorithm
    Penalization level
    Standardization of variables
    Information criteria
    Estimators
    lasso2 vs. Friedman et al.'s glmnet and StataCorp's lasso
    Examples and demonstration
    --Data set
    --General demonstration
    --Information criteria
    --Plotting
    --Predicted values
    --Standardization
    --Penalty loadings and notpen()
    --Partialling vs penalization
    --Adaptive lasso
    --Replication of glmnet and StataCorp's lasso
    Saved results
    References
    Website
    Installation
    Acknowledgements
    Citation of lassopack


Description

    lasso2 solves the following problem

        1/N RSS + lambda/N*alpha*||Psi*beta||[1] + lambda/(2*N)*(1-alpha)*||Psi*beta||[2], 
        
    where

    RSS        = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares,
    beta       is a p-dimensional parameter vector,
    lambda     is the overall penalty level,
    ||.||[j]   denotes the L(j) vector norm for j=1,2;
    alpha      the elastic net parameter, which determines the relative contribution of L1 (lasso-type) to L2 (ridge-type)
                penalization.
    Psi        is a p by p diagonal matrix of predictor-specific penalty loadings. Note that lasso2 treats Psi as a row vector.
    N          number of observations

    Note: the above lambda and alpha differ from the definitions used in parts of the lasso and elastic net literature, e.g.,
    the R package glmnet by Friedman et al. (2010).  We have here adopted an objective function following Belloni et al. (2012).
    See below and below for more discussion and examples of how to use the lglmnet option to replicate glmnet output.

    In addition, if the option sqrt is specified, lasso2 estimates the square-root lasso (sqrt-lasso) estimator, which is
    defined as the solution to the following objective function:

        sqrt(1/N*RSS) + lambda/N*||Psi*beta||[1]. 
        
Coordinate descent algorithm

    lasso2 implements the elastic net and sqrt-lasso using coordinate descent algorithms.  The algorithm (then referred to as
    "shooting") was first proposed by Fu (1998) for the lasso, and by Van der Kooij (2007) for the elastic net.  Belloni et al.
    (2011) implement the coordinate descent for the sqrt-lasso, and have kindly provided Matlab code.

    Coordinate descent algorithms repeatedly cycle over predictors j=1,...,p and update single coefficient estimates until
    convergence.  Suppose the predictors are centered and standardized to have unit variance.  In that case, the update for
    coefficient j is obtained using univariate regression of the current partial residuals (i.e., excluding the contribution of
    predictor j) against predictor j.

    The algorithm requires an initial beta estimate for which the Ridge estimate is used.  If the coefficient path is obtained
    for a list of lambda values, lasso2 starts from the largest lambda value and uses previous estimates as warm starts.

    See Friedman et al. (2007, 2010), and references therein, for further information.

Penalization level: choice of lambda (and alpha)

    Penalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters that control the degree
    and type of penalization.  The estimation methods implemented in lasso2 use two tuning parameters:  lambda, which controls
    the general degree of penalization, and alpha, which determines the relative contribution of L1-type to L2-type
    penalization.  lasso2 obtains elastic net and sqrt-lasso solutions for a given lambda value or a list of lambda values, and
    for a given alpha value (default=1).

    lassopack offers three approaches for selecting the "optimal" lambda (and alpha) value:

    (1) The penalty level may be chosen by cross-validation in order to optimize out-of-sample prediction performance.  K-fold
    cross-validation and rolling cross-validation (for panel and time-series data) are implemented in cvlasso.  cvlasso also
    supports cross-validation across alpha.

    (2) Theoretically justified and feasible penalty levels and loadings are available for the lasso and sqrt-lasso via the
    separate command rlasso.  The penalization is chosen to dominate the noise of the data-generating process (represented by
    the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter
    estimation.  Since the error variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible)
    lasso that relies on an iterative algorithm for estimating the optimal penalization and is valid in the presence of
    non-Gaussian and heteroskedastic errors.  Belloni et al. (2016) extend the framework to the panel data setting.  In the case
    of the sqrt-lasso under homoskedasticity, the optimal penalty level is independent of the unknown error variance, leading to
    a practical advantage and better performance in finite samples (see Belloni et al., 2011, 2014).  See help rlasso for more
    details.

    (3) Lambda can also be selected using information criteria.  lasso2 calculates four information criteria:  Akaike
    Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC; Schwarz, 1978), Extended Bayesian
    information criterion (EBIC; Chen & Chen, 2008) and the corrected AIC (AICc; Sugiura, 1978, and Hurvich, 1989).  By default,
    lasso2 displays EBIC in the output, but all four information criteria are stored in e(aic), e(bic), e(ebic) and e(aicc).
    See section Information criteria for more information.

Standardization of variables

    Standard practice is for predictors to be "standardized", i.e., normalized to have mean zero and unit variance.  By default
    lasso2 achieves this by incorporating the standardization into the penalty loadings.  We refer to this method as
    standardization "on the fly", as standardization occurs during rather than before estimation.  Alternatively, the option
    prestd causes the predictors to be standardized prior to the estimation.

    Standardizing "on the fly" via the penalty loadings and pre-standardizing the data prior to estimation are theoretically
    equivalent.  The default standardizing "on the fly" is often faster.  The prestd option can lead to improved numerical
    precision or more stable results in the case of difficult problems; the cost is the computation time required to
    pre-standardize the data.

Estimators

    Ridge regression (Hoerl & Kennard, 1970)

    The ridge estimator can be written as

         betahat(ridge) = (X'X+lambda*I(p))^(-1)X'y.

    Thus, even if X'X is not full rank (e.g. because p>n), the problem becomes nonsingular by adding a constant to the diagonal
    of X'X.  Another advantage of the ridge estimator over least squares stems from the variance-bias trade-off.  Ridge
    regression may improve over ordinary least squares by inducing a mild bias while decreasing the variance.  For this reason,
    ridge regression is a popular method in the context of multicollinearity.  In contrast to estimators relying on
    L1-penalization, the ridge does not yield sparse solutions and keeps all predictors in the model.

    Lasso estimator (Tibshirani, 1996)

    The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of coefficient estimates.
    Tibshirani (1996) motivates the lasso with two major advantages over least squares.  First, due to the nature of the
    L1-penalty, the lasso tends to produce sparse solutions and thus facilitates model interpretation.  Secondly, similar to
    ridge regression, lasso can outperform least squares in terms of prediction due to lower variance.  Another advantage is
    that the lasso is computationally attractive due to its convex form.  This is in contrast to model selection based on AIC or
    BIC (which employ L0 penalization) where each possible sub-model has to be estimated.

    Elastic net (Zou & Hastie, 2005)

    The elastic net applies a mix of L1 (lasso-type) and L2 (ridge-type) penalization.  It combines some of the strengths of
    lasso and ridge regression.  In the presence of groups of correlated regressors, the lasso selects typically only one
    variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated
    variables.  On the other hand, the ridge does not yield sparse solutions impeding model interpretation.  The elastic net is
    able to produce sparse solutions (for some alpha greater than zero) and retains (or drops) correlated variables jointly.

    Adaptive lasso (Zou, 2006)

    The lasso is only variable selection consistent under the rather strong "irrepresentable condition", which imposes
    constraints on the degree of correlation between predictors in the true model and predictors outside of the model (see Zhao
    & Yu, 2006; Meinshausen & Bühlmann, 2006).  Zou (2006) proposes the adaptive lasso which uses penalty loadings of
    1/abs(beta0(j))^theta where beta0 is an initial estimator.  The adaptive lasso is variable-selection consistent for fixed p
    under weaker assumptions than the standard lasso.  If p<n, OLS can be used as the initial estimator.  Huang et al. (2008)
    suggest to use univariate OLS if p>n.  Other initial estimators are possible.

    Square-root lasso (Belloni et al., 2011, 2014)

    The sqrt-lasso is a modification of the lasso that minimizes (RSS)^(1/2) instead of RSS, while also imposing an L1-penalty.
    The main advantage of the sqrt-lasso over the standard lasso is that the theoretically grounded, data-driven optimal lambda
    is independent of the unknown error variance under homoskedasticity.  See rlasso.

    Post-estimation OLS

    Penalized regression methods induce a bias that can be alleviated by post-estimation OLS, which applies OLS to the
    predictors selected by the first-stage variable selection method.  For the case of the lasso, Belloni and Chernozhukov (
    2013) have shown that the post-lasso OLS performs at least as well as the lasso under mild additional assumptions.

    For further information on the lasso and related methods, see for example the textbooks by Hastie et al. (2009, 2015; both
    available for free) and Bühlmann & Van de Geer (2011).

Information criteria
 
    The information criteria supported by lasso2 are the Akaike information criterion (AIC, Akaike, 1974), the Bayesian
    information criterion (BIC, Schwarz, 1978), the corrected AIC (Sugiura, 1978; Hurvich, 1989), and the Extended BIC (Chen &
    Chen, 2008).  These are given by (omitting dependence on lambda and alpha):

        AIC     = N*log(RSS/N) + 2*df
        BIC     = N*log(RSS/N) + df*log(N) 
        AICc    = N*log(RSS/N) + 2*df*N/(N-df)
        EBIC    = BIC + 2*xi*df*log(p)

    where RSS(lambda,alpha) is the residual sum of squares and df(lambda,alpha) is the effective degrees of freedom, which is a
    measure of model complexity.  In the linear regression model, the degrees of freedom is simply the number of regressors.
    Zou et al. (2007) show that the number of non-zero coefficients is an unbiased and consistent estimator of df(lambda,alpha)
    for the lasso.  More generally, the degrees of freedom of the elastic net can be calculated as the trace of the projection
    matrix.  With an unbiased estimator for df available, the above information criteria can be employed to select tuning
    parameters.

    The BIC is known to be model selection consistent if the true model is among the candidate models, whereas the AIC tends to
    yield an overfitted model.  On the other hand, the AIC is loss efficient in the sense that it selects the model that
    minimizes the squared average prediction error, while the BIC does not possess this property.  Zhang et al. (2010) show that
    these principles also apply when AIC and BIC are employed to select the tuning parameter for penalized regression.

    Both AIC and BIC tend to overselect regressors in the small-N-large-p case.  The AICc corrects the small sample bias of the
    AIC which can be especially severe in the high-dimensional context. Similarily, the EBIC addresses the shortcomings of the
    BIC when p is large by imposing a larger penalty on the number of coefficients.  Chen & Chen (2008) show that the EBIC
    performs better in terms of false discovery rate at the cost of a negligible reduction in the positive selection rate.

    The EBIC depends on an additional parameter, xi (denoted as gamma in the original article), which can be controlled using
    ebicgamma(real).  gamma=0 is equivalent to the BIC.  We follow Chen & Chen (2008, p. 768) and use xi=1-log(n)/(2*log(p)) as
    the default choice.  An upper and lower threshold is applied to ensure that xi lies in the [0,1] interval.

    The EBIC is displayed in the output of lasso2 by default (if lambda is a list), but all four information criteria are
    returned in e().  The lambda values that minimize the information criteria for a given alpha are returned in e(laic),
    e(lbic), e(laicc) and e(lebic), respectively.  To change the default display, use the ic(string) option.  noic suppresses
    the calculation of information criteria, which leads to a speed gain if alpha<1.

lasso2 vs. Hastie et al.'s (2010) glmnet and StataCorp's lasso

    The parameterization used by lasso2 differs from StataCorp's lasso in only one respect:  lambda(StataCorp) =
    (1/2N)*lambda(lasso2).  The elastic net parameter alpha is the same in both parameterizations.  See below for examples.

    The parameterization used by Hastie et al.'s (2010) glmnet uses the same convention as StataCorp for lambda:  lambda(glmnet)
    = (1/2N)*lambda(lasso2).  However, the glmnet treatment of the elastic net parameter alpha differs from both lasso2 and
    StataCorp's lasso.  The glmnet objective function is defined such that the dependent variable is assumed already to have
    been standardized.  Because the L2 norm is nonlinear, this affects the interpretation of alpha.  Specifically, the default
    lasso2 and StataCorp's lasso parameterization means that alpha is not invariant changes in the scale of the dependent
    variable.  The glmnet parameterization of alpha, however, is scale-invariant - a useful feature.

    lasso2 provides an lglmnet option that enables the user to employ the glmnet parameterization for alpha and lambda.  See
    below for examples of its usage and how to replicate glmnet output.  We recommend the use of the lglmnet option in
    particular with cross-validation over alpha (see cvlasso).

Example using prostate cancer data (Stamey et al., 1989)

    Data set

    The data set is available through Hastie et al. (2009) on the authors' website.  The following variables are included in the
    data set of 97 men:

    Predictors    
      lcavol    log(cancer volume)
      lweight   log(prostate weight)
      age       patient age
      lbph      log(benign prostatic hyperplasia amount)
      svi       seminal vesicle invasion
      lcp       log(capsular penetration)
      gleason   Gleason score
      pgg45     percentage Gleason scores 4 or 5

    Outcome       
      lpsa      log(prostate specific antigen)

    Load prostate cancer data.
        . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear tab

    General demonstration

    Estimate coefficient lasso path over (default) list of lambda values.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45

    The replay syntax can be used to re-display estimation results.
        . lasso2

    User-specified lambda list.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)

    The list of returned e() objects depends on whether lambda() is a list (the default) or a scalar value.  For example, if
    lambda is a scalar, one vector of coefficient estimates is returned.  If lambda is a list, the whole coefficient path for a
    range of lambda values is obtained.  The last row of e(betas) is equal to the row vector e(b).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)
        . ereturn list
        . mat list e(betas)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
        . ereturn list
        . mat list e(b)

    Sqrt-lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt

    Ridge regression.  All predictors are included in the model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0)

    Elastic net with alpha=0.1.  Even though alpha is close to zero (Ridge regression), the elastic net can produce sparse
    solutions.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1)

    The option ols triggers the use of post-estimation OLS.  OLS alleviates the shrinkage bias induced by L1 and L2 norm
    penalization.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ols
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt ols
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) ols

    Information criteria

    lasso2 calculates four information criteria: AIC, BIC, EBIC and AICc.  The EBIC is shown by default in the output along with
    the R-squared.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45

    To see another information criterion in the outout, use the ic(string) option where string can be replaced by aic, bic, ebic
    or aicc (note the lower case spelling).  For example, to display AIC:
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ic(aic)

    In fact, there is no need to re-run the full model.  We can make use of the replay syntax:
        . lasso2, ic(aic)

    The long option triggers extended output; instead of showing only the points at which predictors enter or leave the model,
    all models are shown.  An asterisk marks the model (i.e., the value of lambda) that minimizes the information criterion
    (here, AIC).
        . lasso2, ic(aic) long

    To estimate the model corresponding to the minimum information criterion, click on the link at the bottom of the output or
    type one of the following:
        . lasso2, lic(aic)
        . lasso2, lic(ebic)
        . lasso2, lic(bic)
        . lasso2, lic(aicc)

    To store the estimation results of the selected model, add the postresults option.
        . lasso2, lic(ebic)
        . ereturn list
        . lasso2, lic(ebic) postres
        . ereturn list

    The same can also be achieved in one line without using the replay syntax.  lasso2 first obtains the full coefficient path
    for a list lambda values, and then runs the model selected by AIC.  Again, postresults can be used to store results of the
    selected model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lic(aic) postres

    Plotting

    Plot coefficients against lambda:  As lambda increases, the coefficient estimates are shrunk towards zero.  Lambda=0
    corresponds to OLS and if lambda is sufficiently large the model is empty.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(lambda)

    Plot coefficients against L1 norm.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(norm)

    The replay syntax can also be used for plotting.
        . lasso2, plotpath(norm)

    Only selected variables are plotted.
        . lasso2, plotpath(norm) plotvar(lcavol svi)

    The variable names can be displayed directly next to each series using plotlabel.  plotopt(legend(off)) suppresses the
    legend.
        . lasso2, plotpath(lambda) plotlabel plotopt(legend(off))
        . lasso2, plotpath(norm) plotlabel plotopt(legend(off))

    Predicted values

    xbhat1 is generated by re-estimating the model for lambda=10.  The noisily option triggers the display of the estimation
    results.  xbhat2 is generated by linear approximation using the two beta estimates closest to lambda=10.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
        . cap drop xbhat1
        . predict double xbhat1, xb l(10) noisily
        . cap drop xbhat2
        . predict double xbhat2, xb l(10) approx

    The model is estimated explicitly using lambda=10.  If lasso2 is called with a scalar lambda value, the subsequent predict
    command requires no lambda() option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
        . cap drop xbhat3
        . predict double xbhat3, xb

    All three methods yield the same results.  However note that the linear approximation is only exact for the lasso which is
    piecewise linear.
        . sum xbhat1 xbhat2 xbhat3

    It is also possible to obtain predicted values by referencing a specific lambda ID using the lid() option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
        . cap drop xbhat4
        . predict double xbhat4, xb lid(21)
        . cap drop xbhat5
        . predict double xbhat5, xb l(25.45473900468241)
        . sum xbhat4 xbhat5

    Standardization

    By default lasso2 standardizes the predictors to have unit variance.  Standardization is done by default "on the fly" via
    penalty loadings.  The coefficient estimates are returned in original units.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)

    Instead of standardizing "on the fly" by setting penalization loadings equal to standardization loadings, we can standardize
    the regressors prior to estimation with the prestd option.  Both methods are equivalent in theory.  Standardizing "on the
    fly" tends to be faster, but pre-standardization may lead to more stable results in the case of difficult problems.  See
    here for more information.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
        . mat list e(Psi)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd
        . mat list e(Psi)

    The used penalty loadings are stored in e(Psi).  In the first case above, the standardization loadings are returned.  In the
    second case the penalty loadings are equal to one for all regressors.

    To get the coefficients in standard deviation units, stdcoef can be specified along with the prestd option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd stdcoef

    We can override any form of standardization with the unitloadings options, which sets the penalty loadings to a vector of
    1s.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) unitloadings

    The same logic applies to the sqrt-lasso (and elastic net).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt prestd

    Penalty loadings and notpen()

    By default the penalty loading vector is a vector of standard deviations.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
        . mat list e(Psi)

    We can set the penalty loading for specific predictors to zero, implying no penalization.  Unpenalized predictor are always
    included in the model.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol)
        . mat list e(Psi)

    We can specify custom penalty loadings.  The option ploadings expects a row vector of size p where p is the number of
    regressors (excluding the constant, which is partialled out).  Because we pre-standardize the data (and we are using the
    lasso) the results are equivalent to the results above (standardizing on the fly and specifying lcavol as unpenalized).
        . mat myloadings = (0,1,1,1,1,1,1,1)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) ploadings(myloadings) prestd
        . mat list e(Psi)

    Partialling vs penalization

    If lambda and the penalty loadings are kept constant, partialling out and not penalizing of variables yields the same
    results for the included/penalized regressors.  Yamada (2017) shows that the equivalence of partialling out and not
    penalizing holds for lasso and ridge regression.  The examples below suggest that the same result also holds for the elastic
    net in general and the sqrt-lasso.  Note that the equivalence only holds if the regressor matrix and other penalty loadings
    are the same.  Below we use the unitloadings option to achieve this; alternatively we could use the ploadings(.) option.

    Lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) partial(lcavol) unitload

    Sqrt-lasso.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt partial(lcavol) unitload

    Ridge regression.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) partial(lcavol) unitload

    Elastic net.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) notpen(lcavol) unitload
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) partial(lcavol) unitload

    Partialling-out is implemented in Mata using one of Mata's solvers.  In cases where the variables to be partialled out are
    collinear or nearly so, different solvers may generate different results.  Users may wish to check the stability of their
    results in such cases.  The psolver(.) option can be used to specify the Mata solver used.  The default behavior for solving
    AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx.
    Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates
    using svsolve((A'A),(A'B)).  lasso2 will warn if collinear variables are dropped when partialling out.

    Adaptive lasso

    The adaptive lasso relies on an initial estimator to calculate the penalty loadings.  The penalty loadings are given by
    1/abs(beta0(j))^theta, where beta0(j) denotes the initial estimate for predictor j.  By default, lasso2 uses OLS as the
    initial estimator as originally suggested by Zou (2006).  If the number of parameters exceeds the numbers of observations,
    univariate OLS is used; see Huang et al. (2008).
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive
        . mat list e(Psi)

    See the OLS estimates for comparison.
        . reg lpsa lcavol lweight age lbph svi lcp gleason pgg45

    Theta (the exponent for calculating the adaptive loadings) can be changed using the adatheta() option.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adat(2)
        . mat list e(Psi)

    Other initial estimators such as ridge regression are possible.
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0)
        . mat bhat_ridge = e(b)
        . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge)
        . mat list e(Psi)

    Replication of glmnet and StataCorp's lasso

    Use Stata's auto dataset with missing data dropped.  The variable price1000 is used to illustrate scaling effects.
        . sysuse auto, clear
        . drop if rep78==.
        . gen double price1000 = price/1000

    To load the data into R for comparison with glmnet, use the following commands.  The packages haven and tidyr need to be
    installed.
        auto <- haven::read_dta("http://www.stata-press.com/data/r9/auto.dta")
        auto <- tidyr::drop_na()
        n <- nrow(auto)
        price <- auto$price
        X <- auto[, c("mpg", "rep78", "headroom", "trunk", "weight", "length", "turn", "displacement", "gear_ratio", "foreign")]
        X$foreign <- as.integer(X$foreign)
        X <- as.matrix(X)

    Replication of StataCorp's lasso and elasticnet requires only the rescaling of lambda by 2N.  N=69 so the lasso2 lambda
    becomes 138000/(2*69) = 1000

        . lasso2 price mpg-foreign, lambda(138000)
        . lasso linear price mpg-foreign, grid(1, min(1000))
        . lassoselect lambda = 1000
        . lassocoef, display(coef, penalized)

        . lasso2 price mpg-foreign, alpha(0.6) lambda(138000)
        . elasticnet linear price mpg-foreign, alphas(0.6) grid(1, min(1000))
        . lassoselect alpha = 0.6 lambda = 1000
        . lassocoef, display(coef, penalized)

    glmnet uses the same definition of the lasso L0 penalty as StataCorp's lasso, so lasso2's default parameterization again
    requires only rescaling by 2N.  When the lglmnet option is used with the lglmnet option, the L0 penalty should be provided
    using the glmnet definition.  To estimate in R, load glmnet with library("glmnet") and use the following command:
        r<-glmnet(X,price,alpha=1,lambda=1000,thresh=1e-15)

        . lasso2 price mpg-foreign, lambda(138000)
        . lasso2 price mpg-foreign, lambda(1000) lglmnet

    The R code below uses glmnet to estimate an elastic net model.  lasso2 with the lglmnet option will replicate it.
        r<-glmnet(X,price,alpha=0.6,lambda=1000,thresh=1e-15)

        . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet

    lasso2's default parameterization of the elastic net (like StataCorp's elasticnet) is not invariant to scaling:

        . lasso2 price mpg-foreign, alpha(0.6) lambda(138000)
        . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(138)

    When lasso2 uses the glmnet parameterization of the elastic net via the glmnet options, results are invariant to scaling:
    the only difference is that the coefficients change by the same factor of proportionality as the dependent variable.

        . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet
        . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet

    The reason that the default lasso2/StataCorp parameterization is not invariant to scaling is because the penalty on L2 norm
    is influenced by scaling, and this in turn affects the relative weights on the L1 and L2 penalties.  The example below shows
    how to reparameterize so that the default lasso2 parameterization for the elastic net replicates the glmnet
    parameterization.  The example using the scaling above, where the dependent variable is price1000 and the glmnet lambda=1.

    The large-sample standard deviation of price1000 = 2.8912586.
        . qui sum price1000
        . di r(sd) * 1/sqrt( r(N)/(r(N)-1))

    The lasso2 alpha = alpha(lglmnet)*SD(y) / (1-alpha(glmnet) + alpha(glmnet)*SD(y)).  In this example, alpha =
    (0.6*2.8912586)/( 1-0.6 + 0.6*2.89125856) = 0.81262488.
        . di (0.6*2.8912586)/( 1-0.6 + 0.6*2.8912586)

    The lasso2 lambda = 2N*lambda(lglmnet) * (alpha(lglmnet) + (1-alpha(lglmnet))/SD(y)).  In this example, lambda = 2*69*1 *
    (0.6 + (1-0.6)/2.8912586) = 101.89203.
        . di 2*69*( 0.6 + (1-0.6)/2.8912586)

    lasso2 using the glmnet and then replicated using the lasso2/StataCorp parameterization:
        . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet
        . lasso2 price1000 mpg-foreign, alpha(.81262488) lambda(101.89203)

Saved results

    The set of returned e-class objects depends on whether lambda is a scalar or a list (the default).

    scalars       
      e(N)               sample size
      e(cons)            =1 if constant is present, 0 otherwise
      e(fe)              =1 if fixed effects model is used, 0 otherwise
      e(alpha)           elastic net parameter
      e(sqrt)            =1 if the sqrt-lasso is used, 0 otherwise
      e(ols)             =1 if post-estimation OLS results are returned, 0 otherwise
      e(adaptive)        =1 if adaptive loadings are used, 0 otherwise
      e(p)               number of penalized regressors in model
      e(notpen_ct)       number of unpenalized variables
      e(partial_ct)      number of partialled out regressors (incl constant)
      e(prestd)          =1 if pre-standardized
      e(lcount)          number of lambda values

    scalars (only if lambda is a list)
      e(lmax)            largest lambda value
      e(lmin)            smallest lambda value

    scalars (only if lambda is a scalar)
      e(lambda)          penalty level
      e(r2)              R-sq for lasso estimation
      e(rmse)            root mean squared error
      e(rmseOLS)         root mean squared error of post-estimation OLS
      e(objfn)           minimized objective function
      e(k)               number of selected and unpenalized/partialled-out regressors including constant (if present)
      e(s)               number of selected regressors
      e(s0)              number of selected and unpenalized regressors including constant (if present)
      e(df)              L0 norm ("effective degrees of freedom")
      e(niter)           number of iterations
      e(maxiter)         maximum number of iterations
      e(tss)             total sum of squares
      e(aicmin)          minimum AIC
      e(bicmin)          minimum BIC
      e(aiccmin)         minimum AICc
      e(ebicmin)         minimum EBIC
      e(laic)            lambda corresponding to minimum AIC
      e(lbic)            lambda corresponding to minimum BIC
      e(laicc)           lambda corresponding to minimum AICc
      e(lebic)           lambda corresponding to minimum EBIC

    macros        
      e(cmd)             command name
      e(depvar)          name of dependent variable
      e(varX)            all predictors
      e(varXmodel)       penalized predictors
      e(partial)         partialled out predictors
      e(notpen)          unpenalized predictors
      e(method)          estimation method

    macros (only if lambda is a scalar)
      e(selected)        selected predictors
      e(selected0)       selected predictors excluding constant

    matrices      
      e(Psi)             row vector used penalty loadings
      e(stdvec)          row vector of standardization loadings

    matrices (only if lambda is a list)
      e(lambdamat)       column vector of lambdas used for estimations
      e(lambdamat0)      full initial vector of lambdas (includes lambdas not used for estimation)
      e(l1norm)          column vector of L1 norms for each lambda value (excludes the intercept)
      e(wl1norm)         column vector of weighted L1 norms for each lambda value (excludes the intercept), see wnorm
      e(betas)           matrix of estimates, where each row corresponds to one lambda value. The intercept is stored in the
                           last column.
      e(IC)              matrix of information criteria (AIC, AICC, BIC, EEBIC) for each lambda value
      (NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)
      e(dof)             column vector of L0 norm for each lambda value (excludes the intercept)
      e(ess)             column vector of explained sum of squares for each lambda value
      e(rss)             column vector of residual sum of squares for each lambda value
      e(rsq)             column vector of R-squared for each lambda value

    matrices (only if lambda is a scalar)
      e(beta)            coefficient vector
      e(betaOLS)         coefficient vector of post-estimation OLS
      e(betaAll)         full coefficient vector including omitted, factor base variables, etc.
      e(betaAllOLS)      full post-estimation OLS coefficient vector including omitted, factor base variables, etc.
      (NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)
      e(b)               posted coefficient vector (see postall and displayall). Used for prediction.

    functions     
      e(sample)          estimation sample

References

    Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6),
        716–723.  https://doi.org/10.1109/TAC.1974.1100705

    Belloni, A., Chernozhukov, V., & Wang, L. (2011).  Square-root lasso: pivotal recovery of sparse signals via conic
        programming.  Biometrika 98(4), 791–806.  https://doi.org/10.1093/biomet/asr043

    Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse Models and Methods for Optimal Instruments With an
        Application to Eminent Domain. Econometrica 80(6), 2369–2429.  https://doi.org/10.3982/ECTA9626

    Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli,
        19(2), 521–547.  https://doi.org/10.3150/11-BEJ410

    Belloni, A., Chernozhukov, V., & Wang, L. (2014). Pivotal estimation via square-root Lasso in nonparametric regression. The
        Annals of Statistics 42(2), 757–788.  https://doi.org/10.1214/14-AOS1204

    Belloni, A., Chernozhukov, V., Hansen, C., & Kozbur, D. (2016). Inference in High Dimensional Panel Models with an
        Application to Gun Control. Journal of Business & Economic Statistics 34(4), 590–605. Methodology.  
        https://doi.org/10.1080/07350015.2015.1102733

    Bühlmann, P., & Meinshausen, N. (2006). High-dimensional graphs and variable selection with the Lasso. {it:The Annals of
        Statistics], 34(3), 1436–1462.  http://doi.org/10.1214/009053606000000281

    Bühlmann, P., & Van de Geer, S. (2011). Statistics for High-Dimensional Data. Berlin, Heidelberg: Springer-Verlag.

    Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika,
        95(3), 759–771.  https://doi.org/10.1093/biomet/asn034

    Correia, S. 2016.  FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets.  
        https://ideas.repec.org/c/boc/bocode/s458213.html

    Fu, W. J. (1998). Penalized Regressions: The Bridge Versus the Lasso. Journal of Computational and Graphical Statistics
        7(3), 397–416.  https://doi.org/10.2307/1390712

    Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied
        Statistics 1(2), 302–332.  https://doi.org/10.1214/07-AOAS131

    Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate
        Descent. Journal of Statistical Software 33(1), 1–22.  https://doi.org/10.18637/jss.v033.i01

    Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York:
        Springer-Verlag.  https://web.stanford.edu/~hastie/ElemStatLearn/

    Hastie, T., Tibshirani, R., & Wainwright, M. J. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations.
        Boca Raton: CRC Press, Taylor & Francis.  https://www.stanford.edu/~hastie/StatLearnSparsity/

    Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 12(1), 69–82.
        https://doi.org/10.1080/00401706.1970.10488635

    Huang, J., Ma, S., & Zhang, C.-H. (2008). Adaptive Lasso for Sparse High-Dimensional Regression Models Supplement.
        Statistica Sinica 18, 1603–1618.  https://doi.org/10.2307/24308572

    Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2),
        297–307.  http://doi.org/10.1093/biomet/76.2.297

    Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464.  
        https://doi.org/10.1214/aos/1176344136

    Stamey, T. A., Kabalin, J. N., Mcneal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang, N. (1989). Prostate
        Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate. II. Radical Prostatectomy Treated
        Patients. The Journal of Urology 141(5), 1076–1083. https://doi.org/10.1016/S0022-5347(17)41175-X

    Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite corrections.
        Communications in Statistics - Theory and Methods, 7(1), 13–26.  http://doi.org/10.1080/03610927808827599

    Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B
        (Methodological) 58(1), 267–288.  https://doi.org/10.2307/2346178

    Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling Transformations. Ph.D. thesis,
        Department of Data Theory, University of Leiden.  http://hdl.handle.net/1887/12096

    Yamada, H. (2017). The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression. Communications in Statistics -
        Theory and Methods 46(21), 10897–10902.  https://doi.org/10.1080/03610926.2016.1252403

    Zhang, Y., Li, R., & Tsai, C.-L. (2010). Regularization Parameter Selections via Generalized Information Criterion. Journal
        of the American Statistical Association, 105(489), 312–323.  http://doi.org/10.1198/jasa.2009.tm08013

    Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2541–2563.  
        http://dl.acm.org/citation.cfm?id=1248547.1248637

    Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical
        Society. Series B: Statistical Methodology 67(2), 301–320.  https://doi.org/10.1111/j.1467-9868.2005.00503.x

    Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476),
        1418–1429.  https://doi.org/10.1198/016214506000000735

    Zou, H., Hastie, T., & Tibshirani, R. (2007). On the "degrees of freedom" of the lasso. Ann. Statist., 35(5), 2173–2192.  
        https://doi.org/10.1214/009053607000000127

Website

    Please check our website https://statalasso.github.io/ for more information.

Installation

    lasso2 is part of the lassopack package.  To get the latest stable version of lassopack from our website, check the
    installation instructions at https://statalasso.github.io/installation/.  We update the stable website version more
    frequently than the SSC version.  Earlier versions of lassopack are also available from the website.

    To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which requires whichpkg to be
    installed; ssc install whichpkg).

Acknowledgements

    Thanks to Alexandre Belloni, who provided Matlab code for the square-root lasso estimator, Sergio Correia for advice on the
    use of the FTOOLS package, and Jan Ditzen.


Citation of lasso2

    lasso2 is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it
    as such:

    Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020).  LASSOPACK: Stata module for lasso, square-root lasso, elastic
        net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html

    Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020.  lassopack: model selection and prediction with regularized regression in
        Stata.  The Stata Journal, 20(1):176-235.  https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697.  Working paper
        version: https://arxiv.org/abs/1901.05397.

Authors

        Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
        achim.ahrens@gess.ethz.ch
        
        Christian B. Hansen, University of Chicago, USA
        Christian.Hansen@chicagobooth.edu

        Mark E. Schaffer, Heriot-Watt University, UK
        m.e.schaffer@hw.ac.uk


Also see

       Help: cvlasso, rlasso, lassologit, ivlasso, pdslasso (if installed).