----------------------------------------------------------------------------------------------------------------------------------help lasso2lassopack v1.4.2 ----------------------------------------------------------------------------------------------------------------------------------Titlelasso2-- Program for lasso, square-root lasso, elastic net, ridge, adaptive lasso and post-estimation OLSFull syntaxSyntaxlasso2depvarregressors[ifexp] [inrange] [,alpha(real)sqrtadaptiveadaloadings(string)adatheta(real)olslambda(numlist)lcount(integer)lminratio(real)lmax(real)lglmnetnotpen(varlist)partial(varlist)psolver(string)norecoverploadings(string)unitloadingsprestdstdcoeffenoftoolsnoconstanttolopt(real)tolzero(real)maxiter(int)plotpath(method)plotvar(varlist)plotopt(string)plotlabelic(string)lic(string)ebicgamma(real)noiclongdisplayallpostallpostresultsverbosevverbosewnorm] Note: thefeoption will take advantage of theftoolspackage (if installed) for the fixed-effects transformation; the speed gains using this package can be large. See help ftools or click on ssc install ftools to install.EstimatorsDescription ----------------------------------------------------------------------------------------------------------------------------alpha(real)elastic net parameter, which controls the degree of L1-norm (lasso-type) to L2-norm (ridge-type) penalization. alpha=1 corresponds to the lasso (the default estimator), and alpha=0 to ridge regression. alpha must be in the interval [0,1].sqrtsquare-root lasso estimator.adaptiveadaptive lasso estimator. The penalty loading for predictor j is set to 1/abs(beta0(j))^theta where beta0(j) is the OLS estimate or univariate OLS estimate if p>n. Theta is the adaptive exponent, and can be controlled using theadatheta(real)option.adaloadings(string)alternative initial estimates, beta0, used for calculating adaptive loadings. For example, this could be the vector e(b) from an initiallasso2estimation. The elements of the vector are raised to the power -theta (note the minus). Seeadaptiveoption.adatheta(real)exponent for calculating adaptive penalty loadings. Seeadaptiveoption. Default=1.olspost-estimation OLS. If lambda is a list, post-estimation OLS results are displayed and returned ine(betas). If lambda is a scalar, post-estimation OLS is always displayed, and this option controls whether standard or post-estimation OLS results are stored ine(b). ---------------------------------------------------------------------------------------------------------------------------- See overview of estimation methods.Lambda(s)Description ----------------------------------------------------------------------------------------------------------------------------lambda(numlist)a scalar lambda value or list of descending lambda values. Each lambda value must be greater than 0. If not specified, the default list is used which is given byexp(rangen(log(lmax),log(lminratio*lmax),lcount))(seemf_range).lcount(integer)† number of lambda values for which the solution is obtained. Default is 100.lminratio(real)† ratio of minimum to maximum lambda.lminratiomust be between 0 and 1. Default is 1/1000.lmax(real)† maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the square-root lasso (where X is the pre-standardized regressor matrix and y is the vector of the response variable).fdevminimum fractional change in deviance (R-sq) to stop looping through lambdas (path)devmaxmaximum fraction of explained deviance (R-sq) to stop looping through lambdas (path)nodevcritoverride criteria to exit path; loop through all lambdas in listlic(string)after firstlasso2estimation using list of lambdas, estimate model corresponding to minimum information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion.ebicgamma(real)controls thexiparameter of the EBIC.xineeds to lie in the [0,1] interval.xi=0 is equivalent to the BIC. The default choice isxi=1-log(n)/(2*log(p)).postresultsUsed in combination withlic(). Stores estimation results of the model selected by information criterion ine().lglmnetUse the parameterizations for lambda, alpha, standardization, etc. employed byglmnetby Friedman et al. (2010). ---------------------------------------------------------------------------------------------------------------------------- † Not applicable if lambda() is specified.Loadings & standardizationDescription ----------------------------------------------------------------------------------------------------------------------------notpen(varlist)sets penalty loadings to zero for predictors invarlist. Unpenalized predictors are always included in the model.partial(varlist)variables invarlistare partialled out prior to estimation.psolver(string)override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol; default=qrxx)norecoversuppresses recovery of partialled out variables after estimation.ploadings(matrix)a row-vector of penalty loadings; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2))). The size of the vector should equal the number of predictors (excluding partialled out variables and excluding the constant).unitloadingspenalty loadings set to a vector of ones; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2)).nostdis a synonym forunitloadings.prestddependent variable and predictors are standardized prior to estimation rather than standardized "on the fly" using penalty loadings. See here for more details. By default the coefficient estimates are un-standardized (i.e., returned in original units).stdcoefreturn coefficients in standard deviation units, i.e., don't un-standardize.stdallreturn all results (coefficients, information criteria, norms, etc.) in standardized units. ---------------------------------------------------------------------------------------------------------------------------- See discussion of standardization.FE & constantDescription ----------------------------------------------------------------------------------------------------------------------------fewithin-transformation is applied prior to estimation. Requires data to be xtset.noftoolsdo not useftoolspackage for fixed-effects transform (slower; rarely used)noconstantsuppress constant from estimation. Default behaviour is to partial the constant out (i.e., to center the regressors). ----------------------------------------------------------------------------------------------------------------------------OptimizationDescription ----------------------------------------------------------------------------------------------------------------------------tolopt(real)tolerance for lasso shooting algorithm (default=1e-10)tolzero(real)minimum below which coeffs are rounded down to zero (default=1e-4)maxiter(int)maximum number of iterations for the lasso shooting algorithm (default=10,000) ----------------------------------------------------------------------------------------------------------------------------Plotting options*Description ----------------------------------------------------------------------------------------------------------------------------plotpath(method)plots the coefficients path as a function of the L1-norm (norm), lambda (lambda) or the log of lambda (lnlambda)plotvar(varlist)list of variables to be included in the plotplotopt(string)additional plotting options passed on toline. For example, useplotopt(legend(off))to turn off the legend.plotlabeldisplays variable labels in graph. ----------------------------------------------------------------------------------------------------------------------------*Plotting is not available if lambda is a scalar value.Display optionsDescription ----------------------------------------------------------------------------------------------------------------------------displayall*display full coefficient vectors including unselected variables (default: display only selected, unpenalized and partialled-out)postall*post full coefficient vector including unselected variables in e(b) (default: e(b) has only selected, unpenalized and partialled-out)long† show long output; instead of showing only the points at which predictors enter or leave the model, all models are shown.verboseshow additional outputvverboseshow even more outputic(string)† controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the default' are allowed). Note the lower case spelling. See Information criteria for the definition of each information criterion.noic† suppresses the calculation of information criteria. This will lead to speed gains if alpha<1, since calculation of effective degrees of freedom requires one inversion per lambda.wnorm† displays L1 norm of beta estimates weighted by penalty loadings, i.e., ||Psi*beta||(1) instead of ||beta||(1), which is the default. Note that this also affects plotting ifplotpath(norm)} is specified. ----------------------------------------------------------------------------------------------------------------------------*Only applicable if lambda is a scalar value. † Only applicable if lambda is a list (the default). Replay syntaxlasso2[,plotpath(method)plotvar(varlist)plotopt(string)plotlabellongpostresultslic(string)ic(string)wnorm]Replay optionsDescription ----------------------------------------------------------------------------------------------------------------------------longshow long output; instead of showing only the points at which predictors enter or leave the model, all models are shown.ic(string)controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion.lic(string)estimate model corresponding to minimum information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion.postresultsstore estimation results in e() iflic(string)is usedplotpath(method)see Plotting options aboveplotvar(varlist)see Plotting options aboveplotopt(string)see Plotting options aboveplotlabelsee Plotting options above ---------------------------------------------------------------------------------------------------------------------------- Only applicable if lambda was a list in the previouslasso2estimation. Postestimation:predict[type]newvar[if] [in] [,xbresidualsueuexbuolslambda(real)lid(int)approxnoisilypostresults]Predict optionsDescription ----------------------------------------------------------------------------------------------------------------------------xbcompute predicted values (the default)residualscompute residualsegenerate overall error component e(it). Only afterfe.uegenerate combined residuals, i.e., u(i) + e(it). Only afterfe.xbuprediction including fixed effect, i.e., a + xb + u(i). Only afterfe.ufixed effect, i.e., u(i). Only afterfe.olsuse post-estimation OLS for predictionlambda(real)‡ lambda value for prediction. Ignored iflasso2was called with scalar lambda value.lid(int)‡ index of lambda value for prediction.lic(string)selects which information criterion to use for prediction.approx‡ linear approximation is used instead of re-estimation. Faster, but only exact if coefficient path is piecewise linear. Only supported in combination with lambda().noisilydisplays beta used for prediction.postresults‡ store estimation results in e() if re-estimation is used ---------------------------------------------------------------------------------------------------------------------------- ‡ Only applicable if lambda was a list in the previouslasso2estimation.lasso2may be used with time-series or panel data, in which case the data must be tsset or xtset first; see helptssetorxtset. All varlists may contain time-series operators or factor variables; see help varlist.Description Coordinate descent algorithm Penalization level Standardization of variables Information criteria Estimators lasso2 vs. Friedman et al.'sContentsglmnetand StataCorp's lasso Examples and demonstration --Data set --General demonstration --Information criteria --Plotting --Predicted values --Standardization --Penalty loadings and notpen() --Partialling vs penalization --Adaptive lasso --Replication of glmnet and StataCorp's lasso Saved results References Website Installation Acknowledgements Citation of lassopackDescriptionlasso2solves the following problem 1/N RSS + lambda/N*alpha*||Psi*beta||[1] + lambda/(2*N)*(1-alpha)*||Psi*beta||[2], where RSS = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares, beta is a p-dimensional parameter vector, lambda is the overall penalty level, ||.||[j] denotes the L(j) vector norm for j=1,2; alpha the elastic net parameter, which determines the relative contribution of L1 (lasso-type) to L2 (ridge-type) penalization. Psi is a p by p diagonal matrix of predictor-specific penalty loadings. Note thatlasso2treats Psi as a row vector. N number of observations Note: the above lambda and alpha differ from the definitions used in parts of the lasso and elastic net literature, e.g., the R packageglmnetby Friedman et al. (2010). We have here adopted an objective function following Belloni et al. (2012). See below and below for more discussion and examples of how to use thelglmnetoption to replicateglmnetoutput. In addition, if the optionsqrtis specified,lasso2estimates the square-root lasso (sqrt-lasso) estimator, which is defined as the solution to the following objective function: sqrt(1/N*RSS) + lambda/N*||Psi*beta||[1].Coordinate descent algorithmlasso2implements the elastic net and sqrt-lasso using coordinate descent algorithms. The algorithm (then referred to as "shooting") was first proposed by Fu (1998) for the lasso, and by Van der Kooij (2007) for the elastic net. Belloni et al. (2011) implement the coordinate descent for the sqrt-lasso, and have kindly provided Matlab code. Coordinate descent algorithms repeatedly cycle over predictorsj=1,...,pand update single coefficient estimates until convergence. Suppose the predictors are centered and standardized to have unit variance. In that case, the update for coefficientjis obtained using univariate regression of the current partial residuals (i.e., excluding the contribution of predictorj) against predictorj. The algorithm requires an initial beta estimate for which the Ridge estimate is used. If the coefficient path is obtained for a list of lambda values,lasso2starts from the largest lambda value and uses previous estimates as warm starts. See Friedman et al. (2007,2010), and references therein, for further information.Penalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters that control the degree and type of penalization. The estimation methods implemented inPenalization level: choice of lambda (and alpha)lasso2use two tuning parameters: lambda, which controls the general degree of penalization, and alpha, which determines the relative contribution of L1-type to L2-type penalization.lasso2obtains elastic net and sqrt-lasso solutions for a given lambda value or a list of lambda values, and for a given alpha value (default=1).lassopackoffers three approaches for selecting the "optimal" lambda (and alpha) value: (1) The penalty level may be chosen by cross-validation in order to optimize out-of-sample prediction performance.K-fold cross-validation and rolling cross-validation (for panel and time-series data) are implemented incvlasso.cvlassoalso supports cross-validation across alpha. (2) Theoretically justified and feasible penalty levels and loadings are available for the lasso and sqrt-lasso via the separate commandrlasso. The penalization is chosen to dominate the noise of the data-generating process (represented by the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter estimation. Since the error variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible) lasso that relies on an iterative algorithm for estimating the optimal penalization and is valid in the presence of non-Gaussian and heteroskedastic errors. Belloni et al. (2016) extend the framework to the panel data setting. In the case of the sqrt-lasso under homoskedasticity, the optimal penalty level is independent of the unknown error variance, leading to a practical advantage and better performance in finite samples (see Belloni et al.,2011,2014). See helprlassofor more details. (3) Lambda can also be selected using information criteria.lasso2calculates four information criteria: Akaike Information Criterion (AIC; Akaike,1974), Bayesian Information Criterion (BIC; Schwarz,1978), Extended Bayesian information criterion (EBIC; Chen & Chen,2008) and the corrected AIC (AICc; Sugiura,1978, and Hurvich,1989). By default,lasso2displays EBIC in the output, but all four information criteria are stored ine(aic),e(bic),e(ebic)ande(aicc). See section Information criteria for more information.Standard practice is for predictors to be "standardized", i.e., normalized to have mean zero and unit variance. By defaultStandardization of variableslasso2achieves this by incorporating the standardization into the penalty loadings. We refer to this method as standardization "on the fly", as standardization occurs during rather than before estimation. Alternatively, the optionprestdcauses the predictors to be standardized prior to the estimation. Standardizing "on the fly" via the penalty loadings and pre-standardizing the data prior to estimation are theoretically equivalent. The default standardizing "on the fly" is often faster. Theprestdoption can lead to improved numerical precision or more stable results in the case of difficult problems; the cost is the computation time required to pre-standardize the data.EstimatorsRidge regression (Hoerl & Kennard,1970)The ridge estimator can be written as betahat(ridge) = (X'X+lambda*I(p))^(-1)X'y. Thus, even if X'X is not full rank (e.g. because p>n), the problem becomes nonsingular by adding a constant to the diagonal of X'X. Another advantage of the ridge estimator over least squares stems from the variance-bias trade-off. Ridge regression may improve over ordinary least squares by inducing a mild bias while decreasing the variance. For this reason, ridge regression is a popular method in the context of multicollinearity. In contrast to estimators relying on L1-penalization, the ridge does not yield sparse solutions and keeps all predictors in the model.Lasso estimator (Tibshirani,1996)The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of coefficient estimates. Tibshirani (1996) motivates the lasso with two major advantages over least squares. First, due to the nature of the L1-penalty, the lasso tends to produce sparse solutions and thus facilitates model interpretation. Secondly, similar to ridge regression, lasso can outperform least squares in terms of prediction due to lower variance. Another advantage is that the lasso is computationally attractive due to its convex form. This is in contrast to model selection based on AIC or BIC (which employ L0 penalization) where each possible sub-model has to be estimated.Elastic net (Zou & Hastie,2005)The elastic net applies a mix of L1 (lasso-type) and L2 (ridge-type) penalization. It combines some of the strengths of lasso and ridge regression. In the presence of groups of correlated regressors, the lasso selects typically only one variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated variables. On the other hand, the ridge does not yield sparse solutions impeding model interpretation. The elastic net is able to produce sparse solutions (for some alpha greater than zero) and retains (or drops) correlated variables jointly.Adaptive lasso (Zou,2006)The lasso is only variable selection consistent under the rather strong "irrepresentable condition", which imposes constraints on the degree of correlation between predictors in the true model and predictors outside of the model (see Zhao & Yu,2006; Meinshausen & Bühlmann,2006). Zou (2006) proposes the adaptive lasso which uses penalty loadings of 1/abs(beta0(j))^theta where beta0 is an initial estimator. The adaptive lasso is variable-selection consistent for fixed p under weaker assumptions than the standard lasso. If p<n, OLS can be used as the initial estimator. Huang et al. (2008) suggest to use univariate OLS if p>n. Other initial estimators are possible.Square-root lasso (Belloni et al.,2011,2014)The sqrt-lasso is a modification of the lasso that minimizes (RSS)^(1/2) instead of RSS, while also imposing an L1-penalty. The main advantage of the sqrt-lasso over the standard lasso is that the theoretically grounded, data-driven optimal lambda is independent of the unknown error variance under homoskedasticity. Seerlasso.Post-estimation OLSPenalized regression methods induce a bias that can be alleviated by post-estimation OLS, which applies OLS to the predictors selected by the first-stage variable selection method. For the case of the lasso, Belloni and Chernozhukov (2013) have shown that the post-lasso OLS performs at least as well as the lasso under mild additional assumptions. For further information on the lasso and related methods, see for example the textbooks by Hastie et al. (2009,2015; both available for free) and Bühlmann & Van de Geer (2011).The information criteria supported byInformation criterialasso2are the Akaike information criterion (AIC, Akaike,1974), the Bayesian information criterion (BIC, Schwarz,1978), the corrected AIC (Sugiura,1978; Hurvich,1989), and the Extended BIC (Chen & Chen,2008). These are given by (omitting dependence on lambda and alpha): AIC = N*log(RSS/N) + 2*dfBIC = N*log(RSS/N) +df*log(N) AICc = N*log(RSS/N) + 2*df*N/(N-df) EBIC = BIC + 2*xi*df*log(p) where RSS(lambda,alpha) is the residual sum of squares anddf(lambda,alpha) is the effective degrees of freedom, which is a measure of model complexity. In the linear regression model, the degrees of freedom is simply the number of regressors. Zou et al. (2007) show that the number of non-zero coefficients is an unbiased and consistent estimator ofdf(lambda,alpha) for the lasso. More generally, the degrees of freedom of the elastic net can be calculated as the trace of the projection matrix. With an unbiased estimator fordfavailable, the above information criteria can be employed to select tuning parameters. The BIC is known to be model selection consistent if the true model is among the candidate models, whereas the AIC tends to yield an overfitted model. On the other hand, the AIC is loss efficient in the sense that it selects the model that minimizes the squared average prediction error, while the BIC does not possess this property. Zhang et al. (2010) show that these principles also apply when AIC and BIC are employed to select the tuning parameter for penalized regression. Both AIC and BIC tend to overselect regressors in the small-N-large-p case. The AICc corrects the small sample bias of the AIC which can be especially severe in the high-dimensional context. Similarily, the EBIC addresses the shortcomings of the BIC when p is large by imposing a larger penalty on the number of coefficients. Chen & Chen (2008) show that the EBIC performs better in terms of false discovery rate at the cost of a negligible reduction in the positive selection rate. The EBIC depends on an additional parameter,xi(denoted as gamma in the original article), which can be controlled usingebicgamma(real).gamma=0 is equivalent to the BIC. We follow Chen & Chen (2008, p. 768) and usexi=1-log(n)/(2*log(p)) as the default choice. An upper and lower threshold is applied to ensure thatxilies in the [0,1] interval. The EBIC is displayed in the output oflasso2by default (if lambda is a list), but all four information criteria are returned ine(). The lambda values that minimize the information criteria for a given alpha are returned ine(laic),e(lbic),e(laicc)ande(lebic), respectively. To change the default display, use theic(string)option.noicsuppresses the calculation of information criteria, which leads to a speed gain if alpha<1.lasso2 vs. Hastie et al.'s (2010)glmnetThe parameterization used byand StataCorp's lassolasso2differs from StataCorp'slassoin only one respect:lambda(StataCorp)= (1/2N)*lambda(lasso2). The elastic net parameteralphais the same in both parameterizations. See below for examples. The parameterization used by Hastie et al.'s (2010)glmnetuses the same convention as StataCorp for lambda:lambda(glmnet)= (1/2N)*lambda(lasso2). However, theglmnettreatment of the elastic net parameter alpha differs from bothlasso2and StataCorp'slasso. Theglmnetobjective function is defined such that the dependent variable is assumed already to have been standardized. Because the L2 norm is nonlinear, this affects the interpretation of alpha. Specifically, the defaultlasso2and StataCorp'slassoparameterization means that alpha is not invariant changes in the scale of the dependent variable. Theglmnetparameterization of alpha, however, is scale-invariant - a useful feature.lasso2provides anlglmnetoption that enables the user to employ theglmnetparameterization for alpha and lambda. See below for examples of its usage and how to replicateglmnetoutput. We recommend the use of thelglmnetoption in particular with cross-validation over alpha (seecvlasso).Example using prostate cancer data (Stamey et al.,1989)Data setThe data set is available through Hastie et al. (2009) on the authors' website. The following variables are included in the data set of 97 men: Predictors lcavol log(cancer volume) lweight log(prostate weight) age patient age lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) gleason Gleason score pgg45 percentage Gleason scores 4 or 5 Outcome lpsa log(prostate specific antigen) Load prostate cancer data. . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear tabGeneral demonstrationEstimate coefficient lasso path over (default) list of lambda values. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 The replay syntax can be used to re-display estimation results. . lasso2 User-specified lambda list. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10) The list of returnede()objects depends on whetherlambda()is a list (the default) or a scalar value. For example, if lambda is a scalar, one vector of coefficient estimates is returned. If lambda is a list, the whole coefficient path for a range of lambda values is obtained. The last row ofe(betas)is equal to the row vectore(b). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10) . ereturn list . mat list e(betas) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10) . ereturn list . mat list e(b) Sqrt-lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt Ridge regression. All predictors are included in the model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0) Elastic net with alpha=0.1. Even though alpha is close to zero (Ridge regression), the elastic net can produce sparse solutions. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) The optionolstriggers the use of post-estimation OLS. OLS alleviates the shrinkage bias induced by L1 and L2 norm penalization. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ols . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt ols . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) olsInformation criterialasso2calculates four information criteria: AIC, BIC, EBIC and AICc. The EBIC is shown by default in the output along with the R-squared. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 To see another information criterion in the outout, use theic(string)option wherestringcan be replaced byaic,bic,ebicoraicc(note the lower case spelling). For example, to display AIC: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ic(aic) In fact, there is no need to re-run the full model. We can make use of the replay syntax: . lasso2, ic(aic) Thelongoption triggers extended output; instead of showing only the points at which predictors enter or leave the model, all models are shown. An asterisk marks the model (i.e., the value of lambda) that minimizes the information criterion (here, AIC). . lasso2, ic(aic) long To estimate the model corresponding to the minimum information criterion, click on the link at the bottom of the output or type one of the following: . lasso2, lic(aic) . lasso2, lic(ebic) . lasso2, lic(bic) . lasso2, lic(aicc) To store the estimation results of the selected model, add thepostresultsoption. . lasso2, lic(ebic) . ereturn list . lasso2, lic(ebic) postres . ereturn list The same can also be achieved in one line without using the replay syntax.lasso2first obtains the full coefficient path for a list lambda values, and then runs the model selected by AIC. Again,postresultscan be used to store results of the selected model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lic(aic) postresPlottingPlot coefficients against lambda: As lambda increases, the coefficient estimates are shrunk towards zero. Lambda=0 corresponds to OLS and if lambda is sufficiently large the model is empty. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(lambda) Plot coefficients against L1 norm. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(norm) The replay syntax can also be used for plotting. . lasso2, plotpath(norm) Only selected variables are plotted. . lasso2, plotpath(norm) plotvar(lcavol svi) The variable names can be displayed directly next to each series usingplotlabel.plotopt(legend(off))suppresses the legend. . lasso2, plotpath(lambda) plotlabel plotopt(legend(off)) . lasso2, plotpath(norm) plotlabel plotopt(legend(off))Predicted valuesxbhat1is generated by re-estimating the model for lambda=10. Thenoisilyoption triggers the display of the estimation results.xbhat2is generated by linear approximation using the two beta estimates closest to lambda=10. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . cap drop xbhat1 . predict double xbhat1, xb l(10) noisily . cap drop xbhat2 . predict double xbhat2, xb l(10) approx The model is estimated explicitly using lambda=10. Iflasso2is called with a scalar lambda value, the subsequentpredictcommand requires nolambda()option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10) . cap drop xbhat3 . predict double xbhat3, xb All three methods yield the same results. However note that the linear approximation is only exact for the lasso which is piecewise linear. . sum xbhat1 xbhat2 xbhat3 It is also possible to obtain predicted values by referencing a specific lambda ID using thelid()option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . cap drop xbhat4 . predict double xbhat4, xb lid(21) . cap drop xbhat5 . predict double xbhat5, xb l(25.45473900468241) . sum xbhat4 xbhat5StandardizationBy defaultlasso2standardizes the predictors to have unit variance. Standardization is done by default "on the fly" via penalty loadings. The coefficient estimates are returned in original units. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) Instead of standardizing "on the fly" by setting penalization loadings equal to standardization loadings, we can standardize the regressors prior to estimation with theprestdoption. Both methods are equivalent in theory. Standardizing "on the fly" tends to be faster, but pre-standardization may lead to more stable results in the case of difficult problems. See here for more information. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) . mat list e(Psi) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd . mat list e(Psi) The used penalty loadings are stored ine(Psi). In the first case above, the standardization loadings are returned. In the second case the penalty loadings are equal to one for all regressors. To get the coefficients in standard deviation units,stdcoefcan be specified along with theprestdoption. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd stdcoef We can override any form of standardization with theunitloadingsoptions, which sets the penalty loadings to a vector of 1s. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) unitloadings The same logic applies to the sqrt-lasso (and elastic net). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt prestdPenalty loadings andBy default the penalty loading vector is a vector of standard deviations. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) . mat list e(Psi) We can set the penalty loading for specific predictors to zero, implying no penalization. Unpenalized predictor are always included in the model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) . mat list e(Psi) We can specify custom penalty loadings. The optionnotpen()ploadingsexpects a row vector of sizepwherepis the number of regressors (excluding the constant, which is partialled out). Because we pre-standardize the data (and we are using the lasso) the results are equivalent to the results above (standardizing on the fly and specifying lcavol as unpenalized). . mat myloadings = (0,1,1,1,1,1,1,1) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) ploadings(myloadings) prestd . mat list e(Psi)Partialling vs penalizationIf lambda and the penalty loadings are kept constant, partialling out and not penalizing of variables yields the same results for the included/penalized regressors. Yamada (2017) shows that the equivalence of partialling out and not penalizing holds for lasso and ridge regression. The examples below suggest that the same result also holds for the elastic net in general and the sqrt-lasso. Note that the equivalence only holds if the regressor matrix and other penalty loadings are the same. Below we use theunitloadingsoption to achieve this; alternatively we could use theploadings(.)option. Lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) partial(lcavol) unitload Sqrt-lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt partial(lcavol) unitload Ridge regression. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) partial(lcavol) unitload Elastic net. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) partial(lcavol) unitload Partialling-out is implemented in Mata using one of Mata's solvers. In cases where the variables to be partialled out are collinear or nearly so, different solvers may generate different results. Users may wish to check the stability of their results in such cases. Thepsolver(.)option can be used to specify the Mata solver used. The default behavior for solving AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx. Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates using svsolve((A'A),(A'B)).lasso2will warn if collinear variables are dropped when partialling out.Adaptive lassoThe adaptive lasso relies on an initial estimator to calculate the penalty loadings. The penalty loadings are given by 1/abs(beta0(j))^theta, where beta0(j) denotes the initial estimate for predictor j. By default,lasso2uses OLS as the initial estimator as originally suggested by Zou (2006). If the number of parameters exceeds the numbers of observations, univariate OLS is used; see Huang et al. (2008). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive . mat list e(Psi) See the OLS estimates for comparison. . reg lpsa lcavol lweight age lbph svi lcp gleason pgg45 Theta (the exponent for calculating the adaptive loadings) can be changed using theadatheta()option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adat(2) . mat list e(Psi) Other initial estimators such as ridge regression are possible. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) . mat bhat_ridge = e(b) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge) . mat list e(Psi)Replication of glmnet and StataCorp's lassoUse Stata's auto dataset with missing data dropped. The variable price1000 is used to illustrate scaling effects. . sysuse auto, clear . drop if rep78==. . gen double price1000 = price/1000 To load the data into R for comparison with glmnet, use the following commands. The packages haven and tidyr need to be installed.auto <- haven::read_dta("http://www.stata-press.com/data/r9/auto.dta")auto <- tidyr::drop_na()n <- nrow(auto)price <- auto$priceX <- auto[, c("mpg", "rep78", "headroom", "trunk", "weight", "length", "turn", "displacement", "gear_ratio", "foreign")]X$foreign <- as.integer(X$foreign)X <- as.matrix(X)Replication of StataCorp'slassoandelasticnetrequires only the rescaling of lambda by 2N. N=69 so thelasso2lambda becomes 138000/(2*69) = 1000 . lasso2 price mpg-foreign, lambda(138000) . lasso linear price mpg-foreign, grid(1, min(1000)) . lassoselect lambda = 1000 . lassocoef, display(coef, penalized) . lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . elasticnet linear price mpg-foreign, alphas(0.6) grid(1, min(1000)) . lassoselect alpha = 0.6 lambda = 1000 . lassocoef, display(coef, penalized)glmnetuses the same definition of the lasso L0 penalty as StataCorp'slasso, solasso2's default parameterization again requires only rescaling by 2N. When thelglmnetoption is used with thelglmnetoption, the L0 penalty should be provided using the glmnet definition. To estimate in R, load glmnet with library("glmnet") and use the following command:r<-glmnet(X,price,alpha=1,lambda=1000,thresh=1e-15). lasso2 price mpg-foreign, lambda(138000) . lasso2 price mpg-foreign, lambda(1000) lglmnet The R code below usesglmnetto estimate an elastic net model.lasso2with thelglmnetoption will replicate it.r<-glmnet(X,price,alpha=0.6,lambda=1000,thresh=1e-15). lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnetlasso2's default parameterization of the elastic net (like StataCorp'selasticnet) isnotinvariant to scaling: . lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(138) Whenlasso2uses the glmnet parameterization of the elastic net via theglmnetoptions, resultsareinvariant to scaling: the only difference is that the coefficients change by the same factor of proportionality as the dependent variable. . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet The reason that the defaultlasso2/StataCorp parameterization is not invariant to scaling is because the penalty on L2 norm is influenced by scaling, and this in turn affects the relative weights on the L1 and L2 penalties. The example below shows how to reparameterize so that the defaultlasso2parameterization for the elastic net replicates theglmnetparameterization. The example using the scaling above, where the dependent variable is price1000 and theglmnetlambda=1. The large-sample standard deviation of price1000 = 2.8912586. . qui sum price1000 . di r(sd) * 1/sqrt( r(N)/(r(N)-1)) Thelasso2alpha = alpha(lglmnet)*SD(y) / (1-alpha(glmnet) + alpha(glmnet)*SD(y)). In this example, alpha = (0.6*2.8912586)/( 1-0.6 + 0.6*2.89125856) = 0.81262488. . di (0.6*2.8912586)/( 1-0.6 + 0.6*2.8912586) Thelasso2lambda = 2N*lambda(lglmnet) * (alpha(lglmnet) + (1-alpha(lglmnet))/SD(y)). In this example, lambda = 2*69*1 * (0.6 + (1-0.6)/2.8912586) = 101.89203. . di 2*69*( 0.6 + (1-0.6)/2.8912586)lasso2using theglmnetand then replicated using thelasso2/StataCorp parameterization: . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet . lasso2 price1000 mpg-foreign, alpha(.81262488) lambda(101.89203)The set of returned e-class objects depends on whether lambda is a scalar or a list (the default). scalarsSaved resultse(N)sample sizee(cons)=1 if constant is present, 0 otherwisee(fe)=1 if fixed effects model is used, 0 otherwisee(alpha)elastic net parametere(sqrt)=1 if the sqrt-lasso is used, 0 otherwisee(ols)=1 if post-estimation OLS results are returned, 0 otherwisee(adaptive)=1 if adaptive loadings are used, 0 otherwisee(p)number of penalized regressors in modele(notpen_ct)number of unpenalized variablese(partial_ct)number of partialled out regressors (incl constant)e(prestd)=1 if pre-standardizede(lcount)number of lambda values scalars (only if lambda is a list)e(lmax)largest lambda valuee(lmin)smallest lambda value scalars (only if lambda is a scalar)e(lambda)penalty levele(r2)R-sq for lasso estimatione(rmse)root mean squared errore(rmseOLS)root mean squared error of post-estimation OLSe(objfn)minimized objective functione(k)number of selected and unpenalized/partialled-out regressors including constant (if present)e(s)number of selected regressorse(s0)number of selected and unpenalized regressors including constant (if present)e(df)L0 norm ("effective degrees of freedom")e(niter)number of iterationse(maxiter)maximum number of iterationse(tss)total sum of squarese(aicmin)minimum AICe(bicmin)minimum BICe(aiccmin)minimum AICce(ebicmin)minimum EBICe(laic)lambda corresponding to minimum AICe(lbic)lambda corresponding to minimum BICe(laicc)lambda corresponding to minimum AICce(lebic)lambda corresponding to minimum EBIC macrose(cmd)command namee(depvar)name of dependent variablee(varX)all predictorse(varXmodel)penalized predictorse(partial)partialled out predictorse(notpen)unpenalized predictorse(method)estimation method macros (only if lambda is a scalar)e(selected)selected predictorse(selected0)selected predictors excluding constant matricese(Psi)row vector used penalty loadingse(stdvec)row vector of standardization loadings matrices (only if lambda is a list)e(lambdamat)column vector of lambdas used for estimationse(lambdamat0)full initial vector of lambdas (includes lambdas not used for estimation)e(l1norm)column vector of L1 norms for each lambda value (excludes the intercept)e(wl1norm)column vector of weighted L1 norms for each lambda value (excludes the intercept), seewnorme(betas)matrix of estimates, where each row corresponds to one lambda value. The intercept is stored in the last column.e(IC)matrix of information criteria (AIC, AICC, BIC, EEBIC) for each lambda value (NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)e(dof)column vector of L0 norm for each lambda value (excludes the intercept)e(ess)column vector of explained sum of squares for each lambda valuee(rss)column vector of residual sum of squares for each lambda valuee(rsq)column vector of R-squared for each lambda value matrices (only if lambda is a scalar)e(beta)coefficient vectore(betaOLS)coefficient vector of post-estimation OLSe(betaAll)full coefficient vector including omitted, factor base variables, etc.e(betaAllOLS)full post-estimation OLS coefficient vector including omitted, factor base variables, etc. (NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)e(b)posted coefficient vector (seepostallanddisplayall). Used for prediction. functionse(sample)estimation sampleAkaike, H. (1974). A new look at the statistical model identification.ReferencesIEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705 Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming.Biometrika98(4), 791–806. https://doi.org/10.1093/biomet/asr043 Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain.Econometrica80(6), 2369–2429. https://doi.org/10.3982/ECTA9626 Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models.Bernoulli, 19(2), 521–547. https://doi.org/10.3150/11-BEJ410 Belloni, A., Chernozhukov, V., & Wang, L. (2014). Pivotal estimation via square-root Lasso in nonparametric regression.TheAnnals of Statistics42(2), 757–788. https://doi.org/10.1214/14-AOS1204 Belloni, A., Chernozhukov, V., Hansen, C., & Kozbur, D. (2016). Inference in High Dimensional Panel Models with an Application to Gun Control.Journal of Business & Economic Statistics34(4), 590–605. Methodology. https://doi.org/10.1080/07350015.2015.1102733 Bühlmann, P., & Meinshausen, N. (2006). High-dimensional graphs and variable selection with the Lasso. {it:The Annals of Statistics], 34(3), 1436–1462. http://doi.org/10.1214/009053606000000281 Bühlmann, P., & Van de Geer, S. (2011). Statistics for High-Dimensional Data. Berlin, Heidelberg: Springer-Verlag. Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces.Biometrika, 95(3), 759–771. https://doi.org/10.1093/biomet/asn034 Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets. https://ideas.repec.org/c/boc/bocode/s458213.html Fu, W. J. (1998). Penalized Regressions: The Bridge Versus the Lasso.Journal of Computational and Graphical Statistics7(3), 397–416. https://doi.org/10.2307/1390712 Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization.The Annals of AppliedStatistics1(2), 302–332. https://doi.org/10.1214/07-AOAS131 Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent.Journal of Statistical Software33(1), 1–22. https://doi.org/10.18637/jss.v033.i01 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York: Springer-Verlag. https://web.stanford.edu/~hastie/ElemStatLearn/ Hastie, T., Tibshirani, R., & Wainwright, M. J. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press, Taylor & Francis. https://www.stanford.edu/~hastie/StatLearnSparsity/ Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems.Technometrics12(1), 69–82. https://doi.org/10.1080/00401706.1970.10488635 Huang, J., Ma, S., & Zhang, C.-H. (2008). Adaptive Lasso for Sparse High-Dimensional Regression Models Supplement.Statistica Sinica18, 1603–1618. https://doi.org/10.2307/24308572 Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples.Biometrika, 76(2), 297–307. http://doi.org/10.1093/biomet/76.2.297 Schwarz, G. (1978). Estimating the Dimension of a Model.The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136 Stamey, T. A., Kabalin, J. N., Mcneal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang, N. (1989). Prostate Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate. II. Radical Prostatectomy Treated Patients.The Journal of Urology141(5), 1076–1083. https://doi.org/10.1016/S0022-5347(17)41175-X Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite corrections.Communications in Statistics - Theory and Methods, 7(1), 13–26. http://doi.org/10.1080/03610927808827599 Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.Journal of the Royal Statistical Society. Series B(Methodological)58(1), 267–288. https://doi.org/10.2307/2346178 Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling Transformations. Ph.D. thesis, Department of Data Theory, University of Leiden. http://hdl.handle.net/1887/12096 Yamada, H. (2017). The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression.Communications in Statistics -Theory and Methods46(21), 10897–10902. https://doi.org/10.1080/03610926.2016.1252403 Zhang, Y., Li, R., & Tsai, C.-L. (2010). Regularization Parameter Selections via Generalized Information Criterion.Journalof the American Statistical Association, 105(489), 312–323. http://doi.org/10.1198/jasa.2009.tm08013 Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso.Journal of Machine Learning Research, 7, 2541–2563. http://dl.acm.org/citation.cfm?id=1248547.1248637 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.Journal of the Royal StatisticalSociety. Series B: Statistical Methodology67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties.Journal of the American Statistical Association101(476), 1418–1429. https://doi.org/10.1198/016214506000000735 Zou, H., Hastie, T., & Tibshirani, R. (2007). On the "degrees of freedom" of the lasso. Ann. Statist., 35(5), 2173–2192. https://doi.org/10.1214/009053607000000127Please check our website https://statalasso.github.io/ for more information.WebsiteInstallationlasso2is part of thelassopackpackage. To get the latest stable version oflassopackfrom our website, check the installation instructions at https://statalasso.github.io/installation/. We update the stable website version more frequently than the SSC version. Earlier versions of lassopack are also available from the website. To verify thatlassopackis correctly installed, click on or type whichpkg lassopack (which requireswhichpkgto be installed; ssc install whichpkg).Thanks to Alexandre Belloni, who provided Matlab code for the square-root lasso estimator, Sergio Correia for advice on the use of the FTOOLS package, and Jan Ditzen.AcknowledgementsCitation of lasso2lasso2is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it as such: Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata.The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397.Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E. Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.ukAuthorsHelp:Also seecvlasso,rlasso,lassologit,ivlasso,pdslasso(if installed).

**help lasso2**