----------------------------------------------------------------------------------------------------------------------------------
help lasso2 lassopack v1.4.2
----------------------------------------------------------------------------------------------------------------------------------
Title
lasso2 -- Program for lasso, square-root lasso, elastic net, ridge, adaptive lasso and post-estimation OLS
Syntax
Full syntax
lasso2 depvar regressors [if exp] [in range] [, alpha(real) sqrt adaptive adaloadings(string) adatheta(real) ols
lambda(numlist) lcount(integer) lminratio(real) lmax(real) lglmnet notpen(varlist) partial(varlist)
psolver(string) norecover ploadings(string) unitloadings prestd stdcoef fe noftools noconstant tolopt(real)
tolzero(real) maxiter(int) plotpath(method) plotvar(varlist) plotopt(string) plotlabel ic(string) lic(string)
ebicgamma(real) noic long displayall postall postresults verbose vverbose wnorm]
Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transformation; the
speed gains using this package can be large. See help ftools or click on ssc install ftools to install.
Estimators Description
----------------------------------------------------------------------------------------------------------------------------
alpha(real) elastic net parameter, which controls the degree of L1-norm (lasso-type) to L2-norm (ridge-type)
penalization. alpha=1 corresponds to the lasso (the default estimator), and alpha=0 to ridge
regression. alpha must be in the interval [0,1].
sqrt square-root lasso estimator.
adaptive adaptive lasso estimator. The penalty loading for predictor j is set to 1/abs(beta0(j))^theta where
beta0(j) is the OLS estimate or univariate OLS estimate if p>n. Theta is the adaptive exponent, and
can be controlled using the adatheta(real) option.
adaloadings(string) alternative initial estimates, beta0, used for calculating adaptive loadings. For example, this
could be the vector e(b) from an initial lasso2 estimation. The elements of the vector are raised
to the power -theta (note the minus). See adaptive option.
adatheta(real) exponent for calculating adaptive penalty loadings. See adaptive option. Default=1.
ols post-estimation OLS. If lambda is a list, post-estimation OLS results are displayed and returned in
e(betas). If lambda is a scalar, post-estimation OLS is always displayed, and this option controls
whether standard or post-estimation OLS results are stored in e(b).
----------------------------------------------------------------------------------------------------------------------------
See overview of estimation methods.
Lambda(s) Description
----------------------------------------------------------------------------------------------------------------------------
lambda(numlist) a scalar lambda value or list of descending lambda values. Each lambda value must be greater than 0.
If not specified, the default list is used which is given by
exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see mf_range).
lcount(integer)† number of lambda values for which the solution is obtained. Default is 100.
lminratio(real)† ratio of minimum to maximum lambda. lminratio must be between 0 and 1. Default is 1/1000.
lmax(real)† maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the square-root lasso (where
X is the pre-standardized regressor matrix and y is the vector of the response variable).
fdev minimum fractional change in deviance (R-sq) to stop looping through lambdas (path)
devmax maximum fraction of explained deviance (R-sq) to stop looping through lambdas (path)
nodevcrit override criteria to exit path; loop through all lambdas in list
lic(string) after first lasso2 estimation using list of lambdas, estimate model corresponding to minimum
information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower
case spelling. See Information criteria for the definition of each information criterion.
ebicgamma(real) controls the xi parameter of the EBIC. xi needs to lie in the [0,1] interval. xi=0 is equivalent to
the BIC. The default choice is xi=1-log(n)/(2*log(p)).
postresults Used in combination with lic(). Stores estimation results of the model selected by information
criterion in e().
lglmnet Use the parameterizations for lambda, alpha, standardization, etc. employed by glmnet by Friedman et
al. (2010).
----------------------------------------------------------------------------------------------------------------------------
† Not applicable if lambda() is specified.
Loadings & standardization Description
----------------------------------------------------------------------------------------------------------------------------
notpen(varlist) sets penalty loadings to zero for predictors in varlist. Unpenalized predictors are always included
in the model.
partial(varlist) variables in varlist are partialled out prior to estimation.
psolver(string) override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol;
default=qrxx)
norecover suppresses recovery of partialled out variables after estimation.
ploadings(matrix) a row-vector of penalty loadings; overrides the default standardization loadings (in the case of the
lasso, =sqrt(avg(x^2))). The size of the vector should equal the number of predictors (excluding
partialled out variables and excluding the constant).
unitloadings penalty loadings set to a vector of ones; overrides the default standardization loadings (in the case
of the lasso, =sqrt(avg(x^2)). nostd is a synonym for unitloadings.
prestd dependent variable and predictors are standardized prior to estimation rather than standardized "on
the fly" using penalty loadings. See here for more details. By default the coefficient estimates
are un-standardized (i.e., returned in original units).
stdcoef return coefficients in standard deviation units, i.e., don't un-standardize.
stdall return all results (coefficients, information criteria, norms, etc.) in standardized units.
----------------------------------------------------------------------------------------------------------------------------
See discussion of standardization.
FE & constant Description
----------------------------------------------------------------------------------------------------------------------------
fe within-transformation is applied prior to estimation. Requires data to be xtset.
noftools do not use ftools package for fixed-effects transform (slower; rarely used)
noconstant suppress constant from estimation. Default behaviour is to partial the constant out (i.e., to center
the regressors).
----------------------------------------------------------------------------------------------------------------------------
Optimization Description
----------------------------------------------------------------------------------------------------------------------------
tolopt(real) tolerance for lasso shooting algorithm (default=1e-10)
tolzero(real) minimum below which coeffs are rounded down to zero (default=1e-4)
maxiter(int) maximum number of iterations for the lasso shooting algorithm (default=10,000)
----------------------------------------------------------------------------------------------------------------------------
Plotting options* Description
----------------------------------------------------------------------------------------------------------------------------
plotpath(method) plots the coefficients path as a function of the L1-norm (norm), lambda (lambda) or the log of lambda
(lnlambda)
plotvar(varlist) list of variables to be included in the plot
plotopt(string) additional plotting options passed on to line. For example, use plotopt(legend(off)) to turn off the
legend.
plotlabel displays variable labels in graph.
----------------------------------------------------------------------------------------------------------------------------
* Plotting is not available if lambda is a scalar value.
Display options Description
----------------------------------------------------------------------------------------------------------------------------
displayall* display full coefficient vectors including unselected variables (default: display only selected,
unpenalized and partialled-out)
postall* post full coefficient vector including unselected variables in e(b) (default: e(b) has only selected,
unpenalized and partialled-out)
long† show long output; instead of showing only the points at which predictors enter or leave the model,
all models are shown.
verbose show additional output
vverbose show even more output
ic(string)† controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the
default' are allowed). Note the lower case spelling. See Information criteria for the definition
of each information criterion.
noic† suppresses the calculation of information criteria. This will lead to speed gains if alpha<1, since
calculation of effective degrees of freedom requires one inversion per lambda.
wnorm† displays L1 norm of beta estimates weighted by penalty loadings, i.e., ||Psi*beta||(1) instead of
||beta||(1), which is the default. Note that this also affects plotting if plotpath(norm)} is
specified.
----------------------------------------------------------------------------------------------------------------------------
* Only applicable if lambda is a scalar value. † Only applicable if lambda is a list (the default).
Replay syntax
lasso2 [, plotpath(method) plotvar(varlist) plotopt(string) plotlabel long postresults lic(string) ic(string) wnorm]
Replay options Description
----------------------------------------------------------------------------------------------------------------------------
long show long output; instead of showing only the points at which predictors enter or leave the model,
all models are shown.
ic(string) controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the
default) are allowed. Note the lower case spelling. See Information criteria for the definition of
each information criterion.
lic(string) estimate model corresponding to minimum information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the
default) are allowed. Note the lower case spelling. See Information criteria for the definition of
each information criterion.
postresults store estimation results in e() if lic(string) is used
plotpath(method) see Plotting options above
plotvar(varlist) see Plotting options above
plotopt(string) see Plotting options above
plotlabel see Plotting options above
----------------------------------------------------------------------------------------------------------------------------
Only applicable if lambda was a list in the previous lasso2 estimation.
Postestimation:
predict [type] newvar [if] [in] [, xb residuals u e ue xbu ols lambda(real) lid(int) approx noisily postresults]
Predict options Description
----------------------------------------------------------------------------------------------------------------------------
xb compute predicted values (the default)
residuals compute residuals
e generate overall error component e(it). Only after fe.
ue generate combined residuals, i.e., u(i) + e(it). Only after fe.
xbu prediction including fixed effect, i.e., a + xb + u(i). Only after fe.
u fixed effect, i.e., u(i). Only after fe.
ols use post-estimation OLS for prediction
lambda(real)‡ lambda value for prediction. Ignored if lasso2 was called with scalar lambda value.
lid(int)‡ index of lambda value for prediction.
lic(string) selects which information criterion to use for prediction.
approx‡ linear approximation is used instead of re-estimation. Faster, but only exact if coefficient path is
piecewise linear. Only supported in combination with lambda().
noisily displays beta used for prediction.
postresults‡ store estimation results in e() if re-estimation is used
----------------------------------------------------------------------------------------------------------------------------
‡ Only applicable if lambda was a list in the previous lasso2 estimation.
lasso2 may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or
xtset.
All varlists may contain time-series operators or factor variables; see help varlist.
Contents
Description
Coordinate descent algorithm
Penalization level
Standardization of variables
Information criteria
Estimators
lasso2 vs. Friedman et al.'s glmnet and StataCorp's lasso
Examples and demonstration
--Data set
--General demonstration
--Information criteria
--Plotting
--Predicted values
--Standardization
--Penalty loadings and notpen()
--Partialling vs penalization
--Adaptive lasso
--Replication of glmnet and StataCorp's lasso
Saved results
References
Website
Installation
Acknowledgements
Citation of lassopack
Description
lasso2 solves the following problem
1/N RSS + lambda/N*alpha*||Psi*beta||[1] + lambda/(2*N)*(1-alpha)*||Psi*beta||[2],
where
RSS = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares,
beta is a p-dimensional parameter vector,
lambda is the overall penalty level,
||.||[j] denotes the L(j) vector norm for j=1,2;
alpha the elastic net parameter, which determines the relative contribution of L1 (lasso-type) to L2 (ridge-type)
penalization.
Psi is a p by p diagonal matrix of predictor-specific penalty loadings. Note that lasso2 treats Psi as a row vector.
N number of observations
Note: the above lambda and alpha differ from the definitions used in parts of the lasso and elastic net literature, e.g.,
the R package glmnet by Friedman et al. (2010). We have here adopted an objective function following Belloni et al. (2012).
See below and below for more discussion and examples of how to use the lglmnet option to replicate glmnet output.
In addition, if the option sqrt is specified, lasso2 estimates the square-root lasso (sqrt-lasso) estimator, which is
defined as the solution to the following objective function:
sqrt(1/N*RSS) + lambda/N*||Psi*beta||[1].
Coordinate descent algorithm
lasso2 implements the elastic net and sqrt-lasso using coordinate descent algorithms. The algorithm (then referred to as
"shooting") was first proposed by Fu (1998) for the lasso, and by Van der Kooij (2007) for the elastic net. Belloni et al.
(2011) implement the coordinate descent for the sqrt-lasso, and have kindly provided Matlab code.
Coordinate descent algorithms repeatedly cycle over predictors j=1,...,p and update single coefficient estimates until
convergence. Suppose the predictors are centered and standardized to have unit variance. In that case, the update for
coefficient j is obtained using univariate regression of the current partial residuals (i.e., excluding the contribution of
predictor j) against predictor j.
The algorithm requires an initial beta estimate for which the Ridge estimate is used. If the coefficient path is obtained
for a list of lambda values, lasso2 starts from the largest lambda value and uses previous estimates as warm starts.
See Friedman et al. (2007, 2010), and references therein, for further information.
Penalization level: choice of lambda (and alpha)
Penalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters that control the degree
and type of penalization. The estimation methods implemented in lasso2 use two tuning parameters: lambda, which controls
the general degree of penalization, and alpha, which determines the relative contribution of L1-type to L2-type
penalization. lasso2 obtains elastic net and sqrt-lasso solutions for a given lambda value or a list of lambda values, and
for a given alpha value (default=1).
lassopack offers three approaches for selecting the "optimal" lambda (and alpha) value:
(1) The penalty level may be chosen by cross-validation in order to optimize out-of-sample prediction performance. K-fold
cross-validation and rolling cross-validation (for panel and time-series data) are implemented in cvlasso. cvlasso also
supports cross-validation across alpha.
(2) Theoretically justified and feasible penalty levels and loadings are available for the lasso and sqrt-lasso via the
separate command rlasso. The penalization is chosen to dominate the noise of the data-generating process (represented by
the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter
estimation. Since the error variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible)
lasso that relies on an iterative algorithm for estimating the optimal penalization and is valid in the presence of
non-Gaussian and heteroskedastic errors. Belloni et al. (2016) extend the framework to the panel data setting. In the case
of the sqrt-lasso under homoskedasticity, the optimal penalty level is independent of the unknown error variance, leading to
a practical advantage and better performance in finite samples (see Belloni et al., 2011, 2014). See help rlasso for more
details.
(3) Lambda can also be selected using information criteria. lasso2 calculates four information criteria: Akaike
Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC; Schwarz, 1978), Extended Bayesian
information criterion (EBIC; Chen & Chen, 2008) and the corrected AIC (AICc; Sugiura, 1978, and Hurvich, 1989). By default,
lasso2 displays EBIC in the output, but all four information criteria are stored in e(aic), e(bic), e(ebic) and e(aicc).
See section Information criteria for more information.
Standardization of variables
Standard practice is for predictors to be "standardized", i.e., normalized to have mean zero and unit variance. By default
lasso2 achieves this by incorporating the standardization into the penalty loadings. We refer to this method as
standardization "on the fly", as standardization occurs during rather than before estimation. Alternatively, the option
prestd causes the predictors to be standardized prior to the estimation.
Standardizing "on the fly" via the penalty loadings and pre-standardizing the data prior to estimation are theoretically
equivalent. The default standardizing "on the fly" is often faster. The prestd option can lead to improved numerical
precision or more stable results in the case of difficult problems; the cost is the computation time required to
pre-standardize the data.
Estimators
Ridge regression (Hoerl & Kennard, 1970)
The ridge estimator can be written as
betahat(ridge) = (X'X+lambda*I(p))^(-1)X'y.
Thus, even if X'X is not full rank (e.g. because p>n), the problem becomes nonsingular by adding a constant to the diagonal
of X'X. Another advantage of the ridge estimator over least squares stems from the variance-bias trade-off. Ridge
regression may improve over ordinary least squares by inducing a mild bias while decreasing the variance. For this reason,
ridge regression is a popular method in the context of multicollinearity. In contrast to estimators relying on
L1-penalization, the ridge does not yield sparse solutions and keeps all predictors in the model.
Lasso estimator (Tibshirani, 1996)
The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of coefficient estimates.
Tibshirani (1996) motivates the lasso with two major advantages over least squares. First, due to the nature of the
L1-penalty, the lasso tends to produce sparse solutions and thus facilitates model interpretation. Secondly, similar to
ridge regression, lasso can outperform least squares in terms of prediction due to lower variance. Another advantage is
that the lasso is computationally attractive due to its convex form. This is in contrast to model selection based on AIC or
BIC (which employ L0 penalization) where each possible sub-model has to be estimated.
Elastic net (Zou & Hastie, 2005)
The elastic net applies a mix of L1 (lasso-type) and L2 (ridge-type) penalization. It combines some of the strengths of
lasso and ridge regression. In the presence of groups of correlated regressors, the lasso selects typically only one
variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated
variables. On the other hand, the ridge does not yield sparse solutions impeding model interpretation. The elastic net is
able to produce sparse solutions (for some alpha greater than zero) and retains (or drops) correlated variables jointly.
Adaptive lasso (Zou, 2006)
The lasso is only variable selection consistent under the rather strong "irrepresentable condition", which imposes
constraints on the degree of correlation between predictors in the true model and predictors outside of the model (see Zhao
& Yu, 2006; Meinshausen & Bühlmann, 2006). Zou (2006) proposes the adaptive lasso which uses penalty loadings of
1/abs(beta0(j))^theta where beta0 is an initial estimator. The adaptive lasso is variable-selection consistent for fixed p
under weaker assumptions than the standard lasso. If p<n, OLS can be used as the initial estimator. Huang et al. (2008)
suggest to use univariate OLS if p>n. Other initial estimators are possible.
Square-root lasso (Belloni et al., 2011, 2014)
The sqrt-lasso is a modification of the lasso that minimizes (RSS)^(1/2) instead of RSS, while also imposing an L1-penalty.
The main advantage of the sqrt-lasso over the standard lasso is that the theoretically grounded, data-driven optimal lambda
is independent of the unknown error variance under homoskedasticity. See rlasso.
Post-estimation OLS
Penalized regression methods induce a bias that can be alleviated by post-estimation OLS, which applies OLS to the
predictors selected by the first-stage variable selection method. For the case of the lasso, Belloni and Chernozhukov (
2013) have shown that the post-lasso OLS performs at least as well as the lasso under mild additional assumptions.
For further information on the lasso and related methods, see for example the textbooks by Hastie et al. (2009, 2015; both
available for free) and Bühlmann & Van de Geer (2011).
Information criteria
The information criteria supported by lasso2 are the Akaike information criterion (AIC, Akaike, 1974), the Bayesian
information criterion (BIC, Schwarz, 1978), the corrected AIC (Sugiura, 1978; Hurvich, 1989), and the Extended BIC (Chen &
Chen, 2008). These are given by (omitting dependence on lambda and alpha):
AIC = N*log(RSS/N) + 2*df
BIC = N*log(RSS/N) + df*log(N)
AICc = N*log(RSS/N) + 2*df*N/(N-df)
EBIC = BIC + 2*xi*df*log(p)
where RSS(lambda,alpha) is the residual sum of squares and df(lambda,alpha) is the effective degrees of freedom, which is a
measure of model complexity. In the linear regression model, the degrees of freedom is simply the number of regressors.
Zou et al. (2007) show that the number of non-zero coefficients is an unbiased and consistent estimator of df(lambda,alpha)
for the lasso. More generally, the degrees of freedom of the elastic net can be calculated as the trace of the projection
matrix. With an unbiased estimator for df available, the above information criteria can be employed to select tuning
parameters.
The BIC is known to be model selection consistent if the true model is among the candidate models, whereas the AIC tends to
yield an overfitted model. On the other hand, the AIC is loss efficient in the sense that it selects the model that
minimizes the squared average prediction error, while the BIC does not possess this property. Zhang et al. (2010) show that
these principles also apply when AIC and BIC are employed to select the tuning parameter for penalized regression.
Both AIC and BIC tend to overselect regressors in the small-N-large-p case. The AICc corrects the small sample bias of the
AIC which can be especially severe in the high-dimensional context. Similarily, the EBIC addresses the shortcomings of the
BIC when p is large by imposing a larger penalty on the number of coefficients. Chen & Chen (2008) show that the EBIC
performs better in terms of false discovery rate at the cost of a negligible reduction in the positive selection rate.
The EBIC depends on an additional parameter, xi (denoted as gamma in the original article), which can be controlled using
ebicgamma(real). gamma=0 is equivalent to the BIC. We follow Chen & Chen (2008, p. 768) and use xi=1-log(n)/(2*log(p)) as
the default choice. An upper and lower threshold is applied to ensure that xi lies in the [0,1] interval.
The EBIC is displayed in the output of lasso2 by default (if lambda is a list), but all four information criteria are
returned in e(). The lambda values that minimize the information criteria for a given alpha are returned in e(laic),
e(lbic), e(laicc) and e(lebic), respectively. To change the default display, use the ic(string) option. noic suppresses
the calculation of information criteria, which leads to a speed gain if alpha<1.
lasso2 vs. Hastie et al.'s (2010) glmnet and StataCorp's lasso
The parameterization used by lasso2 differs from StataCorp's lasso in only one respect: lambda(StataCorp) =
(1/2N)*lambda(lasso2). The elastic net parameter alpha is the same in both parameterizations. See below for examples.
The parameterization used by Hastie et al.'s (2010) glmnet uses the same convention as StataCorp for lambda: lambda(glmnet)
= (1/2N)*lambda(lasso2). However, the glmnet treatment of the elastic net parameter alpha differs from both lasso2 and
StataCorp's lasso. The glmnet objective function is defined such that the dependent variable is assumed already to have
been standardized. Because the L2 norm is nonlinear, this affects the interpretation of alpha. Specifically, the default
lasso2 and StataCorp's lasso parameterization means that alpha is not invariant changes in the scale of the dependent
variable. The glmnet parameterization of alpha, however, is scale-invariant - a useful feature.
lasso2 provides an lglmnet option that enables the user to employ the glmnet parameterization for alpha and lambda. See
below for examples of its usage and how to replicate glmnet output. We recommend the use of the lglmnet option in
particular with cross-validation over alpha (see cvlasso).
Example using prostate cancer data (Stamey et al., 1989)
Data set
The data set is available through Hastie et al. (2009) on the authors' website. The following variables are included in the
data set of 97 men:
Predictors
lcavol log(cancer volume)
lweight log(prostate weight)
age patient age
lbph log(benign prostatic hyperplasia amount)
svi seminal vesicle invasion
lcp log(capsular penetration)
gleason Gleason score
pgg45 percentage Gleason scores 4 or 5
Outcome
lpsa log(prostate specific antigen)
Load prostate cancer data.
. insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear tab
General demonstration
Estimate coefficient lasso path over (default) list of lambda values.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
The replay syntax can be used to re-display estimation results.
. lasso2
User-specified lambda list.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)
The list of returned e() objects depends on whether lambda() is a list (the default) or a scalar value. For example, if
lambda is a scalar, one vector of coefficient estimates is returned. If lambda is a list, the whole coefficient path for a
range of lambda values is obtained. The last row of e(betas) is equal to the row vector e(b).
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10)
. ereturn list
. mat list e(betas)
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
. ereturn list
. mat list e(b)
Sqrt-lasso.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt
Ridge regression. All predictors are included in the model.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0)
Elastic net with alpha=0.1. Even though alpha is close to zero (Ridge regression), the elastic net can produce sparse
solutions.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1)
The option ols triggers the use of post-estimation OLS. OLS alleviates the shrinkage bias induced by L1 and L2 norm
penalization.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ols
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt ols
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) ols
Information criteria
lasso2 calculates four information criteria: AIC, BIC, EBIC and AICc. The EBIC is shown by default in the output along with
the R-squared.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
To see another information criterion in the outout, use the ic(string) option where string can be replaced by aic, bic, ebic
or aicc (note the lower case spelling). For example, to display AIC:
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ic(aic)
In fact, there is no need to re-run the full model. We can make use of the replay syntax:
. lasso2, ic(aic)
The long option triggers extended output; instead of showing only the points at which predictors enter or leave the model,
all models are shown. An asterisk marks the model (i.e., the value of lambda) that minimizes the information criterion
(here, AIC).
. lasso2, ic(aic) long
To estimate the model corresponding to the minimum information criterion, click on the link at the bottom of the output or
type one of the following:
. lasso2, lic(aic)
. lasso2, lic(ebic)
. lasso2, lic(bic)
. lasso2, lic(aicc)
To store the estimation results of the selected model, add the postresults option.
. lasso2, lic(ebic)
. ereturn list
. lasso2, lic(ebic) postres
. ereturn list
The same can also be achieved in one line without using the replay syntax. lasso2 first obtains the full coefficient path
for a list lambda values, and then runs the model selected by AIC. Again, postresults can be used to store results of the
selected model.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lic(aic) postres
Plotting
Plot coefficients against lambda: As lambda increases, the coefficient estimates are shrunk towards zero. Lambda=0
corresponds to OLS and if lambda is sufficiently large the model is empty.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(lambda)
Plot coefficients against L1 norm.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(norm)
The replay syntax can also be used for plotting.
. lasso2, plotpath(norm)
Only selected variables are plotted.
. lasso2, plotpath(norm) plotvar(lcavol svi)
The variable names can be displayed directly next to each series using plotlabel. plotopt(legend(off)) suppresses the
legend.
. lasso2, plotpath(lambda) plotlabel plotopt(legend(off))
. lasso2, plotpath(norm) plotlabel plotopt(legend(off))
Predicted values
xbhat1 is generated by re-estimating the model for lambda=10. The noisily option triggers the display of the estimation
results. xbhat2 is generated by linear approximation using the two beta estimates closest to lambda=10.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
. cap drop xbhat1
. predict double xbhat1, xb l(10) noisily
. cap drop xbhat2
. predict double xbhat2, xb l(10) approx
The model is estimated explicitly using lambda=10. If lasso2 is called with a scalar lambda value, the subsequent predict
command requires no lambda() option.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10)
. cap drop xbhat3
. predict double xbhat3, xb
All three methods yield the same results. However note that the linear approximation is only exact for the lasso which is
piecewise linear.
. sum xbhat1 xbhat2 xbhat3
It is also possible to obtain predicted values by referencing a specific lambda ID using the lid() option.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45
. cap drop xbhat4
. predict double xbhat4, xb lid(21)
. cap drop xbhat5
. predict double xbhat5, xb l(25.45473900468241)
. sum xbhat4 xbhat5
Standardization
By default lasso2 standardizes the predictors to have unit variance. Standardization is done by default "on the fly" via
penalty loadings. The coefficient estimates are returned in original units.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
Instead of standardizing "on the fly" by setting penalization loadings equal to standardization loadings, we can standardize
the regressors prior to estimation with the prestd option. Both methods are equivalent in theory. Standardizing "on the
fly" tends to be faster, but pre-standardization may lead to more stable results in the case of difficult problems. See
here for more information.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
. mat list e(Psi)
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd
. mat list e(Psi)
The used penalty loadings are stored in e(Psi). In the first case above, the standardization loadings are returned. In the
second case the penalty loadings are equal to one for all regressors.
To get the coefficients in standard deviation units, stdcoef can be specified along with the prestd option.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd stdcoef
We can override any form of standardization with the unitloadings options, which sets the penalty loadings to a vector of
1s.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) unitloadings
The same logic applies to the sqrt-lasso (and elastic net).
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt prestd
Penalty loadings and notpen()
By default the penalty loading vector is a vector of standard deviations.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10)
. mat list e(Psi)
We can set the penalty loading for specific predictors to zero, implying no penalization. Unpenalized predictor are always
included in the model.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol)
. mat list e(Psi)
We can specify custom penalty loadings. The option ploadings expects a row vector of size p where p is the number of
regressors (excluding the constant, which is partialled out). Because we pre-standardize the data (and we are using the
lasso) the results are equivalent to the results above (standardizing on the fly and specifying lcavol as unpenalized).
. mat myloadings = (0,1,1,1,1,1,1,1)
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) ploadings(myloadings) prestd
. mat list e(Psi)
Partialling vs penalization
If lambda and the penalty loadings are kept constant, partialling out and not penalizing of variables yields the same
results for the included/penalized regressors. Yamada (2017) shows that the equivalence of partialling out and not
penalizing holds for lasso and ridge regression. The examples below suggest that the same result also holds for the elastic
net in general and the sqrt-lasso. Note that the equivalence only holds if the regressor matrix and other penalty loadings
are the same. Below we use the unitloadings option to achieve this; alternatively we could use the ploadings(.) option.
Lasso.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) unitload
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) partial(lcavol) unitload
Sqrt-lasso.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt notpen(lcavol) unitload
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt partial(lcavol) unitload
Ridge regression.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) notpen(lcavol) unitload
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) partial(lcavol) unitload
Elastic net.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) notpen(lcavol) unitload
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) partial(lcavol) unitload
Partialling-out is implemented in Mata using one of Mata's solvers. In cases where the variables to be partialled out are
collinear or nearly so, different solvers may generate different results. Users may wish to check the stability of their
results in such cases. The psolver(.) option can be used to specify the Mata solver used. The default behavior for solving
AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx.
Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates
using svsolve((A'A),(A'B)). lasso2 will warn if collinear variables are dropped when partialling out.
Adaptive lasso
The adaptive lasso relies on an initial estimator to calculate the penalty loadings. The penalty loadings are given by
1/abs(beta0(j))^theta, where beta0(j) denotes the initial estimate for predictor j. By default, lasso2 uses OLS as the
initial estimator as originally suggested by Zou (2006). If the number of parameters exceeds the numbers of observations,
univariate OLS is used; see Huang et al. (2008).
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive
. mat list e(Psi)
See the OLS estimates for comparison.
. reg lpsa lcavol lweight age lbph svi lcp gleason pgg45
Theta (the exponent for calculating the adaptive loadings) can be changed using the adatheta() option.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adat(2)
. mat list e(Psi)
Other initial estimators such as ridge regression are possible.
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0)
. mat bhat_ridge = e(b)
. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge)
. mat list e(Psi)
Replication of glmnet and StataCorp's lasso
Use Stata's auto dataset with missing data dropped. The variable price1000 is used to illustrate scaling effects.
. sysuse auto, clear
. drop if rep78==.
. gen double price1000 = price/1000
To load the data into R for comparison with glmnet, use the following commands. The packages haven and tidyr need to be
installed.
auto <- haven::read_dta("http://www.stata-press.com/data/r9/auto.dta")
auto <- tidyr::drop_na()
n <- nrow(auto)
price <- auto$price
X <- auto[, c("mpg", "rep78", "headroom", "trunk", "weight", "length", "turn", "displacement", "gear_ratio", "foreign")]
X$foreign <- as.integer(X$foreign)
X <- as.matrix(X)
Replication of StataCorp's lasso and elasticnet requires only the rescaling of lambda by 2N. N=69 so the lasso2 lambda
becomes 138000/(2*69) = 1000
. lasso2 price mpg-foreign, lambda(138000)
. lasso linear price mpg-foreign, grid(1, min(1000))
. lassoselect lambda = 1000
. lassocoef, display(coef, penalized)
. lasso2 price mpg-foreign, alpha(0.6) lambda(138000)
. elasticnet linear price mpg-foreign, alphas(0.6) grid(1, min(1000))
. lassoselect alpha = 0.6 lambda = 1000
. lassocoef, display(coef, penalized)
glmnet uses the same definition of the lasso L0 penalty as StataCorp's lasso, so lasso2's default parameterization again
requires only rescaling by 2N. When the lglmnet option is used with the lglmnet option, the L0 penalty should be provided
using the glmnet definition. To estimate in R, load glmnet with library("glmnet") and use the following command:
r<-glmnet(X,price,alpha=1,lambda=1000,thresh=1e-15)
. lasso2 price mpg-foreign, lambda(138000)
. lasso2 price mpg-foreign, lambda(1000) lglmnet
The R code below uses glmnet to estimate an elastic net model. lasso2 with the lglmnet option will replicate it.
r<-glmnet(X,price,alpha=0.6,lambda=1000,thresh=1e-15)
. lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet
lasso2's default parameterization of the elastic net (like StataCorp's elasticnet) is not invariant to scaling:
. lasso2 price mpg-foreign, alpha(0.6) lambda(138000)
. lasso2 price1000 mpg-foreign, alpha(0.6) lambda(138)
When lasso2 uses the glmnet parameterization of the elastic net via the glmnet options, results are invariant to scaling:
the only difference is that the coefficients change by the same factor of proportionality as the dependent variable.
. lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet
. lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet
The reason that the default lasso2/StataCorp parameterization is not invariant to scaling is because the penalty on L2 norm
is influenced by scaling, and this in turn affects the relative weights on the L1 and L2 penalties. The example below shows
how to reparameterize so that the default lasso2 parameterization for the elastic net replicates the glmnet
parameterization. The example using the scaling above, where the dependent variable is price1000 and the glmnet lambda=1.
The large-sample standard deviation of price1000 = 2.8912586.
. qui sum price1000
. di r(sd) * 1/sqrt( r(N)/(r(N)-1))
The lasso2 alpha = alpha(lglmnet)*SD(y) / (1-alpha(glmnet) + alpha(glmnet)*SD(y)). In this example, alpha =
(0.6*2.8912586)/( 1-0.6 + 0.6*2.89125856) = 0.81262488.
. di (0.6*2.8912586)/( 1-0.6 + 0.6*2.8912586)
The lasso2 lambda = 2N*lambda(lglmnet) * (alpha(lglmnet) + (1-alpha(lglmnet))/SD(y)). In this example, lambda = 2*69*1 *
(0.6 + (1-0.6)/2.8912586) = 101.89203.
. di 2*69*( 0.6 + (1-0.6)/2.8912586)
lasso2 using the glmnet and then replicated using the lasso2/StataCorp parameterization:
. lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet
. lasso2 price1000 mpg-foreign, alpha(.81262488) lambda(101.89203)
Saved results
The set of returned e-class objects depends on whether lambda is a scalar or a list (the default).
scalars
e(N) sample size
e(cons) =1 if constant is present, 0 otherwise
e(fe) =1 if fixed effects model is used, 0 otherwise
e(alpha) elastic net parameter
e(sqrt) =1 if the sqrt-lasso is used, 0 otherwise
e(ols) =1 if post-estimation OLS results are returned, 0 otherwise
e(adaptive) =1 if adaptive loadings are used, 0 otherwise
e(p) number of penalized regressors in model
e(notpen_ct) number of unpenalized variables
e(partial_ct) number of partialled out regressors (incl constant)
e(prestd) =1 if pre-standardized
e(lcount) number of lambda values
scalars (only if lambda is a list)
e(lmax) largest lambda value
e(lmin) smallest lambda value
scalars (only if lambda is a scalar)
e(lambda) penalty level
e(r2) R-sq for lasso estimation
e(rmse) root mean squared error
e(rmseOLS) root mean squared error of post-estimation OLS
e(objfn) minimized objective function
e(k) number of selected and unpenalized/partialled-out regressors including constant (if present)
e(s) number of selected regressors
e(s0) number of selected and unpenalized regressors including constant (if present)
e(df) L0 norm ("effective degrees of freedom")
e(niter) number of iterations
e(maxiter) maximum number of iterations
e(tss) total sum of squares
e(aicmin) minimum AIC
e(bicmin) minimum BIC
e(aiccmin) minimum AICc
e(ebicmin) minimum EBIC
e(laic) lambda corresponding to minimum AIC
e(lbic) lambda corresponding to minimum BIC
e(laicc) lambda corresponding to minimum AICc
e(lebic) lambda corresponding to minimum EBIC
macros
e(cmd) command name
e(depvar) name of dependent variable
e(varX) all predictors
e(varXmodel) penalized predictors
e(partial) partialled out predictors
e(notpen) unpenalized predictors
e(method) estimation method
macros (only if lambda is a scalar)
e(selected) selected predictors
e(selected0) selected predictors excluding constant
matrices
e(Psi) row vector used penalty loadings
e(stdvec) row vector of standardization loadings
matrices (only if lambda is a list)
e(lambdamat) column vector of lambdas used for estimations
e(lambdamat0) full initial vector of lambdas (includes lambdas not used for estimation)
e(l1norm) column vector of L1 norms for each lambda value (excludes the intercept)
e(wl1norm) column vector of weighted L1 norms for each lambda value (excludes the intercept), see wnorm
e(betas) matrix of estimates, where each row corresponds to one lambda value. The intercept is stored in the
last column.
e(IC) matrix of information criteria (AIC, AICC, BIC, EEBIC) for each lambda value
(NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)
e(dof) column vector of L0 norm for each lambda value (excludes the intercept)
e(ess) column vector of explained sum of squares for each lambda value
e(rss) column vector of residual sum of squares for each lambda value
e(rsq) column vector of R-squared for each lambda value
matrices (only if lambda is a scalar)
e(beta) coefficient vector
e(betaOLS) coefficient vector of post-estimation OLS
e(betaAll) full coefficient vector including omitted, factor base variables, etc.
e(betaAllOLS) full post-estimation OLS coefficient vector including omitted, factor base variables, etc.
(NB: All the matrices above but in standardized units are also saved with the same name preceded by "s".)
e(b) posted coefficient vector (see postall and displayall). Used for prediction.
functions
e(sample) estimation sample
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6),
716–723. https://doi.org/10.1109/TAC.1974.1100705
Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic
programming. Biometrika 98(4), 791–806. https://doi.org/10.1093/biomet/asr043
Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse Models and Methods for Optimal Instruments With an
Application to Eminent Domain. Econometrica 80(6), 2369–2429. https://doi.org/10.3982/ECTA9626
Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli,
19(2), 521–547. https://doi.org/10.3150/11-BEJ410
Belloni, A., Chernozhukov, V., & Wang, L. (2014). Pivotal estimation via square-root Lasso in nonparametric regression. The
Annals of Statistics 42(2), 757–788. https://doi.org/10.1214/14-AOS1204
Belloni, A., Chernozhukov, V., Hansen, C., & Kozbur, D. (2016). Inference in High Dimensional Panel Models with an
Application to Gun Control. Journal of Business & Economic Statistics 34(4), 590–605. Methodology.
https://doi.org/10.1080/07350015.2015.1102733
Bühlmann, P., & Meinshausen, N. (2006). High-dimensional graphs and variable selection with the Lasso. {it:The Annals of
Statistics], 34(3), 1436–1462. http://doi.org/10.1214/009053606000000281
Bühlmann, P., & Van de Geer, S. (2011). Statistics for High-Dimensional Data. Berlin, Heidelberg: Springer-Verlag.
Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika,
95(3), 759–771. https://doi.org/10.1093/biomet/asn034
Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets.
https://ideas.repec.org/c/boc/bocode/s458213.html
Fu, W. J. (1998). Penalized Regressions: The Bridge Versus the Lasso. Journal of Computational and Graphical Statistics
7(3), 397–416. https://doi.org/10.2307/1390712
Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied
Statistics 1(2), 302–332. https://doi.org/10.1214/07-AOAS131
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate
Descent. Journal of Statistical Software 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York:
Springer-Verlag. https://web.stanford.edu/~hastie/ElemStatLearn/
Hastie, T., Tibshirani, R., & Wainwright, M. J. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations.
Boca Raton: CRC Press, Taylor & Francis. https://www.stanford.edu/~hastie/StatLearnSparsity/
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 12(1), 69–82.
https://doi.org/10.1080/00401706.1970.10488635
Huang, J., Ma, S., & Zhang, C.-H. (2008). Adaptive Lasso for Sparse High-Dimensional Regression Models Supplement.
Statistica Sinica 18, 1603–1618. https://doi.org/10.2307/24308572
Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2),
297–307. http://doi.org/10.1093/biomet/76.2.297
Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464.
https://doi.org/10.1214/aos/1176344136
Stamey, T. A., Kabalin, J. N., Mcneal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang, N. (1989). Prostate
Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate. II. Radical Prostatectomy Treated
Patients. The Journal of Urology 141(5), 1076–1083. https://doi.org/10.1016/S0022-5347(17)41175-X
Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite corrections.
Communications in Statistics - Theory and Methods, 7(1), 13–26. http://doi.org/10.1080/03610927808827599
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B
(Methodological) 58(1), 267–288. https://doi.org/10.2307/2346178
Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling Transformations. Ph.D. thesis,
Department of Data Theory, University of Leiden. http://hdl.handle.net/1887/12096
Yamada, H. (2017). The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression. Communications in Statistics -
Theory and Methods 46(21), 10897–10902. https://doi.org/10.1080/03610926.2016.1252403
Zhang, Y., Li, R., & Tsai, C.-L. (2010). Regularization Parameter Selections via Generalized Information Criterion. Journal
of the American Statistical Association, 105(489), 312–323. http://doi.org/10.1198/jasa.2009.tm08013
Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2541–2563.
http://dl.acm.org/citation.cfm?id=1248547.1248637
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical
Society. Series B: Statistical Methodology 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476),
1418–1429. https://doi.org/10.1198/016214506000000735
Zou, H., Hastie, T., & Tibshirani, R. (2007). On the "degrees of freedom" of the lasso. Ann. Statist., 35(5), 2173–2192.
https://doi.org/10.1214/009053607000000127
Website
Please check our website https://statalasso.github.io/ for more information.
Installation
lasso2 is part of the lassopack package. To get the latest stable version of lassopack from our website, check the
installation instructions at https://statalasso.github.io/installation/. We update the stable website version more
frequently than the SSC version. Earlier versions of lassopack are also available from the website.
To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which requires whichpkg to be
installed; ssc install whichpkg).
Acknowledgements
Thanks to Alexandre Belloni, who provided Matlab code for the square-root lasso estimator, Sergio Correia for advice on the
use of the FTOOLS package, and Jan Ditzen.
Citation of lasso2
lasso2 is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it
as such:
Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). LASSOPACK: Stata module for lasso, square-root lasso, elastic
net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html
Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in
Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper
version: https://arxiv.org/abs/1901.05397.
Authors
Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland
achim.ahrens@gess.ethz.ch
Christian B. Hansen, University of Chicago, USA
Christian.Hansen@chicagobooth.edu
Mark E. Schaffer, Heriot-Watt University, UK
m.e.schaffer@hw.ac.uk
Also see
Help: cvlasso, rlasso, lassologit, ivlasso, pdslasso (if installed).