Logistic Lasso: Spam data #
For demonstration we consider the Spambase Data Set from the Machine Learning Repository. The data set includes 4,601 observations and 57 variables. The aim is to predict if an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email.
Predictors v1-v48 percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor corresponds to, see link below. v49-v54 percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times the character appears in the e-mail) divided by total number of characters in e-mail. To see which character each predictor corresponds to, see link below. v55 average length of uninterrupted sequences of capital letters v56 length of longest uninterrupted sequence of capital letters v57 total number of capital letters in the e-mail Outcome v58 denotes whether the e-mail was considered spam (1) or not (0). . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma
Introduction to lassologit #
The basic syntax for lassologit is to specify the dependent variable followed by a list of predictors:
. lassologit v58 v1-v57
The output of lassologit shows the penalty levels (
lambda), the number of
predictors included (
norm, one information criterion (
default), McFadden’s Pseudo-
and, in the last column,
which predictors are included/removed
from the model.
By default, one line per knot is shown. Knots are points at which predictors enter or leave the model.
To obtain the logistic lasso estimate for a user-specified
scalar lambda or a list of lambdas,
lambda(numlist) option can be used.
Note that output and the objects stored in
e() depend on whether lambda is only
one value or a list of more than one value.
Information criteria #
To estimate the model selected by one of the information criteria, use the
. lassologit v58 v1-v57 . lassologit, lic(ebic) . lassologit, lic(aicc)
In the above example, we use the replay syntax that works similar to a
lassologit reports the logistic lasso estimates and the
post-logit estimates (from applying logit estimation to the model selected by
the logitistic lasso) for the value of lambda selected by the specified
lic() does not change the estimation results in memory. The advantage
is that this way
lic() can be used multiple times to compare results without
that we need to re-estimate the model.
To store the model selected by one of the information criteria,
. lassologit, lic(ebic) postresults
Cross-validation with cvlassologit #
-fold cross-validation where the data is by default
Here, we use
seed(123) to set the seed for reproducibility. (Be
patient, this takes a minute.)
. cvlassologit v58 v1-v57, nfolds(3) seed(123)
The output shows the prediction performance measured by deviance for each
value. To estimate the model selected by cross-validation we can specify
lse using the replay syntax.
. cvlassologit, lopt . cvlassologit, lse
Rigorous penalization with rlassologit #
Lastly, we consider the logistic lasso with rigorous penalization:
. rlassologit v58 v1-v57
rlassologit displays the logistic lasso solution and the post-logit solution.
The rigorous lambda is returned in
e(lambda) and, in this example, is equal to
. di e(lambda)
We get the same result when specifying the rigorous lambda manually using the
lambda() option of lassologit:
. lassologit v58 v1-v57, lambda(79.207801)
After selecting a model, we can use predict to obtain predicted probabilities or linear predictions.
First, we select a model using
lic() in combination with
postresults as above:
. lassologit v58 v1-v57 . lassologit, lic(ebic) postresults
Then, we use
. predict double phat, pr . predict double xbhat, xb
pr saves the predicted probability of success and
xb saves the linear predicted
Note that the use of
postresults is required. Without
postresults the results
of the estimation with the selected penalty level are not stored.
The approach for
cvlassologit is very similar:
. cvlassologit v58 v1-v57 . cvlassologit, lopt postresults . predict double phat, pr
In the case of
rlassologit, we don’t need to select a specific penalty level and
we also don’t need to specify
. rlassologit v58 v1-v57 . predict double phat, pr
Assessing prediction accuracy with holdout() #
We can leave one partition of the data out of the estimation sample and check
the accuracy of prediction using the
We first define a binary holdout variable:
. gen myholdout = (_n>4500)
There are 4,601 observations in the sample, and we exclude observations 4,501 to 4,601 from the estimation. These observations are used to assess classification accuracy. The holdout variable should be set to 1 for all observations that we want to use for assessing classification accuracy.
. lassologit v58 v1-v57, holdout(myholdout) . mat list e(loss) . rlassologit v58 v1-v57, holdout(myholdout) . mat list e(loss)
The loss measure is returned in
e(loss). As with cross-validation, deviance is
used by default.
lossmeasure(class) will return the average number of
Plotting with lassologit #
lassologit supports plotting of the coefficient path over $$\lambda$$. Here, we
create the plot using the replay syntax, but the same can be achieved in one
. lassologit v58 v1-v57 . lassologit, plotpath(lambda) plotvar(v1-v5) plotlabel plotopt(legend(off))
In the above example, we use the following settings:
estimates against lambda.
plotvar(v1-v5) restricts the set of variables plotted
v1-v5 (to avoid that the graph is too cluttered).
plotlabel puts variable
labels next to the lines.
plotopt(legend(off)) turns the legend off.
Plotting with cvlassologit #
plotcv option creates a graph of the estimates loss a function of lambda:
. cvlassologit v58 v1-v57, nfolds(3) seed(123) . cvlassologit v58 v1-v57, plotcv
The vertical solid red line indicates the value of lambda that minimizes the loss function. The dashed red line corresponds to the largest lambda for which MSPE is within one standard error of the minimum loss.
More information can be found in the help file: