Example using Spam data

Logistic Lasso: Spam data #

For demonstration we consider the Spambase Data Set from the Machine Learning Repository. The data set includes 4,601 observations and 57 variables. The aim is to predict if an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email.

Predictors    
  v1-v48    percentage of words in the e-mail that match a specific word, i.e.
              100 * (number of times the word appears in the e-mail) divided by
              total number of words in e-mail.  To see which word each predictor
              corresponds to, see link below.
  v49-v54   percentage of characters in the e-mail that match a specific
              character, i.e. 100 * (number of times the character appears in
              the e-mail) divided by total number of characters in e-mail.  To
              see which character each predictor corresponds to, see link below.
  v55       average length of uninterrupted sequences of capital letters
  v56       length of longest uninterrupted sequence of capital letters
  v57       total number of capital letters in the e-mail

Outcome       
  v58       denotes whether the e-mail was considered spam (1) or not (0).

. insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma

Introduction to lassologit #

The basic syntax for lassologit is to specify the dependent variable followed by a list of predictors:

. lassologit v58 v1-v57

The output of lassologit shows the penalty levels (lambda), the number of predictors included (s), the $\ell_1$ norm, one information criterion ( $EBIC$ by default), McFadden’s Pseudo- $R^2$ and, in the last column, which predictors are included/removed from the model.

By default, one line per knot is shown. Knots are points at which predictors enter or leave the model.

To obtain the logistic lasso estimate for a user-specified scalar lambda or a list of lambdas, the lambda(numlist) option can be used. Note that output and the objects stored in e() depend on whether lambda is only one value or a list of more than one value.

Information criteria #

To estimate the model selected by one of the information criteria, use the lic() option:

. lassologit v58 v1-v57
. lassologit, lic(ebic)
. lassologit, lic(aicc)

In the above example, we use the replay syntax that works similar to a post-estimation command. lassologit reports the logistic lasso estimates and the post-logit estimates (from applying logit estimation to the model selected by the logitistic lasso) for the value of lambda selected by the specified information criterion.

NB: lic() does not change the estimation results in memory. The advantage is that this way lic() can be used multiple times to compare results without that we need to re-estimate the model.

To store the model selected by one of the information criteria, add postresults:

. lassologit, lic(ebic) postresults

Cross-validation with cvlassologit #

cvlassologit implements $K$ -fold cross-validation where the data is by default randomly partitioned.

Here, we use $K=3$ and seed(123) to set the seed for reproducibility. (Be patient, this takes a minute.)

. cvlassologit v58 v1-v57, nfolds(3) seed(123)

The output shows the prediction performance measured by deviance for each $\lambda$ value. To estimate the model selected by cross-validation we can specify lopt or lse using the replay syntax.

. cvlassologit, lopt
. cvlassologit, lse

Rigorous penalization with rlassologit #

Lastly, we consider the logistic lasso with rigorous penalization:

. rlassologit v58 v1-v57

rlassologit displays the logistic lasso solution and the post-logit solution.

The rigorous lambda is returned in e(lambda) and, in this example, is equal to 79.207801.

. di e(lambda)

We get the same result when specifying the rigorous lambda manually using the lambda() option of lassologit:

. lassologit v58 v1-v57, lambda(79.207801)

Prediction #

After selecting a model, we can use predict to obtain predicted probabilities or linear predictions.

First, we select a model using lic() in combination with postresults as above:

. lassologit v58 v1-v57
. lassologit, lic(ebic) postresults

Then, we use predict:

. predict double phat, pr
. predict double xbhat, xb

pr saves the predicted probability of success and xb saves the linear predicted values.

Note that the use of postresults is required. Without postresults the results of the estimation with the selected penalty level are not stored.

The approach for cvlassologit is very similar:

. cvlassologit v58 v1-v57
. cvlassologit, lopt postresults
. predict double phat, pr

In the case of rlassologit, we don’t need to select a specific penalty level and we also don’t need to specify postresults.

. rlassologit v58 v1-v57
. predict double phat, pr

Assessing prediction accuracy with holdout() #

We can leave one partition of the data out of the estimation sample and check the accuracy of prediction using the holdout(varname) option.

We first define a binary holdout variable:

. gen myholdout = (_n>4500)

There are 4,601 observations in the sample, and we exclude observations 4,501 to 4,601 from the estimation. These observations are used to assess classification accuracy. The holdout variable should be set to 1 for all observations that we want to use for assessing classification accuracy.

. lassologit v58 v1-v57, holdout(myholdout)
. mat list e(loss)

. rlassologit v58 v1-v57, holdout(myholdout)
. mat list e(loss)

The loss measure is returned in e(loss). As with cross-validation, deviance is used by default. lossmeasure(class) will return the average number of miss-classifications.

Plotting with lassologit #

lassologit supports plotting of the coefficient path over $$\lambda$$. Here, we create the plot using the replay syntax, but the same can be achieved in one line:

. lassologit v58 v1-v57
. lassologit, plotpath(lambda) plotvar(v1-v5) plotlabel plotopt(legend(off))

In the above example, we use the following settings: plotpath(lambda) plots estimates against lambda. plotvar(v1-v5) restricts the set of variables plotted to v1-v5 (to avoid that the graph is too cluttered). plotlabel puts variable labels next to the lines. plotopt(legend(off)) turns the legend off.

Plotting with cvlassologit #

The plotcv option creates a graph of the estimates loss a function of lambda:

. cvlassologit v58 v1-v57, nfolds(3) seed(123)
. cvlassologit v58 v1-v57, plotcv

The vertical solid red line indicates the value of lambda that minimizes the loss function. The dashed red line corresponds to the largest lambda for which MSPE is within one standard error of the minimum loss.

More #

More information can be found in the help file:

help lassologit