## Logistic Lasso: Spam data #

For demonstration we consider the Spambase Data Set from the Machine Learning Repository. The data set includes 4,601 observations and 57 variables. The aim is to predict if an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email.

```
Predictors
v1-v48 percentage of words in the e-mail that match a specific word, i.e.
100 * (number of times the word appears in the e-mail) divided by
total number of words in e-mail. To see which word each predictor
corresponds to, see link below.
v49-v54 percentage of characters in the e-mail that match a specific
character, i.e. 100 * (number of times the character appears in
the e-mail) divided by total number of characters in e-mail. To
see which character each predictor corresponds to, see link below.
v55 average length of uninterrupted sequences of capital letters
v56 length of longest uninterrupted sequence of capital letters
v57 total number of capital letters in the e-mail
Outcome
v58 denotes whether the e-mail was considered spam (1) or not (0).
. insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma
```

### Introduction to lassologit #

The basic syntax for lassologit is to specify the dependent variable followed by a list of predictors:

```
. lassologit v58 v1-v57
```

The output of lassologit shows the penalty levels (`lambda`

), the number of
predictors included (`s`

), the
\(\ell_1\)
norm, one information criterion (
\(EBIC\)
by
default), McFadden’s Pseudo-
\(R^2\)
and, in the last column,
which predictors are included/removed
from the model.

By default, one line per *knot* is shown. Knots are points at which predictors
enter or leave the model.

To obtain the logistic lasso estimate for a user-specified
scalar lambda or a list of lambdas,
the `lambda(numlist)`

option can be used.
Note that output and the objects stored in `e()`

depend on whether lambda is only
one value or a list of more than one value.

### Information criteria #

To estimate the model selected by one of the information criteria, use the `lic()`

option:

```
. lassologit v58 v1-v57
. lassologit, lic(ebic)
. lassologit, lic(aicc)
```

In the above example, we use the replay syntax that works similar to a
post-estimation command. `lassologit`

reports the logistic lasso estimates and the
post-logit estimates (from applying logit estimation to the model selected by
the logitistic lasso) for the value of lambda selected by the specified
information criterion.

**NB:** `lic()`

does not change the estimation results in memory. The advantage
is that this way `lic()`

can be used multiple times to compare results without
that we need to re-estimate the model.

To store the model selected by one of the information criteria,
add `postresults`

:

```
. lassologit, lic(ebic) postresults
```

### Cross-validation with cvlassologit #

`cvlassologit`

implements
\(K\)
-fold cross-validation where the data is by default
randomly partitioned.

Here, we use
\(K=3\)
and `seed(123)`

to set the seed for reproducibility. (Be
patient, this takes a minute.)

```
. cvlassologit v58 v1-v57, nfolds(3) seed(123)
```

The output shows the prediction performance measured by deviance for each
\(\lambda\)
value. To estimate the model selected by cross-validation we can specify `lopt`

or `lse`

using the replay syntax.

```
. cvlassologit, lopt
. cvlassologit, lse
```

### Rigorous penalization with rlassologit #

Lastly, we consider the logistic lasso with rigorous penalization:

```
. rlassologit v58 v1-v57
```

`rlassologit`

displays the logistic lasso solution and the post-logit solution.

The rigorous lambda is returned in `e(lambda)`

and, in this example, is equal to `79.207801`

.

```
. di e(lambda)
```

We get the same result when specifying the rigorous lambda manually using the
`lambda()`

option of lassologit:

```
. lassologit v58 v1-v57, lambda(79.207801)
```

### Prediction #

After selecting a model, we can use predict to obtain predicted probabilities or linear predictions.

First, we select a model using `lic()`

in combination with `postresults`

as above:

```
. lassologit v58 v1-v57
. lassologit, lic(ebic) postresults
```

Then, we use `predict`

:

```
. predict double phat, pr
. predict double xbhat, xb
```

`pr`

saves the predicted probability of success and `xb`

saves the linear predicted
values.

Note that the use of `postresults`

is required. Without `postresults`

the results
of the estimation with the selected penalty level are not stored.

The approach for `cvlassologit`

is very similar:

```
. cvlassologit v58 v1-v57
. cvlassologit, lopt postresults
. predict double phat, pr
```

In the case of `rlassologit`

, we don’t need to select a specific penalty level and
we also don’t need to specify `postresults`

.

```
. rlassologit v58 v1-v57
. predict double phat, pr
```

### Assessing prediction accuracy with holdout() #

We can leave one partition of the data out of the estimation sample and check
the accuracy of prediction using the `holdout(varname)`

option.

We first define a binary holdout variable:

```
. gen myholdout = (_n>4500)
```

There are 4,601 observations in the sample, and we exclude observations 4,501 to 4,601 from the estimation. These observations are used to assess classification accuracy. The holdout variable should be set to 1 for all observations that we want to use for assessing classification accuracy.

```
. lassologit v58 v1-v57, holdout(myholdout)
. mat list e(loss)
. rlassologit v58 v1-v57, holdout(myholdout)
. mat list e(loss)
```

The loss measure is returned in `e(loss)`

. As with cross-validation, deviance is
used by default. `lossmeasure(class)`

will return the average number of
miss-classifications.

### Plotting with lassologit #

`lassologit`

supports plotting of the coefficient path over $$\lambda$$. Here, we
create the plot using the replay syntax, but the same can be achieved in one
line:

```
. lassologit v58 v1-v57
. lassologit, plotpath(lambda) plotvar(v1-v5) plotlabel plotopt(legend(off))
```

In the above example, we use the following settings: `plotpath(lambda)`

plots
estimates against lambda. `plotvar(v1-v5)`

restricts the set of variables plotted
to `v1-v5`

(to avoid that the graph is too cluttered). `plotlabel`

puts variable
labels next to the lines. `plotopt(legend(off))`

turns the legend off.

### Plotting with cvlassologit #

The `plotcv`

option creates a graph of the estimates loss a function of lambda:

```
. cvlassologit v58 v1-v57, nfolds(3) seed(123)
. cvlassologit v58 v1-v57, plotcv
```

The vertical solid red line indicates the value of lambda that minimizes the loss function. The dashed red line corresponds to the largest lambda for which MSPE is within one standard error of the minimum loss.

## More #

More information can be found in the help file:

```
help lassologit
```