Data arising in actuarial science are typically non-normal, because extremely large losses may arise with small probabilities, whereas small losses may arise with very high probability. For this reason, heavy-tail distributions have been utilized in the literature to capture the nature of highly skewed insurance loss data. An example of a flexible class of distributions to model heavy-tailed data is the GB2 (Generalized Beta of Second Kind) distributions. The density function for the GB2 distribution is shown in Section 3.
In relation to loss models, regression is a useful technique to understand the influence of explanatory variables on a response variable. For example, an actuary may use regression to figure out the average building and contents loss amount for different entity types, by using entity type indicators as explanatory variables in a regression. The actuary may then use the regression coefficient estimates to determine the relationship between the explanatory variables and the response. Performing a regression typically requires an assumption on the distribution of the response variable, and for this reason loss models and regression are closely related, in the actuarial science context.
Now consider the case when there are too many explanatory variables, or in other words when x is high dimensional. In this case, estimation of the regression coefficients become a challenging task. In the statistics literature, this problem is solved using so called high dimensional statistical methods, an example of which is the lasso estimation method, see Tibshirani (1996). The lasso approach penalizes the log-likelihood of a regression problem with a penalty term, which reduces the number of coefficients to be estimated. In the statistics literature, lasso methods have been applied extensively to linear regression models, generalized linear models, and recently to the Tweedie distribution (see Qian, Yang, and Zou 2016), which is useful for modeling the pure premium of insurance losses. For example, let ℓ(β0,β) be the log-likelihood function for a linear model, where β0 and β are the regression coefficients to be estimated. In this case, the penalized negative log-likelihood −ℓ(β0,β)/n+P(β0,β,λ) may be minimized in order to obtain a sparse set of coefficients β, given a tuning parameter λ, and a penalty function P. If the group lasso penalty function is used, the regression coefficient becomes
where xi=(xTi1,xTi2,…,xTig)T, and β=(βT1,…,βTg)T, and g is the number of groups in the model. Here, ‖⋅‖S∗j=√βTjS∗jβj, with S∗j is a square matrix. Often an identity matrix is used for S∗j, resulting in the notation ‖⋅‖2=√βTjβj. Extensive work in the statistical learning literature has been performed recently, and the method has become a popular choice for analyzing high-dimensional data in the bio-medical context.
One can easily imagine plugging in the log-likelihood for heavy-tailed loss models, or the log-likelihood for the GB2 distribution in place of ℓ(β0,β), in order to obtain a sparse set of coefficients β for a high-dimensional problem in actuarial science. This is precisely the motivation for this paper. The lasso approach has not been applied to heavy-tailed loss distributions yet in the statistics literature, nor the actuarial literature. In this paper, we will demonstrate how fast regression routines with group-lasso penalties (a variable selection approach for selecting groups of coefficients as in Yuan and Lin 2006) can be built for important loss models including the GB2 distribution.
2.1. One dimensional LASSO
The easiest way to understand a concept is to look at a simple example. Let’s suppose we are interested in minimizing a function as simple as
By inspection, the minima is b=0. To numerically converge to this solution, we may use Newton’s method. A good overview of numerical methods including Newton’s method is Heath (2002). For the method to work, we need the first and second derivative of the function ℓ(b) evaluated at a particular point, say ˜b. The expressions for the first and second derivatives are simple, and we have ℓ′(b)=2b and ℓ′′(b)=2. Then, the update scheme for Newton’s method becomes:
where we have converged in one step, because Newton’s method is basically iteratively solving quadratic approximations to the original function, and in this case the original function is quadratic. This procedure can in general be easily implemented on a computer using the following algorithm:
In statistics, or actuarial science practices, the objective function may be defined as a negative log-likelihood function, whose arguments are coefficients of a model to be estimated. Because the likelihood function in such cases are defined as a summation of log likelihood values evaluated at observations of the data, we may imagine that there is some spurious noise in the data related to the argument, resulting in the function to be
where we imagine ϵ>0 is a small number. The minima of ℓϵ(b) is ϵ. Yet, we may not be interested in the effects of the spurious noise. Let’s suppose we are actually interested in the “true” solution 0. In order to remove the noise, we may use the LASSO technique, introduced in the statistics literature originally by Tibshirani (1996). The idea is to penalize the likelihood function, so that we minimize the expression
for some tuning parameter λ≥0. Doing this results in a so called shrinkage of the parameter b, so that if λ is large enough and ϵ is small enough, the solution returned by the algorithm is indeed the desired 0. The tuning parameter basically tells us how strong the penalty should be, where a stronger penalty corresponds to the removal of larger values of ϵ’s. Let’s first suppose b≥0. Then, equation 4 reduces to
In this case, the first and second derivatives of the function ℓP(b) would be ℓ′P(b)=2(b−ϵ)+λ and ℓ′P(b)=2. Thus, the first order condition gives b=ϵ−λ/2, b≥0, whose solution is b=0 if λ≥2ϵ, and a positive number otherwise. If b≤0, then a similar thought process gives b=ϵ+λ/2>0, which is contradictory. So the solution satisfies:
If we plot this solution for λ=1, we obtain the graph in Figure 1. In the figure, notice that if the spurious noise is large, then the LASSO routine converges to some positive number, but if the noise is small, then it is successful at converging to zero.
2.2. Two dimensional problem
This time, let’s consider the following function:
By inspection, the minima is clearly (0,0). For Newton’s method to work, we need the gradient, and the Hessian matrix, of the function evaluated at a particular point, (˜a,˜b)T. We denote the function value at the point ℓ(˜a,˜b). The gradient and Hessian at the point can be obtained by differentiation:
Then, given an initial guess (˜a,˜b)T, the updated values may be obtained by
This procedure can be easily implemented using the following algorithm.
Let’s see what happens now when we have spurious noise in both of the variables a and b. With spurious noise, the function may be perturbed to
We hope to use the LASSO technique to minimize the expression
The hope is that we are able to apply first order conditions similar to that in Section 2.1 separately to each of a and b, but we immediately see that this is not so easy, because the update scheme now involves a matrix multiplication by the Hessian inverse in Equation (9). The hope is that we can ignore the off-diagonal terms of the Hessian matrix and use update schemes like
which would allow us to shrink each parameter separately using the same technique as in Section 2.1. Doing this directly, however results in the algorithm to diverge, because it is not a correct Newton’s method. The good news is, by using a double loop, it “kind of” becomes possible to do this! The way it can be done is explained in the next section.
2.3. Double looping
Let’s try to understand how the approach taken in Yang and Zou (2015) and Qian, Yang, and Zou (2016) for the implementation of the
gglasso R package, and the
HDtweedie package really works. To keep the discussion as simple as possible, let us continue to consider the simple function in Section 2.2. Let
The idea is to allow the ˜Uϵ vector and ˜Hϵ matrix be updated in the outer loop, while running an internal loop that allows the algorithm to converge to each Newton’s method iteration solution for each of the parameters separately. Let’s denote the parameter values in the outer loop (˜a,˜b). At that current parameter value, we can obtain the ˜Uϵ vector and ˜Hϵ matrix. Then, consider the quadratic surface
We can think of Newton’s iteration as solving a quadratic minimization problem at each step, and Equation (15) is what this surface looks like in a functional form. Remember that values with the tilde sign are fixed in the outer loop. We have the liberty to solve each Newton iteration step whatever way we want. So let’s consider taking a quadratic approximation to (15) one more time. What this will do is create an inner loop, which will run within each Newton’s method iteration of the outer loop. We have