1. Introduction^{[1]}
Many claim data sets are modeled, and estimates of loss reserve produced, by means of simple statistical structures. The chain ladder model, and its simple derivatives, such as BornhuetterFerguson, Cape Cod, etc. (Taylor 2000; Wüthrich and Merz 2008) may be singled out for special mention in this respect.
Other data sets are modeled by means of more complex statistical structures. For example, Taylor and McGuire (2016) describe in detail the application of generalized linear models (GLMs) to claims data. This approach is especially suitable for data sets that contain features such that the chain ladder model is inapplicable.
More recently, interest has been growing in machine learning (Harej, Gächter, and Jamal 2017; Jamal et al. 2018). This category of model includes the artificial neural net (ANN), which has been studied in earlier literature (Mulquiney 2006), and shown to be well adapted to data sets with complex features, such as those modeled with GLMs.
The drawback of a GLM is that its fitting to a complex data set requires a skilled resource, and is timeconsuming. Many diagnostics will require programming, much time will be absorbed by their review, and many iterations of the model will be required.
For a data set such as that described in Section 4.4.1, a total of 15 to 25 hours may be required, even assuming that all diagnostics have already been programmed, and excluding documentation. The cost of this will vary from place to place, but a reasonable estimate of the consultancy fee in a developed country might be US$5,00010,000.
The time and cost might be cut if an ANN could be applied. However, the ANN model will not be transparent. It will provide only limited revelation of the claim processes and mechanics (e.g., superimposed inflation (SI)) generating the data set. Although the ANN might produce a good fit to past data, the absence of this information on claim processes might render their extrapolation to the future difficult. This can induce considerable uncertainty in any estimate of loss reserve.
A GLM, on the other hand, is a structured and transparent model that rectifies these shortcomings, but, as mentioned above, at a cost.
The question that motivates this work is whether it is possible to produce a model similar in structure, transparency and interpretability to a GLM, but in a mostly automatic fashion. For this, we turn to regularized regression, specifically the lasso, which provides a compromise between the GLM and the ANN, with the potential to achieve the best of both worlds. Given a set of input features and a degree of regularization or penalty, the lasso will select a subset of the features to generate a model that resembles a GLM, without any further intervention.
A key phrase above is “given a set of input features.” The lasso will perform feature selection to construct a model, but will not generate the initial set of features. Therefore, much of the focus of this paper is in determining a framework to construct a set of initial features for consideration by the lasso. Specifically, our aim here is to establish a framework in which the lasso may be automated for application to claim modeling in such a way as to cut the modeling cost to a small fraction of that required by a GLM, but with an output equivalent to that of a GLM, and with all of the latter’s transparency.
In order to achieve these objectives, the model is to be selfassembling in the sense that, on submission of a data set to the lassobased software, the model will assemble itself in a form that may be read, validated, understood and extrapolated exactly as would a GLM.
Beyond this very practical motivation, regularization is possessed of some very useful theoretical properties. Hoerl and Kennard (1970), in their discussion of ridge regression (a special case of regularization), noted that there is always a shrinkage constant which gives lower estimation and prediction variances than the maximum likelihood estimate (which is the minimum variance unbiased estimate). The same observation applies to the lasso.
This biasvariance tradeoff property is an important justification of regularization techniques. The result is that while estimates are biased relative to the overall mean, they are also closer to the individual means. This is a concept well understood by actuaries familiar with credibility theory.
The lasso has found its way into recent actuarial literature, though not always in application to loss reserving. Venter (2018) uses a Bayesian lasso for loss reserving where, for the most part, the model structure is one of multiplicative accident and development year effects. A later section of the paper considers some variations of this structure, such as the inclusion of some additive effects or a multiplicative payment year effect.
Li, O’Hare, and Vahid (2017) and Venter and Şahın (2018) apply the lasso to ageperiodcohort models, a classical lasso in the first case and a Bayesian lasso in the second. The “ageperiodcohort” terminology is used in mortality studies, but, as those authors point out, precisely the same concept arises in insurance claim modeling, where the translation is accident yeardevelopment yearpayment year.
On the other hand, the model structures required for mortality and claim modelling are quite different, so the mortality papers do not consider some of the issues required in loss modeling.
A somewhat related paper is that of Gao and Meng (2018), who construct a Bayesian model of a loss reserving triangle. This resembles the present paper to the extent that it uses natural cubic splines as basis functions for the construction of curves describing the development of an accident year’s claim cost over development years. The present paper uses linear splines as basis function for describing this type of development and much else.
Neither of the loss reserving papers mentioned here considers the issue of loss development patterns that vary from one accident period to another, an issue that arises in practice. Gao and Meng obtain their data from a particular NAIC data base (Meyers 2015) and demonstrate the reasonableness of their assumption of a common development pattern for all accident years. However, it is also true that examples can be found in the same data set that fail this condition (Taylor and Xu 2016). The example of Section 4.4 is of this type.
In short, although the lasso has been introduced into the loss reserving literature, it has as yet been applied to some of the less challenging problems. The present paper endeavors to fill that gap, illustrating that it can also be applied successfully to more complex loss reserving exercises.
The innovations of the paper to lasso implementation include:

the use of an individual claim data base, aggregated only to a low level at which little of the individual claim information is lost;

the incorporation of claim development patterns that vary substantially from one accident period to another;

the incorporation of claim finalization data;

the use of explanatory variables other than accident year, development year and payment year;

exploration of standardization of explanatory variables.
Section 3 describes the construction of a lasso model aimed at these objectives, after Section 2 deals with notational and other preliminaries. Section 4 illustrates numerically the application of the model to both synthetic and real data, and Section 5 discusses the prediction error associated with this type of forecasting. Section 6 examines a couple of possible areas of further investigation, and Section 7 closes with some concluding remarks.
2. Framework and notation
Many simple claim models use the conventional data triangle, in which some random variable of interest Y is labeled by accident period i=1,2,…,I and development period j=1,2,…,I−i+1. In this setup, the combination (i,j) is referred to as a cell of the triangle, and the quantity Y observed in this cell denoted Yij. The durations of the accident and development periods will be assumed equal, but need not be years. As a matter of notation, E[Yij]=μij, Var[Yij]=σ2ij. A realization of Yij will be denoted yij.
The real data set used in this paper consists of individual claim histories. These are capable of representation in the above triangular form but, for most purposes, this is unnecessary. However, some aggregation of claims occurs, as explained in Section 4.4.1, simply in order to reduce computation. The quantity of interest throughout will be individual claim size at claim finalization, converted to constant dollars on the basis of wage inflation. A claim often involves multiple payments on various dates. The indexation takes account of the different dates.
Claim size for individual claim k will be denoted Y[k]. A vector v[k] of labeling quantities will also be associated with claim k. These quantities may include i,j,t=i+j−1= calendar period in cell (i,j), and others.
One nonroutine quantity of interest later is operational time (OT) at the finalization of a claim. OT was introduced to the loss reserving literature by Reid (1978), and is discussed by Taylor (2000) and Taylor and McGuire (2016).
Let the OT for claim k be denoted τ[k], defined as follows. Assume that claim k belongs to accident period i[k], and that ˆNi[k] is an estimator of the number of claims incurred in this accident period. Let Fi[k]:[k] denote the number of claims from the accident period finalized up to and including claim k. Then τ[k]=Fi[k]:[k]ˆNi[k]. In other words, τ[k] is the proportion of claims from the same accident period as claim k that are finalized up to and including claim k.
There will be frequent reference to openended ramp functions. These are singleknot linear splines with zero gradient in the lefthand segment and unit gradient in the righthand segment. Let RK(x) denote the openended ramp function with knot at K. Then
RK(x)=max(0,x−K).
For a given condition c, define the indicator function I(c)=1 when c is true, and I(c)=0 when c is false.
3. Regularized regression
3.1. In general
Consider an ordinary least squares regression model
y=Xβ+ε
where y=(y1,…,yn)T is the response vector, β=(β1,…,βp)T is the parameter vector, ε=(ε1,…,εn)T is an error vector subject to ε∼N(0,I), and X is the n×p design matrix. The parameter vector β is estimated by that ˆβ which minimizes the loss function L(y;ˆβ)=(y−Xˆβ)2. It will be convenient to express this as L(y;ˆβ)=(‖y−Xˆβ‖2)2, where, for vector x with components xm, ‖x‖q denotes the Lqnorm (q≥1)
‖x‖q=(∑mxmq)1q
Regularized regression includes in the loss function a penalty for large regression coefficients (components of ˆβ). The regularized loss function is
L(y;ˆβ)=(‖y−Xˆβ‖2)2+λ(‖ˆβ‖q)q
for some selected q, where λ>0 is a tuning constant.
The generalization of (3.1) to a GLM assumes the form
y=h−1(Xβ)+ε
where h is the (possibly nonlinear) link function, h−1 operates componentwise on the vector Xβ, and ε may now be nonnormal, but still with E[ε]=0 and independence between observations still assumed. The parameter vector β is estimated by the maximum likelihood estimator ˆβ, which minimizes the loss function (negative loglikelihood)
L(y;ˆβ)=−n∑m=1l(ym;ˆβ)
where the summand is the loglikelihood of observation yi when β=ˆβ.
The regularized form of (3.4) is similar to (3.2):
L(y;ˆβ)=−n∑m=1l(ym;ˆβ)+λ(‖ˆβ‖q)q
The effect of variation of λ is as follows. As λ→0, (3.5) tends to revert to the unregularized form of GLM. As λ→∞, the penalty for any nonzero coefficient increases without limit, forcing the model towards one with a single coefficient, in which case the model consists of just an intercept term and all predictions are equal to ¯y, the mean response.
3.2. The lasso
3.2.1. Loss function
There are two recognizable special cases of (3.2):
(a) q=0. Here the second member of (3.2) reduces to just λ multiplied by the number of nonzero terms in the model. This is a computationally intractable problem, although it is noteworthy that setting λ=1 produces a metric equivalent to the Akaike information criterion.
(b) q=2. In this case, (3.2) is recognized as the ridge regression (Bibby and Toutenburg 1977) loss function. All included covariates will be nonzero in this approach.
A further case of interest arises when q=1. This is the least absolute shrinkage and selection operator (lasso) (Tibshirani 1996), which is the modeling device used throughout this paper. Its loss function (3.5) may be written explicitly as
L(y;ˆβ)=−n∑m=1l(ym;ˆβ)+λ‖ˆβ‖1=−n∑m=1l(ym;ˆβ)+λp∑r=1ˆβr
The appearance of absolute values of coefficients in the loss function will generate many corner solutions when that function is minimized. Hence, the covariates included in the model will be a subset of those included in the regression design, and the lasso acts (as its full name suggests) as both a selector of covariates and as a shrinker of coefficient size. Efficient algorithms such as least angle regression (Efron et al. 2004), which generates a path of solutions across all λ, make the lasso computationally feasible.
The strength of the shrinkage of covariate set can be controlled with the tuning parameter λ, as discussed at the end of Section 3.1, with the number of covariates decreasing as λ increases.
An obvious generalization of loss function (3.6) is
L(y;ˆβ)=−n∑m=1l(ym;ˆβ)+p∑r=1λrˆβr
where the λr(>0) may differ with r. This generalization will be discussed further in Sections 3.2.2 and 4.4.3.
3.2.2. Lasso interpretations
The lasso has been presented in Section 3.2.1 rather in heuristic terms. However, it is useful for some purposes (see, e.g., Section 4.4.3) to consider specific models for which the lasso is the optimal estimator in some sense. There are two main interpretations (Tibshirani 1996).
Bounded coefficients. Consider the optimization problem:
ˆβ=arg min∑pr=1βr≤Bn∑m=1l(ym;β)
i.e., maximum likelihood subject to the bound ∑pr=1βr≤B.
This problem is soluble by the method of Lagrange multipliers, then (3.6) is the Lagrangian, and λ the Lagrange multiplier. So the solution is as for the lasso.
It is straightforward to show by a parallel argument that, if the problem is modified to the following:
ˆβ=arg minβr≤Br,r=1,…,pn∑m=1l(ym;β)
then the solution is as for the lasso with loss function (3.7).
Bayesian interpretation. Suppose that each parameter βr,r=1,…,p is subject to a Bayesian prior that follows a centred Laplace distribution with density
π(βr)=(2s)−1exp(−βrs)
where s is a scale parameter. In fact, E[βr]=0,Var[βr]=2s2. All priors are supposed stochastically independent.
The posterior negative logdensity of β is
−n∑m=1l(ym;β)+s−1p∑r=1βr
up to a normalizing coefficient, and this is the same as (3.6) with λ=1s=(½Var[βr])−½. The lasso, in minimizing this function, maximizes the posterior density of β, so that ˆβ is the maximum a posteriori (MAP) estimator of β.
It is straightforward to show by a parallel argument that if the prior (3.8) is replaced by
π(βr)=(2sr)−1exp(−βrsr)
then the posterior negative logdensity of β becomes
−n∑