1. Introduction
The U.S. Federal Crop Insurance Program (FCIP) started as a federal support program for agriculture in the 1930s. Administered by the U.S. Department of Agriculture (USDA) Risk Management Agency (RMA), the program provides insurance coverage for losses due to various causes. Participation in the program has grown steadily since its inception thanks to legislative changes supporting premium subsidies and the introduction of new insurance products. The aggregate crop coverage level recently reached 74%, according to the Economic Research Service of the USDA.
Figure 1 shows a time series plot of the distribution of commodities covered under the FCIP. The reader may observe that the primary commodity category is the row crops category, whose increase during recent years has been quite dramatic. Figure 2 shows the loss cost ratio (the ratio of losses to premium) over time. Typically a loss cost ratio greater than 1 indicates that the company is losing money from underwriting. In the figure, we see that the loss cost ratio was very high until about the year 2000. Legislative changes mandating crop insurance have contributed to bringing the ratio down in more recent years, as can be seen.
The program costs of the FCIP are shown in Figure 3. One can see that premium subsidies have increased dramatically in recent years (presumably due to this cost’s correlation with the increase in liability). Premium subsidies have been the primary means of increasing program participation, and the reader may observe that the program does not come without costs. We hope that improvements in the rating engine can help alleviate the costs associated with the crop insurance program and improve its sustainability.
In actuarial science, we often assume that random variables are defined over positive real numbers. Some examples are the age of a policyholder, the total amount of expenses incurred due to a policy, or the total loss amount from an insured property. Such an assumption simplifies the actuarial theory, and it is a reasonable one to make. For example, we may use distributions that are defined over positive real numbers (take the gamma distribution, for example) to model such variables.
Yet when working with real data, we often discover data that are not consistent with our theories and intuitions. For example, messy real data may contain negative values for a variable that can theoretically only be positive. One example is the company expense variable in National Association of Insurance Commissioners data, where spurious negative values are observed. Another example is the indemnity amount variable in RMA crop insurance data, where negative indemnity amounts can arise when total production amounts are greater than the liability amounts. Formally, the liability amounts in this data set are defined by
Liability=Acres planted×Expected yield (Actual Production History (APH) yield)×Selected coverage level×Base price×Price election percentage,
and the indemnity is defined as
Indemnity=Liability−Value of production, where the value of production is calculated by Value of production=Actual yield×Base price×Price election percentage.
The premium paid is defined by
Premium=Liability× Rate ×Adjustment factor.
Authors such as Shi (2012) and Yu and Zhang (2005) have modeled this type of response using an asymmetric Laplace distribution (ALD) after transforming the response variable. But they faced a problem: the ALD does not have a heavy enough tail to properly model the expense variable. To work around that limitation, the authors relied on transformations of the expense variable.
In this paper, we revisit the illustrated problem and use an alternative approach to solve it. Specifically, we introduce a longer-tailed version of the ALD, which still allows the response to be negative yet fits the data better than the original ALD. Long-tailed modeling of insurance claims has been thoroughly researched in the literature. For example, Sun, Frees, and Rosenberg (2008) studied the properties of a generalized beta 2 (GB2) model to predict nursing home utilization. The GB2 model has since been a popular regression model for long-tailed insurance claims (see Frees and Valdez 2008; Frees, Shi, and Valdez 2009; or Frees, Lee, Lu 2016).
An added feature of our approach is the ability to model data with high-dimensional features. For that reason, the paper is related to the high-dimensional data analysis literature in statistics. In this case, estimation of the regression coefficients becomes a challenging task. The lasso approach to estimating model coefficients with high-dimensional features was introduced into the literature by Tibshirani (1996). The lasso approach basically L1-penalizes the log-likelihood of a regression problem so as to reduce the number of coefficients to be estimated. In the statistics literature, such lasso methods have been applied extensively to linear regression models, generalized linear models, and recently to the Tweedie distribution (see Qian, Yang, and Zou 2016). The Tweedie distribution is useful for modeling the pure premium of insurance losses, where the pure premium is the product of the frequency and severity means of the insurance claims.
Meanwhile, ridge regression—first introduced by Hoerl and Kennard (1970)—is a form of regularized regression, where an L2 penalty term is added to the regression objective function to prevent overfitting and improve the stability of the parameter estimates. Combining the lasso and ridge regression approaches, Zou and Hastie (2005) introduced the elastic net penalty, whereby they proposed a penalized regression approach that combines the L1 and L2 penalties. They demonstrated that the elastic net penalty is particularly effective in handling high-dimensional data with correlated predictors, where either L1 or L2 regularization alone may not perform well.
In this paper, we demonstrate that fast regression routines with elastic net penalties can be built for the modified ALD distribution. Thus the approach would be suitable for situations with a great many explanatory variables, as is the case in the data set used for our empirical study. The high dimensionality of the features results from there being many different categories for the commodity type, unit structure, coverage level percentage, insurance plan type, and coverage type.
The paper proceeds as follows: In Section 2, we summarize the data set used for the empirical analysis so as to give the reader some idea of the problem at hand and the motive for the use of the modified ALD distribution. In Section 3, we explain the details of the modified ALD distribution and the motive behind the distributional form of the model. Section 4 describes the details of the high-dimensional estimation approach for the modified ALD distribution. Section 5 describes a simulation study, where the modified ALD is found to fit longer-tailed data better than the original ALD. In Section 6, we show the results of the estimation. There the reader will discover that the approach with a lasso penalty outperforms the basic model without the lasso penalty. Section 7 concludes the paper.
2. Data description
From the state/county/crop summary of business on the USDA RMA website, we use the type/practice/unit structure data files for our analyses. We use 2018 data to form the training sample (in-sample), and 2019 data for the validation sample (out-of-sample). Only Michigan data are used for the purposes of this paper. Within the data set, one may observe that there are different commodities grown at the unit level. As an example, Figure 4 shows the areas in which soybeans, cherries, and sugar beets are grown.
Along with the geographical locations and commodity types, the liability amounts and the indemnity amounts are recorded. A histogram plot of the indemnity amounts appears in the left panel of Figure 5. The right-hand panel of the figure shows a magnified version of the plot for the observations below zero. Here, the reader clearly sees that some of the indemnity values are negative. Note that for the purpose of our analysis, observations with nonzero liability values are taken as a subset, because those with zero liability do not make sense (we believe that only when there is a nonzero liability can there be a meaningful indemnity value).
Within the state of Michigan data one sees various commodity types, with corn being the most popular, and soybeans the second most. Table 1 summarizes the number of farms, total liability, proportion of nonzero indemnity farms, and average indemnity amounts for each farm in the category. Note that the indemnity severities display significant commodity-level heterogeneity, with apples having the highest average severity.
Tables 2, 3, and 4 show that there are observed heterogeneities depending on the unit structure, coverage level percentage, insurance plan categories, and coverage-type code. These features in the data set motivate us to use these categories as rating variables in our regression model for the indemnity amount. The motivation behind a lasso estimation is to reduce overfitting under the presence of many rating variables, and thus ours is precisely the situation under which a lasso estimation technique has advantages.
The coverage level percentage is an interesting variable in the data set. Different coverage levels correspond to different amounts of insurance purchased by the policyholder. Oftentimes, the amount of insurance purchased is used as a variable to detect adverse selection in the insurance market. In these data, the most common coverage percentages were the 50% and 75% levels, each with total liabilities of $314 million and $518 million. The average size of the indemnity was largest in the 50% category, as can be seen in Table 3.
Table 4 shows the insurance plan categories. The most common insurance plan code is RP, which stands for “revenue protection,” where the policy protects the total revenue earned by the policyholder. The next most common is YP, or “yield protection,” where the yield from the commodity is protected instead of the revenue. The revenue protection category alone corresponds to $1.3 billion in total liability, and the yield protection $155 million. Note that the YDO (yield-based dollar amount) category corresponds to the highest average indemnity, and the DO (dollar amount) category the second highest. The insurance plan code exhibit available on RMA’s website defines the other insurance plan codes.
Table 5 shows summary statistics by coverage-type code. The C category stands for CAT (catastrophic) coverage, while all other policies are categorized as A (standing for buy-up) coverage. We observe that the CAT coverage policies have a lesser amount of total liability but a higher average indemnity.
3. Model
Consider the distribution function
F(y;μ,σ,τ)={τexp[−1−τσlog(1+μ−y1+μ)]y≤01−(1−τ)exp[−τσlog(1+μ+y1+μ)]y>0,
where Koenker and Machado (1999), Yu and Moyeed (2001), Shi and Frees (2010), Shi (2012), and Yu and Zhang (2005), the distribution for the three-parameter ALD is given by
and The reader may understand this as a modified version of the three-parameter asymmetric Laplace distribution. According toFALD(y;μ,σ,τ)={τexp[1−τσ(y−μ)]y≤μ1−(1−τ)exp[−τσ(y−μ)]y>μ.
Comparing equations (1) and (2), we see that the modified distribution in (1) looks somewhat similar to the unmodified version in (2) but differs in that it involves a logarithm; also, equation (2) involves the term
The motivation for the new distribution is this: When the ALD is used for a regression analysis of heavily skewed insurance loss data, the fit of the distribution is not as good as one hopes. For that reason, Shi and Frees (2010) and Shi (2012) recommend either rescaling or transforming the response before fitting the distribution. The transformation approach introduces an additional parameter, which needs to be estimated, and the analysis becomes somewhat more complicated. The approach is an elegant workaround, but a practicing actuary may find it desirable to use a simpler model that fits the response directly.
So in this paper, instead of using the transformation approach, we propose the alternative distribution function in equation (1). In the distribution, the logarithmic function handles the skewness in the response variable. Notice that if
then the fraction inside the logarithm becomes 1, making the entire term inside the exponential function zero. The density may be obtained by differentiating equation (1):f(y;μ,σ,τ)={τ(1−τ)σ(1+μ−y)exp[−1−τσlog(1+μ−y1+μ)]y≤0τ(1−τ)σ(1+μ+y)exp[−τσlog(1+μ+y1+μ)]y>0.
Clearly, the expression in equation (3) is a density, because
as and as Also, the density is continuous at zero by design. Compare this with the original density function of the ALD:f(y;μ,σ,τ)=τ(1−τ)σexp(−y−μσ[τ−I(y≤μ)]).
Figure 6 shows the density function for selected parameter values. Notice that for small values of and the density becomes highly skewed, allowing for a potentially good fit for heavily skewed insurance loss amounts. The density has the desired property of being defined for allowing for negative indemnity amounts.
For regression purposes, we parametrize
as a linear combination of covariates, with a log link. Specifically, let where where and are regression coefficients to be estimated. Also, let and with Then, for a given policyholder defineΦi=Φ(yi,β0,ββ,s,t)=−logf(yi;μi,σ,τ),=−logf(yi;exp(β0+xx′iββ),exp(s),exp(t)/(1+exp(t))) where for In order to reduce overfitting by inducing zeros into the coefficients, we seek to find
^θθ=(ˆβ0,^ββT,ˆs,ˆt)T=argminβ0,ββ,s,t[n∑i=1Φi+λp∑j=1(ξwj|βj|+12(1−ξ)β2j)],
where
is a tuning parameter, are weights, and defines an elastic net penalty. If then the penalty term reduces to a lasso penalty, but if then the penalty becomes a ridge penalty.4. Estimation
For the density in equation (3), the explicit form of the contribution
of each policy to the negative log-likelihood is given in the appendix. For this specific it is simple to obtain the form of the gradient and Hessian in terms of the parameter vector and those derivations are shown in the appendix as well. The estimation is performed iteratively, starting from an initial estimate of the regression coefficients and shape parameters. Assume we have a current estimate and would like to update it. We take the first-order Taylor’s series approximation around the point to the negative log-likelihood:ℓQ(θθ)=ℓ(~θθ)+n∑i=1(∂˜Φi∂θθ)T(θθ−~θθ)+12n∑i=1(θθ−~θθ)T(∂2˜Φ∂θθ2)(θθ−~θθ).
Then, after some algebra, we can verify that the gradient and Hessian of
has the following form:∂ℓQ∂θθ=n∑i=1(∂˜Φi∂θθ)+n∑i=1(∂2˜Φi∂θθ2)(θθ−~θθ).
With this expression, an inner loop is constructed so that the current estimate
is updated to in each iteration. First, we attempt to perform the following for each regression coefficient within the inner loop:˘βj,(new)=argminβj[ℓQ(˘θθ)+˘UQ,j(βj−˘βj)+˜γQ,j2(βj−˘βj)2+λ(ξwj|βj|+12(1−ξ)β2j)],
where
˘UQ,j=∂ℓQ∂βj=n∑i=1˘g1xij
and
˜γQ,j=∂2ℓQ∂β2j=n∑i=1˜Φ11x2ij.
The problem in equation (7) is a one-dimensional one, whose first-order condition gives the following:
˘ββj,(new)=(˜γQ,j˘βj−˘UQ,j)(1−λξwj‖˜γQ,j˜βj−˘Uj‖)+˜γQ,j+λ(1−ξ).
The reason the first-order condition gives this update scheme is explained clearly in the lasso literature. The reader is referred to Tibshirani (1996), Yang and Zou (2015), and Qian, Yang, and Zou (2016) for the details. For the intercept and the two shape parameters, we attempt to solve as follows:
(˘β0,(new),˘s(new),˘t(new))=argminβ0,s,t[ℓQ(˘θθ)+˘UUTQ,shape(β0−˘β0s−˘st−˘t)+˜γQ,shape2(β0−˘β0s−˘st−˘t)T(β0−˘β0s−˘st−˘t)],
where
˘UQ, shape =(∂ℓQ(˘θθ)∂β0,∂ℓQ(˘θθ)∂s,∂ℓQ(˘θθ)∂t)T
and
is the largest eigenvalue of the matrixHHQ,shape=n∑i=1∂2˜Φ∂θθ2.
With these, the update for the intercept and the shape parameters is given by
(˘β0,(new)˘s(new)˘t(new))=(˘β0˘s˘t)−˜γ−1Q,shape˘UUQ,shape.
The reader may wonder why the inverse of the largest eigenvalue is multiplied, instead of the Hessian inverse itself. This approach is called majorization minimization, as explained in Lange, Hunter, and Yang (2000), Hunter and Lange (2004), Wu and Wu and Lange (2010), and Yang and Zou (2015). In a nutshell, the majorization minimization approach is used to construct a stable algorithm, by iteratively minimizing a function that majorizes (is above) the original objective function. Computing the eigenvalues requires time, but nowadays there are fast library routines to perform this task.
This results in algorithm 1. Given a raw design matrix
in order for the algorithm to run smoothly, one must scale the design matrix using methods described in the appendix so as to obtain where If there are zero columns in the raw design matrix, then the standard deviation of the entries in those columns is zero (in other words, for that column; see the appendix for details). For those columns, we set for every When running algorithm 1, these zero columns result in zero values. So for these columns, we set which is just the initial value for the ’s.-
Initialize
and set equal to it. -
Repeat the following until convergence of
:-
Repeat the following for
:-
Compute
from equation (9). -
Repeat the following until convergence of
:-
Compute
using equation (8). -
Update
using equation (10).
-
-
Set
-
-
Compute
from equation (13). -
Repeat the following until convergence of
:-
Compute
using equation (12). -
Update
using equation (14).
-
-
Set
-
Implementation of this algorithm is a simple but tedious exercise in coding. Writing the routine in R results in a slow algorithm, because of the double looping and the computation of the eigenvalue. For that reason, we recommend using a lower-level language such as Fortran or C++ to implement it. The results we present in Section 6 are obtained using an implementation of algorithm 1 in C++. Nowadays the Rcpp
library allows for convenient integration of C++ code into the R environment, so that the routine would be implemented in C++ but could be called from within the R environment.
5. Simulation
This section describes a simulation study showing that for long-tailed data the modified ALD fits the data better than the original ALD. For the study, synthetic data were generated from a Lomax (Pareto Type II) distribution, with the scale parameter fixed to 10 and the shape parameter varying from 10 down to 2. As the shape parameter decreases, the tail of the data becomes heavier and heavier. Both the ALD and the modified ALD distributions are fit to the data using maximum likelihood estimation (MLE), and the Q–Q plots for the fit are shown. Figure 7 shows the case when the true shape parameter for the data is 10, and Figure 8 shows the case when the shape parameter is 2 (the heavy-tailed case).
From the figures, one can see that the modified ALD fits the data better than the original ALD, especially when the data become heavy-tailed. Although not shown here, we ran the simulation using several other shape parameters, and observed that the fit for the original ALD suffers more and more as the data become more heavy-tailed. This shows precisely when the modified ALD outperforms the original ALD. Note that we have not transformed the data at all, and we are directly modeling the response using the distributions.
As the shape parameter for the Lomax distribution decreased below 1.1 (with it being less than 1 formally called the heavy-tailed case), the original ALD distribution started to give error messages during the estimation using MLE. This indicates that when the data are too heavy-tailed, one cannot use the original ALD distribution without transforming the data.
6. Empirical Results
The model described in Section 3 is estimated using the approach in Section 4 with
for every and in order to estimate the severity scores. By setting the same weights are applied to all ’s. We believe this simplifies the analysis, and the results are not influenced much for our purposes. One motive for selecting a small value for as in the paper could be ease of computation, since the ridge penalty is easier to optimize. In the paper, we made the choice after trial and error, attempting to find a value that seems to generate reasonable predictions. The optimal is selected by choosing the one that gives the fitted values with the smallest mean squared error (MSE) among a sequence of candidate ’s. In the paper, this value turned out to be In addition to the regularized indemnity model, a simple logistic regression model is used to predict the indicator of indemnities. The overall indemnity score is obtained by multiplying the predicted indemnity indicator by the severity score predicted by the algorithm.Because the modified ALD distribution is long-tailed and may not have a finite mean, we cannot rely on the mean for prediction purposes. Instead, we use the lowest MSE quantile as the predictor for the indemnity severities. For this, we need to determine which quantile will give the lowest MSE. To determine that, we rely on the in-sample MSEs and select the quantile that gives the lowest MSE. Then the out-of-sample quantile is obtained.
For comparison, we predict the severity score using a simple MLE regression model (without the lasso penalty) and multiply it to the same predicted indemnity indicator. The indemnity score obtained in this way is called the base model. The out-of-sample correlation with the actual indemnities is taken and compared. The result is shown in Table 6.
Table 6 shows that the out-of-sample Spearman correlation improves with the elastic net approach. Thus, we have effectively modeled a response variable with possible negative indemnity values for a ratemaking application, using the modified ALD distribution.
Note that the premiums had a Spearman correlation of 48.63% with the out-of-sample indemnities. This means our ALD approach to modeling the indemnities outperforms the premiums recorded in the RMA data.
7. Conclusion
In this paper, we present a solution to an indemnity modeling problem in the form of a case study using the public RMA data set. The problem we address is that in some data sets, such as the public RMA data set, indemnity amounts may be negative if production levels exceed the liability amounts. The modified ALD distribution gives us an easy way to model the response under such circumstances. In addition, the elastic net approach allows us to incorporate high-dimensional features into our model—the high dimensionality arises because there are many categories for the explanatory variables for the farms.
Compared with other approaches to modeling responses with negative values, the approach introduced here is simple and flexible enough to be used widely in practice. Future work may further analyze the properties of the modified ALD distribution. Application of the modified ALD distribution to the insurance company expense modeling problem may constitute another interesting avenue for future work.