In this paper, we propose a market-based model selection method that focuses on the goals of the prediction exercise: to optimally price auto policies. To show the potential benefits of this method, we apply it in a commercial auto insurance ratemaking exercise. We fit many different predictive models to a robust data set of 70,000 commercial auto policies and compare the models’ performance.
The ratemaking literature is vast. When modeling pure premium in a property/casualty setting, there are two main options supported by different streams of research. One option is to model the pure premium directly. Observed pure premium has many zeros for all the policies with no claims, but the positive values are relatively continuous. The Tweedie distribution (Tweedie 1984) is commonly chosen in this setting because (depending on a power parameter between 1 and 2) it has a point mass at zero and continuous support over the positive reals. Jørgensen and Paes De Souza (1994) first applied the Tweedie distribution to insurance claim data as a generalized linear model (GLM). The other option is to model the frequency (count of claims per policy) and severity (cost per claim) separately and then combine them to get the total cost. Some more recent work (Shi 2016) also incorporates the type of claim (liability or physical damage, for example). We model the frequency, severity, and the pure premium, but focus on pure premium for our market-based method. It can be applied to frequency and severity but is most natural when comparing pure premium models.
After choosing how to model the total losses, we compare the different available models. When examining many potential ratemaking models, it can be difficult to choose the best one. Some models are more accurate by one measure, while other models are more accurate under a different measure. Even if a model is consistently more accurate, what if it is much more computationally intense, or if its optimization is more unstable? In this paper, we propose a market-based model comparison technique that can give the user a better understanding of the potential impact of different ratemaking models on the results. We compare eight different models of pure premium, both using standard accuracy and complexity measures and using our market-based model comparison metric. The market-based metric provides a much clearer picture of the differences between the models.
Many papers use likelihood-based methods, such as the Akaike information criterion or Bayesian information criterion to choose the best models (Bermúdez and Karlis 2011; Shi 2016; Tang, Xiang, and Zhu 2014; Yip and Yau 2005). Others use prediction error metrics, such as mean squared or median absolute prediction error, or the Gini index (e.g., Frees, Meyers, and Cummings 2014; Guelman 2012). Still others use both (e.g., Klein et al. 2014). Our market-based model selection method provides additional information to help answer the questions of which model is best and how much better it is.
In the next section, we outline market-based model selection and the other comparable metrics. In Section 3, we describe our data. In Section 4, we outline the predictive models that we use for comparison. In Section 5, we detail the results of our analysis. In Section 6, we conclude and discuss limitations and next steps.
2. Market-Based Model Selection Metric
The main contribution of this paper is a market-based model selection metric. Imagine that each of the candidate models is an insurance provider. Each of the pure premium predictions is the price the provider offers for each policy. The company with the lowest price earns the policy. We then compare market share and loss ratios for the providers to see which model best differentiates the good risks from the poor ones. For example, we compare five different models and their (cross-validated) predictions for a single observation:
Because Model 2 has the lowest prediction, it will get to write the policy. It receives 27 in premium and 24 in losses. All of the rest of the models are unaffected. This outcome does not depend on the actual earned premium for this policy in our data. This process is repeated for all of the observations. Then, to compare the models, we compute the loss ratio (total losses/total earned premium) and the market share (policies written by the model/total policies). This is also compared for each of the 100 iterations to understand the uncertainty.
In addition to the market-based model comparison, we compare the models using median absolute prediction error (MAPE), the median of the absolute differences between the predicted and actual pure premium; the mean squared prediction error (MSPE), the mean of the squared differences; and the model’s runtime.
All of the computing was performed on an Intel Xeon E5-2689 v4 (3.10GHz) with 192 GB of RAM and 40 available threads. To get a better understanding of the prediction metrics, we ran the cross-validation 100 times. Intervals of the middle 95 iterations are also displayed.
3. Commerical Ratemaking Data
Our data set consists of bodily injury and property damage claims from 77,229 commercial auto policyholders in California in 1999. We use the following covariates to model frequency, severity, and pure premium:
Program (standard, substandard, assigned risk, direct excess, and motorcycle)
Good driver discount (true/false)
Limits (basic, greater than basic)
Coverage (bodily injury or property damage)
Policy-level covariates (five, named Var1 through Var5, including some continuous and some categorical)
Some of the policy-level variables are included not because we think it is likely they will accurately predict losses but, rather, because we hope that they will be removed from the model (or their impact will be minimized) by some of the methods we try below. In this paper, we are mainly interested in different model families in order to measure which perform best. Discussion of the chosen covariates (and their impact on losses) is deliberately avoided to maintain that focus.
4. Predictive Models
As mentioned in the previous section, we model frequency, severity, and pure premium. We chose the following models after reviewing the most common ratemaking models in the literature. We hope to focus on the model comparison process and also make our models easily replicable by practitioners who would like to apply our methods in their work. To that end, we will use only models that have a published R package. The methods we use in this paper are detailed in the remainder of this section.
Basic GLMs. Generalized linear models are a natural starting point. They extend the linear model when Gaussian additive errors don’t make sense. Using the
glm function in base R, we fit basic GLMs as a benchmark for all of the loss metrics. For frequency, we use a Poisson GLM with exposure offset. For severity, we use a gamma GLM. Finally, for pure premium, we use a Tweedie distribution with exposure offset. For more information about GLMs, please see Nelder and Wedderburn (1972).
Bayesian GLMs. While the
glm function chooses the parameters with a likelihood-based optimization, Bayesian GLMs use a prior distribution and then sample from the posterior distributions of the parameters using the Markov chain Monte Carlo method. These models are ideal for incorporating expert information not included in the data and for properly accounting for all types of uncertainty. In this application, we want a relatively fair comparison between our Bayesian and frequentist models, so we use basic noninformative priors on all the parameters. Using the
arm package, we fit basic Bayesian GLMs for all of the loss metrics. Similar to the basic GLMs, we use Poisson (with exposure offset) for frequency, gamma for severity, and Tweedie for pure premium (with exposure offset). For more details, please see Dey, Ghosh, and Mallick (2000).
GLMMs. Using the
lme4 package, we fit generalized linear mixed models (GLMMs) for all of the loss metrics. The main reason to model a covariate as a random effect instead of a fixed effect is to take advantage of partial pooling (Gelman and Hill 2006). When you have a categorical variable where some levels have only a few entries, a random effect will shrink that level’s coefficient estimate toward the estimates for the other levels. As is the case in the GLM models, we used Poisson for frequency, gamma for severity, and Tweedie for pure premium. We tried first to fit all the categorical variables (program, good driver discount, limits, coverage, and city type) as fixed effects and all the continuous variables as random effects, but the likelihood function became so flat that the model was unable to converge. We then simplified the model to still include all of the fixed effects but program them as random effects. This is important to note when reading the results section, as the model we use will fit slightly faster and may perform more poorly than the other models because of the limited covariates. For more information on GLMMs, please see McCulloch and Searle (2004).
Regularization (LASSO, elastic net, and ridge). Regularized methods aim to automatically find the balance between overly complex and overly simple models. In all cases—least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), elastic net (Zou and Hastie 2005), and ridge (Hoerl and Kennard 1970)—the models are penalized for having large regression coefficients. This makes the models as simple as they can be, but no simpler. If there is a large amount of signal for a large regression coefficient, it will overwhelm the penalty term and be included in the model. Elastic net and LASSO can also perform variable selection by penalizing the coefficients of the insignificant variables all the way to zero.
glmnet package, we fit Poisson LASSO, elastic net, and ridge regression models on the frequency data with an exposure offset. To fit the severity and pure premium models (gamma and Tweedie with an exposure offset), we use the
HDTweedie package. When testing and scrutinizing the
HDTweedie package, we find some concerning results. Setting the shrinkage parameter (λ) to 0 removes the regularization from the model, making it a standard GLM. We set the power parameter to 1 (implying a Poisson distribution) and compare the
HDTweedie with the
glm, and we find that all the parameter estimates are equal. When we adjust the power parameter, the estimates no longer match, though they should. We decided to include these results in the paper, but they should be taken with a grain of salt.
Spike and slab. As an alternative to regularized regression to determine which variables are important, we fit a Bayesian spike and slab model. A standard regression model depends on some function of ∑kXikβk for observation i. The spike and slab model (Kuo and Mallick 1998) adds a binary variable to the kernel in the previous sentence to get to ∑kγkXikβk. When γk=1, then Xk is included in the model, but Xk is left out when γk=0. This model provides a posterior probability that βk is significantly different from 0 (the probability that γk=1). In contrast to the other regularization methods above, the spike and slab model does not shrink by much the variables that stay in the model, but it shrinks the others all the way to zero.
To fit the spike and slab model, we use the package
SpikeSlabGAM. That package implements the Poisson model, but not the gamma or Tweedie models. The
boral package states that it implements Tweedie spike and slab models, but we were unable to get reasonable results. Because of this, we decided to use the spike and slab model only in the frequency predictions.
Random forest. Random forest models are an ensemble of many relatively simple models run on subsets of the data. By combining the estimates from many simple models, we can get better results than those from a single complicated model. Each of the individual models is a regression tree (Breiman et al. 1984). Each of the individual trees is built on a random sample of the observations and a random sample of the possible covariates. We use the
randomForest package to fit random forest models. For the frequency model, we include the exposure as a covariate. That is similar to including it as an offset in a standard GLM, though we are using more external information than in the GLM, where the coefficient is effectively constrained to be 1. The severity values are already per claim and do not need to be adjusted for exposure. In the pure premium model, we first divide the total losses by the exposure and then model that (per-exposure) loss amount. When fitting a random forest model, several main hyperparameters need to be set:
Number of trees to grow: This is the number of simple models (trees) to fit. Because each model is fitted to a subset of the explanatory variables and a subset of the observations, a sufficient number of trees need to be built. This is only really limited by your computational power. As we will show, random forest models can be computationally intense.
Number of explanatory variables randomly sampled for each split in the tree: This is the number of variables that should be used in each simple model. The default value is one-third of the total number of variables.
Sample size of each draw: This is the number of samples drawn for each tree.
Minimum size of the terminal nodes: Making this value larger will cause smaller trees to be grown. The default value is 5.
We set these four hyperparameters through cross-validation, as described in Section 4.1. For the rest of the settings, we simply use the defaults. For more details on random forest models, please see Breiman (2001).
4.1. Model Fitting
In summary, we fit the following models:
We fit all of the above models using 10-fold cross-validation (we divide the data set into 10 parts and hold out 1 at a time while fitting the model on the other parts). For the models without hyperparameters to specifically tune (GLMs, GLMMs, and spike and slab), we simply fit the models to 9/10 of the data and then use that model to predict the held-out 1/10. For each set of predictions, we iterate over each 1/10. That way, all of the predictions are on out-of-sample data.
For the other models (LASSO, ridge, elastic net, and random forest) we use nested cross-validation, whereby the bottom level determines the hyperparameter settings. Specifically, we use the 9/10 of the data to first tune the hyperparameters (λ in the regularization methods and various hyperparameters for the random forest model). We do this through 10-fold cross-validation on the 9/10 of the original data set. That way, we choose the hyperparameters that work best when holding out some data, and then use those hyperparameters to fit the model on the entire 9/10. Finally, we use that fitted model to predict the values for the held out 1/10 of the data that the model has not seen yet. This method, while computationally expensive (because of the nested cross-validations), allows us to use the entire data set more fully and still maintain proper holdout samples.
The frequency models produced the following results:
The Poisson GLM was the most computationally efficient model, followed closely by the ridge and Bayesian models. The LASSO and elastic net were next, with the GLMM and spike and slab models a number of times slower. The random forest model was orders of magnitude slower than the other models because of the many individually optimized trees.
When predicting, we were able to run the random forest model only once because of its long runtime. The simplest models (GLM and Bayesian GLM) performed the best in terms of MAPE. The random forest model did really well in terms of MSPE. After the random forest model, the GLM, Bayesian GLM, LASSO, and elastic net all performed similarly. The GLMM did not do very well, but it also did not use all of the same covariates as the other models. MSPE adds more emphasis to single predictions with large errors—in this case, exceptionally large predicted (or actual) frequencies. It is interesting that the ridge regression performed worse than the LASSO and elastic net in this situation. The random forest model may better predict large values than the GLM. That is why the GLM has a consistently smaller MAPE, but a larger MSPE. In choosing the best model for frequency on this data, it is difficult to differentiate between the GLM (whether Bayesian or not) and the random forest model.
The severity models produced the following metrics:
As in the frequency data, the simple models were very effective. Under MAPE, the random forest model was optimal, but the GLM, Bayesian GLM, and GLMM were next. The results were similar with MSPE, but with the simpler models outperforming the random forest model.
The GLMM was slower than all the other models, except for the random forest model, which, again, was orders of magnitude slower than the other models.
Note that these results are similar to the frequency results, except that here, the random forest model outperformed in terms of MAPE and was worse in MSPE. Again, it is difficult to choose a best model between the GLMs and the random forest models.
5.3. Pure Premium
The pure premium models produced the following results:
Similar to the severity models, the random forest outperformed the rest of the models under MAPE but was much worse in terms of MSPE. This is likely because it severely overpredicted (or underpredicted) some of the policies. This is another potential benefit of the market-based model selection metric. In actual practice, underpredicting premium is much more detrimental than overpredicting it. Also, continuing the theme, the random forest model took much longer to fit than any of the other models.
5.4. Market-Based Metric
Looking only at these metrics continues to offer a difficult decision: Which model is the best? The GLMs were the most computationally efficient and performed the best under some of the metrics, but the random forest model was best under other metrics. Is the increased computational complexity of the random forest model worthwhile? Why was the performance so different between frequency and severity? Does the dramatic improvement in terms of MAPE cancel out the poor performance in terms of MSPE? The market-based model comparison answers this very question by simulating what would happen to our market share and our loss ratio if we were to price with a simpler model but our competitor priced with a random forest. This can help us decide more concretely which model to choose and understand the value of a more complicated or computationally intense method.
In our application, the simulation produced the following results:
The random forest model was the only one with a loss ratio less than 1.0 (implying more premiums than losses) and the only one that obtained almost half of the market share. The simpler models were next best, and the regularization methods performed the worst. This shows that even though the random forest was the least accurate in terms of pure premium MSPE, being the most accurate in terms of MAPE led it to be dominant in the market simulation. One possible explanation is that MSPE excessively penalizes large misses. If the random forest model severely overpredicted some of the policies, that would greatly impact MSPE, but the impact on MAPE would be muted. The model wouldn’t win those policies, so the overprediction would not adversely impact the market simulation. These results show that the increased computational time to fit a random forest is definitely worth the advantage gained in the market.
One other nice feature of the market-based model comparison is the ability to look at the results and quickly interpret possible reasons for them. About half of the market went to the random forest model and the other half went to the remaining models. What if the other models simply split the rest of the policies? How would the random forest perform if compared with just one of the other six models? What would happen if we allowed the models to contain interaction or polynomial terms? What about an ensemble of many of the models explored? All of these questions and more can be answered by market-based model selection.
With a loss ratio less than 1.0, a potential issue emerges. Because there is no accounting for profit or expenses, is a loss ratio less than 1.0 actually desirable? If a model were perfect, its predicted pure premium would always equal the actual losses. Therefore, any of the policies that it earns would have a loss ratio of exactly 1.0. Any loss ratio less than 1.0 would be due to overprediction, which is less desirable than an accurate prediction.
This turns out to be another scenario in which the market-based model selection method aligns statistical techniques with business realities. We can see that by looking at two separate scenarios:
A perfect model exists. It would write only the policies for which all other models overpredict the losses. All of the other models would write only policies for which they underpredicted losses, making their loss ratios greater than 1.0 and showing that the perfect model is the best.
As in reality, no perfect model exists. The best model is the one that makes the least costly mistakes. Most common model selection procedures are symmetric, implying that overpredictions and underpredictions are equally penalized. In the market-based method (and in actual practice), an overprediction simply means that you aren’t likely to write the policy (because the premium is so high), but an underprediction exposes you to severe risk because you are likely to write the policy and to have more losses than premiums if you do.
6. Conclusion and Future Research
When looking at standard metrics, it can be difficult to determine which model is best. This is especially true when the different metrics provide conflicting information. The market-based model comparison method can help an analyst choose a model more closely based on the desired outcomes and results. For stakeholders, it can provide better justification for capital expenditures to roll out a new model, or at least it can provide more data for making that decision. In our commercial auto data set, it is difficult to decide between GLMs and a random forest when looking at standard model comparison metrics. The market-based model comparison shows that the random forest model is dramatically superior to the other models in metrics that actually affect insurance profits: loss ratio and market share.
There are a few shortcomings in—and cautions for using—the results of this paper. The most important issue has to do with the amount of time necessary to fit a quality model in each of the model families. While the random forest model can do much of its own tuning, variable selection, and even a little bit of feature engineering, the other model families require a good bit of testing and adjustment to their variables to get a good fit.
Other considerations include the following:
Some (or even all) of the advantage of the random forest model may evaporate if certain interaction terms or other covariates are added to the other models.
While the random forest model won convincingly in our application, this was just one application. Care should be taken when using these results in a similar situation.
All of the methods in this paper are limited by the implementation in their respective R packages. A different implementation of some of the models could allow them to perform better.
This paper does not provide an exhaustive list of possible models. Two-part models (in which frequency and severity are modeled separately and then combined) are a great example of other possible models to consider.
It is possible that the increased flexibility of the random forest model, especially when it comes to the relationships between the covariates, helped the random forest to outperform the others. If that is true, some of the benefits of the random forest could be muted by allowing more complicated (with interactions, polynomials, etc.) model structures in the GLMs.
Legal and regulatory requirements can limit which ratemaking models are even available to an insurer.
Losses in other lines of business can behave very differently than losses in commercial auto.
All of these factors should be taken into account when comparing your own set of models.
These models also didn’t incorporate the number of exposures as a weight in the regression models, which would properly account for the fact that an observation with 4 exposures is actually 4 observations with identical characteristics. Essentially all of the models compared have a weight variable that can be set. The notable exception is the random forest model. To incorporate exposures as a weight in that model, the observations in the data set would be expanded to multiple rows with identical characteristics. For example, if an observation includes 6 exposures, it could be expanded to 6 rows, each with a single exposure. This could also work with decimal exposures (dividing a 2.3-exposure observation into 5 observations each with 0.5 exposures or 23 observations each with 0.1 exposures). The more precisely the observations are divided, the larger the computational impact. Because this paper focuses on comparative and predictive (not inferential) results, the results are unchanged by this simplification.
Additionally, it would be better if this metric were used over multiple years, allowing the different models to adjust their base rates and drop out of the market completely depending on their performance. Unfortunately, we had access to only a single year of data in this project.
It would be beneficial to continue to test the hypothesis that the random forest outperforms the other models. This can be done with other data sets, other lines of business, more comparison models (e.g., two-part models, ensembles of the current models, more complicated design matrices), and different comparison metrics (e.g., Gini or information criteria).
The authors are grateful for the financial support of the Casualty Actuarial Society; to the anonymous reviewers for their thoughtful comments, which greatly improved the manuscript; and to Jim King for planting the seeds of this project.