Maximum Likelihood Approaches to Misrepresentation Models in GLM ratemaking: Model Comparisons

Michelle Xia; Rexford Akakpo; Matthew Albaugh

1. Introduction

Misrepresentation is a type of insurance fraud in which a policy applicant makes a false statement about a risk factor (or rating factor) that determines insurance eligibility or rates. In insurance ratemaking, information about certain rating factors may be acquired through self-reporting, possibly due to the high cost of verifying the information through alternative channels. For instance, in auto insurance, major rating factors such as vehicle use and annual mileage are usually self-reported by insurance applicants. In health and life insurance, information about smoking status and preexisting medical conditions is acquired mainly through voluntary disclosures. Due to the financial incentive for insureds, self-reported rating factors are usually subject to misrepresentation, giving rise to additional costs for insurance companies and unfair insurance rates for consumers.

In the era of big data, predictive analytics has attracted enormous attention from both the insurance industry and academia. In the insurance industry, predictive analytics has been widely applied in the areas of marketing, underwriting, and claims modeling (Nyce 2007). In actuarial literature, Frees, Derrig, and Meyers (2014) gave a comprehensive review of traditional statistical learning methods and their applications in actuarial science. Quan and Valdez (2018) proposed multivariate decision trees for predictive analytics of insurance claims. With the availability of big data on demographics, socioeconomic status, social media usage, and geolocation, predictive analytics has proven very useful in identifying and predicting insurance claims fraud. There are abundant examples of predictive analytics use for healthcare fraud detection (e.g., Ekin et al. 2018; Bauder and Khoshgoftaar 2017; Bayerstadler, van Dijk, and Winter 2016; Joudaki et al. 2015; and Thornton et al. 2013).

When it comes to underwriting misrepresentation, however, predictive modeling is deterred by the unobserved feature of misrepresentation at the policy level, and the literature available on quantitative methods tailored to the problem is very limited. Recently, Xia and Gustafson (2016) formulated the problem of unidirectional misrepresentation under the regression setting and proved the identifiability of the conditional distribution of the loss outcome given the observed risk status for nonbinary loss outcomes. They showed that consistent statistical inference can be obtained on the parameters of interest, including the risk effect (i.e., the relativity) and the misrepresentation probability, using regular ratemaking data without information on the misrepresentation status. For mathematical convenience, they only studied situations in which there was only one risk factor in the ratemaking models. Xia, Hua, and Vadnais (2018) extended the misrepresentation models to more realistic situations in which there were multiple risk factors in the generalized linear model (GLM) ratemaking framework. Moreover, they relaxed the assumption that the misrepresentation probability does not change with other risk factors and thus allowed for predictive analytics on the misrepresentation probability. With the aid of the aforementioned models, insurance companies can predict the misrepresentation probability at the policy level based on historical ratemaking data.

For the models’ implementation, Xia and Gustafson (2016) and Xia, Hua, and Vadnais (2018) relied on Bayesian inference based on Markov chain Monte Carlo simulations. To facilitate the industrial application of the misrepresentation models, Akakpo, Xia, and Polansky (2019) considered the maximum likelihood inference and proposed using the expectation maximization (EM) algorithm for the lognormal model from Xia and Gustafson (2016) with only one risk factor. (Maximum likelihood inference is more commonly used for GLM ratemaking models due to advantages such as computational efficiency.) Owing to the simplicity of the problem and the tractability of the normal likelihood function, they derived explicit formulas for iterating between the expectation step (E-step) and the maximization step (M-step), along with the observed Fisher information matrix for large-sample inference on the risk effect (relativity) and normal variance parameters. Recently, Chen, Su, and Xia (2021) extended the normal model and its EM algorithm to include zero claims based on a two-part process, and Li, Song, and Su (2021) proposed robust estimation of insurance misrepresentation based on kernel quantile regression mixtures. For the sophisticated misrepresentation predictive models from Xia, Hua, and Vadnais (2018), we aim to incorporate a more general exponential family of distributions under the GLM ratemaking framework. Thus, the maximum likelihood inference will be more involved and the explicit formulas from Akakpo, Xia, and Polansky (2019) will no longer be available for the EM algorithm and large-sample inference.

In this paper, we accept the challenge and use the maximum likelihood method to implement the misrepresentation predictive models from Xia, Hua, and Vadnais (2018) that allow latent binary regression on the prevalence of misrepresentation. Specifically, we introduce a set of latent binary factors on the occurrence of misrepresentation for policies reporting a negative risk status, and we derive general forms of the partial and complete data likelihood functions for the misrepresentation predictive models from Xia, Hua, and Vadnais (2018). We use the EM algorithm (Louis 1982; McLachlan and Krishnan 2008) based on the complete data likelihood function to estimate the parameters in the misrepresentation predictive models. The proposed estimation algorithm enables the frequentist predictive analytics on the misrepresentation status by allowing the prevalence of misrepresentation to vary with certain risk factors. For the sake of illustration, we implement the EM algorithm with some commonly used loss severity and frequency distributions such as the lognormal, gamma, Poisson, and negative binomial. The standard errors of the estimators are obtained for large-sample inference using numerical approximation of the Hessian matrix (Yang et al. 2005) associated with the partial data log-likelihood function. We perform simulation studies to compare the parameter estimates of the proposed EM algorithm and its Bayesian counterpart while assessing the ability of the proposed misrepresentation models to learn parameters, including the misrepresentation prevalence. The simulation studies demonstrate the computational efficiency of the EM algorithm as well as the importance of misrepresentation modeling in practical scenarios where misrepresentation is possible.

Applying the EM algorithm based on the aforementioned loss frequency and severity distributions, we perform model comparisons for the purposes of assessing misrepresentation and predicting claims at the policy level. The application study uses healthcare expenditure and utilization data from the Medical Expenditure Panel Survey (MEPS) to demonstrate the usefulness of the proposed methodology for claim prediction and misrepresentation risk assessment. The analysis obtains a statistically significant test for the presence of misrepresentation and the effects of risk factors on the prevalence of misrepresentation based on the loss severity models on total medical expenditures and the loss frequency models on the number of office-based visits. All the criteria, including goodness-of-fit statistics, in-sample and out-of-sample prediction, and statistical tests, confirm the significance of the test on misrepresentation. For insurance applications, including claims analytics and ratemaking, the application study reveals the importance of model selection in misrepresentation models used for the prediction of future losses fitted using historical data.

The rest of this paper is organized as follows: Section 2 reviews the embedded misrepresentation predictive models from Xia, Hua, and Vadnais (2018). Section 3 considers the EM algorithm for maximum likelihood estimation of the model’s parameters, including those concerning the risk effects (relativities) on the loss frequency/severity and the misrepresentation prevalence. Section 4 discusses the large-sample inference with respect to the risk effects and the misrepresentation parameters and compares models with different distribution choices. Section 5 evaluates the EM algorithm and its Bayesian counterpart based on the parameter estimates and computation speed. Section 6 applies the proposed algorithm and model comparison tools for assessing misrepresentation risk in the healthcare expenditure and utilization variables collected in the MEPS. (For the MEPS data, model validation is performed using various methods, including those commonly adopted for ratemaking models.) Section 7 concludes the paper.

2. Predictive models for misrepresentation

We first review the notation and models from Xia, Hua, and Vadnais (2018) that enable embedded predictive analytics on misrepresentation risk at the policy level based on observed data from regular ratemaking.

2.1. Misrepresentation

We first formulate the statistical problem of misrepresentation for a binary rating factor $V$ (e.g., on smoking). For the risk factor, $(V=1)$ denotes an insured with a positive risk status and $(V=0)$ denotes a negative risk status. The corresponding reported status that is subject to misrepresentation is denoted as $V^*$ . Due to the financial incentive of denying a positive risk status, misrepresentation is assumed to happen only in the direction that is beneficial to the applicant. This unidirectional property translates mathematically to $\mathbb{P}[V^*=0\,|\,V=0]=1$ and $\mathbb{P}[V^*=0\,|\,V=1]=p$ with $0\leq p\leq 1$ . Here, the conditional probability $p$ is referred to as the misrepresentation probability. Thereby, the conditional probability distribution of $(V^*\,|\,V)$ quantifies the severity of misrepresentation.

In regular ratemaking data, the reported status $V^*$ is the surrogate version of the risk status that is observable. From a practical point of view, it is more meaningful for insurance companies to study the conditional distribution of $(V\,|\,V^*)$ (i.e., given an observed status, the true risk attribute of the insured). This conditional probability distribution can be obtained using Bayes’ theorem. Denoting by $\theta=\mathbb{P}[V=1]$ the true probability of a positive risk status, it is straightforward to obtain the observed probability of a positive risk status as $\theta^*=\mathbb{P}[V^*=1]=\mathbb{P}[V^{*}=1,\,V=0]+\mathbb{P}[V^{*}=1,\,V=1]=$
$\theta(1-p)$ . Application of Bayes’ theorem yields $\mathbb{P}[V=1|V^*=1]=1$ ; that is, given that an applicant reports a positive status, the true risk status must be positive. Another quantity of particular interest is $\lambda=\mathbb{P}[V=1|V^*=0]$ , the percentage of misrepresented cases within those applicants who reported a negative status. In Xia, Hua, and Vadnais (2018), the conditional probability $\lambda$ is referred to as the prevalence of misrepresentation. Using Bayes’ theorem, the misrepresentation prevalence can be obtained as

$\mathbb{P}\left[V=1 \mid V^*=0\right]=\lambda=\frac{\theta p}{1-\theta(1-p)}. \tag{1}$

Note that the misrepresentation prevalence quantifies the misrepresentation risk given the reported negative status and determines the total number of misrepresented cases in the whole book of policies. Due to the unobserved feature of $V$ , the conditional probability $\lambda$ cannot be estimated directly from regular ratemaking data. Even if fraud investigations are conducted to identify the misrepresented cases, the information may not be credible for estimating the misrepresentation prevalence in the overall book of business due to the selection bias inherent in the investigation process.

2.2. GLM ratemaking

Under the GLM ratemaking framework (Bermúdez and Karlis 2011; Frees 2009; Brockman and Wright 1992), Xia, Hua, and Vadnais (2018) proposed a class of embedded predictive analytics models based on regular ratemaking data for studying misrepresentation risk. For the purpose of ratemaking, we use the random variable $Y$ to denote the loss outcome that can be either continuous (e.g., for modeling loss severity on the mean of the claim amount given that a claim has occurred) or discrete (e.g., for modeling loss frequency on the average number of claims per policy period). Assume there are $k\in \mathbb{N}$ additional risk factors that are predictive of the loss outcome $Y$ , denoted by $\boldsymbol{X}=$ ( $X_1$ , $X_2$ , $\cdots$ , $X_k)'$ . Examples of risk factors that are commonly used in personal property and casualty insurance include age, sex, credit rating, and claim history. For specific product lines such as personal auto insurance, commonly used risk factors include vehicle use, annual mileage, traffic violations, and geographic location.

In GLM ratemaking, a generalized linear model is assumed for the relationship between the mean of $Y$ and the rating factors in $\{V,\boldsymbol{X}\}$ . We use the set notation $\mathcal{S},\mathcal{T}\subseteq \{1,\ldots,k\}$ to describe the sets of rating factors in the GLM ratemaking model and the misrepresentation prevalence model, respectively. Thus, we have $\boldsymbol{X}^{\square}=\{X_{j}:j\in \square \}$ where $\square$ can be one of $\{\mathcal{S},\mathcal{T}\}$ . For a GLM ratemaking model, we may specify the conditional probability distribution of ( $Y\,|\,V,\,\boldsymbol{X}^{\mathcal{S}}$ ) as

$\begin{aligned} Y \mid V, \boldsymbol{X}^{\mathcal{S}} &\sim D\left(\mu_{V, \boldsymbol{X}^{\mathcal{s}}}, \varphi\right) \\ g\left(\mu_{V, \boldsymbol{X}^{\mathcal{S}}}\right)&=\alpha_0+\sum_{j \in \mathcal{S}} \alpha_j X_j+\alpha_{k+1} V, \end{aligned} \tag{2}$

where $D(\mu_{V,\boldsymbol{X}^{\mathcal{S}}},\,\varphi)$ denotes a distribution in the exponential family with the mean $\mu_{V,\boldsymbol{X}^{\mathcal{S}}}$ and a dispersion parameter $\varphi$ . The function $g(\cdot)$ is referred to as the link function that connects the mean to the linear predictor $\alpha_0+\sum_{j\in\mathcal{S}}\alpha_jX_j+\alpha_{k+1}V$ . In GLM ratemaking, $D(\cdot,\cdot)$ may take the form of a gamma or lognormal (i.e., normal for the logarithm of the loss) distribution for loss severity modeling, and it may take the form of a Poisson or negative binomial distribution for frequency models. For ease of interpretation and rate calculation, a log link function $\log(\cdot)$ is commonly assumed in both severity and frequency models. Under a log link function, the exponential of the coefficient $\exp(\alpha_j)$ , $j\in\mathcal{S}$ quantifies the relative effect on the mean of $Y$ with one unit increase in the risk factor $X_j$ , and is thus referred to as the relativity associated with $X_j$ . Interested readers may refer to standard references such as Goldburd, Khare, and Tevet (2020) for commonly used GLM ratemaking models based on lognormal, gamma, Poisson, and negative binomial distributions. In addition, Xia, Hua, and Vadnais (2018) give examples of distributional specifications for GLM ratemaking models incorporating misrepresentation.

For the implementation of the aforementioned GLM ratemaking models, the difficulty lies in the fact that the true risk status $V$ cannot be directly observed due to the presence of misrepresentation in self-reported rating factors. Ignoring the problem of misrepresentation usually leads to bias in the estimate of the risk effect $\alpha_{k+1}$ (Xia and Gustafson 2016), resulting in unfairness in the calculated premiums for consumers. Furthermore, the misrepresented status may cause the insurance company to accept high-risk customers who would be ineligible based on their true status. In order to study the severity of the misrepresentation problem and perform predictive modeling at the policy level, we consider the conditional loss distribution given the observed variables treating $V$ as a latent variable depending on $V^*$ .

2.3. Predictive models on misrepresentation

Denote by $f_Y(y\,|\,\boldsymbol{\alpha},\,\varphi,\,V,\,\boldsymbol{X}^{\mathcal{S}})$ the conditional probability function of ( $Y\,|\,V,\,\boldsymbol{X}^{\mathcal{S}}$ ), where $\boldsymbol{\alpha}=\{\alpha_j\}$ , $j\in\mathcal{S}\bigcup\,\{0,\,k+1\}$ , which contains the set of regression coefficients. In loss severity modeling, the conditional distribution function $f_Y(y\,|\,\boldsymbol{\alpha},\,\varphi,\,V,\,\boldsymbol{X}^{\mathcal{S}})$ usually takes the form of the probability density function of gamma and lognormal distributions. In loss frequency modeling, the function $f_Y(y\,|\,\boldsymbol{\alpha},\,\varphi,\,V,\,\boldsymbol{X}^{\mathcal{S}})$ usually takes the form of the probability mass function of Poisson and negative binomial distributions.

Regarding the conditional distribution of the observed variables, ( $Y\,|\,V^*,\,\boldsymbol{X}^{\mathcal{S}}$ ), Xia and Gustafson (2016) and Xia, Hua, and Vadnais (2018) derived a general form

$\scriptsize{ \begin{aligned} f_Y\left(y \mid \boldsymbol{\alpha}, \varphi, \lambda, V^*=1, \boldsymbol{X}^{\mathcal{S}}\right)&=f_Y\left(y \mid \boldsymbol{\alpha}, \varphi, V=1, \boldsymbol{X}^{\mathcal{S}}\right) \\ f_Y\left(y \mid \boldsymbol{\alpha}, \varphi, \lambda, V^*=0, \boldsymbol{X}^{\mathcal{S}}\right)&=(1-\lambda) f_Y\left(y \mid \boldsymbol{\alpha}, \varphi, V=0, \boldsymbol{X}^{\mathcal{S}}\right)\\ &\quad +\lambda f_Y\left(y \mid \boldsymbol{\alpha}, \varphi, V=1, \boldsymbol{X}^{\mathcal{S}}\right), \end{aligned} } \tag{3}$

where $\lambda$ is the prevalence of misrepresentation defined in Equation 1 that directly determines the total number of misrepresented cases in the book of policies and thus is of particular interest to insurance companies. From Equation 1, the prevalence of misrepresentation is determined by the prevalence of the risk factor $\theta=\mathbb{P}[V=1]$ , and the misrepresentation probability $p=\mathbb{P}[V^*=0\,|\,V=1]$ . Note that the mixture structure in Equation 3 cannot be caused by a binary confounding factor unless individuals with a positive reported status $V^*=1$ all have the same status for this confounding factor. Here the unique feature of misrepresentation data is that the conditional distribution in the first line of Equation 3 is a single distribution instead of a mixture distribution that would typically arise in the case of a binary confounding factor not captured in the model.

Under real insurance settings, either of the probabilities may vary with certain risk factors in $\boldsymbol{X}$ . That is, it is reasonable to assume that the probability of a true positive status and/or the occurrence of misrepresentation depends on certain risk factors. Hence, we may assume that the prevalence of misrepresentation $\lambda$ depends on the risk factors in $\boldsymbol{X}^{\mathcal{T}}$ . This can be achieved by assuming a latent binomial regression model on $\lambda$ . For the purpose of predictive modeling on the misrepresentation prevalence, Xia, Hua, and Vadnais (2018) proposed a latent binomial regression structure given by

$\begin{aligned} \left(V \mid V^*=0, \boldsymbol{X}^{\mathcal{\tau}}\right) & \sim {Bernoulli}\left(\lambda_{\boldsymbol{X}^{\boldsymbol{\tau}}}\right) \\ g\left(\lambda_{\boldsymbol{X}^{\mathcal{T}}}\right) & =\beta_0+\sum_{j \in \mathcal{T}} \beta_j X_j, \end{aligned} \tag{4}$

where the prevalence of misrepresentation $\lambda_{\boldsymbol{X}^{\mathcal{T}}}=\mathbb{P}[V =1\,|\,V^*=0,\,\boldsymbol{X}^{\mathcal{T}}]$ depends on the risk factors in $\boldsymbol{X}^{\mathcal{T}}$ , and $g(\cdot)$ is a link function that may take the logit, log, or probit form. Here, the logit link function enables us to estimate the risk effect in terms of the odds ratio, while the log link function gives risk effect in terms of the relative risk. The parameters of the logit model, $\boldsymbol{\beta}=\{\beta_j\}$ , $j\in\mathcal{T}\bigcup\,\{0\}$ , quantify the effects the additional risk factors have on the prevalence of misrepresentation and thus describe the mechanism of misrepresentation. Since the true status $V$ is unobserved, the binomial regression model in 4 is a latent model.

From Equation 3, the conditional distribution of the observed variables is a single distribution for applicants who reported a positive risk status (i.e., $V^*=1$ ), and it takes the form of a mixture distribution for applicants who reported a negative risk status. The mixture model in the second line of Equation 3 is called a mixture regression model (Grün and Leisch 2007) when there are additional covariates (risk factors) $\boldsymbol{X}^{\mathcal{S}}$ in the model. When the prevalence of misrepresentation (i.e., the mixture weights) depends on some additional covariates $\boldsymbol{X}^{\mathcal{T}}$ , as in the case of Equation 4, the model in the second line of 3 is called a mixture of experts model (Jiang and Tanner 1999), or a mixture regression model with concomitant variables (Grün and Leisch 2008).

For distributions in the exponential family, including gamma, normal, Poisson, and negative binomial distributions, Jiang and Tanner (1999) proved that the mixture of experts model is identifiable up to permutation when the true risk factor $V$ has a nonzero effect. From the first line of Equation 3, parameters in the conditional distribution ( $Y\,|\,V=1,\,\boldsymbol{X}^{\mathcal{S}}$ ) can be learned separately from the data with $V^*=1$ . The permutation of the two components in the second line of 3 can be learned from the combined data, and the model given in Equations 3 and 4 thus possesses identifiability. This implies that all the parameters, including those for the risk effects on the loss distribution in Equation 2 and those for the logit model on the prevalence of misrepresentation in Equation 4, can be consistently estimated from regular ratemaking data ( $Y$ , $V^*$ , $\boldsymbol{X}$ ) without observing the true risk status $V$ .

The estimated regression coefficients from the loss model in Equation 2 allow actuaries to understand the true risk effects on the loss outcome, while those from the misrepresentation model in Equation 4 enable actuaries to assess the prevalence of misrepresentation under the policy level for the purpose of predictive modeling. Based on the models fitted against the historical data, prediction of misrepresentation risk can be made on the new policies. The underwriting department can then undertake a cost-benefit analysis for potential misrepresentation investigations while making informed decisions on the selection of policies for investigation.

3. Maximum likelihood estimation

Because of the unobserved feature of the misrepresentation status, we may implement the misrepresentations predictive models using either Bayesian inference based on Markov chain Monte Carlo simulations or maximum likelihood inference based on the EM algorithm. Both types of implementation use the complete data likelihood function that includes the latent status on the occurrence of misrepresentation at the policy level. Compared with the Bayesian approach, the maximum likelihood method seems to have gained more popularity for ratemaking purposes in the insurance industry. Hence, in this paper, we place our special focus on the maximum likelihood inference for the predictive misrepresentation models presented in Section 2.

3.1. Partial and complete data likelihood

In order to derive the EM algorithm, we first introduce the partial and complete data likelihood for the misrepresentation models of interest. For notational convenience, we denote

$\begin{aligned} f_0(y;\boldsymbol{\alpha},\,\varphi)&=f_Y(y\,|\,{\boldsymbol{\alpha},\,\varphi},V=0,\,\boldsymbol{X}^{\mathcal{T}})\notag\\ f_1(y;\boldsymbol{\alpha},\,\varphi)&=f_Y(y\,|\,{\boldsymbol{\alpha},\,\varphi},V=1,\,\boldsymbol{X}^{\mathcal{T}}).\notag \end{aligned}$

Suppose there is a random sample of size $n$ for the observed variables ( $Y$ , $V^*$ , $\boldsymbol{X}$ ), denoted respectively by $\boldsymbol{y}=(y_1,\,y_2,\,\cdots,\,y_n)$ , $\boldsymbol{v^*}=(v_1^*,\,v_2^*,\,\cdots,\,v_n^*)$ , and $\boldsymbol{x}=(\boldsymbol{x}_1,\,\boldsymbol{x}_2,\,\cdots,\,\boldsymbol{x}_n)$ with $\boldsymbol{x}_i=(x_{i1},\,x_{i2},\,\cdots,\,x_{ik})$ .

Set $\boldsymbol{\theta}=(\boldsymbol{\alpha},\varphi,\boldsymbol{\beta})$ . The prevalence of misrepresentation for the $i$ th observation is $\lambda_{i1}(\boldsymbol{\beta})=\lambda_{\boldsymbol{X}^{\mathcal{T}}=\boldsymbol{x}_i^{\mathcal{T}}}$ , $\lambda_{i0}(\boldsymbol{\beta})=1-\lambda_{i1}(\boldsymbol{\beta})$ , and $\boldsymbol{\beta}=\{\beta_j\}$ , $j\in\mathcal{T}\bigcup\,\{0\}$ , which contains the regression coefficients from the misrepresentation model 4. The partial data likelihood function can be written as

$\scriptsize{ L^p\left(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{v}^*, \boldsymbol{x}\right)=\prod_{i=1}^n\left[v_i^* f_1\left(y_i ; \boldsymbol{\alpha}, \varphi\right)+\left(1-v_i^*\right) \sum_{j=0}^1 \lambda_{i j}(\boldsymbol{\beta}) f_j\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\right] }, \tag{5}$

with the corresponding partial log-likelihood function given by

$\small{ \begin{align} l^p\left(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{v}^*, \boldsymbol{x}\right)&=\sum_{i=1}^n v_i^* \log f_1\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\\ &\quad +\sum_{i=1}^n\left(1-v_i^*\right) \log \sum_{j=0}^1 \lambda_{i j}(\boldsymbol{\beta}) f_j\left(y_i ; \boldsymbol{\alpha}, \varphi\right). \end{align} } \tag{6}$

In order to obtain the complete data likelihood for the EM algorithm, we need to introduce the latent status on the occurrence of misrepresentation. For observations where $v_i^*=0$ , denote by $z_{i1}$ the unobserved binary indicator on whether the $i$ th observation is misrepresented (i.e., whether the observation comes from the second component distribution). Similarly, we use $z_{i0}=1-z_{i1}$ to indicate whether the observation comes from the first component distribution. Denote by $\boldsymbol{z}$ the vector containing all $z_{ij}$ for different values of $i\in\{1,\ldots,n\}$ and $j\in\{0,1 \}$ . The complete data likelihood function can be written as

$\scriptsize{ \begin{align} L\left(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{v}^*, \boldsymbol{x}, \boldsymbol{z}\right) \propto \prod_{i=1}^n \biggl\{&v_i^* f_1\bigl(y_i ; \boldsymbol{\alpha}, \varphi\bigr) \\ & +\bigl(1-v_i^*\bigr) \prod_{j=0}^1\left[\lambda_{i j}(\boldsymbol{\beta}) f_j\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\right]^{z_{i j}}\biggr\}, \end{align} } \tag{7}$

with the corresponding complete data log-likelihood function given by

$\scriptsize{ \begin{align} l\left(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{v}^*, \boldsymbol{x}, \boldsymbol{z}\right) = C &+ \sum_{i=1}^n v_i^* \log f_1\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\\ &+\sum_{i=1}^n\left(1-v_i^*\right) \sum_{j=0}^1 z_{i j} \log \left[\lambda_{i j}(\boldsymbol{\beta}) f_j\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\right]. \end{align} } \tag{8}$

where $C$ is a constant that can be ignored for the numerical procedures in the later sections.

3.2. EM algorithm

In order to obtain the maximum likelihood estimates (MLEs) of the simplified misrepresentation models in Xia and Gustafson (2016), Akakpo, Xia, and Polansky (2019) proposed using the expectation maximization (EM) algorithm that has been commonly used for mixture models in which there are no analytical forms for the MLEs. Due to the tractability of the normal likelihood function, Akakpo, Xia, and Polansky (2019) obtained explicit forms for the iterative formulas involved in the EM algorithm. For other loss frequency and severity distributions such as the ones considered in this paper, there is no explicit form available for the iterative steps in the algorithm. Using numerical optimization methods such as the Newton-Raphson algorithm, we derive the EM algorithm for the predictive models of interest where the prevalence of misrepresentation depends on additional risk factors. The algorithm can be conveniently implemented in the statistical software R for the loss frequency and severity models mentioned previously.

3.2.1. Expectation step

In the expectation step (E-step), the algorithm calculates the expectation of the complete data log-likelihood function with respect to the conditional distribution of $\boldsymbol{z}$ given the observations $(\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ , and the latest estimates for $\boldsymbol{\theta}$ . Specifically, in iteration $s\in \mathbb{N}$ , we have the conditional expectation

$\small{ \begin{aligned} l^*\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(s)}\right) & =\mathbb{E}_{\boldsymbol{z} \mid \boldsymbol{y}, v^*, \boldsymbol{X}, \boldsymbol{\theta}^{(\boldsymbol{s})}}\left[l\left(\boldsymbol{\theta} \mid \boldsymbol{y}, \boldsymbol{v}^*, \boldsymbol{x}, \boldsymbol{z}\right)\right] \\ & =\sum_{i=1}^n v_i^* \log f_1\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\\ &\quad +\sum_{i=1}^n\left(1-v_i^*\right) \sum_{j=0}^1 \mathbb{E}_{z \mid y, v^*, x, \boldsymbol{\theta}^{(s)}}\left[z_{i j}\right] \\ &\quad \cdot \log \left[\lambda_{i j}(\boldsymbol{\beta}) f_j\left(y_i ; \boldsymbol{\alpha}, \varphi\right)\right], \end{aligned} } \tag{9}$

where $f_j(y_i;\,\boldsymbol{\alpha},\,\varphi)$ is the density function of the loss distribution evaluated at $\boldsymbol{\theta}$ . By Bayes’ theorem, we have for $i=1,\ldots,n$ and $j=0,1$

$\small{ \begin{aligned} \mathbb{E}_{\boldsymbol{z} \mid \boldsymbol{y}, v^*, \boldsymbol{X}, \boldsymbol{\theta}^{(s)}}\left[z_{i j}\right] & =\mathbb{P}\left[z_{i j}=1 \mid \boldsymbol{\theta}^{(s)}, y_i\right] \\ & =\frac{f\left(y_i \mid z_{i j}=1, \boldsymbol{\theta}^{(s)}\right) \mathbb{P}\left[z_{i j}=1 \mid \boldsymbol{\theta}^{(s)}\right]}{f\left(y_i \mid \boldsymbol{\theta}^{(s)}\right)}\\ &=\frac{f_j\left(y_i ; \boldsymbol{\alpha}^{(s)}, \varphi^{(s)}\right) \lambda_{i j}\left(\boldsymbol{\beta}^{(s)}\right)}{\sum_{l=0}^1 f_l\left(y_i ; \boldsymbol{\alpha}^{(s)}, \varphi^{(s)}\right) \lambda_{i l}\left(\boldsymbol{\beta}^{(s)}\right)}, \end{aligned} } \tag{10}$

where $f_j\left(y_i;\,\boldsymbol{\alpha}^{(s)},\,\varphi^{(s)}\right)$ , $j\in \{0,1\}$ is the density function evaluated at $\boldsymbol{\theta}^{(s)}$ and $\lambda_{ij}\left(\boldsymbol{\beta}^{(s)}\right)$ is the prevalence of misrepresentation at the policy level evaluated at $\boldsymbol{\theta}^{(s)}$ . Note that the posterior probability $\mathbb{E}_{\boldsymbol{z}|\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{X},\boldsymbol{\theta}^{(s)}}[z_{ij}]$ only appears in the second part of Equation 9, so we only need to calculate it for observations with $v_i^*=0$ . Based on Equations 9 and 10, we can derive an explicit expression for the expected complete data likelihood function that will be used to update the parameters in the maximization step of the EM algorithm.

3.2.2. Maximization step

In the maximization step (M-step) of the EM algorithm, an update is obtained for the parameters in $\boldsymbol{\theta}$ by maximizing the expected complete data log-likelihood in Equation 9. In iteration $(s+1)$ of the M-step, the update of the parameters $\boldsymbol{\theta}^{(s+1)}$ is obtained by

$\begin{aligned} \boldsymbol{\theta}^{(s+1)}&=\mathop{\text{argmax}}_{\boldsymbol{ \theta}}\Big[l^*\left(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}^{(s)}\right)\Big]\notag\\ &=\mathop{\text{argmax}}_{\boldsymbol{ \theta}}\mathbb{E}_{\boldsymbol{z}|\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{X},\boldsymbol{\theta}^{(s)}} \Big[l(\boldsymbol{\theta}\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x},\boldsymbol{z})\Big]\notag. \end{aligned}$

For the M-step, Akakpo, Xia, and Polansky (2019) derived analytical forms for the lognormal severity model when neither the loss severity nor the prevalence of misrepresentation depends on any other risk factor (i.e., when the model does not involve a logit regression structure 4, or, alternatively, when $\lambda_{ij}(\boldsymbol{\beta})$ does not depend on $i$ , $\boldsymbol{\beta}$ or $x_{ih}$ for any $h\in\{1,2,\cdots,k\}$ ). For commonly used loss frequency and severity models including Poisson, negative binomial, gamma, and lognormal models with a logit structure on the prevalence of misrepresentation, no explicit formulas can be derived for estimating the parameters in the M-step. In such situations, we may resort to numerical optimization based on Newton-type methods. For example, according to our extensive numerical experiments, the Newton-Raphson method implemented in the R function $nlm()$ performs well for the misrepresentation predictive models based on the aforementioned loss distributions of interest.

Starting with the initial values from a regression analysis without adjusting for misrepresentation, the EM algorithm iterates between the E-step and M-step until the change in the partial data log-likelihood is below a prespecified threshold. For illustration purposes, in the Appendix, we provide the R implementation for a gamma model when $\lambda$ does not depend on any risk factor. For the R implementation, we specify 0.4 and 0 as the default values of the starting values for the misrepresentation parameters $\lambda$ and each element in $\boldsymbol{\beta}$ , respectively. The starting values are specified as input variables of the R functions to allow for changes when needed (e.g., for checking algorithm convergence).

4. Statistical inference and model selection

In this section, we discuss the statistical inference on the risk effects of the rating factors (including the misrepresented ones), the statistical tests on the presence and mechanism of misrepresentation, and model selection with regard to loss distributions and model structures.

4.1. Inference on risk effects

For the purpose of statistical inference on the risk effects of the rating factors, the MLEs have a large sample multivariate normal distribution with the mean converging to the true values of the parameters and the variance-covariance matrix converging to the inverse of the Fisher information matrix, which is the Hessian matrix with respect to the partial data log-likelihood function. For the lognormal loss severity model, Akakpo, Xia, and Polansky (2019) derived explicit forms of the observed Fisher information using the complete data log-likelihood function. For the predictive misrepresentation models of interest, we may evaluate the Hessian matrix of the partial data log-likelihood function using numerical derivatives. Note that numerical derivatives are widely used for obtaining the Hessian matrix in situations where the MLEs do not have explicit forms (e.g., in the case of GLM ratemaking).

Denote by $\widehat{\boldsymbol{\theta}}$ the MLEs of $\boldsymbol{\theta}$ and by $l^p(\boldsymbol{\theta})=l^p(\boldsymbol{\theta}\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ the partial data log-likelihood function evaluated at $\boldsymbol{\theta}$ . The observed Fisher information evaluated at the MLEs is defined as

$\small{ \mathcal{I}_n(\widehat{\boldsymbol{\theta}})=-\left(\begin{array}{cccc} \frac{\partial^2}{\partial \alpha_0^2} & \frac{\partial^2}{\partial \alpha_0 \partial \beta_0} & \cdots & \frac{\partial^2}{\partial \alpha_0 \partial \varphi} \\ \frac{\partial^2}{\partial \beta_0 \partial \alpha_0} & \frac{\partial^2}{\partial \beta_0^2} & \cdots & \frac{\partial^2}{\partial \beta_0 \partial \varphi} \\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial^2}{\partial \varphi \partial \alpha_0} & \frac{\partial^2}{\partial \varphi \partial \beta_0} & \cdots & \frac{\partial^2}{\partial \varphi^2} \end{array}\right) l^p(\boldsymbol{\theta})|_{\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}}. } \tag{11}$

When the sample size $n$ is large, the MLEs $\widehat{\boldsymbol{\theta}}$ have an approximate multivariate normal distribution $N(\boldsymbol{\theta},\,\boldsymbol{\Omega})$ , with $\boldsymbol{\Omega}$ being the inverse of the observed Fisher information matrix evaluated at the MLEs (i.e., $\mathcal{I}_n(\widehat{\boldsymbol{\theta}})^{-1}$ ). Thus, for the purpose of statistical inference, the standard error of the $j$ th parameter can be approximated by $\sqrt{\omega_{jj}}$ , the square root of the $j$ th diagonal element of $\boldsymbol{\Omega}$ . Interested readers may refer to Chapter 7 of Frees (2022) for details regarding the asymptotic properties of MLEs that justify the use of the aforementioned normal distribution for large-sample inference on MLEs.

For the partial log-likelihood function in Equation 6, there are no analytical forms available for the partial second derivatives. For loss frequency and severity distributions, including Poisson, negative binomial, lognormal, and gamma distributions, we resort to numerical derivatives for obtaining the observed Fisher information for the misrepresentation predictive models that can involve a logit structure on the prevalence of misrepresentation. There are a variety of methods implemented in R that can be used to approximate the Hessian matrix. Examples include the complex-step derivative approximation (Martins, Sturdza, and Alonso 2003) implemented in the hessian() function of the numDeriv package and the hessian_csd() function implemented in the pracma package, along with various versions of finite difference approximation (Yang et al. 2005) (e.g., the Secant method in the optim() function, the Richardson’s method implemented in the hessian() function of the numDeriv package, and the three-point central difference formula in the hessian() function of the pracma package). For the misrepresentation predictive models with Poisson, negative binomial, lognormal, and gamma loss distributions, for example, the hessian() function based on the complex-step derivative approximation performs well and provides reasonable standard errors given various data generation mechanisms and in comparison with the standard errors from the corresponding unadjusted models.

Using the standard errors obtained from the observed Fisher information, we can perform statistical inference, including the Wald tests and Wald confidence intervals, on the risk effects of the rating factors that are predictive of the loss distribution and the misrepresentation prevalence.

4.2. Inference on misrepresentation

Regarding inference on the misrepresentation, our first question is whether we can construct a test to assess statistical evidence on the presence of misrepresentation based on observed data. For assessing the presence of misrepresentation, we cannot use the standard error of $\widehat{\lambda}$ to perform a statistical test, as the null value $\lambda=0$ lies on the boundary of the parameter space $[0,1]$ . Instead, we may perform a likelihood ratio test (LRT) between the unadjusted model assuming $\lambda=0$ (the null model) and the adjusted model with $\lambda\in[0,1]$ (the alternative model). The null model is simply a regular GLM ratemaking model that is nested within the misrepresentation model of interest with $\lambda\in[0,1]$ . Denote by $l^p_a=l^p(\widehat{\boldsymbol{\theta}}\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ the partial data log-likelihood function (5) evaluated at the MLEs and $l^p_0=l^p(\widehat{\boldsymbol{\alpha}},\widehat{\varphi},\lambda=0\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ evaluated at the MLEs obtained with the restriction $\lambda=0$ . The LRT statistic is given by

$G^2=2(l^p_a-l^p_0).$

For mixture models, the LRT does not have an asymptotic chi-squared distribution for the test on the order of the mixture (i.e., the number of mixture components) (Hartigan 1985). Due to the involvement of a mixture model structure in the partial data likelihood function (5), the property is also true for the misrepresentation models that contain a mixture regression model under a hybrid structure. Hence, we resort to the parametric bootstrap from McLachlan (1987) to obtain the null distribution of the LRT statistic for hypothesis testing on the presence of misrepresentation. The parametric bootstrap approach fits a null model (i.e., an unadjusted model assuming $\lambda=0$ ) using the original data and obtains the empirical distribution of the LRT statistic based on repeatedly generated pseudo data according to the fitted null model (e.g., for $s$ times). For each simulated data set, an LRT statistic is calculated. The empirical distribution of the $s$ LRT statistics are then used to obtain the critical value, or the $p$ -value, of the LRT on the original data.

If the LRT on the presence of misrepresentation is significant, then we can perform predictive modeling on the misrepresentation risk by including a logit regression structure on the misrepresentation prevalence according to Equation 4. Since the logit model in 4 is a latent model that can have weak identification (Xia and Gustafson 2016), we recommend a forward selection procedure on risk factors that are significant in predicting the loss distribution. For such variable selection, we may use the $p$ -value from either the Wald test based on the observed Fisher information or the LRT based on the parametric bootstrap.

4.3. Model selection with different loss distributions

For selecting models based on different loss frequency or severity distributions, we may resort to model selection criteria, including the Akaike information criterion (AIC), Bayesian information criterion (BIC), and their extended versions. Unlike the LRT that requires the models to be nested (see Section 6.2 of Goldburd, Khare, and Tevet 2020), Lindsey and Jones (1998) and Burnham and Anderson (2002, Section 6.7) noted that AIC and BIC are valid for GLM model comparisons involving different response distributions when the likelihood functions are calculated based on the same sets of observed variables/records. Based on the type of likelihood functions given in the previous section, we may use these criteria to compare the goodness of fit of the different loss models considered in this paper.

The AIC is defined based on the maximized partial data log-likelihood and the number of parameters in the model. It penalizes the complexity of the model when comparing the goodness of fit (i.e., the maximized partial data log-likelihood). In particular, the AIC is given by

$AIC=2m-2l,$

where the maximized log-likelihood $l=$ $l^p(\widehat{\boldsymbol{\alpha}},\widehat{\varphi},\lambda=0\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ for the regular GLM ratemaking models and it is given by $l=$ $l^p(\widehat{\boldsymbol{\theta}}\,|\,\boldsymbol{y},\boldsymbol{v^*},\boldsymbol{x})$ for the misrepresentation models, and $m$ is the number of parameters in the model. To avoid overfitting in small sample scenarios, a penalty term ${(2m^2+2m)}/{(n-m-1)}$ is added, giving rise to the corrected AIC (AICc).

An alternative to the AIC is the BIC, in which the penalty depends on both the sample size and the number of parameters in the model. The BIC can be computed via $BIC=\log(n)m-2l.$

For AIC, AICc, and BIC, a lower value is associated with better goodness of fit, that is, a larger value of the maximized partial data log-likelihood function. These three types of criteria may yield different suggestions for the model selection. There is no golden rule regarding which one is superior to the others. When they favor different models, we may choose the one that has the most recommendations among different criteria. Interested readers may refer to papers by Burnham and Anderson (2002), Fonseca and Cardoso (2007), and Vrieze (2012) for detailed comparisons of the three under various model settings.

4.4. In-sample and out-of-sample prediction

In addition to the AIC/BIC, we may compare and validate predictive models based on in-sample and out-of-sample prediction. Similar to the AIC/BIC that are valid for non-nested models, out-of-sample prediction may be used to compare the predictive power of traditional rating plans as well as regular ratemaking models and those adjusting for misrepresentation.

In predictive analytics and statistical learning, commonly used criteria for evaluating out-of-sample prediction include mean squared error (MSE) and mean absolute error (MAE), which have been widely used for continuous outcomes (see, e.g., Section 2.2 of James et al. 2021). For loss severity and frequency models, including those adjusting for misrepresentation, we use $\boldsymbol{y}=(y_1,\,y_2,\,\cdots,\,y_n)$ to denote a total of $n$ observed values of the claim outcome in Section 3. For defining the MSE, we denote the corresponding predicted values of the claim outcome by $\hat{\boldsymbol{y}}=(\hat{y}_1,\,\hat{y}_2,\,\cdots,\,\hat{y}_n)$ , where $\hat{y}_i$ denotes the conditional expectation of claim outcome given the observed risk factors and estimated parameters. For the misrepresentation models, for example, $\hat{y}_i= \mathbb{E}[Y\,|\,\boldsymbol{\hat{\alpha}},\hat{\varphi},\hat{\lambda},\,V^{*}=v^*_i,\,\boldsymbol{X}^{\mathcal{S}}=\boldsymbol{x}_i]$ is the conditional expectation of the $i$ th observation based on the fitted model with distributional form given in Equation 3. For the sample, the MSE is defined as

$M S E=\frac{1}{n} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2, \tag{12}$

which is a measure of the average distance between the observed values and predicted values of the claim outcome. Alternatively, we may use the root-mean-square error (RMSE), defined as the square root of the MSE.

This formula gives the training MSE since the estimated parameters $(\boldsymbol{\hat{\alpha}},\hat{\varphi},\hat{\lambda})$ are fitted from the same observations (i.e., training data) $\boldsymbol{y}=(y_1,\,y_2,\,\cdots,\,y_n)$ , $\boldsymbol{v^*}=(v_1^*,\,v_2^*,\,\cdots,\,v_n^*)$ and $\boldsymbol{x}=(\boldsymbol{x}_1,\,\boldsymbol{x}_2,\,\cdots,\,\boldsymbol{x}_n)$ . The training MSE measures the quality of in-sample prediction. For predictive models, such as those used in ratemaking and claim analytics, it is important to ensure the quality of out-of-sample prediction since the models are usually fitted from historical data. Thus, in practice, it is common to divide the data into two subsets: training data for estimating the model and test data for evaluating the model’s predictive power in unseen data. Assuming we have an additional $t$ samples as test data, then the test MSE is simply the MSE of the additional test samples given the earlier parameters $(\boldsymbol{\hat{\alpha}},\hat{\varphi},\hat{\lambda})$ from the $n$ training samples.

From Equation 12, we note that the MSE (RMSE) is an appropriate measure of model fit for normal responses in linear regression settings where the variance does not change with the mean. A similar criterion that works for response variables with constant variance is the MAE, defined as $MAE=\sum_{i=1}^n\left|y_i-\hat{y}_i\right|/n$ . Compared with MSE (RMSE), MAE is less impacted by outliers with large magnitudes of residuals. MSE and MAE criteria, however, may not be appropriate for ratemaking models, including those based on lognormal, gamma, Poisson, and negative binomial distributions. In such ratemaking models, as well as in real loss severity/frequency data, the variance typically increases with the mean, causing larger observations to carry higher weights in the MSE and MAE formulas.

In order to alleviate such concerns, ratemaking actuaries typically assess prediction accuracy using rank-based methods based on actual and fitted values for nonbinary loss responses that exhibit the nonconstant variance feature (e.g., from residual checking). Such assessments usually involve visual examination of plots, including double lift charts (quintile plots) and Lorenz curves (Gini index values), introduced in Chapter 7 of Goldburd, Khare, and Tevet (2020). Interested readers may refer to their book for a comprehensive review of methods that can be used for validating ratemaking plans, including those based on GLM.

5. Simulation studies

Using simulation studies, we compare the proposed EM algorithm for the misrepresentation models with the Bayesian implementation from Xia, Hua, and Vadnais (2018) in terms of parameter estimation and computation speed. For correctly specified models, the simulation studies serve as an internal consistency test of the ability of the proposed estimation procedures to learn the parameters used to generate the data.

For data generation, we consider severity models based on lognormal and gamma distributions, as well as frequency models based on Poisson and negative binomial distributions. We assume multiplicative rating models with the frequency/severity mean structure given in Equation 2 with a log link function. We generate the true status of the risk factor $V$ from a Bernoulli trial with probability $\theta=0.3$ . Using a misrepresentation probability $p=0.25$ , we generate the corresponding samples of $V^*$ by modifying the samples of $V$ . These parameter values imply a prevalence of misrepresentation of $\lambda = 0.0968$ . Regarding the correctly measured risk factors, we assume there are three additional risk factors denoted by the vector $\boldsymbol{X}^{\mathcal{S}}=(X_1,X_2,X_3)$ . We generate these risk factors based on the following distributions: $X_1\sim\mathrm{Bernoulli}(0.4)$ , $X_2\sim\mathrm{N}(0,1)$ and $X_3\sim\mathrm{beta}(2,1)$ . For Equation 2, we assume the regression coefficients $\alpha_j$ to be different for the severity and frequency models. Table 1 presents the values of parameters used for generating the loss severity and frequency outcomes.

Table 1.Regression coefficients used for generating loss severity and frequency outcomes

Parameter	Lognormal	Gamma	Poisson	Negative binomial
Intercept ( $\alpha_0$ )	1	1	-6	-6
$X_1$ ( $\alpha_1$ )	2	2	2	2
$X_2$ ( $\alpha_2$ )	0	0	2	2
$X_3$ ( $\alpha_3$ )	4	4	-2	-2
$V$ ( $\alpha_4$ )	2	2	4	4
Dispersion	0.427	0.2	—	1
Misrepresentation ( $\lambda$ )	0.097	0.097	0.097	0.097
Overall mean*	581	581	0.33	0.33
Overall SD*	1371	1371	6.10	9.14

* The overall mean and SD are estimated based on a Monte Carlo sample of 100,000,000. Note: For the Poisson data generating mechanism, we observe that the overall (unconditional) variance is much larger than the overall (unconditional) mean. In regression settings, the Poisson conditional mean equals to the conditional variance when we fix the values of the risk factors. With the same mean structure, the negative binomial mechanism with a dispersion parameter 1 gives unconditional variance that is twice larger than the Poisson model.

Using data simulated from the above mechanisms, we compare the proposed EM algorithms with the Bayesian approaches from Xia and Gustafson (2016) and Xia, Hua, and Vadnais (2018) in terms of parameter estimates and computation time. For the lognormal, gamma, Poisson, and negative binomial misrepresentation models, we use the settings of parameter values from Table 1 for generating samples of size $n=10,000$ . In Tables 2 and 3, we report the parameter estimates, standard errors (in parentheses), and computation times in seconds, respectively, for the loss severity and frequency models using the Bayesian and proposed maximum likelihood approaches based on the EM algorithm. In order to assess the impact of ignoring misrepresentation, the first column provides MLEs from unadjusted analysis using regular GLM.

Table 2.Parameter estimates, standard errors, and computation times from Bayesian and EM estimation of loss severity models

	Lognormal			Gamma
Parameter	Unadjusted	Bayesian	EM	Unadjusted	Bayesian	EM
Int. ( $\alpha_0$ )	1.213 (0.021)	0.997 (0.014)	0.997 (0.014)	1.478 (0.039)	0.980 (0.016)	0.982 (0.015)
$X_1$ ( $\alpha_1$ )	1.996 (0.014)	1.996 (0.009)	1.996 (0.009)	2.005 (0.025)	2.011 (0.010)	2.011 (0.010)
$X_2$ ( $\alpha_2$ )	-0.010 (0.007)	-0.011 (0.004)	-0.011 (0.004)	-0.015 (0.012)	0.004 (0.005)	0.003 (0.005)
$X_3$ ( $\alpha_3$ )	3.990 (0.029)	4.011 (0.019)	4.011 (0.019)	4.044 (0.052)	4.018 (0.021)	4.015 (0.020)
$V$ ( $\alpha_4$ )	1.793 (0.016)	1.999 (0.009)	1.999 (0.009)	1.490 (0.029)	2.004 (0.010)	2.004 (0.010)
Dispersion	0.677	0.421 (0.003)	0.421 (0.003)	0.618 (0.003)	0.202 (0.003)	0.202 (0.003)
$\lambda$	—	0.100 (0.003)	0.100 (0.003)	—	0.104 (0.004)	0.104 (0.004)
Comp. time	0.0	1042.7	2.5	0.1	1909.7	8.7

Table 3.Parameter estimates, standard errors, and computation times from Bayesian and EM estimation of loss frequency models

	Poisson			Negative binomial
Parameter	Unadjusted	Bayesian	EM	Unadjusted	Bayesian	EM
Int. ( $\alpha_0$ )	-4.287 (0.081)	-6.089 (0.141)	-6.092 (0.134)	-4.359 (0.157)	-5.958 (0.221)	-5.951 (0.179)
$X_1$ ( $\alpha_1$ )	2.058 (0.048)	1.991 (0.050)	1.992 (0.050)	1.893 (0.097)	1.964 (0.093)	1.956 (0.092)
$X_2$ ( $\alpha_2$ )	1.915 (0.020)	1.979 (0.023)	1.979 (0.022)	1.972 (0.058)	2.038 (0.058)	2.038 (0.057)
$X_3$ ( $\alpha_3$ )	-1.814 (0.078)	-1.961 (0.080)	-1.959 (0.080)	-1.508 (0.184)	-1.773 (0.190)	-1.790 (0.134)
$V$ ( $\alpha_4$ )	2.304 (0.042)	4.119 (0.114)	4.119 (0.108)	2.135 (0.094)	3.817 (0.140)	3.827 (0.077)
Dispersion	—	—	—	0.306 (0.022)	0.833 (0.084)	0.844 (0.096)
$\lambda$	—	0.099 (0.009)	0.100 (0.008)	—	0.094 (0.011)	0.093 (0.011)
Comp. time	0.1	2404.3	46.9	0.6	5016.6	47.0

From Tables 2 and 3, we observe that the estimates of the parameters are essentially the same for the Bayesian and EM approaches in the current large-sample scenarios when the vague priors from the Bayesian approach have minimal effects on the estimation. When compared with the true values in Table 1, both the Bayesian and EM approaches provide estimates of parameters that are close to the true values used to generate the sample (with differences within twice the standard error), demonstrating the ability of the proposed misrepresentation models to learn all parameters without requiring labeled data on misrepresentation. The unadjusted GLM approach, on the other hand, provides biased estimates of parameters. In particular, the estimates of $\alpha_4$ from the unadjusted analysis are biased toward zero for all the models, confirming the attenuation effect expected from ignoring the misrepresentation. The similarity of the estimates is consistent with the asymptotic theories on the equivalences of estimators from Bayesian and maximum likelihood approaches (Strasser 1975). In terms of computation time, we observe that the proposed EM approach can be 50 to 400 times faster for estimating the misrepresentation models of concern when compared with the Bayesian approaches from Xia and Gustafson (2016) and Xia, Hua, and Vadnais (2018). This conclusion about the computation speed is consistent with that of the lognormal algorithm based on closed forms proposed by Akakpo, Xia, and Polansky (2019).

6. MEPS misrepresentation analysis

In this section, we perform an empirical study using the 2014 full year consolidated data from the MEPS (AHRQ 2016). In particular, we apply the gamma and lognormal severity models to total medical expenditures and the negative binomial and Poisson models to the number of office-based visits in order to assess the presence and mechanism of misrepresentation in the data. We perform risk effect assessment, misrepresentation testing, model comparison, and model validation using the methods introduced in Section 4.

6.1. Data and background

The MEPS comprises a series of large-scale surveys on the U.S. population regarding medical expenditures, healthcare utilization, and health insurance coverage that are conducted by the Agency for Healthcare Research and Quality. The MEPS data have been used extensively in the actuarial literature for studying the loss severity and frequency patterns of healthcare expenditures and utilization. For example, Hua (2015) used the MEPS data to study the dependence between the medical loss frequency and severity, and Frees (2009, Chapter 16) demonstrated the fitting of the two-part model combining loss severity and frequency (occurrence) with the MEPS data.

In the misrepresentation literature, earlier papers such as those by Xia and Gustafson (2018) and Akakpo, Xia, and Polansky (2019) assessed the presence of misrepresentation in self-reported uninsured status in the 2012, 2013, and 2014 MEPS data. Due to the individual insurance mandate of the Patient Protection and Affordable Care Act (PPACA), the authors suspected that there was a financial incentive for MEPS respondents to misrepresent their status due to the tax penalty that was introduced in 2014. Based on an empirical analysis treating uninsured status as a response variable, Xia and Gustafson (2018) did not find a statistically significant test of misrepresentation in the 2012 data. Using a simplified lognormal model including only the risk factor subject to misrepresentation, Akakpo, Xia, and Polansky (2019) performed an LRT on the existence of misrepresentation and found the test to be insignificant with the 2013 data but significant with the 2014 data (when the PPACA took effect). The simplified model from Akakpo, Xia, and Polansky (2019) ignored other risk factors in the MEPS data that are predictive of total medical expenditures and overlooked the potential influence other risk factors might have on the prevalence of misrepresentation.

Here, we conduct a more extensive empirical study on the presence of misrepresentation in uninsured status from the 2014 MEPS data using loss severity and frequency models based on lognormal, gamma, Poisson, and negative binomial distributions. Moreover, in an effort to understand the mechanism of the misrepresentation, we study whether the prevalence of misrepresentation varies in response to some of the risk factors. The response variables of interest are the total medical expenditures variable for the loss severity model and the number of office-based visits for the loss frequency model. In addition to uninsured status, risk factors we consider in the analysis include sex, age, smoking, and health status. For the analysis, we exclude adults over 65 who are eligible for Medicare.

For the loss severity misrepresentation model on total medical expenditures, the total sample size is 13,301 after excluding records with missing values and zero total expenditures. For the loss frequency model on total number of office-based visits, we use the same 13,301 records in order to obtain comparable results from the two models with different loss outcomes. The health status variable contains five levels, representing a range of excellent to poor health. The summary statistics of the variables are given in Table 4.

Table 4.Summary statistics for the variables used in the misrepresentation models on total medical expenditures and total number of office-based visits

Variable	Mean	Standard Deviation
Uninsured	0.12	0.33
Sex (Male)	0.42	0.49
Age	41.7	13.3
Smoking	0.17	0.37
Health	2.43	1.07
Total expenditures	5042	13585
Office-based visits	6.23	12.72
Zeros in office-based visits	0.16	0.37

From the table, we observe that a representative individual in the data is an insured 40-year-old nonsmoking female who has an average health condition. From Figure 7 in Akakpo, Xia, and Polansky (2019), the logarithm of the total medical expenditures variable has an empirical distribution that is approximately normal. For the frequency analysis, the variance of the number of office-based visits variable is substantially larger than the mean, raising a concern of potential overdispersion even after adjusting for the risk factors we consider. There are a total of 2,122 zeros (16%), consisting of individuals with no office-based visits but with other types of medical expenditures such as outpatient care and emergency services.

6.2. Severity misrepresentation analysis

We first perform a loss severity analysis and thus set the total medical expenditures variable as the response variable $Y$ . The first set of models we consider are the unadjusted GLM ratemaking models based on the gamma and lognormal loss severity distributions. We specify the uninsured status as the variable $V$ that is subject to misrepresentation. The additional risk factors of concern, $\boldsymbol{X}^{\mathcal{S}}$ , include sex, age, smoking, and health status. We then consider two types of adjusted models from Section 2.3 that account for the misrepresentation. The first type assumes that the prevalence of misrepresentation is fixed, whereas the second type assumes that the prevalence changes with one of the aforementioned risk factors under consideration. For each of the loss severity distributions, we consider a total of $1+1+4=6$ models for the analysis on the total medical expenditures variable. For the two distributions, the MLEs of the parameters for the $6\times2=12$ models are obtained using the proposed EM algorithm implemented in R, with the inference and model comparison tools introduced previously.

Since the models we consider assume different loss distributions, we first perform a model selection based on the AIC, AICc, and BIC introduced in the previous section. Based on the AIC, AICc, and BIC, we choose the lognormal distribution over the gamma distribution due to the corresponding smaller values for all models. In Table 5, we present the goodness-of-fit statistics for the six severity misrepresentation models based on lognormal and gamma distributions. From the table, we observe that according to the BIC, the model with the best goodness of fit is the lognormal misrepresentation model without a latent logit structure on the prevalence of misrepresentation (Model II). According to the AIC and AICc, however, the model with a logit model on the relationship between the prevalence of misrepresentation and age has better goodness of fit (Model IV).

Table 5.Model selection based on AIC, AICc, and BIC for the severity misrepresentation models on total medical expenditures

Model		I	II	III	IV	V	VI
	Criteria	Unadj.	No.logit	Logit.sex	Logit.age	Logit.smoke	Logit.health
Gamma	AIC	243090	241156	241141	241129	241137	241082
	AICc	243090	241156	241141	241129	241137	241082
	BIC	243142	241216	241209	241197	241204	241149
Lognormal	AIC	244255	239834.8	239837	239834.6	239838	239839
	AICc	244255	239834.8	239837	239834.6	239838	239839
	BIC	244308	239887	239904	239902	239906	239906

Note: Owing to the large sample size, the AIC and AICc values are almost identical for all the models.

Note that the AIC and BIC consider the goodness of fit of the data based on the maximized (partial data) log-likelihood as well as the complexity of the models. They serve a different purpose than the test on the presence of misrepresentation. In order to obtain statistical conclusions that take into consideration the uncertainty in the estimation, we may perform statistical tests on the presence and mechanism of the misrepresentation and the risk effects using the lognormal misrepresentation models favored by the AIC and BIC.

In Table 6, we present the results from the LRT on the presence of misrepresentation and the Wald test on the effect of each risk factor on the prevalence of misrepresentation for the lognormal misrepresentation model.

Table 6.Statistical inference on the presence and mechanism of misrepresentation for the lognormal misrepresentation models on total medical expenditures

Test	$\mathbf{H_0}$	Models	Estimate	S.E.	$p$ -value	OR.LB	OR.UB
LRT $G^2$	$\lambda=0$	I vs. II	25.73	—	$<\mathbf{0.002}$	—	—
Sex (male)	$\beta_1=0$	II vs. III	0.458	0.320	0.1523	0.834	2.997
Age	$\beta_2=0$	II vs. IV	-0.231	0.108	0.0318	0.640	0.984
Smoking	$\beta_3=0$	II vs. V	0.325	0.372	0.3822	0.658	2.909
Health	$\beta_4=0$	II vs. VI	0.027	0.087	0.7530	0.864	1.222

Note: The first row gives the LRT statistic and p-value, with an estimate of $\hat{\lambda}=0.0249$ for the prevalence of misrepresentation.

From Table 6, the LRT test on the presence of misrepresentation is significant at the 5% level. According to the parameter estimates, the odds of misrepresentation decreases by a factor of $\exp(-0.231)=79\%$ when the age of the participant increases by one standard deviation (i.e., 13.3 years). The $95\%$ confidence interval for the relativity is (0.640, 0.984) for the age variable, with the effect being significant at the 5% level. All the other risk factors are insignificant in predicting the prevalence of misrepresentation. After combining results from Tables 5 and 6, we compare the estimates of the unadjusted lognormal model, the adjusted misrepresentation predictive model without a logit structure, and the estimate with the prevalence of misrepresentation depending on the age. Using the three models, we can then compare the estimates on the true risk effects that uninsured status, sex, age, smoking, and health have on average total medical expenditures.

In Figure 1, we present the estimated relativity and the 95% confidence intervals on the relative effect of the risk factors on the odds of misrepresentation and the loss severity as measured by average total medical expenditures. Figure 1a presents the relative effect of the age variable on the odds of misrepresentation $\lambda/(1-\lambda)$ for the adjusted model with a logit structure. We set age 18 as the reference level associated with a relativity of 1. We observe that the odds (prevalence) of misrepresentation decreases with age for the individuals with positive medical expenditures. Figure 1b presents the estimated relativity and the 95% confidence intervals of the relativity, that is, the relative effect of each risk factor on average total medical expenditures for the six models of interest. For the age variable, relativity corresponds to an increase of one standard deviation (13.3 years). For the health variable, relativity corresponds to an increase of one unit of health level (from 1 to 5, increasing as the health condition deteriorates).

Figure 1.The 95% confidence intervals of the relative effects of the risk factors on the odds of misrepresentation (a) and average total medical expenditures (b)

Note: The odds ratio represents the relativity on the odds of misrepresentation, $\lambda/(1-\lambda)$ .

From Figure 1, we observe that the adjustment to the severity and mechanism of misrepresentation seems to result in a difference in the estimated relativity of uninsured status (a larger negative effect with relativity further away from 1) for all adjusted models. Note that the driver of incentive to deny an uninsured status differs from that of insurance misrepresentation, in which policyholders tend to not disclose risk-increasing statuses to avoid higher premiums. For insurance misrepresentation in the ratemaking context, the $V$ effect would be a mirror image of what is shown in Figure 1, with the unadjusted estimate showing attenuation regardless of the direction of error (Xia and Gustafson 2016). Misrepresentation modeling also results in smaller age and smoking effects (with relativity closer to 1) estimated from the adjusted model with a prevalence of misrepresentation based on the age. For all adjusted models, modeling misrepresentation seems to result in noticeable differences in estimated standard errors of sex, age, smoking, and health effects. Whereas earlier papers, such as Xia and Gustafson’s (2016), have demonstrated the identifiability of misrepresentation models guaranteeing statistical consistency, the unobserved feature of misrepresentation does result in an efficiency loss manifested by wider confidence intervals in most parameters. Except for the smoking variable (associated with relativity around 1), uninsured individuals and males have substantially lower average total medical expenditures, while older individuals and those with worse health conditions have much higher average total medical expenditures after adjusting for other risk factors in the model.

Based on the significance of the LRT on $\lambda$ , we may conclude that the test on the presence of misrepresentation is statistically significant, suggesting potential misrepresentation. Regarding the misrepresentation mechanism, however, we select the simpler misrepresentation model without a logit structure based on BIC.

6.3. Frequency misrepresentation analysis

For the loss frequency analysis, we set the number of office-based visits as the response variable $Y$ . Since the response variable is a count variable, we consider unadjusted GLM ratemaking models based on the Poisson and negative binomial loss frequency distributions. The risk factors $V$ (subject to misrepresentation), $\boldsymbol{X}^{\mathcal{S}}$ , and $\boldsymbol{X}^{\mathcal{T}}$ are the same as those for the severity models. For each of the loss frequency distributions, we consider the same six unadjusted and adjusted models for the analysis of the number of office-based visits variable. For the two distributions, the MLEs of the parameters for the $6\times2=12$ models are obtained using the proposed EM algorithms implemented in R, with the inference and model comparison tools introduced previously.

Similar to the misrepresentation severity analysis, we first perform a model selection based on the AIC, AICc, and BIC introduced in the previous section. In Table 7, we present the goodness-of-fit statistics for the six frequency misrepresentation models based on the Poisson and negative binomial distributions. From the table, we observe that according to all three types of criteria, the model with the best goodness of fit is the negative binomial misrepresentation model without a latent logit model on the prevalence of misrepresentation (Model II), followed by the model with a logit structure between the prevalence of misrepresentation and the age of the participant (Model IV). Consistent with the lognormal model on total medical expenditures, the goodness-of-fit statistics favor Models II and IV.

Table 7.Model selection based on AIC, AICc, and BIC for the frequency misrepresentation models on the number of office-based visits

Model		I	II	III	IV	V	VI
	Criteria	Unadj.	No.logit	Logit.sex	Logit.age	Logit.smoke	Logit.health
Poisson	AIC	163059	99883	99808	99884	99866	99881
	AICc	163059	99883	99808	99884	99866	99881
	BIC	163127	99937	99868	99944	99926	99941
Negative	AIC	66656	65338.0	65338.4	65338.2	65339	65341
binomial	AICc	66656	65338.0	65338.4	65338.2	65339	65341
	BIC	66709	65398	65405.9	65405.7	65407	65408

Note: Owing to the large sample size, the AIC and AICc values are almost identical for all the models.

Similar to the severity misrepresentation model, we perform statistical inference on the presence and mechanism of the misrepresentation and the risk effects using the negative binomial misrepresentation models. In Table 8, we present the results from the LRT in the presence of misrepresentation and the Wald test on the effect of each risk factor on the prevalence of misrepresentation based on the negative binomial misrepresentation models. The better goodness of fit for the negative binomial models (when compared with the Poisson models according to the AIC and BIC) is consistent with the fact that the variance of the number of office-based visits variable is substantially larger than the mean.

Table 8.Statistical inference on the presence and mechanism of misrepresentation for the negative binomial misrepresentation models on the number of office-based visits

Test	$\mathbf{H_0}$	Models	Estimate	S.E.	$p$ -value	OR.LB	OR.UB
LRT $G^2$	$\lambda=0$	I vs. II	1320	—	$\mathbf{<}$ 0.002	—	—
Sex (male)	$\beta_1=0$	II vs. III	0.251	0.118	0.0331	1.016	1.628
Age	$\beta_2=0$	II vs. IV	0.045	0.062	0.4698	0.924	1.185
Smoking	$\beta_3=0$	II vs. V	-0.087	0.145	0.5514	0.686	1.227
Health	$\beta_4=0$	II vs. VI	-0.231	0.056	$<$ 0.0001	0.710	0.888

Note: The first row gives the LRT statistic and $p$ -value, with an estimate of $\hat{\lambda}=0.0022$ for the prevalence of misrepresentation.

From Table 8, the LRT test on the presence of misrepresentation is significant at the 5% level. The sex and health variables are significant based on the Wald tests on their effects on the prevalence of misrepresentation. According to the model estimates, each increase in age of one standard deviation (13.3 years) is associated with an increase of odds of 1.05 times for misrepresentation, with the $95\%$ confidence interval of relativity being (0.924, 1.185). Despite the difference in the sign of the age effect, the confidence interval overlaps with that of the severity analysis using the lognormal model, meaning that the difference in the estimates is not significant for the two models. Indeed, the confidence intervals of all the variables of sex, age, smoking, and health status overlap, revealing that the estimates of the effects are not significantly different for the lognormal and negative binomial models on the mechanism of misrepresentation when two different response variables are used. Based on the results from the goodness-of-fit and the Wald tests, we compare the unadjusted negative binomial model with the adjusted misrepresentation models with and without a logit model on the sex, age, and health variables. Using these models, we then compare the estimates of the effects of uninsured status, sex, age, smoking, and health status on the average number of office-based visits.

In Figure 2, we present the estimated relativity and the 95% confidence intervals on the relative effect of the risk factors on the odds of misrepresentation and the average number of office-based visits. Figure 2a presents the relative effect of age on the odds of misrepresentation $\lambda/(1-\lambda)$ . We set age 18 as the reference level associated with a relativity of 1. Figure 2b presents the estimated relativity and the 95% confidence intervals of the relativity of each risk factor on the average number of office-based visits for the six models under consideration.

Figure 2.The 95% confidence intervals of the relative effects of the risk factors on the odds of misrepresentation (a) and the average number of office-based visits (b)

Note: The odds ratio represents the relativity on the odds of misrepresentation, $\lambda/(1-\lambda)$ .

The results on the frequency relativities (Figure 2b) are very similar to those from the lognormal misrepresentation models on the total medical expenditures. From Figure 2a, we observe that the estimated odds (prevalence) of misrepresentation slightly increases with age, according to the misrepresentation model on the number of office-based visits. The confidence region (the shaded area), however, overlaps with that from the lognormal model, indicating that the estimates are not significantly different for the two models with different loss outcomes. Regarding the risk effects on the loss frequency (Figure 2b), we observe that the adjustment to the severity and mechanism of misrepresentation seems to result in a difference in the estimated relativity of uninsured status (a larger negative effect with relativity further away from 1), age (a larger effect with relativity further away from 1), sex (a slightly larger negative effect), and health status (a smaller relativity). Due to the aforementioned difference in the misrepresentation incentive, the $V$ effect would be a mirror image of what is shown in Figure 2 in a ratemaking context. Modeling of misrepresentation seems to result in noticeable differences in estimated standard errors of sex, age, smoking, and health effects. The wider intervals we observe for most parameters are a manifestation of the efficiency loss caused by not observing the misrepresentation status at the policy level. Except for the smoking variable (associated with relativity around 1), uninsured individuals and males have a substantially lower average number of office-based visits, while older individuals and those with worse health conditions have a much higher average number of office-based visits.

Based on the significance of the LRT on $\lambda$ , we conclude that the statistical test on the presence of misrepresentation is significant, suggesting potential misrepresentation. Due to the inconsistency of results from the loss severity and frequency models concerning the misrepresentation mechanism, we select the misrepresentation models without a logit structure based on the AIC/BIC that penalize model complexity.

6.4. In-sample and out-of-sample validation

For the MEPS analysis, the conclusions on the presence of misrepresentation and risk effects from the previous subsections are made using statistical tests and goodness-of-fit criteria based on (in-sample) data used to fit the models. For the severity models on total medical expenditures and the frequency models on the number of office-based visits, it would be helpful to conduct model validation based on both in-sample and out-of-sample prediction. For predictive analytics purposes, we evaluate each of the six models under each loss distribution, including models that perform less favorably based on in-sample tests. Using the actual and predicted values from the models, we conduct comprehensive in-sample and out-of-sample assessments based on RMSE and MAE criteria, as well as double lift charts and Lorenz curves (Gini index values) typically adopted for ratemaking (Goldburd, Khare, and Tevet 2020).

For both loss severity and frequency modeling, we use the first 10,000 observations to train each model and the remaining 3,301 samples to evaluate out-of-sample prediction. For the lognormal models, we fit normal models on the logarithm of total medical expenditures, $\log(Y)$ , and calculate a first set of RMSE and MAE based on the normal residuals using the predicted normal mean $\mu$ and logarithmic transformed actual expenditures at the individual level. The corresponding lognormal residuals are obtained based on the original medical expenditures and the predicted lognormal mean given by $\exp(\mu+\sigma^2/2)$ . The gamma model residuals are obtained from the original expenditures and the predicted gamma mean $\mu$ at the individual level, while the gammalog residuals (comparable to normal ones) are obtained from the logarithm of expenditures and the predicted mean calculated from $\mathbb{E}\left[\log(Y)\right]=\log(\mu)+\log(\phi)+digamma(1/\phi)$ , where $digamma(\cdot)$ denotes the logarithmic derivative of the gamma function. We plot the normal and gammalog residuals against the corresponding predicted values, and the variance seems to stay constant across different predicted values. In Tables 9 and 10, we report, respectively, in-sample (train) and out-of-sample (test) RMSE and MAE for the severity and frequency models. Due to the equivalence of MLEs and least squared estimates (LSEs) under linear regression settings, we may expect the in-sample RMSE to be smaller for normal residuals that correspond to the lognormal models selected from LRT and AIC/BIC based on maximized likelihood functions.

Table 9.Model validation based on in-sample (train) RMSE and MAE criteria for severity models on total medical expenditures and frequency models on office-based visits

Model		I	II	III	IV	V	VI
	Criteria	Unadj.	No.logit	Logit.sex	Logit.age	Logit.smoke	Logit.health
Gammalog	RMSE	1.94592	1.66299	1.65933	1.66505	1.66080	1.65389
	MAE	1.55227	1.31218	1.30954	1.31316	1.31100	1.30577
Normal	RMSE	1.59843	1.59847	1.59831	1.59829	1.59861	1.59852
	MAE	1.26593	1.26589	1.26593	1.26565	1.26593	1.26593
Gamma	RMSE	12905.3	12922.5	12933.1	12918.6	12920.8	12887.3
	MAE	5339.97	5407.44	5391.06	5363.73	5397.26	5439.88
Lognormal	RMSE	12927.9	12923.4	12920.9	12921.4	12923.3	12923.3
	MAE	5684.49	5627.46	5684.49	5614.66	5626.81	5628.08
Poisson	RMSE	12.99710	13.0504	12.9978	12.9979	12.9979	12.9979
	MAE	6.30351	6.38090	6.24280	6.24283	6.24286	6.24282
Neg. bin.	RMSE	12.99711	13.0134	13.0127	13.0156	13.0170	13.0165
	MAE	6.30358	6.32468	6.32410	6.32957	6.32877	6.33022

Note: For lognormal and gamma models, RMSE and MAE criteria seem to be appropriate for normal and gammalog residuals under the log scale. For the frequency models, even in-sample results contradict with LRT and BIC/AIC on nested model comparisons of loss distributions and misrepresentation mechanisms, suggesting RMSE/MAE may not be appropriate.

Table 10.Model validation based on out-of-sample (test) RMSE and MAE for severity models on total medical expenditures and frequency models on office-based visits

Model		I	II	III	IV	V	VI
	Criteria	Unadj.	No.logit	Logit.sex	Logit.age	Logit.smoke	Logit.health
Gammalog	RMSE	1.94611	1.66337	1.65932	1.66721	1.66066	1.65745
	MAE	1.54705	1.30200	1.29880	1.30443	1.30047	1.29746
Normal	RMSE	1.60605	1.60633	1.60606	1.60586	1.60650	1.60637
	MAE	1.26088	1.26105	1.26088	1.26057	1.26117	1.26109
Gamma	RMSE	13415.1	13421.0	13427.7	13421.9	13421.6	13431.5
	MAE	5336.41	5385.68	5367.97	5351.56	5380.26	5441.96
Lognormal	RMSE	13445.8	13439.6	13436.8	13437.3	13439.9	13439.8
	MAE	5676.81	5621.45	5676.81	5610.13	5621.89	5622.24
Poisson	RMSE	11.8415	11.9033	11.8386	11.8386	11.8385	11.8386
	MAE	6.05976	6.13997	5.99765	5.99754	5.99779	5.99757
Neg. bin.	RMSE	11.8413	11.8573	11.8588	11.8620	11.8610	11.8628
	MAE	6.05944	6.08480	6.08671	6.09231	6.09063	6.09272

Note: For lognormal and gamma models, RMSE and MAE criteria seem to be appropriate for normal and gammalog residuals under the log scale. For the frequency models, both in-sample and out-of-sample results contradict with LRT and BIC/AIC on nested model comparisons of loss distributions and misrepresentation mechanisms, suggesting RMSE/MAE may not be appropriate.

Based on the residual assessment, the RMSE and MAE criteria seem to be appropriate for the normal and gammalog residuals, with variance staying constant across different predicted values. Hence, we may use the results from the top block of each table with better confidence. For the severity models, the normal model on the transformed expenditures (i.e., lognormal model evaluated using residuals from the log scale) with a logit structure on age seems to give the best prediction based on both in-sample and out-of-sample RMSE/MAE criteria. The gammalog results are also consistent with those from AIC/BIC regarding the mechanism of misrepresentation under gamma models. The results are consistent with the in-sample results from Tables 5 and 6 on the selection of loss distribution and mechanism of misrepresentation per the AIC/AICc and LRT. For the MEPS expenditure data, AIC/AICc seems to perform better than BIC for anticipating out-of-sample performance. The in-sample and out-of-sample performance is consistent, confirming that the selected lognormal misrepresentation model provides a reasonable representation of MEPS total expenditures.

From the two blocks at the bottom of Tables 9 and 10, we observe that the conclusions on the misrepresentation mechanism and loss distributions contradict those from the LRT and BIC/AIC reported in the previous subsections. Particularly for the frequency models, even the in-sample results contradict those from the LRT and BIC/AIC regarding nested model comparisons on the choice of distribution and mechanism of misrepresentation (e.g., the Poisson and negative binomial models are nested given the same model structure in each column, and those considered in the LRT are nested in each row). When we experiment by replacing the raw residuals with standardized (working or Pearson’s) residuals for the frequency models, the conclusions are contradictory to those from Tables 7 and 8. This suggests that the RMSE/MAE criteria may not be appropriate for loss frequency models, probably owing to the discrete nature of data.

Hence, we may resort to graphical methods commonly used for ratemaking plans that may be less impacted by the residual magnitudes. We use the double lift charts and Lorenz curves along with the Gini index (Goldburd, Khare, and Tevet 2020) to compare the in-sample and out-of-sample prediction of the severity and frequency models under consideration. We implement the graphical methods based on the R code from the study note for Exam 3 for the Certified Specialist in Predictive Analytics certification offered by the Casualty Actuarial Society Institute (iCAS 2019).

We first create double lift charts comparing unadjusted models with misrepresentation models without a logit structure on the prevalence of misrepresentation. The sort ratio is calculated based on the predicted values from two models from the same loss distribution. The observations are divided into deciles to plot the average observed values and the average predicted values from the models for comparing prediction accuracy within each decile. Since the misrepresentation models seem to provide similar predictions regardless of the misrepresentation mechanism, we decide to plot models within each row of Tables 9 and 10 in the same plot in order to reduce the total number of plots. Figure 3 reports the double lift charts for comparing the unadjusted and adjusted severity models on the total medical expenditures.

Figure 3.In-sample and out-of-sample double lift charts for comparing the unadjusted and adjusted severity models on total medical expenditures

Note: The sort ratio is calculated as the adjusted model predictions divided by the unadjusted model predictions.

From Figure 3a and 3b, we observe that the results are similar for the unadjusted model and adjusted models with different misrepresentation mechanisms when the lognormal distribution is assumed for the total medical expenditures. In Figure 3c and 3d, the adjusted gamma models seem to capture the actual expenditures better than the unadjusted models at the lower and upper bins where the two types of models disagree the most. Figure 4 reports the double lift charts for comparing the unadjusted and adjusted frequency models on the number of office-based visits.

Figure 4.In-sample and out-of-sample double lift charts for comparing the unadjusted and adjusted frequency models on office-based visits

Note: The sort ratio is calculated as the adjusted model predictions divided by the unadjusted model predictions.

From Figures 4a and 4b, we observe that the adjusted models seem to work better in capturing the actual numbers of office-based visits under the Poisson distribution, except for the one without a logit structure. In Figures 4c and 4d, the unadjusted negative binomial model gives predictions that are relatively flat and better at capturing the slight decreasing trend at upper bins, while the adjusted counterparts give predictions that better capture the overall increasing trend at the lower bins. Due to the volatility in the frequency data, it seems to be harder to make a clear decision on model performance.

We further resort to the Gini index and Lorenz curves to measure the lift of severity and frequency models. For each model, the Lorenz curve plots the cumulative percentage of predicted values (in expenditures or numbers of visits) against the total percentage of individuals after sorting the individuals based on their predicted values. The Gini index is defined as twice the area between the Lorenz curve and the diagonal line (the line of equality). For model validation, the Gini index quantifies the ability of the model to segment the worst and best risks based on predicted values. In order to facilitate comparisons of Gini index values and to reduce the total number of figures, we plot the Lorenz curves of different severity (or frequency) models in the same figure, along with the one based on the actual experience data. Figure 5 presents the Lorenz curves for comparing loss distributions and misrepresentation mechanisms under the severity models on total medical expenditures.

Figure 5.In-sample and out-of-sample Lorenz curves for comparing loss distributions and misrepresentation mechanisms under severity models on total medical expenditures

Note: To make the curves comparable in terms of Gini index values, each curve is created using its corresponding order of individuals based on the predicted/actual values of concern.

From Figure 5, we observe that all the models seem to provide Gini index values of about half of those of actual expenditures, suggesting a larger number of risk factors may be needed for achieving better risk segmentation. The best performing model seems to be the gamma misrepresentation model with the health variable affecting the prevalence of misrepresentation. The conclusions are consistent for in-sample and out-of-sample prediction. In Figure 6, we present the Lorenz curves for comparing loss distributions and misrepresentation mechanisms under the frequency models on the number of office-based visits.

Figure 6.In-sample and out-of-sample Lorenz curves for comparing loss distributions and misrepresentation mechanisms under frequency models on office-based visits

Note: To make the curves comparable in terms of Gini index values, each curve is created using its corresponding order of individuals based on the predicted/actual values of concern.

From Figure 6, we observe that all of the frequency models seem to provide Gini index values much smaller than those of actual numbers of office-based visits, indicating additional difficulty of achieving good risk segmentation for the discrete frequency models. The best performing model seems to be the Poisson misrepresentation model without a logit structure on the prevalence of misrepresentation. The conclusions are consistent for in-sample and out-of-sample prediction. Table 11 presents the corresponding Gini index values associated with the Lorenz curves for comparing loss distributions and misrepresentation mechanisms under the severity and frequency models.

Table 11.Model validation based on Gini index for severity models on total medical expenditures and frequency models on number of office-based visits

Model		I	II	III	IV	V	VI
	Data	Unadj.	No.logit	Logit.sex	Logit.age	Logit.smoke	Logit.health
Gamma	Train	0.34679	0.36303	0.36296	0.34594	0.36295	0.39198
	Test	0.34709	0.36427	0.36467	0.34640	0.36421	0.39299
Lognormal	Train	0.36318	0.36682	0.36617	0.36617	0.36700	0.36702
	Test	0.36448	0.36792	0.36713	0.36713	0.36794	0.36812
Poisson	Train	0.02124	0.09830	0.01933	0.01923	0.01903	0.01932
	Test	0.02098	0.09478	0.01904	0.01894	0.01874	0.01903
Neg. bin.	Train	0.02067	0.05566	0.05738	0.06363	0.06048	0.06379
	Test	0.02042	0.05357	0.05489	0.06104	0.05815	0.06152

Note: The Gini index values are 0.7521 and 0.6644, respectively, for the actual severity and frequency data.

From Table 11, the Gini index criteria indicate that the severity model with the best ability to segment worst and best risks based on in-sample and out-of-sample prediction is the gamma misrepresentation model with a logit structure (with health variable) on the prevalence of misrepresentation. Among the frequency models, the Poisson misrepresentation model without a logit structure seems to have the best ability to differentiate risks according to the Gini index for both in-sample and out-of-sample prediction. For the lognormal and negative binomial models, the misrepresentation models (with or without a logit structure) have better ability to segment risks compared with their unadjusted counterparts. The ranks of predictive ability are consistent for in-sample and out-of-sample prediction. Although the model validation criteria considered in this subsection seem to favor distributions and mechanisms different from those in the previous subsections, all the criteria seem to reveal the advantage of misrepresentation models in enhancing predictive power, both in-sample and out-of-sample.

6.5. Practical considerations

Using the 2014 MEPS data, we have illustrated the use of proposed algorithms in estimating parameters on the risk effects as well as those related to the existence and mechanism of misrepresentation under models with different loss frequency and severity distributions. For model comparison purposes, we have provided various quantitative methods based on goodness of fit and in-sample and out-of-sample prediction. In addition to these quantitative methods, we would like to emphasize the importance of professional judgment from actuaries and predictive modelers for making the ultimate decision and interpretation of misrepresentation models using their practical knowledge. The following are some important practical considerations related to misrepresentation modeling.

Regarding inference on the presence of misrepresentation, the statistical significance of the LRT indicates that the data favor a mixture distribution in Equation 3 over a regular GLM without accounting for misrepresentation. The LRT and model comparison criteria, however, cannot provide information on the potential source of heterogeneity suggested by the mixture structure. In the case of the MEPS analysis, for example, individuals with high-deductible healthcare plans might behave more like uninsured individuals. Hence, the significance of the test might be caused by both potential misrepresentation on the insurance status and the low protection level of some insurance policies. To assess potential evidence on misrepresentation, we fit the lognormal misrepresentation models using the 2012 and 2013 MEPS data collected before the individual insurance mandate. The analysis confirmed that the LRT statistics reduce by 78% to 82%, respectively, for the two models when compared with the values from Table 6 with BIC criteria favoring the unadjusted lognormal severity models for the 2012 and 2013 data. This suggests misrepresentation to be a potential major contributor to the significance of the 2014 misrepresentation tests.

Like all statistical models, the ability of misrepresentation models to reveal the true values of parameters, including the risk effects and those related to misrepresentation, also relies on the assumption that the model is correctly specified. Misspecification of the distributional assumptions, such as omitting important risk factors or interaction terms, can result in potential bias in the estimation of such parameters. For the MEPS analysis, we tried estimating the prevalence of misrepresentation on the sex indicator, which is unlikely to be subject to misrepresentation, and obtained an insignificant LRT on the presence of misrepresentation with both AIC/BIC favoring regular GLM without accounting for misrepresentation. In specific actuarial applications, we caution that it may be possible for the LRT or model comparison criteria to favor misrepresentation models concerning risk factors unlikely to be subject to misrepresentation, in which case we recommend trying to improve ratemaking models by refining the existing risk factor(s) or adding additional predictors, polynomials, and interaction terms.

Regarding the latent logit model on the prevalence of misrepresentation, we found that the age variable has a significant effect on the prevalence of misrepresentation for the lognormal severity model based on the 2014 MEPS data. According to Equations 3 and 4, the mixture of experts model under misrepresentation has a different distributional form when compared with an unadjusted GLM with an interaction effect between uninsured status and age. In particular, the latent logit model features the prevalence of misrepresentation (i.e., mixture weights) that varies with the individual’s age, whereas the interaction model features constant prevalence of misrepresentation but with the difference of the $V=0$ and $V=1$ distributions (i.e., component distributions) varying with age. For the misrepresentation models considered in this paper, we could include interaction terms between $V$ and other risk factors from $\boldsymbol{X}^{\mathcal{S}}$ and $\boldsymbol{X}^{\mathcal{T}}$ . Although the misrepresentation models would still possess the general mixture forms given in Equations 3 and 4, the EM algorithms would need to be redesigned since the interaction terms involving $V$ are additional latent factors that need to be addressed in the E-step and M-step. Given that interaction terms are commonly included in ratemaking models, this would be an interesting future research topic for misrepresentation modeling. For the MEPS analysis, we would then be able to compare the unadjusted GLM to the corresponding misrepresentation model with the interaction of age and uninsured status in the component distributions, with and without a latent logit model with age as a predictor. The model comparison tools introduced previously could be used for assessing the presence and mechanism of misrepresentation in the presence of an interaction effect between $V$ and another risk factor.

7. Conclusions

In this paper, we proposed the EM algorithm for obtaining the MLEs of the parameters from GLM ratemaking models that embed predictive analytics on the misrepresentation risk. The EM algorithms were developed and implemented in R for commonly used loss severity and frequency distributions including lognormal, gamma, Poisson, and negative binomial distributions, with the models allowing the prevalence of misrepresentation to change with certain risk factors. Furthermore, we proposed statistical techniques for inference on the presence of misrepresentation and the effects of various risk factors on the prevalence of misrepresentation and the loss severity or frequency (i.e., ratemaking relativities). The comprehensive simulation studies we performed demonstrated the advantages of the maximum likelihood approach against its Bayesian counterpart and the importance of misrepresentation modeling.

By fitting different loss severity and frequency models using the MEPS data, we performed model selection based on goodness-of-fit statistics that penalize model complexity. The selected models confirmed the statistical significance of the tests on the presence of misrepresentation in the self-reported uninsured status, a conclusion consistent with the previous literature. Regarding the effects that the risk factors have on the loss outcomes, the lognormal loss severity model and negative binomial loss frequency model selected for the two loss outcomes give consistent statistical conclusions. Regarding the misrepresentation mechanism, there is inconsistency in the two models concerning the significance and effect of the age variable on misrepresentation. Thus, we resorted to other criteria such as the BIC and selected the lognormal and negative binomial models without a logit structure on the prevalence of misrepresentation. For model validation, we adopted various numeric and graphical methods to evaluate in-sample and out-of-sample prediction of unadjusted ratemaking models and those adjusting for misrepresentation. Such model validation procedures can be useful for assessing the predictive ability of claims analytics and ratemaking models that are fitted from historical data.

For the current study, the severity and frequency models were fitted, respectively, on the two different response variables. For future studies, it would be interesting and promising to study joint learning of the statistical evidence and mechanism of misrepresentation from a methodological standpoint by combining the frequency and severity models at the policy level. This would be particularly helpful in assessing the overall effect and significance of the risk factors on the prevalence of misrepresentation. The extension would lead to different structures of the complete data and partial data likelihood functions required for the EM algorithm. By resorting to numerical optimization and numerical derivatives, the implementation of the EM algorithms and inference techniques for future joint frequency and severity misrepresentation models would be feasible and promising for obtaining integrated conclusions regarding the presence and mechanism of misrepresentation.

Acknowledgments

The authors are grateful to the editor, anonymous referees and Dr. Jianxi Su for their valuable comments and suggestions that helped significantly improve the quality of this paper. The authors are grateful to the Casualty Actuarial Society for its generous support at the 2018 Individual Grant Competition.

Maximum Likelihood Approaches to Misrepresentation Models in GLM ratemaking: Model Comparisons

Abstract

1. Introduction

2. Predictive models for misrepresentation

2.1. Misrepresentation

2.2. GLM ratemaking

2.3. Predictive models on misrepresentation

3. Maximum likelihood estimation

3.1. Partial and complete data likelihood

3.2. EM algorithm

3.2.1. Expectation step

3.2.2. Maximization step

4. Statistical inference and model selection

4.1. Inference on risk effects

4.2. Inference on misrepresentation

4.3. Model selection with different loss distributions

4.4. In-sample and out-of-sample prediction

5. Simulation studies

6. MEPS misrepresentation analysis

6.1. Data and background

6.2. Severity misrepresentation analysis

6.3. Frequency misrepresentation analysis

6.4. In-sample and out-of-sample validation

6.5. Practical considerations

7. Conclusions

Acknowledgments

References

Appendix: EM algorithm for gamma model