1. Introduction
Ever since insurance became a commercial product, insurance fraud has been wreaking havoc on the insurance industry. Insurance fraud has become a major problem in the US since the last century (e.g., Derrig 2002). At the time of writing, Coalition Against Insurance Fraud estimates that insurance fraud costs Americans billion each year (CAIF 2025). Since honest policyholders will ultimately foot the bill, this means hundreds of dollars in premium increase for an average American family. Therefore, detecting insurance fraud is one of the most important problems for the insurance industry.
While different insurers have different fraud-detecting systems, the general process can be described as follows. When a new claim arrives, it will first go through an initial screening process, which is often an automated system based on a statistical or machine learning method. If a claim is flagged, it will be singled out for further investigation; otherwise, it will be paid immediately. The process of evaluating a potentially fraudulent claim can be both complicated and costly. It often involves many human components, such as adjusters, special investigators, prosecutors, lawyers, and judges; see Derrig (2002) for a detailed review.
A fraud detection procedure should have two key elements. On the one hand, the insurer faces a large number of claims every year; it is practically impossible to investigate every incoming claim. However, when a fraudulent claim passes the initial screening (i.e., a false negative case), it will be paid as a valid claim. Therefore, it is critical that the initial screening should detect as many fraudulent claims as possible. On the other hand, if a valid claim is mistakenly flagged during the initial screening (i.e., a false positive case), the insurer will waste resources investigating it. Hence, the insurer wants to have as few false positive cases as possible. Taking these two aspects into consideration, we see that controlling the probability of prediction error is crucial in fraud detection.
Researchers have been investigating the problem of insurance fraud detection for decades, and several statistical methods have been proposed; see Ai et al. (2009), Ai, Brockett, and Golden (2013), Brockett and Derrig (2002), Frees, Derrig, and Meyers (2014), Gomes, Jin, and Yang (2021), Tumminello et al. (2023), and references therein. In particular, Ai et al. (2009) and Ai, Brockett, and Golden (2013) developed a method based on ridit analysis—a statistical method for assigning numerical scores to categorical data. The key advantage of this method is that it has rigorous theoretical support (Brockett and Levine 1977; Brockett 1981). Recently, Gomes, Jin, and Yang (2021) proposed a method based on autoencoders and variational autoencoders—two deep-learning models. Their method is applicable to a wider range of situations than the method developed in Ai et al. (2009) and Ai, Brockett, and Golden (2013), but the theoretical foundations for these methods are yet to be established. Similar to Gomes, Jin, and Yang (2021), Tumminello et al. (2023) also adapted a machine learning method to detect insurance fraud, but their method is mostly applicable to auto insurance only.
When it comes to insurance fraud detection, a machine learning method leaves at least two things to be desired. First, a machine learning method often has some tuning parameters, and its performance depends on them. In practice, the actuary cannot know the value of a tuning parameter needed for an automated fraud-detecting mechanism to achieve a given error rate. Second, most machine learning methods for insurance fraud detection do not have theoretical guarantees. An ideal fraud-detecting method should allow the insurer to control the probability of prediction error at a predetermined level each time the insurer makes a prediction. However, it has been proved that no method can ever achieve this goal (e.g., Lemma 1 of Lei and Wasserman (2014) and Theorem 2 of Hong (2023)). Therefore, we seek an attainable goal that is still very desirable: to find a fraud-detection method that allows the insurer to control the coverage probability of prediction at a preassigned level (see Section 2 for a detailed discussion on the difference between these two goals).
To our knowledge, no extant fraud-detecting methods provide the insurer with such an option. In addition, an ideal fraud-detecting method should have a provable guarantee of this desirable property. The purpose of this article is to propose a method for detecting insurance fraud that guarantees finite-sample validity. The proposed method is based on conformal prediction—a general machine learning strategy. For a general discussion of conformal prediction, see Shafer and Vovk (2008) and Vovk, Gammerman, and Shafer (2005); for applications of conformal prediction to insurance, see Hong and Martin (2021) and Hong (2023). Our method has several desirable properties: (1) it is distribution-free, (2) it has no tuning parameter, (3) it guarantees finite-sample validity, (4) it is applicable regardless of whether the features are continuous or categorical, and (5) it can be used to detect types of fraud other than insurance fraud.
An automated fraud-detecting system based on a statistical or machine learning method only serves as an initial screening mechanism for the insurer. Such a system might face several challenges. First, real insurance fraud data can be highly imbalanced. As a result, the performance of an automated fraud-detecting system can be unreliable. Moreover, the nature of fraud varies from case to case. For example, a medical claim of a skiing accident in Texas in August would be a glaring red flag. However, imagine the following situation: a family physician claims several charges for a patient’s visit and one of the charges is fraudulent while all the others are legitimate. In such a case, the fraud is so subtle that even an excellent automated fraud-detecting system may not be able to detect it, because this type of fraud may not have a numerical threshold. Finally, fraudsters keep changing their tricks based on the latest fraud-detecting procedures of the insurer. Therefore, the random features of the insurance fraud data might change over a short period, rendering some existing fraud-detecting methods useless. The proposed method generally overcomes the first and third challenges. However, like other fraud-detecting methods, the proposed method may not be able to detect the aforementioned type of subtle fraud.
The remainder of the paper proceeds as follows. Section 2 provides readers with necessary background by giving a high-level overview of conformal prediction. Section 3 details the proposed method for detecting insurance fraud based on conformal prediction. Section 4 gives several numerical examples to show the excellent performance of the proposed method. Finally, Section 5 concludes the paper with some remarks.
2. Conformal prediction
Conformal prediction is a general machine learning approach for guaranteeing provably valid predictions; see Shafer and Vovk (2008) for a review and Vovk, Gammerman, and Shafer (2005) for a monograph treatment. There are two versions: an unsupervised version and a supervised version. For the insurance applications of the unsupervised version, we refer to Hong and Martin (2021) and Hong (2023). Because the problem of detecting insurance fraud is a supervised learning problem, we will focus on the supervised version. To this end, we assume data take the form of exchangeable pairs for where is a vector of features and is the corresponding label. Our goal is to predict the next label at a randomly sampled feature based on observed data
Conformal prediction starts with a deterministic mapping Algorithm 1— to predict the value of the next label at a randomly sampled feature
of two arguments, where the first argument is a bag, i.e., a collection, of observed data, and the second argument is a provisional value of a future observation to be predicted based on the data in measures the degree of nonconformity of the provisional value with the data in That is, when the provisional value is square with the data in will be relatively small; otherwise, it will be relatively large. Therefore, we call a nonconformity measure. For example, if is real-valued and then we can take where of the conditional mean function, based on the bag The choice of nonconformity measure is not unique and is at the discretion of the actuary—in general, according to the problem at hand. Once a nonconformity measure is specified, the actuary implements the conformal prediction algorithm—In Algorithm 1, denotes the indicator function of an event The quantity called the -th nonconformity score, assigns a numerical value to to show how much agrees with the data in the augmented bag where itself is excluded to avoid biases as in leave-one-out cross-validation. The function termed the plausibility function, summarizes these nonconformity scores and outputs a value between and to indicate how plausible is as a value of based on the available data Based on the plausibility function output, the actuary can construct a conformal prediction band
Cα(x;Zn)={y:plZn(x,y)>α},
where
Moreover, we have the following theorem:Theorem 1. If
denotes the distribution of an exchangeable sequence then write for the corresponding joint distribution of For define where denotes the greatest integer less than or equal to ThensupPn+1{plZn(Zn+1)≤tn(α)}≤α for all n and all α∈(0,1),
where the supremum is over all distributions
for the exchangeable sequence.Proof. The proof is similar to that of Theorem 1 in Hong and Martin (2021). Since are exchangeable, we know as functions of are exchangeable, too. Therefore, the rank of is uniformly distributed on the set By its definition, the plausibility function is proportional to the rank of Therefore, follows the discrete uniform distribution on the set For a given < < if is an integer, then Otherwise, we will have Therefore, (2) always holds.
It follows immediately from Theorem 1 that the prediction band given by (1) is jointly valid in the sense that
Pn+1{Yn+1∈Cα(Xn+1;Zn)}≥1−αfor all (n,P),
where
is the joint distribution for That is, the coverage probability of prediction using the conformal prediction band is at least for all sample size and all distribution This coverage probability result is joint, and its associated joint validity of the prediction band is different from a more desirable conditional validity property, namely,Pn+1{Yn+1∈Cα(Xn+1;Zn)∣Xn+1=x}≥1−α for all (n,P) and almost all x.
Conditional validity implies joint validity because
Pn+1{Yn+1∈Cα(Xn+1;Zn)}=E[Pn+1{Yn+1∈Cα(Xn+1;Zn)∣Xn+1}],
where the expectation is taken with respect to the distribution of Vovk (2012) and Lei and Wasserman (2014) show that it is impossible to achieve conditional validity property with a bounded prediction region in supervised learning; see also Foygel-Barber et al. (2021) and Guan (2019). Hong (2023) establishes a similar result for unsupervised learning. Therefore, no practically useful fraud-detecting method can ever achieve conditional validity. It suggests that finite-sample joint validity guaranteed by conformal prediction is the best we can do. Though joint validity is a nice theoretical property, its practical meaning needs to be interpreted carefully. Note that the strong law of large numbers implies (4) can be written as
Conditional validity says that the probability of accurate prediction is at least for each prediction. However, joint validity means that the rate of accurate prediction is at least i.e., if the insurer performs an infinite sequence of independent predictions, then at least of them are accurate. Clearly, conditional validity should be an ideal property for any fraud-detecting method. ButPn+1{Yn+1∈Cα(Xn+1;Zn)}=limm→∞∑mi=1Pn+1{Yn+1∈Cα(Wi;Zn)∣Wi}m,
where
is a sequence of independent random variables that have the same distribution as Since no insurer can conduct infinitely many predictions, this means that if the insurer performs a sufficiently large number of independent predictions, then about of them will be accurate, because the right-hand side of (5) may or may not have converged for finitely many predictions. Also, this interpretation does not contradict the finite-sample validity of the conformal prediction band. The latter refers to the fact that the inequality in (3) holds for any finite sample Furthermore, for two different coverage probability levels and the corresponding conformal prediction regions and are different. Therefore, convergence in (5) depends not only on the training data but also on Finally, finite-sample validity, given by (3), is not to be confused with finite-sample generalization error bound: the former does not depend on any loss function, while the latter depends on the choice of a loss function.It bears noting that conformal prediction has three potential drawbacks: (1) one may not be able to implement Algorithm 1 for all possible jeopardizing finite-sample validity; (2) the shape of the conformal prediction region could be irregular, rendering it useless in practice; and (3) the computation required for implementing Algorithm 1 could be prohibitively expensive. In a regression problem, (1) and (3) are major concerns for conformal prediction. Fortunately, we do not need to worry about them for the problem of detecting insurance fraud, because there are only two possible values of and the resulting conformal prediction region can only take four possible shapes; see the next section for details. However, (2) is a challenge we must overcome in applying conformal prediction to insurance fraud detection. To circumvent this difficulty, we will propose a nonconformity measure and derive close-form formulas for the resulting conformal prediction region.
3. Proposed method
In this article, we consider two cases: (1) all features are continuous and (2) all features are categorical. It is evident that the aforementioned conformal prediction band
depends on the choice of the nonconformity measure Therefore, the proposed nonconformity measures are different for these two cases. The choice of the nonconformity measure is not unique. In practice, the actuary may choose other appropriate nonconformity measures.3.1. Continuous features
Suppose
are observed data where and Define a bag of data where For the -th observation, is known. The goal is to determine based on and the data in the bag Without loss of generality, we may assume for and for for some integerThe nonconformity measure we choose here is
M(B,z)=||¯XB∪{(x,y)},y−x||,
where
is the Euclidean norm on and denotes the vector obtained by averaging all the s whose labels in the bag For example, if and where thenM(B,z)=|¯XB∪{(x,y)},y−x|={|(1+5)/2−3|=0,if y=0;|8−3|=5,if y=1.
Since all norms are equivalent on a finite-dimensional Euclidean space, our choice of the Euclidean norm is made without loss of generality.
To derive the
conformal prediction band we need to consider two cases: (I) and (II)Case I:
. The -th nonconformity score is given byμi=M(Zn+1∖{Zi},Zi)={||¯X(Zn+1∖{Zi})∪{Zi},0−Xi||, if Yi=0;||¯X(Zn+1∖{Zi})∪{Zi},1−Xi||, if Yi=1.={((¯XZn+1,0−Xi)T(¯XZn+1,0−Xi))1/2, if Yi=0;((¯XZn+1,1−Xi)T(¯XZn+1,1−Xi))1/2, if Yi=1.μn+1=||¯XZn+1,0−Xn+1||=((¯XZn+1,0−Xn+1)T(¯XZn+1,0−Xn+1))1/2,
where the superscript
used in the last equality denotes matrix transpose. When we can rewrite as:μi≥μn+1⇔((¯XZn+1,0−Xi)T(¯XZn+1,0−Xi))1/2≥((¯XZn+1,0−Xn+1)T(¯XZn+1,0−Xn+1))1/2⇔(¯XZn+1,0−Xi)T(¯XZn+1,0−Xi)≥(¯XZn+1,0−Xn+1)T(¯XZn+1,0−Xn+1)⇔XTiXi−2XTi¯XZn+1,0+¯XTZn+1,0¯XZn+1,0≥XTn+1Xn+1−2XTn+1¯XZn+1,0+¯XTZn+1,0¯XZn+1,0⇔XTn+1Xn+1−XTiXi−2XTn+1¯XZn+1,0+2XTi¯XZn+1,0≤0⇔(Xn+1−Xi)T(Xn+1+Xi)−2(Xn+1−Xi)T¯XZn+1,0≤0⇔(Xn+1−Xi)T(Xn+1+Xi)−2(Xn+1−Xi)T∑mk=1Xk+Xn+1m+1≤0⇔(Xn+1−Xi)T(Xn+1+Xi−2∑mk=1Xk+Xn+1m+1)≤0⇔(Xn+1−Xi)T(m−1m+1Xn+1+Xi−2∑mk=1Xkm+1)≤0⇔(Xn+1−Xi)T[Xn+1−(2∑mk=1Xkm−1−m+1m−1Xi)]≤0.
When
we can rewrite as:μi≥μn+1⇔||¯XZn+1,1−Xi||≥||¯XZn+1,0−Xn+1||⇔||∑nk=m+1Xkn−m−Xi||≥||∑mk=1Xk+Xn+1m+1−Xn+1||⇔||∑nk=m+1Xkn−m−Xi||2≥||∑mk=1Xkm+1−mm+1Xn+1||2⇔||∑mk=1Xkm+1−mm+1Xn+1||2−||∑nk=m+1Xkn−m−Xi||2≤0⇔(||mm+1Xn+1−∑mk=1Xkm+1||+||∑nk=m+1Xkn−m−Xi||)(||mm+1Xn+1−∑mk=1Xkm+1||−||∑nk=m+1Xkn−m−Xi||)≤0.
Therefore, the plausibility value
is given byplZn(Yn+1=0,Xn+1)=1n+1n+1∑i=1I(μi≥μn+1)=1n+1n+1∑i=1{I{(Xn+1−Xi)T[Xn+1−(2∑mk=1Xkm−1−m+1m−1Xi)]≤0}+I{(‖mm+1Xn+1−∑mk=1Xkm+1‖+‖∑nk=m+1Xkn−m−Xi‖)(‖mm+1Xn+1−∑mk=1Xkm+1‖−‖∑nk=m+1Xkn−m−Xi‖)≤0)}}.
Case II:
. As in the case where we can derive thatμi=M(Zn+1∖{Zi},Zi)={||¯XZn+1,0−Xi||, if Yi=0;||¯XZn+1,1−Xi||, if Yi=1.={((¯XZn+1,0−Xi)T(¯XZn+1,0−Xi))1/2, if Yi=0;((¯XZn+1,1−Xi)T(¯XZn+1,1−Xi))1/2, if Yi=1.μn+1=||¯XZn+1/{Zi},1−Xn+1||=((¯XZn+1,1−Xn+1)T(¯XZn+1,1−Xn+1))1/2.
When
we can rewrite asμi≥μn+1⇔||¯XZn+1,0−Xi||≥||¯XZn+1,1−Xn+1||⇔||∑mk=1Xkm−Xi||≥||∑nk=m+1Xk+Xn+1n−m+1−Xn+1||⇔||∑mk=1Xkm−Xi||2≥||∑nk=m+1Xkn−m+1−n−mn−m+1Xn+1||2⇔||n−mn−m+1Xn+1−∑nk=m+1Xkn−m+1||2−||∑mk=1Xkm−Xi||2≤0⇔(||n−mn−m+1Xn+1−∑nk=m+1Xkn−m+1||+||∑mk=1Xkm−Xi||)(||n−mn−m+1Xn+1−∑nk=m+1Xkn−m+1||−||∑mk=1Xkm−Xi||)≤0.
When
we can rewrite asμi≥μn+1⇔((¯XZn+1,1−Xi)T(¯XZn+1,1−Xi))1/2≥((¯XZn+1,1−Xn+1)T(¯XZn+1,1−Xn+1))1/2⇔(¯XZn+1,1−Xi)T(¯XZn+1,1−Xi)≥(¯XZn+1,1−Xn+1)T(¯XZn+1,1−Xn+1)⇔XTn+1Xn+1−2XTn+1¯XZn+1,1−XTiXi+2XTi¯XZn+1,1≤0⇔(Xn+1−Xi)T(Xn+1+Xi)−2(Xn+1−Xi)T¯XZn+1,1≤0⇔(Xn+1−Xi)T(Xn+1+Xi−2¯XZn+1,1)≤0⇔(Xn+1−Xi)T(Xn+1+Xi−2∑nk=m+1Xk+Xn+1n−m+1)≤0⇔(Xn+1−Xi)T(n−m−1n−m+1Xn+1+Xi−2∑nk=m+1Xkn−m+1)≤0⇔(Xn+1−Xi)T[Xn+1−(2∑nk=m+1Xkn−m−1−n−m+1n−m−1Xi)]≤0.
It follows that the plausibility value
is given byplZn(Yn+1=1,Xn+1)=1n+1n+1∑i=1I(μi≥μn+1)=1n+1n+1∑i=1{I{(||n−mn−m+1Xn+1−∑nk=m+1Xkn−m+1||+||∑mk=1Xkm−Xi||)(||n−mn−m+1Xn+1−∑nk=m+1Xkn−m+1||−||∑mk=1Xkm−Xi||)≤0}+I{(Xn+1−Xi)T[Xn+1−(2∑nk=m+1Xkn−m−1−n−m+1n−m−1Xi)]≤0}}.
Thus, the
conformal prediction band given byCα(Xn+1,Zn)={{0}, if plZn(Yn+1=0,Xn+1)>α and plZn(Yn+1=1,Xn+1)≤α;{1}, if plZn(Yn+1=0,Xn+1)≤α and plZn(Yn+1=1,Xn+1)>α;{0,1}, if both plZn(Yn+1=0,Xn+1)>α and plZn(Yn+1=1,Xn+1)>α;∅, if plZn(Yn+1=0,Xn+1)≤α and plZn(Yn+1=1,Xn+1)≤α,
where
and are calculated using (6) and (7).A few remarks are in order. The first two cases in (8) where
and denote the classification results of a valid claim and a fraudulent claim, respectively. When the conformal prediction classifier says the claim at hand is either a valid claim or a fraudulent claim, which is always true and not helpful for practitioners. Such a result will be deemed “noninformative.” If turns out to be the empty set, then the conformal prediction classifier cannot generate any conformal prediction band based on given data. This means the coverage probability level is too high for a given information In practice, a claim that results in will automatically be flagged for further investigation according to the insurance fraud detecting process described in Section 1. In addition, any claim leading to and should be further examined by other classification methods or investigated by the fraud-detecting staff of the insurer.3.2. Categorical features
To consider the case where features are categorical, we let
be observed data for fraud detection, where and is categorical, and Define a bag of data where For the -th observation, where is known. As in the case of continuous features, our task here is to predict the value of given based on the bag To this end, we will first transfer all categorical features to numerical values using frequency encoding. That is, for each categorical value a feature takes, we replace it with the frequency of the occurrence of that category in that data. That is, for and we replace the value in the original data with the new value Once we finish this frequency encoding, we apply our conformal prediction classifier to the encoded data. Though frequency encoding converts the features to numbers, it does not change the categorical nature of these features, and the resulting features still take only finitely many values in Therefore, fraud detection is expected to be more challenging here than in the above case of continuous features; see Section 4.2 for a concrete example.4. Examples
4.1. Continuous features
Example 1. The data used in this example are the same as the data Gomes, Jin, and Yang (2021). They consist of credit card transactions made by European cardholders in September 2013. The dataset is available at https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud. It was collected during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. In total, there are 284,807 transactions; 492 of them are fraudulent. The raw data have been anonymized for confidentiality. The resulting dataset contains 30 continuous features and one categorical label (i.e., the fraud indicator). The label is tagged as “class”; it takes the value if the claim is fraudulent and otherwise. Of the 30 features, 28 (labeled as V1 to V28) are the principal analysis components obtained from the raw data, and the remaining two, labeled “time” and “amount” respectively, are the claim time and the claim amount. The features “time” and “amount” and the label “class” have not been transformed and are the same as in the raw data. This dataset is highly imbalanced: the positive class (frauds) accounts for of all transactions.
inWe take the validation set approach with a Table 1 reports the results, where the test error rate is calculated as
split of the original dataset into the training dataset and the test dataset. The training dataset consists of legitimate claims and fraudulent claims. The test dataset has valid claims and fraudulent claims. Here we take the first six principal components to be our features. For and we train our conformal prediction classifier on the training dataset and then test it on the test dataset.test error rate =FN+FPtest data sample size.
Here “noninformative” corresponds to the case where the prediction band
is That is, the conformal prediction classifier can only tell the user that a claim must be either valid or fraudulent, which is not informative. Alternatively, “empty” is the case where This means that the conformal prediction classifier is unable to tell whether a claim is either valid or fraudulent. Neither case is useful.For each chosen value of
the false positive counts are more than the false negative counts, but the test error rate is below the coverage probability level As we mentioned in Section 2, we must exercise caution in interpreting the numeral results produced by conformal prediction. Although the prediction band is provably valid, the test error rate in any real-world example may or may not be less than because we can only test our conformal prediction classifier a finite number of times in a real-world example, and there is no way to guarantee that convergence in (5) has been achieved in such a case. In this example, we are confident that the convergence has been achieved. In addition, any statistics-based or data science–based fraud-detecting tool only serves as an initial screening tool in practice. For this reason, a test error rate of is already considered to be very good. Moreover, for values of and our conformal prediction classifier is able to label a claim in the test dataset as fraudulent or valid and of the time, respectively. Given these observations, the performance of our method is excellent.It is customary to compare a new method to some existing methods. However, we are not aware of any existing methods in the insurance literature that can provide finite-sample validity. Therefore, we compare the performance of our method with the two latest methods along these lines. The two methods, proposed by Gomes, Jin, and Yang (2021), are (a) variational autoencoder (VAE) and (b) autoencoder (AE). We will take (a) as the baseline model. Table 2 summarizes the results. Note that AE and VAE both have a parameter called the reconstruction error threshold (RE-T), but neither provides information about possible noninformative or empty cases. Also, the preassigned coverage probability level does not apply to AE or VAE. However, TP, FN, FP, TN, and test error rate apply to all fraud-detecting methods. The second column of Table 2 refers to the level for the conformal prediction region or the tuning parameter RE-T of VAE and AE. RE-T is a tuning parameter, but is not. Therefore, one must exercise caution in interpreting Table 2. In particular, an RE-T value of 40 for VAE or AE is not comparable to an value of for conformal prediction.
Table 2 shows that both VAE and VE perform better as RE-T increases. Specifically, TP and TN increase as RE-T increases, while FN, FP, and test error rate decrease as RE-T decreases. Regarding TP, FN, FP, and TN, conformal prediction seems to be relatively conservative compared to VAE and AE. This does not mean VAE or AE is better than conformal prediction. First, neither VAE nor AE guarantees finite-sample validity, while conformal prediction achieves validity at every chosen level, which is the key strength of the proposed method. Second, there is no established relationship between RE-T and test error rate. In particular, the choice of RE-T is subjective, and the actuary cannot know the exact RE-T value for achieving a test error rate below a given level. Third, the fraud data is highly imbalanced: its entries are dominantly nonfraudulent. Hence, when the actuary raises the RE-T level of VAE or AE, TN will increase and FP will decrease. This will generally reduce the test error rate. Therefore, it is difficult to distinguish this general effect from the performance of VAE and AE in any empirical study. In sum, VAE or AE might perform better than the proposed method in some cases, but this possibility does not provide any useful information for our purpose: to design an automated fraud-detecting method that is used for initial screening. In particular, given an value, an actuary cannot know beforehand what value of RE-T is needed for an automated fraud-detecting method based on VAE or AE to perform better than conformal prediction. Even so, the proposed method tends to be conservative, but it has been proven to guarantee finite-sample validity.
Example 2. The insurance fraud dataset used in this example consists of insurance claims provided by a major insurer in Spain from 2015–2016. Like the credit card fraud data in the previous example, the dataset has been anonymized due to its confidential nature. The dataset is available at https://data.mendeley.com/datasets/g3vxppc8k4/2. It contains a total of 163,182 claims, of which 13,037 are fraudulent. There are 325 continuous features labeled from to one categorical feature, and one categorical label. The categorical feature represents the claim ID, and the categorical label, which only takes values in the set is the fraud indicator. It is unclear from the data what the continuous features stand for. However, this does not affect the applicability of our method. To make the data manageable to analyze, we apply the principal component analysis transformation across all features except the categorical feature. Then we select the first six principal components as the features for our conformal prediction classifier. As in the previous example, we take the validation set approach and split the data into a training dataset or 122,387 claims) and a test dataset or 40,795 claims). In the training dataset, there are valid claims and fraudulent claims. The test dataset has valid claims and fraudulent claims. For and we train our conformal prediction classifier on the training set and then test it on the test dataset. Table 3 summarizes the key quantities from the results. As in the previous examples, the test error rate is below the coverage probability level across four different values of This again demonstrates the excellent performance of our conformal prediction classifier.
Finally, we point out that no fraud-detecting method, including our conformal prediction classifier, will work well if the given data are not informative enough in the sense that the correlation coefficient between each feature and the label is very low. This is not a drawback of our method. When all features are barely correlated with the label, the information supplied by the features has little bearing on the label. In this case, no method is expected to provide a reasonable solution. For example, the highest correlation coefficient between the six principal components and the label in Example 1 is Table 4 shows that the results are completely unsatisfactory.
Though the correlation coefficient is a bit low, the huge sample size compensates for it, and the result is satisfying. For the Mendeley data, the highest correlation coefficient between the six principal components and the label is To further illustrate this point, we consider leaving out all features except six features with the lowest absolute correlation coefficients (i.e., coefficients and and keeping the label. Now we apply our method to this modified dataset.In practice, an actuary should first check whether at least one feature is correlated with the label to a reasonable extent. In the case of continuous features, this can be done by calculating the correlation coefficient between each feature and the label. If the answer is affirmative, then the actuary can proceed further to select the right tool for fraud detection. Otherwise, the task of fraud detection may be too challenging to be completed without more data. Also, a large sample size can compensate for a relatively weak correlation.
4.2. Categorical features
Intuitively, the requirement that at least one feature is associated with the label to a reasonable extent should also apply when the features are categorical. In this case, the correlation coefficient is no longer appropriate for measuring the association between categorical variables. Instead, we should use Cramér’s V.
Let
and be two categorical variables such that takes categorical values and takes categorical values For a sample of with size we putni⋅=the number of times xi is observed in the data,n⋅j=the number of times yj is observed in the data,nij=the number of times (xi,yj) is observed in the data.
Then the corresponding chi-squared statistic is given by
χ2=s∑i=1t∑j=1(nij−ni⋅n⋅j/nij)2ni⋅n⋅j/nij.
Cramér’s V between
and denoted as is defined asV(X,Y)=√χ2/nmin{s−1,t−1}.
Like the correlation coefficient,
takes values in where a higher value of means a higher degree of association between and and vice versa. In particular, and denote total lack of association and perfect association, respectively.Example 3. Here we consider a public dataset that contains auto insurance claims over a year in a given territory. The data are available from the link https://www.kaggle.com/code/buntyshah/insurance-fraud-claims-detection. There are categorical features and one categorical label that denotes whether a claim is legitimate or fraudulent. The data contain a total of claims. Like the previous two datasets, this auto insurance claim dataset displays a typical imbalance: legitimate claims and fraudulent claims.
To apply our conformal prediction classifier, we first encode all the categorical features using the frequency of observations within each category. Take the feature WitnessPresent for an example. This feature has Table 5 displays the results.
occurrences with no witness and occurrences with a witness present. Then the two categorical values “no witness” and “witness present” will be encoded as two numerical values and respectively. For the encoded data, we check Cramér’s V between each feature and the label. It turns out that each feature is very weakly associated with the label: the largest value of Cramér’s V is only This shows that features are weakly associated with the label. Moreover, the additional challenge for the categorical feature mentioned at the end of Section 3.2 makes things even worse. Thus, our method is not expected to perform well on such a dataset. To see this, we still follow the validation set approach to make a split of the encoded data into a training dataset and a test dataset. The training dataset consists of valid claims and fraudulent claims, while the test dataset contains valid claims and fraudulent claims. Next, we train our conformal prediction classifier on the training dataset before testing it on the test dataset, using six features having the highest association with the label.The performance of our conformal prediction classifier is unsatisfying. This public dataset is not informative enough, in the sense that the association between each feature and the label is too weak. In particular, Table 5 shows that our conformal prediction classifier discovers many noninformative cases. This is not a flaw of our method. Quite the opposite—it shows that our conformal prediction classifier can tell the actuary that the data are noninformative when they are.
Next, we use the R package “GenOrd” to simulate a new categorical variable Table 6. Since the association of and is high, the results are satisfactory, and the test error rate is lower than each chosen level.
such that Cramér’s V between and the label equals 0.75. Then, we form a simulated dataset using this simulated feature, the label from the original data, and the five features with the highest value of Cramér’s V’s with the label in the original data. Finally, we apply our conformal prediction classifier on the simulated data for and The results are summarized inTo further investigate the degree of association of the features with the label on prediction accuracy, we repeat the aforementioned simulation for 7 and 8 demonstrate the results. For the case where the association between and is medium, i.e., the results are mixed. When the targeted level is not too demanding, i.e., and the test error rate is lower than but for and the test error rate is unacceptably higher than When the association between and is low, i.e., the results are unacceptable for each given level. Recall that the highest value of Cramér’s V between a feature and in the original data is If then the highest Cramér’s V between and any of the six features in this simulated data will still be Therefore, the unsatisfying performance of the conformal prediction classifier here is no surprise.
and Tables5. Concluding remarks
We have proposed a new fraud-detecting method based on a general machine learning strategy called conformal prediction. Our method has three desirable properties: (1) it guarantees finite-sample validity, (2) it is model-free, and (3) it has a solid theoretical backup. For practical purposes, when actuaries apply our method to predict possible fraudulent claims, the test error rate is expected to be below the preassigned level when at least one feature is reasonably associated with the label and the sample size is sufficiently large. We have demonstrated that our method applies to both continuous and categorical features. In practice, the actuary may also encounter the case where the data contain both numerical and categorical features. This “mixed” case can be handled as in Section 3. That is, the actuary can first encode all the categorical features into numerical values and then apply our method as described in Section 3.1.
The proposed method may not be applicable when the sample size
is too small with respect to a given coverage probability level Specifically, if is less than 1, then the prediction region will be In this case, the result is noninformative. In addition, the proposed method only guarantees finite-sample validity, though empirical evidence shows that it yields low FP and TN rates. Like many other machine learning methods, the proposed method does not guarantee provably low FP and TN rates.Acknowledgments
We thank the anonymous reviewer for many helpful comments and suggestions.
Funding
Liang Hong is grateful to CAS for their support for his research.