Simulation Engine for Adaptive Telematics Data

Banghee So; Himchan Jeong

1. Introduction and motivation

Recent innovations in technology have made it possible to use telematics devices to track individual automobile vehicle usage data, such as mileage, speed, and acceleration. Telematics refers to “the use or study of technology that allows information to be sent over long distances using computers” (Oxford English Dictionary). Telematics are widely used in many fields. The insurance industry has been applying telematics to some practices, such as usage-based insurance, which allows them to collect additional information from driving records beyond traditionally observable policy and driver characteristics. Using a plug-in device or mobile application to collect data, insurers can easily gather relevant information from specific policyholders, providing a more sophisticated method for classifying risk.

Inspired by much interest in this practice, telematics use for automobile insurance has been actively studied in the actuarial literature. Ayuso et al. (2014) published one of the earliest studies to address using telematics for pay-as-you-drive insurance. The authors analyzed telematics data and found that vehicle usage differed substantially between novice and experienced young drivers.

Other research has also demonstrated that telematics provide valuable additional data that supplements traditional rating variables. According to Ayuso et al. (2016), gender—a widely used traditional rating variable—is not a significant rating factor after controlling for observed driving factors available through telematics. In a later study, Ayuso et al. (2019) found a significant effect of driving habits on the expected number of claims. Guillen et al. (2019) found that telematics devices help to more precisely predict excess zeros in claim frequencies. Other studies that demonstrated the efficacy of telematics information for improving traditional risk classification models include Boucher (2017), Gao et al. (2019), Pesantez-Narvaez et al. (2019), and Pérez-Marín et al. (2019). Recently, Guillen et al. (2020; 2021) used telematics data to analyze near miss events.

Despite ongoing academic and industry research, access to telematics data has been limited because of privacy concerns; this has made it difficult for actuarial and insurance community members to apply more diverse analytics to telematics data. To address this barrier, So, Boucher, and Valdez (2021) proposed a novel algorithm that generates a synthetic dataset of driver telematics by emulating an existing insurance portfolio. However, the authors’ algorithm replicates the distributions of independent and dependent variables almost identically, which might make the use of the resulting synthetic dataset less practical if overall characteristics of the target portfolio differ substantially from the original telematics data. Therefore, we proposed an advanced algorithm of adaptive telematics data generation that incorporates prior knowledge of an insurance market’s or company’s specific characteristics. We used the synthetic dataset from So, Boucher, and Valdez (2021) as the source data for our algorithm and generated an output dataset with modified distributions of the observed responses and covariates. This approach could provide a concrete method for generating synthetic telematics datasets based on insurers’ various needs.

The remaining sections are organized as follows. Section 2 introduces the general concept and characteristics of telematics information and synthetic datasets. Section 3 covers our main work, which includes constructing the simulation engine for adaptive telematics (SEAT) data, along with relevant theoretical foundations and actual implementation. Section 4 illustrates an empirical application of the proposed method by generating a portfolio that matches the South Korean insurance market. We also discuss the characteristics of the generated portfolio compared with the original portfolio. We conclude with remarks in Section 5.

2. Telematics and synthetic data

As mentioned in Section 1, vehicle usage telematics data are collected and transmitted on a real-time basis; therefore, telematics data collection and processing differs substantially from that of traditional data. According to Johanson et al. (2014), a driving signal dataset covering 1,000 vehicles will generate about 560 gigabytes per day. Since a typical automobile insurance portfolio includes millions of drivers, problems may arise regarding the effectiveness of data recording, data privacy (Duri et al. 2002, 2004), and feature extraction. Telematics data feature extraction is illustrated in Figure 1.

Figure 1.Feature extraction from telematics data (Weidner, Transchel, and Weidner 2016).

Conventionally, telematics data have been feature engineered into specific forms of attributes that are meaningful and ready to use for predicting auto insurance claims. Table 1 introduces a brief list of summary statistics that can be used in practice (Gerardo and Lee 2009).

Table 1.Examples of extracted features from telematics data.

Category	GPS	Mileage	Operation
Items	Car location	Weekend mileage	Average speed
	Running status	Night mileage	Average acceleration
	Travel time	Average monthly mileage	Average RPM

Recently, Wüthrich (2017) and Gao and Wüthrich (2018) suggested a different framework for telematics data feature extraction and dimension reduction. Since telematics data inherently form a continuum of observations, the authors considered the velocity-acceleration heatmaps and analyzed them via a k-means clustering algorithm to classify car drivers’ risk. Henceforth, we focus on emulation and analysis of a summarized form of telematics data rather than of raw data, as found in most actuarial studies.

However, finding a publicly accessible telematics dataset, even in a summarized form, has been extremely difficult, owing to privacy concerns of insurers who are reluctant to provide their driving records to the public. Given this, insurers are increasingly interested in using synthetic data, because it is free from privacy issues yet maintains the original dataset’s essential characteristics. Synthetic data can also be easily implemented for construction and validation of statistical models that improve risk classifications. For example, Gan and Valdez (2018) created a synthetic dataset of a large variable annuity portfolio that can be used to develop annuity valuation or hedging techniques such as metamodeling. Gabrielli and Wüthrich (2018) proposed an individual claims history simulation machine that helps researchers calibrate their own individual or aggregate reserving models. Cote et al. (2020) applied generative adversarial networks to synthesize a property and casualty ratemaking dataset, which could be used for predictive analytics in the ratemaking process. Avanzi et al. (2021) introduced an individual insurance claims simulator with feature control, which allows modelers to assess the validity of their reserving methods by back-testing.

To the best of our knowledge, So, Boucher, and Valdez (2021) is the first study to address synthesizing a dataset that includes features engineered from an actual telematics dataset. The authors used a three-stage process that applies various machine learning algorithms. First, they generated feature variables of a synthetic portfolio via an extended version of the synthetic minority oversampling technique (Chawla et al. 2002). Second, they simulated a corresponding number of claims via binary classifications using feedforward neural networks. Finally, they simulated corresponding aggregated amounts of claims via regression feedforward neural networks.

Despite its novelty, the extended synthetic minority oversampling algorithm for feature generation produced almost identical distributions of the traditional and telematics feature variables from the synthetic and original datasets, as shown in the Appendix of So, Boucher, and Valdez (2021). This poses a concern for researchers and practitioners wishing to use a synthetic dataset, given disparities in target market and original data feature portfolios. Therefore, we propose an innovative simulation engine, SEAT, that uses target market prior information to synthesize a feature portfolio similar to that of the target market.

3. Simulation of adaptive telematics data

3.1. Desirable characteristics of a synthetic telematics dataset

Before introducing SEAT in detail, we will discuss the agreed-upon desirable characteristics of a synthetic dataset. First, the dataset should be accessible to the public. As previously mentioned, limited access to telematics data is a major obstacle to developing and back-testing ratemaking methods with telematics features. Second, the dataset should be flexible enough to satisfy the needs of modelers with diverse and specific interests. Finally, dataset granularity should be assured so that the data can be used to train, test, and apply various features related to individual risk classification via predictive modeling.

Therefore, we aimed to develop a publicly available method that allows anyone to access the sample dataset, data generation routine, and source codes from the following link (https://github.com/bheeso/SEAT.git). The resulting dataset is fully granular and flexible in that it can match various target market profiles of interest. In future research, we hope to incorporate other important characteristics, such as longitudinality and multiple lines of business to emulate serial and/or between-coverage dependence.

3.2. Description of the source data

As discussed earlier, SEAT uses synthetic insurance claims data from So, Boucher, and Valdez (2021; http://www2.math.uconn.edu/~valdez/data.html, accessed on April 30, 2025) as the data source, which consists of traditional and telematics features and two response variables.

The source dataset contains 52 variables, which can be categorized into three groups:

11 traditional features, such as insurance exposure, age of driver, and main use of the vehicle.
39 telematics features, such as total distance driven and number of sudden accelerations.
Two response variables that describe the claim frequency and aggregated claim amounts.

Table 2 shows the name, description, and data attributes of the available features in the dataset. Note that feature attributes are important because they affect the data generation scheme that involves random perturbation, as elaborated in Step 4 of Section 3.3. For detailed information and preliminary analysis of the dataset, see Section 3 of So, Boucher, and Valdez (2021).

Table 2.Source dataset variable names and descriptions.

Type	Variable	Description	Attributes
Traditional	`Duration`	Duration of the insurance coverage of a given policy, in days	Ordinal
	`Insured.age`	Age of insured driver, in years	Ordinal
	`Insured.sex`	Sex of insured driver: male, female	Categorical
	`Car.age`	Age of vehicle, in years	Ordinal
	`Marital`	Marital status (single/married)	Categorical
	`Car.use`	Use of vehicle: private, commute, farmer, commercial	Categorical
	`Credit.score`	Credit score of insured driver	Ordinal
	`Region`	Type of region where driver lives: rural, urban	Categorical
	`Annual.miles.drive`	Annual miles expected to be driven declared by driver	Continuous
	`Years.noclaims`	Number of years without any claims	Ordinal
	`Territory`	Territorial location of vehicle	Categorical
Telematics	`Annual.pct.driven`	Annualized percentage of time on the road	Continuous
	`Total.miles.driven`	Total distance driven in miles	Continuous
	`Pct.drive.xxx`	Percent of driving day xxx of the week: mon/tue/…/sun	Continuous
	`Pct.drive.xhrs`	Percent vehicle driven within x hrs: 2hrs/3hrs/4hrs	Continuous
	`Pct.drive.xxx`	Percent vehicle driven during xxx: wkday/wkend	Continuous
	`Pct.drive.rushxx`	Percent of driving during xx rush hours: am/pm	Continuous
	`Avgdays.week`	Mean number of days used per week	Continuous
	`Accel.xxmiles`	Number of sudden accelerations 6/8/9/…/14 mph/s per 1,000 miles	Ordinal
	`Brake.xxmiles`	Number of sudden brakes 6/8/9/…/14 mph/s per 1,000 miles	Ordinal
	`Left.turn.intensityxx`	Number of left turns per 1,000 miles with intensity 08/09/10/11/12	Ordinal
	`Right.turn.intensityxx`	Number of right turns per 1,000 miles with intensity 08/09/10/11/12	Ordinal
Response	`NB_Claim`	Number of claims during observation	Ordinal
	`AMT_Claim`	Aggregated amount of claims during observation	Continuous

3.3. Adaptive data generation scheme

The proposed algorithm in SEAT uses predetermined distributions of the traditional covariates such as sex, region, age, and main use of the insured vehicle, which are easily accessible in the target market. In our source dataset, they are named Insured.sex, Region, Insured.age, and Car.use, respectively.

Here we assume that we know the benchmark ratios of classes in four covariates (Insure.age, Insure.sex, Region, Car.use) and set these ratios as the inputs to the algorithm. Table 3 shows how these inputs are defined. For example, \(P^*(A1)\) (in Table 3) indicates the ratio of insured who are between ages 16 and 30. We produced an adaptive portfolio by applying the SEAT algorithm, based on the ratios of the four variables.

Table 3.Description of inputs associated with the benchmark covariates.

Variable	Classes	Inputs
Insured.age	(16,30) / (30,40) / (40,50) / (50,60) / (60,103)	\(P^(A1)\), \(P^(A2)\), \(P^(A3)\), \(P^(A4)\), \(P^*(A5)\)
Insured.sex	male / female	\(P^(M)\), \(P^(F)\)
Region	rural / urban	\(P^(R)\), \(P^(U)\)
Car.use	private / commute / farmer / commercial	\(P^(C1)\), \(P^(C2)\), \(P^(C3)\), \(P^(C4)\)

When the input values are provided, the following algorithm generates data that are adapted from the source data:

Step 1: Calculate conditional distributions of the benchmark covariates (see Figure 2) using source data and inputs. The four variables may be colinear, so we used modified conditional distributions to reflect such collinearities in the algorithm. To obtain the modified conditional distributions, we used the following probability rules where \(P^*(\cdot)\) represents the probabilities for the target distribution. (For simplicity, we write \(P^*(\texttt{Variable 1}=x)\) , \(P^*(\texttt{Variable 2}=y\,|\,\texttt{Variable 1}=x)\), and \(P^*(\{\texttt{Variable 1}=x\} \cap \{\texttt{Variable 2}=y\})\) as \(P^*(x)\), \(P^*(y|x)\), and \(P^*(x \cap y)\), respectively.)

\[P^*(R|F)=\frac{P^*(R\cap F)}{P^*(R\cap F)+P^*(U\cap F)}\] and \[P^*(R|M)=\frac{P^*(R\cap M)}{P^*(R\cap M)+P^*(U\cap M)},\] where

\[\scriptsize \begin{aligned} P^*(R\cap F)&=P(R\cap F)\frac{P^*(R)}{P(R)}, \ P^*(U\cap F)=P(U\cap F)\frac{P^*(U)}{P(U)}, \\ P^*(R\cap M)&=P(R\cap M)\frac{P^*(R)}{P(R)}, \ P^*(U\cap M)=P(U\cap M)\frac{P^*(U)}{P(U)}, \end{aligned} \tag{1}\]

and \(P(\cdot)\) is the sample ratio calculated from the source data. One can easily check that (1) is equivalent to \[P^*(F|R)=P(F|R) \text{ and } P^*(F|U)=P(F|U).\] We may call \(\displaystyle\frac{P^*(R)}{P(R)}\) and \(\displaystyle\frac{P^*(U)}{P(U)}\) adjustment factors since they play an important role in adjusting ratios of the original portfolio to ratios in the adaptive portfolio. Other conditional ratios are subsequently calculated in the same way.
Step 2: Create a random sample from the standard uniform distribution and categorize it based on each conditional distribution in the order of Insured.sex, Region, Insured.age, and Car.use. First, one can sample a random observation for Insured.sex from its benchmark distribution [\(P^*(\texttt{Insured.sex}=F)\) and \(P^*(\texttt{Insured.sex}=M)\)]. If a uniform random number is categorized into female (F), then, in the next stage, a newly generated uniform random number is categorized based on the conditional distribution of Insured.sex [\(P^*(\texttt{Region}=R\,|\,\texttt{Insured.sex}=F)\) and \(P^*(\texttt{Region}=U\,|\,\texttt{Insured.sex}=F)\)].
Step 3: By repeating Step 2 \(N\) times, obtain \(N\) configurations of the four traditional covariates \(\{\mathcal{C}_i\}_{i=1,\ldots, N}\) where \(N\) is the number of total observations in the adaptive portfolio to be generated.
Step 4: From the original portfolio, sample \(N\) observations of the remaining variables (telematics and claims information) from their empirical distributions with Gaussian white noise, conditional on each configuration of the benchmark covariates. Note that the number of possible combinations of the benchmark covariates increases exponentially, depending on the number of covariates and levels of a covariate. Therefore, many cases may have configurations with sparse empirical observations. To avoid repeatedly producing the same samples, we twisted samples slightly using Gaussian white noise. For example, for configuration \(\mathcal{C}_i=\{\texttt{Car.use}=F, \ \texttt{Region}=R,\) \(\texttt{Insured.age}=A1,\) \(\texttt{Insured.sex} =C1\}\), the corresponding independent variables (both telematics and remaining traditional features) \(\mathcal{T}_i=(T^{(1)}_i, \ldots, T^{(p_{\mathcal{T}})}_i)\), number of claims \(N_i\), and total amount of claims \(S_i\) are generated as follows: \[\begin{aligned} T^{(j)}_i &\simeq \tilde{F}^{-1}_{T^{(j)}}(U_i|\mathcal{C}_i) + \sigma Z_i, \\ N_i &\simeq \tilde{F}^{-1}_\mathcal{N}(U_i|\mathcal{C}_i) + \sigma Z_i, \\ S_i &= \tilde{F}^{-1}_\mathcal{S}(U_i|\mathcal{C}_i) + \sigma Z_i, \\ \end{aligned}\] where \(Z_i\) is a random sample from the standard normal distribution, \(p_{\mathcal{T}}\) is the number of telematics and remaining traditional features, \(\tilde{F}_X(\cdot | \mathcal{C})\) is the empirical distribution of feature \(X\) given the configuration of the benchmark covariate \(\mathcal{C}\), and \(\sigma\) is the input that controls the degree of random perturbation. For simplicity and convenience of employing the SEAT algorithm, we use the same \(\sigma\) to all covariates. However, to give an equivalent degree of randomness to all covariates, their scales are adjusted by centering and scaling with the median and interquartile range. For example, for a variable \(X\), the scale-adjusted variable is defined as \(\frac{X-Q_2(X)}{Q_3(X)-Q_1(X)}\) where \(Q_1(X), Q_2(X)\), and \(Q_3(X)\) denote the first, second, and third quartile of the observed values of \(X\), respectively. Note that for ordinal variables (for example, number of claims or number of sudden brakes), \(T^{(j)}_i\) and/or \(N_i\) are rounded to the nearest ordinal number, respectively. We also set an upper limit and lower limit of each variable so that values are inside the expected interval.

Figure 2.A branch of generating conditional distributions with the benchmark covariates.

Algorithm 1 summarizes the data generation steps for given input values. Note that the proposed algorithm provides room for flexibility by (sub)data generation. For example, if a modeler is interested in analyzing policyholders in urban areas, one can impose \(P^*(R)=0\) so that the resulting dataset only contains policyholders in urban areas and their corresponding features.

Algorithm 1.SEAT.

\[ \mathbf { Input: } \begin{aligned} & {\left[P^*(M), P^*(F)\right], } \\ & {\left[P^*(R), P^*(U)\right], } \\ & {\left[P^*(A 1), P^*(A 2), P^*(A 3), P^*(A 4), P^*(A 5)\right], } \\ & {\left[P^*(C 1), P^*(C 2), P^*(C 3), P^*(C 4)\right] } \\ & N, \sigma \end{aligned} \]
Output: an adaptive dataset matching the target feature portfolio (size is \(N\) )

Import the data introduced in So et al. (2021) as the source dataset;
Calculated conditional distributions, \(P^*(R \mid F), P^*(U \mid F), P^*(R \mid M), P^*(U \mid M), P^*(A 1 \mid R F)\),
\[ P^*(A 2 \mid R F), \cdots, P^*(A 5 \mid U M), P^*(C 1 \mid A 1 R F), \cdots, P^*(C 4 \mid A 5 U M) ; \]
for \(i=1, \ldots, N\) do
Create a random sample, \(u_i\), from \(U(0,1)\);
Categorize \(u_i\) based on conditional distributions for Insured.sex, Region,
Insured.age, and Car.use, subsequently, and create a configuration, \(\mathcal{C}_i\);
From the source data, randomly sample an observation having \(\mathcal{C}_i\) with white Gaussian noise ( \(\sigma\) ) as follows:
\[ \begin{aligned} T_i^{(j)} & \simeq \tilde{F}_{T^{(j)}}^{-1}\left(U_i \mid \mathcal{C}_i\right)+\sigma Z_i \\ N_i & \simeq \tilde{F}_{\mathcal{N}}^{-1}\left(U_i \mid \mathcal{C}_i\right)+\sigma Z_i \\ S_i & =\tilde{F}_{\mathcal{S}}^{-1}\left(U_i \mid \mathcal{C}_i\right)+\sigma Z_i \end{aligned} \]
end
Return the adaptive dataset: \(\left\{\left(\mathcal{C}_1, \mathcal{T}_1, N_1, S_1\right),\left(\mathcal{C}_2, \mathcal{T}_2, N_2, S_2\right), \cdots,\left(\mathcal{C}_N, \mathcal{T}_N, N_N, S_N\right)\right\}\).

4. Empirical application: Synthetized telematics data for the South Korean insurance market

To test the effectiveness of SEAT, we generated a synthetized telematics dataset tailored to the South Korean insurance market with predetermined inputs and analyzed its properties compared with the original telematics dataset. We used the South Korean insurance market as a case study, but the proposed algorithm is applicable to any national or regional market, as well as to any industry or company, as it can be adjusted based on the specific interests of the modeler.

The South Korean insurance market has grown rapidly in recent years, ranking seventh in total premium volume in 2020 (Korean Insurance Research Institute 2021), with a global market share of 3.1%. Nevertheless, the use of telematics data both in actuarial practice and in research is still developing in South Korea. Although Han (2016) discussed regulatory and legal issues regarding telematics data for usage-based insurance in the South Korean insurance market, follow-up research on implementation of ratemaking methods with telematics data is lacking, partially due to the scarcity of publicly available data. Furthermore, only one company, Carrot Insurance, actively uses driver telematics information in their ratemaking scheme. We therefore extracted basic profiles of the South Korean insurance market from various sources (The Korean National Police Agency 2020; Korean Statistical Information Service 2020a, 2020b; The Korean Ministry of Land, Infrastructure and Transport 2020) and used these as the inputs to generate the adaptive portfolio. Table 4 provides the input specification of the benchmark covariates used in dataset generation.

Table 4.Specification of inputs for the South Korean insurance market.

Variable	Inputs
Insured.age	\(P^(A1)=0.16\), \(P^(A2)=0.21\), \(P^(A3)=0.24\), \(P^(A4)=0.22\), \(P^*(A5)=0.17\)
Insured.sex	\(P^(M)=0.6\), \(P^(F)=0.4\)
Region	\(P^(R)=0.1\), \(P^(U)=0.9\)
Car.use	\(P^(C1)=0.175\), \(P^(C2)=0.517\), \(P^(C3)=0.005\), \(P^(C4)=0.303\)

Running the algorithm (as described in Section 3.3) with the input specifications produced an adaptive portfolio; the resulting ratios of target variables are summarized in Table 5 and Figure 3. The ratios confirm that the proposed algorithm effectively replicated specified benchmark covariates to mimic the target market accordingly, as we expected.

Table 5.Realized ratio of the benchmark covariates in the generated portfolio.

Variable	Classes	Realized ratio
Insured.age	(16,30) / (30,40) / (40,50) / (50,60) / (60,103)	0.16 : 0.21 : 0.24 : 0.22 : 0.17
Insured.sex	male / female	0.59 : 0.41
Region	rural / urban	0.1 : 0.9
Car.use	private / commute / farmer / commercial	0.117 : 0.591 : 0.008 : 0.284

Figure 3.Target ratios (inputs) and original ratios.

While the target and original ratios are quite close in the benchmark covariates, the other variable distributions are not identical to those of the originals, which distinguishes the generated portfolio from the original. Figure 4 shows that the distributions of the two response variables differ between the generated and original portfolios (NB_Claim, AMT_Claim), where the variability increases as the perturbation input \(\sigma\) increases.

Figure 4.Comparisons of NB_Claim (left) and AMT_Claim (middle and right) in target and original data.

Similarly, Figure 4 shows how the distributions of the covariates’ generated values may differ from those of the covariates’ original values. As the level of perturbation increases, the influence of random noise in the generated covariates also increases. For example, in the case of Years.noclaims, the distribution becomes almost uniform when a large perturbation parameter is applied.

5. Concluding remarks

This article explored a new algorithm, SEAT, which allows users to generate insurance claims datasets with telematics features that are tailored to match the policy characteristics of a target market. The proposed algorithm employs both the ratios of selected policyholder characteristics and collinearity of these benchmark covariates in the original data as inputs. This generates a synthetized claims dataset that matches the target market where the resulting dataset is not identical to the original. The proposed simulation engine, SEAT, is also fully accessible to the public and can generate granular datasets with telematics features.

We acknowledge that the proposed algorithm uses the original feature portfolio as the main source; therefore the behavior of the synthetic portfolio is heavily dependent on the source dataset. Nevertheless, this research is meaningful in that it provides a novel method of producing telematics claims datasets with flexibility and accessibility, which encourages both practitioners and researchers to deepen their understanding of telematics data, and provides insight into the potential uses of existing telematics data. Finally, the procedure described in this paper could be expanded to many different types of data in addition to telematics. For instance, it could be applied as an extended version of the Casualty Actuarial Society (2018) CAS loss simulator to simulate data for a claims-level loss reserving model with individual policy characteristics.

Acknowledgments

The authors thank Daesan Shin Yong Ho Memorial Society for financial support through its Insurance Research Grant program.

Simulation Engine for Adaptive Telematics Data

Abstract

1. Introduction and motivation

2. Telematics and synthetic data

3. Simulation of adaptive telematics data

3.1. Desirable characteristics of a synthetic telematics dataset

3.2. Description of the source data

3.3. Adaptive data generation scheme

4. Empirical application: Synthetized telematics data for the South Korean insurance market

5. Concluding remarks

Acknowledgments

References

Simulation Engine for Adaptive Telematics Data

Abstract

1. Introduction and motivation

2. Telematics and synthetic data

3. Simulation of adaptive telematics data

3.1. Desirable characteristics of a synthetic telematics dataset

3.2. Description of the source data

3.3. Adaptive data generation scheme

4. Empirical application: Synthetized telematics data for the South Korean insurance market

5. Concluding remarks

Acknowledgments

References

This website uses cookies