Loading [Contrib]/a11y/accessibility-menu.js
Côté, Marie-Pier, Brian Hartman, Olivier Mercier, Josh Meyers, Jared Cummings, and Elijah Harmon. 2025. “Synthesizing Property & Casualty Ratemaking Datasets Using Generative Adversarial Networks.” Variance 18 (September).
Download all (12)
  • Figure 1. WGAN schema. The arrows represent the flow of the training process.
  • Figure 2. Architecture inside the generator (orange) and the critic (blue) for our multicategorical and continuous WGAN. The dimensions \(d_1, \ldots, d_p\) represent the number of levels in categorical variables \(1, \ldots, p\). FC stands for fully connected and BN stands for batch normalization.
  • Figure 3. CTGAN schema. The model flow is illustrated for the case when the selected column is Fuel Type and the selected value for that column is Diesel.
  • Figure 4. Architecture inside the CTGAN generator (orange) and critic (blue). The dimensions \(d_1, \ldots, d_p\) represent the number of levels in categorical variables \(1, \ldots, p\). The input of the generator is Gaussian random noise and the condition cond of the feature value that was randomly selected (see Figure 3). FC stands for fully connected, BN for batch normalization, Gumbel for the Gumbel softmax activation, and drop for dropout.
  • Figure 5. MNCDP-GAN schema. The orange relates to the generated data, and the green relates to the original data. The colored arrows represent the flow into the loss for the training of each network, with autoencoder in red, generator in orange, and discriminator in blue. For DP training, noise is added in training the decoder and the critic.
  • Figure 6. Comparison of univariate categorical variable distributions.
  • Figure 7. Comparison of response variable distributions.
  • Figure 8. Comparison of univariate categorical variable distributions for MNCDP-GAN models with \(\epsilon \in\{5,10000,100000, \infty\}\).
  • Figure 9. Comparison of the synthesized and real group frequencies for each class of the four categorical variables.
  • Figure 10. Comparison of univariate numerical variable distributions.
  • Figure 11. Comparison of the synthesized and real empirical probability of a claim for each group.
  • Figure 12. Comparison of the Poisson regression coefficients for the synthesized and real data.

Abstract

Due to confidentiality issues, it can be difficult to access or share interesting datasets for methodological development in actuarial science or other fields where personal data are important. We show how to design three different types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset. The goal is to obtain synthetic data that no longer contains sensitive information but still has the same structure as the original dataset and retains the multivariate relationships. In order to adequately model the specific characteristics of insurance data, we use GAN architectures adapted for multicategorical data: a Wassertein GAN with gradient penalty (MC-WGAN-GP), a conditional tabular GAN (CTGAN), and a Mixed Numerical and Categorical Differentially Private GAN (MNCDP-GAN). For transparency, the approaches are illustrated using a public dataset, the French motor third-party liability data. We compare the three different GANs on various aspects: ability to reproduce the original data structure and predictive models, privacy, and ease of use. We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.

This work was supported by an Individual Grant from the Casualty Actuarial Society.

Accepted: January 21, 2022 EDT