# 1. Introduction

Recommender systems (RSs) are machine learning algorithms that suggest to loyal clients the best next product to buy or propose to prospective customers the most appropriate set of products (Ricci et al. 2011).

Currently, these algorithms constitute one of the more active research areas of machine learning (ML), partly because of their successful employment in various business sectors, from the entertainment industry to e-commerce. Netflix’s movie suggestions (Netflix 2009), Amazon’s recommended site sections, and Twitter’s “users to follow” are just a few well-known outstanding examples of RS use cases.

However, despite this great popularity, the insurance industry (II) has paid relatively little attention to them. In particular, by mid-2018 only a few articles (Mitra, Chaudhari, and Patwardhan 2014; Rokach et al. 2013; Qazi et al. 2017) in the actuarial literature had focused on the application of RSs to the II. The reasons behind the paucity of scientific papers have mainly to do with the intrinsic nature of the II when compared to other fields, where RSs have been rewardingly applied (Rokach et al. 2013). Building an RS capable of bringing value to both companies and consumers, using canonical techniques, requires either a large set of similar products to recommend or a vast amount of unwittingly provided profiling data characterizing clients and prospects. In contrast, in the insurance industry the number of products available is relatively small, limiting the effectiveness of classical RS models. Moreover, many techniques adopted in RS analyses are quite different from those routinely used by actuaries. Nevertheless, we believe that unlocking the potential of RSs in the II could probably benefit insurers—that is, by proposing the purchase of nonregulated coverages, which typically offer a higher underwriting margin than regulated ones.

This work aims at providing an overview of the state-of-the-art RS algorithms from theory to code implementation in a real-case scenario, showing that such techniques could find fruitful application in actuarial science. Furthermore, we suggest a way to mathematically model the recommendations in the II field and we build an RS approach based on supervised learning algorithms. For our data, this suggested method shows promising results in overcoming the difficulties of the classical RS models in the II.

# 2. Brief literature review

According to Sarwar et al. (2001), RSs consist of the application of statistical and knowledge discovery techniques to the problem of making product recommendations based on previously recorded data. The objective of making such recommendations is the development of a value-added relationship between the seller and the consumer, by promoting cross-selling and conversion rates.

RSs date back to the early '90s with the Apriori algorithm (Agrawal and Srikant 1994). Basically, this technique returns a set of “if-then” rules that suggests the next purchase given the history of previous transactions (“market basket”) using simple probability rules. Nowadays, RSs are grouped into two macro categories: those using “collaborative filtering” (CF) reasoning and those using “content-based filtering” (CBF) (Ansari, Essegaier, and Kohli 2000). Whereas the CF approach computes its predictions on the behavior of similar customers (Breese, Heckerman, and Kadie 1998), the CBF approach uses descriptions of both items and clients (or prospects) to understand the kinds of items a person likes (Aggarwal 2016). For an exhaustive discussion, Ricci et al. (2011) provide a broad overview.

As one may note from the increasing relevance of the RecSys conference (Association for Computing Machinery 2018), literature regarding RSs is vast and growing. Nevertheless, it is a topic little explored in actuarial and insurance-related research. In Mitra, Chaudhari, and Patwardhan (2014), we found an analysis of the potential of RSs in the II, but mostly from a business perspective and without an empirical approach. On the contrary, Rokach et al. (2013) underscore the difficulty of applying RSs in the II, addressing the issue of the small number of items to recommend. Nevertheless, the paper showed that adopting a simple CF algorithm to recommend riders could significantly benefit sales department productivity.

Qazi proposes the use of Bayesian networks to predict upselling and downselling (Qazi et al. 2017); however, the paper provides scarce details of the methodology.

It is important to stress that none of the techniques discussed in Mitra, Chaudhari, and Patwardhan (2014), Rokach et al. (2013), and Qazi et al. (2017) belongs to any model class with which actuaries would be familiar. For that reason, we offer an overview of the classical RS algorithms in the following sections.

# 3. The framework

In this section we introduce how to build and evaluate a model based on machine learning algorithms, focusing our attention on the development of a recommendation system for insurance marketing. In particular, we stress that our final goal is the construction of an RS that can predict a client’s propensity to buy an insurance coverage in order to support business decisions.

## 3.1. Methodology

### 3.1.1. Modeling ratings data

Most classical RSs base their suggestions on a ratings matrix R that encodes information about customers and products. The name “ratings matrix” comes from the fact that originally each row contained customer ratings for sold products. In practice, the element rij could be expressed in either a numeric-ordinal scale (for example, one-to-five-star rating) or using 0–1 data, sometimes referred to as “implicit rating” (IR). In other words, IR represents data gathered from user behavior where no rating has been assigned.

IR is the data format most relevant for our application: a value of 1 means that the policyholder has expressed a preference for (i.e., purchased) that specific insurance coverage, while a zero value can mean that the customer does not like the product, does not need it now, or is unaware of it. Thus, while the 1 class contains only positive samples, the zero class is a mixture of both positive and negative samples. In fact, it is not known whether the policyholder dislikes the item or would appreciate it, if he or she were aware of it. For the sake of completeness, other forms of IRs can be represented by the number of times the user purchased that specific product or by the times that product has been used (in the entertainment industry, for example).

A frequently used metric to compute product similarities in the implicit data context is the Jaccard index (Jaccard 1901), J=|A∩B||A∪B|, with |A∩B| representing the number of occurrences where products A and B are jointly held.

ML predictive algorithms are usually identified by a family of models. Such algorithms typically depend on the choice of one or more hyperparameters, within specific ranges, whose optimal value cannot usually be known a priori. The process of determining the best set of hyperparameters, known as “model tuning,” requires the testing of the algorithm over different choices of hyperparameters. This procedure is a fundamental step for the success of most ML models.

The simplest approach to this problem consists of the adoption of a Cartesian grid to sample over the hyperparameter space. Because this issue is crucial and deserves much thought, we address the interested reader to the work of Claesen and Moor (2015) for more refined techniques.

### 3.1.2. Assessing performance

An important consideration is the assessment of model performance. In particular, we need to understand how good a developed model is and how to compare alternative models.

Typically, the performance of a model is evaluated on a holdout data set, e.g., splitting customer data (usually the data set rows) into training and testing groups, or using a more computationally intensive *k*-fold cross-validation approach, which is a standard in predictive modeling (Kuhn and Johnson 2013).

When dealing with RS data, Hahsler (2018) suggests the following approach for the test group recommendations: withhold one or more purchases from their profiles, apply the chosen model, and assess how well the predicted items match those of the withheld ones. For binary and IR data, this reduces to verifying that the withheld items are contained within the top-*N* predicted items.

When ratings are produced for implicit data, binary classification accuracy metrics (like accuracy, AUC, log loss, and so forth) can be used to score and rank RS algorithms. In particular, the log loss metric measures the discrepancy between estimated ˆy and actual probabilities of a target *i*-th event occurrence, yi:

logloss=−1nn∑i=1[yi⋅loge(^yi)+(1−yi)⋅loge(1−^yi)].

On the other hand, the area under the receiver operating characteristic (ROC) curve (known as the AUC) measures the model’s ability to classify the target (binary) outcomes. In fact, a model’s output can be used in a decision-making process, defining a certain threshold and classifying cases as belonging either to the purchaser category or not whereby their score is higher than the set threshold. It is possible to cross-tabulate actual and predicted groups using something known as the “confusion matrix,” which can be synthetized as follows, where N=TP+FP+FN+TN is the total scored sample size:

Predicted | Observed: Event | Observer: Non-Event |

Event | True Positive (TP) | False Positive (FP) |

Non-Event | False Negative (FN) | True Negative (TN) |

Performance metrics such as the following can be created on such a matrix:

- Accuracy, TP+TNN
- Sensitivity (recall), TPTP+FN
- Precision (positive predictive value), TPTP+FP
- Specificity (true negative rate), TNTN+FP
- Negative predictive value, TNTN+FN

As Kuhn and Johnson (2013, 262-65) well detail, these metrics depend on the selected threshold value; in particular, there is a trade-off between sensitivity and specificity, and the AUC represents the area under all possible combinations of sensitivity and specificity as a function of the selected threshold.

Briefly, the log loss metric is used when it is necessary to estimate the probability as precisely as possible, while the AUC score is more appropriate when it is more important to distinguish between two classes (such as purchasers and non-purchasers for marketing strategy).

Furthermore, if two periods of data are available, a possible strategy is to build the RS based on the initial-period data in the training set and to assess how well the algorithm predicts purchases in the next period in the test set. This work adopts this assessment approach. We remark that the scoring is clearly performed on a product basis and is evaluated only on customers that do not hold that precise product in the initial period.

We stress that although splitting the data set into the training set and the test set is not necessary for all the approaches presented in this paper, for the sake of comparison of the algorithms, it has been shared across all the techniques.

We decided to use the ROC AUC as the metric to test the algorithms because of its ability as a ranking performance measure (Flach and Kull 2015; Hand and Till 2001). Our recommendation approach is not based on a classification task (i.e., client A is going to buy a specific product or not) but rather on a ranking task (i.e., client A is more inclined to buy a specific product than client B).

For that reason, we prefer the ROC AUC, although our data set presented highly unbalanced classes, as usually happens for some kinds of insurance coverages. The appendix briefly addresses this specific issue.

## 3.2. The available data and modeling limitations

The chosen algorithms will be trained on an anonymized insurance marketing data set containing about 1 million policyholders’ purchased coverages in two subsequent periods, as well as some demographic covariates. The examined approach’s ability to predict next-period purchases (from the set of insurance coverages still not held) is measured. Supplementary binary variables indicate whether the policyholder purchased the coverage in period one and whether the policyholder purchased the coverage in period two; we use this supplementary information to assess the performance of the analyzed RSs.

We exclude CBF algorithms from our analysis, because the number of possible kinds of insurance coverage is lower by an order of magnitude than the number of items typically traded in online industries (e.g., Netflix, Amazon, Spotify, etc.). In fact, our application deals with a small number of products—the different kinds of insurance coverages—that show relative heterogeneity of characteristics. For example, a life or health policy is quite different from auto coverage. For this reason, if one client has both kinds of policies, while another one has just a life policy, the implication that the latter might also like an auto policy does not necessarily hold as easily as it might in other industries where the variety of products is much greater.

In addition, only two periods’ worth of portfolio composition, rather than the full history of previous purchases, is available. For that reason, sequence-based algorithms, which have been found to be extremely useful in the literature, are not considered in this study.

## 3.3. Mathematical formulation

With regard to the issue of recommendations in the insurance field, we present a mathematical framework capable of describing the functional setting between customers and products. We assume that the willingness of a person i to buy a product j may be represented by a probability density function Pj depending on the demographic characteristics of the person di, the social pressure si, and the timing ti.

It is reasonable to endow the aforementioned variables with the following characteristics:

- The demographic features (i.e., age, gender, wealth) identify the client’s state and account for most of the Pj value.
- Strong social pressure might increase (or decrease) the likelihood of buying a product j—hence si can be regarded as a quantity that explains factors external to the client (i.e., advertisement campaigns or client networks).
- The timing is a random variable that models phenomena related to unpredictable events (e.g., a tennis player who loses his racket might need a new one)—thus, this variable is usually not detectable.

In an RS framework, typically product similarity bias is also often taken into account. However, as pointed out previously, in the insurance market the number of diverse products is small, and hence this bias effect, if present, can be neglected. For this reason, the initial assumption of a probability function Pj dependent only on a single product j rather than the entire set of products could be reasonable.

Notwithstanding this framework, finding the analytic formulation of the function Pj or discovering the dynamics of the problem is a complex task.

In the latter part of the following section, we propose a way of exploiting supervised learning algorithms for learning the function Pj explicitly based on the client’s demographic features di, disregarding the timing ti and the social pressure si variables due to the lack of data. In particular, the next section is devoted to the implementation of all the classical RS approaches for the sake of completeness and comparison.

# 4. Statistical techniques

Herewith we provide an overview of different RS algorithms. In particular, we underscore that the power of these techniques lies in their generality. In fact, some of the algorithms presented have been and would be employed to tackle other problems of the actuarial sciences.

## 4.1. arules

Association rules (arules) is a well-established approach to finding interesting relationships in large databases (Agrawal and Srikant 1994). Following the notation of Hahsler, Grün, and Hornik (2005), let I={i1,i2,…,in} be a set of *n* binary attributes called items. Let D={t1,t2,…,tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in *I.* A rule is defined as an implication of the form X⇒Y where X,Y⊆I and X∩Y=∅. The sets of items (for short item sets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule. Various measures of significance have been defined on rules, the most important of which are these:

*support,*Pr(X∩Y)=supp(X∪Y), is defined as the proportion of transactions in the data set that contain that specific item set;*confidence*is the probability of finding the RHS given that the transaction contains the LHS: conf(X,Y)=Pr(Y|X); and*lift*is a measure of association strength, in terms of departure from independence: lift(X,Y)=Pr(X∩Y)Pr(X)Pr(Y).

The process of identifying rules can yield many of them, and thus we usually set a minimum constraint either on confidence or on support, or on both, in order to select “meaningful” rules. In addition, rules are often ranked according to the lift metrics to highlight more important associations. Nevertheless, the selection of the confidence and support thresholds is subjective given that no optimal metric was defined in the original arules model. The interested reader can consult Tan, Kumar, and Srivastava (2004) for a deeper discussion of the topic.

The two most used algorithms to mine frequent item sets are arules and Eclat, with the latter the less well known; Borgelt (2003) provides the interested reader more details about their actual implementation. The arules algorithm has also been extended to work on supervised learning problems, as Hahsler (2018) discuss.

A great advantage of the arules approach is that it provides easily interpretable model output that can be easily communicated to nontechnical audiences, like senior management or producers, in the form of simple directives. A drawback of this technique is that there does not exist a direct measure of predictive performance, especially on one-class data.

## 4.2. User-based and item-based collaborative filtering

User-based collaborative filtering (UBCF) provides recommendations for a generic ua user, identifying the k users most similar to ua and averaging their ratings for the items not present in the ua profile. It is assumed that because users with similar preferences would rate items similarly, missing ratings can be obtained by first finding a neighborhood of similar users and then aggregating their ratings to form a prediction (Hahsler 2018). A common measure of user similarity, at least for non-implicit data, is the cosine similarity (CS) coefficient,

sx,y=→x∗→y∥x∥∗∥y∥,

which, clearly, is calculated only on the dimensions (items) that both users rated. For a given user a it is possible to define a neighborhood N(a) selecting, for example, the k nearest neighbors by CS or another (dis)similarity metric. Then the estimated rating of the jth item for the a user is simply ˆraj=1∑i∈|N(a)|sai∑i∈|N(a)|sairij, where each *i*th neighbor rating rij has been weighted by its similarity with user a, sai. A serious drawback of UBCF is that it requires holding in memory the users’ similarity matrix. This makes UBCF unfeasible on real data sets.

Another approach, item-based collaborative filtering (IBCF), calculates similarities within the items’ distances set, whose dimensionality is usually orders of magnitude lower than the users’ one. Then, given similarities between *i*th and *j*th items sij and the neighborhood S(i) of the k items most similar to the *i*th, it is possible to estimate ˆrai=1∑j∈S(i)∩{l;ral≠?}sij∑j∈S(i)∩{l;ral≠?}sijraj. As described further, this paper applies IBCF using the Jaccard index as a measure of products’ distances. In fact, the CS coefficient is not suitable when only implicit data are available at the user level, as the jointly purchased products will just be flagged with ones. In general, the Jaccard index appears to be more frequently used when binary or implicit data are used.

Whereas UBCF requires a u2-sized similarity matrix to be kept in memory, IBCF requires an m2-sized one; an additional memory reduction is obtained by keeping in memory only the k most similar distances. Nevertheless, when the magnitude of the customers’ size is relevant, only IBCF is used, and, empirically, it appears that moving from UBCF to IBCF leads to only small performance losses according to Hahsler (2018).

## 4.3. Matrix factorization approaches

Matrix factorization techniques, such as nonnegative matrix factorization and singular value decomposition (SVD), are the mathematical foundation of many RSs belonging to the CF family (Koren, Bell, and Volinsky 2009). The matrix factorization approach approximates the whole ranking matrix into a smaller representation:

Rm×n≈Xm×k∗Yk×n.

This artifact both reduces the problem’s dimensionality and increases the computational efficiency. X has the same number of rows as the R matrix, but typically a much smaller number of columns, k≪n. Y is a k×n matrix that represents the archetypal features set generated by combining columns from R, while X represents the projection of original users on the Y space. The result is twofold: first, each product is mapped on a reduced space representing its key dimension (archetypes); second, it is possible to compute products-dimension preferences for each given customer. For example, if we think of products as a combination of a few “essential” tastes, the Y matrix represents all items projected in the taste dimension, while the X matrix presents the projection of all users in the taste dimension. Therefore, the rating given by subject i to product j is estimated by ˆxij=ˉxi∗ˉyTj. Our analysis considers four algorithms in this category that represent different variations of the matrix decomposition approach —LIBFM, GLRM, ALS, and SLIM.

### 4.3.1. LIBFM

A typical solution for X and Y can be found solving

minX,Y∑(u,v)∈R[f(xu,yv;ru,v)+μX||xu||1+μY||yv||1+λX2||xu||22+λY2||yv||22]

as shown in Chin et al. (2015), where f is the loss function and μP,μQ,λP,and λQ are penalty coefficients similar to the ridge and lasso ones of the elastic net regression model (Zou and Hastie 2005). For binary matrices, the logarithmic loss is typically used as well as binary matrix factorization (Z. Zhang et al. 2007). This algorithm is efficiently implemented in the LIBFM open-source library, from which it takes its name (Chin et al. 2016).

### 4.3.2. GLRM

Udell et al. (2016) recently introduced the generalized low rank model (GLRM) approach, which is another approach to effectively perform matrix factorization and dimensionality reduction. The compelling feature of this approach is that it can handle mixed-type data matrices, which means that the GLRM can handle matrices filled not only with numerical data but also with some columns of categorical, ordinal, binary, or count data. This is achieved by imposing not just a single loss function but many according to the category of the columns. This enables the GLRM to encompass and generalize many well-known matrix decomposition methods, such as PCA, robust PCA, binary PCA, SVD, nonnegative matrix factorization, and so on. As for LIBFM, the GLRM approach supports regularization, thus allowing the estimation process to encourage some structure such as sparsity and nonnegativity in X and Y matrices. Notable applications for GLRMs lie in missing values imputation, features embedding, and dimensionality reduction using latent variables.

### 4.3.3. ALS

The Alternating Least Squares (ALS) algorithm (Hu, Volinsky, and Koren 2008) was specifically developed for IR data. Used by Facebook, Spotify, and others, it has gained increasing popularity. That Apache Spark, a well-known open-source big-data environment, chose to implement ALS as the reference recommendation algorithm (Meng et al. 2016) for its machine learning engine