1. Introduction
Regression analysis is one of the most widely used statistical methods, dating back to Francis Galton’s 1875 discovery of the regression slope (Stanton 2001). The key concept behind this methodology is identifying a relationship between a set of variables. The premise is that a dependent or response variable, labeled Y, is assumed to be related in some functional form to several covariates (also referred to as explanatory or independent variables), labeled X_{1}, X_{2}, . . . , X_{k}. The response variable, Y, measures a random quantity of interest. Examples include the annual claims on an automobile insurance policy and the age at death of an insured life. The covariates, X_{1}, X_{2}, . . . , X_{k}, could be a set of characteristics of the insured. For example, in automobile insurance the variables might be age, gender, accident history, credit score, and so on. In life insurance the variables might be age, gender, smoking status, blood pressure, height, weight, and so on. If it is possible to express how the distribution of the response variable is related to the covariates, the result is often a reduction in the uncertainty of insured events that depend on Y.
The basic OLS regression model presents a specific model for the relationship. The distribution of
given the covariates is assumed to be normal with a variance that is constant (that is, not related to the covariates) and a mean that is related to the covariates as The equivalent (in this case) techniques of maximum likelihood and leastsquares are used to estimate the unknown coefficients. The covariates can be considered as random observations (in which case it is assumed the complete set of variables has a multivariate normal distribution) or as fixed quantities, for example, where the levels are set in a designed experiment. A measure of the success of the model is the reduction in the variance of through the use of the covariates. Basic OLS regression is covered in most introductory statistics texts.Insurance losses rarely follow a normal distribution. Models for the age at death are usually skewed to the left, while models for casualty losses are usually skewed to the right. Furthermore, there is no logical argument, other than mathematical simplicity, to support a linear relationship between the covariates and the mean of the response variable. Embrechts, McNeil, and Straumann (2002) show how the Pearson correlation coefficient can be misleading when the underlying distributions are not normal. They advise using copulas to model data that are not normal because such models capture a greater variety of relationships (essentially being nonparametric).
A significant advance was provided by the development of the generalized linear model (GLM). The standard reference is McCullagh and Nelder (1989). The generalization comes in two places. First, the distribution of Y must be a member of the exponential family (which includes many commonly used distributions such as binomial, Poisson, negative binomial, normal, gamma, inverse Gaussian, and lognormal, see pages 26–28 of McCullagh and Nelder). Second, the relationship between the mean of Y and the covariates is expressed as E(Y  X_{1} = x_{1}, . . . , X_{k}= X_{k}) = g^{–1}(β_{0} + β_{1}x_{1} + ··· β_{k}X_{k}) where g(y) is called the link function. The conditional variance of Y is no longer constant but will depend on the conditional mean. The nature of the dependence depends on the choice of distribution for Y. Modern statistical software includes GLM analysis for specific combinations of distributions and link functions. Estimation is done by maximum likelihood. This model has been widely used in automobile pricing and there are numerous articles providing actuarial applications (Aitkin et al. 1989; Mosley 2004).
The next step in generalizing this model is to remove the restriction that the distribution of Y be from the exponential family. There is nothing inherent in the GLM that demands the use of the exponential family, although some of the calculations become more difficult and there is no convenient expression for the conditional variance. This generalization is briefly discussed by Klugman, Panjer, and Willmot (2008) and is further expanded in Venter (2007).
We are proposing an alternative approach to generalizing the OLS regression model. We are not claiming that it is superior to GLM or OLS, but rather that it provides an alternative that has the possibility of providing a better fit to observed data. Begin by assuming that all the variables (response and covariates) are random observations from some joint probability distribution. Because some of the variables are dependent (if there are no dependencies, then possessing the covariates provides no useful information about the response variable) a model with dependence must be used. A popular method for creating a multivariate model with dependence is the copula model. There are numerous papers with actuarial applications of copulas. A good introduction is provided by Frees and Valdez (1998). In this paper, we will use the normal copula for the reasons described in Section 2.
This paper will develop copula regression as follows. Section 2 introduces copula functions and the normal copula. Section 3 covers parameter estimation and computation of the conditional mean. Section 4 concludes the paper with comprehensive examples.
2. Copulas
Copulas are useful for describing multivariate nonnormal distributions. They describe the dependence structure between the variables. Marginal distribution functions are used as inputs to the copula and these can be any set of disparate distributions. Thus, a copula is a realistic way of describing the multivariate distributions, as the analyst normally has a good idea about each marginal distribution and seldom has a good idea about the joint distribution of these variables.
A copula model can be constructed as follows. Begin with
random variables with distribution functions The joint distribution function is created by applying the copula function, asFX1,…,Xn(x1,…,xn)=C[FX1(x1),…,FXn(xn)]
Only certain functions can be used as copula functions in order to ensure that regardless of the n input distribution functions, the resulting function is a legitimate multivariate distribution function. The key is that the copula creates the dependence structure independently of the marginal distributions. This concept of defining a multivariate dependence structure (and hence copula) is based on Sklar’s theorem, which states:
Sklar’s Theorem (Sklar 1959): Let F be an ndimensional joint cumulative distribution function for the random variables X_{1}, X_{2}, . . . , X_{n} with marginal distribution functions F_{1}(x_{1}), F_{2}(x_{2}), . . . , F_{n}(x_{n}). Then there exists a Copula function, C, such that
FX1,…,Xn(x1,…,xn)=C[F1(x1),…,Fn(xn)]
If the marginal distributions are continuous, then C is unique. If any distribution is discrete, then C is uniquely defined on the support of the joint distribution.
In the case of continuous distributions, the multivariate dependence structure and marginal distributions can be separated and the copula can be considered “independent” of the margins (Joe 1997, 12–13). The copula thus allows arbitrary continuous marginal distributions to be combined and describes their dependence structure.
There are several bivariate and multivariate copula distributions discussed in the literature. Hutchinson and Lai (1990), Nelson (1999), Joe (1997) and Klugman, Panjer, and Willmot (2008) are excellent texts that discuss various properties of bivariate copula distributions. Joe (1997) also discusses the properties of several multivariate copula distributions. Embrechts, McNeil, and Straumann (2002) use copulas to show how the Pearson correlation coefficient can be misleading. Klugman and Parsa (1999), and Frees and Valdez (1998) use copulas in modeling insurance data. Others who have contributed to the copula literature are Genest and MacKay (1986) and Miller and Liu (2002).
While many copula functions have been identified, we believe only two are useful for building a regression model with several covariates. One of the features of any useful regression model is that the degree of association (e.g., Spearman’s Rank Correlation) between the response variable and each covariate need not be constant. We are aware of only two copula models that allow for this, the normal copula and its generalization, the tcopula (which is based on the multivariate Student’s t distribution. For example, see the list of copulas in Klugman, Panjer, and Willmot (2008, Chapter 7), where it can be seen that the other copulas do not allow for variations in the association measure. In a recent paper by Crane and Van Der Hoek (2008) they use arbitrary bivariate copulas for building regression models. Because their methods do not extend to multivariate situations, their approach is not useful for most applications. Also, they do not address the case of discrete variables, which are common in practice.
Computing predicted values of Y using copula regression is a threestep process.

Assume a model for the joint distribution of all the variables (response and covariates),

Estimate the parameters of the model (the parameters for the selected marginal distributions and the parameters of the copula), and

Compute the predicted values of Y given a set of covariates by using the conditional mean of Y given the covariates.
In this paper, we assume a normal copula model for the joint distribution of the variables and maximum likelihood for parameter estimation. For the normal copula explicit formulas for the likelihood function and conditional distribution are available. The conditional mean will usually need to be obtained by numerical integration. Because this is a onedimensional integration, there are wellestablished accurate algorithms for obtaining the answer. Thus, this ease of use is one of the reasons we have selected the normal copula. The methodology presented in this paper can be used with other multivariate copulas (such as t or elliptical), but they would require extensive numerical analysis.
Note that if all the variables are assigned the normal distribution and the normal copula is used, the results of maximum likelihood estimation will match those from OLS. Thus, like GLM, this method is a (different) generalization of the OLS model.
3. Parameter estimation
The normal (multivariate normal) copula function is defined as (Song 2000)
CR(u1,…,un)=G[Φ−1(u1),…,Φ−1(un)]
where Φ(u) is the standard normal cumulative distribution function and G is the multivariate normal cumulative distribution with zero means, unit variances, and correlation matrix R. The matrix R has ones on the diagonal and the off diagonal elements are correlations. However, it is important to note that when the normal copula is used these are not the correlations of the modeled variables.
Thus, for arbitrary marginals, the distribution function induced by the normal copula is
F(x1,…,xn)=G{Φ−1[F1(x1)],…,Φ−1[Fn(xn)]}
If the marginal distributions are continuous then the density function is given by (Clemen and Reilly 1999; Nelsen 1999)
f(x1,…,xn)=f1(x1)f2(x2)⋯fn(xn)exp{−vT(R−1−I)v2}×R−0.5
where
is a vector with th element and is the identity matrix. Note that if the correlations are all zero, then is the identity matrix and the joint density is the product of the marginal densities and all the variables are independent.This density function can then be used to determine the likelihood function.
Since we will be estimating predicted values using the conditional mean of Y given the independent variables, we will need the conditional distribution of x_{n} given x_{1}, . . . , x_{n−1}. It is
f(xn∣x1,…,xn−1)=fn(xn)exp{−0.5[{Φ−1[Fn(xn)]−rTR−1n−1v∗}21−rTR−1n−1r−{Φ−1[Fn(xn)]}2]}×(1−rTR−1n−1r)−1/2
where
andwhere
is an vector that is the rightmost column of with the last element removed.In the regression context, X_{n} will be the response variable (Y). Thus, this is the density function of the response variable given the covariates. The expected value is the predicted value (the median could also be used) and the standard deviation is equivalent to the standard error in OLS regression. However, the standard deviation will not be constant as it is in OLS regression.
Maximum likelihood estimates are asymptotically normal with covariance matrix given by the inverse of Fisher’s information matrix (I(θ)^{–1}). To use this result, we would have to calculate the second derivatives of the likelihood function and take their expected values. In this problem, it is difficult to calculate second derivatives and their expected values, so we instead use the observed information matrix (Klugman, Panjer, and Willmot 2008). Once again, this is a wellunderstood problem. Most commercially available numerical analysis software provides the Hessian matrix, which is used to estimate the asymptotic covariance matrix (inverse of the Hessian matrix) for the estimates. We present this for Example 1 in Section 4.
When all the marginal distributions are continuous, maximum likelihood estimation is performed by using the density function in (3.3). Note that if there are, for example, eight marginal distributions, each with two parameters, there will be 8(2) + 8(8 – 1)/2 = 44 parameters to estimate. Often it is easier to use the marginal data to obtain estimates of the parameters of the marginal distributions and use the joint data to estimate the correlation parameters in the normal copula. These can be used as is, producing a suboptimal result. On the other hand, one can use the estimates from the marginal distributions as starting values for the global maximization (which is optimal). We used the latter procedure for estimating the parameters. This was done numerically. One author used the solver tool in Excel and the other author used SAS IML and the NLPTR optimization procedure (NLPFDD to compute the Hessian matrix).
When any of the marginal distributions are discrete, there is an additional problem. This is illustrated with an example. Let X take on the values 0, 1, and 2 with probabilities 0.3, 0.4, and 0.3 respectively. Let Y be the other marginal with a uniform (0,1) distribution. Consider the independence copula. Now consider the contribution to the likelihood function for the observation x = 1 and y = 0.6. It is the product of the probability function of X at 1 and the density function of Y at 0.6, which is 0.4(1) = 0.4. But now suppose the normal copula is being used. This requires calculation of cumulative probabilities from the multivariate normal distribution (in order to obtain the discrete probability), a nontrivial task. In addition, if there are several discrete variables, the probabilities for higher dimensional cubes will be required.
In OLS and GLM regressions, distributions for the covariates are usually not specified. In a copula model these distributions must be specified. However, there is an easy way to replicate this situation—for the marginal models, use the empirical distribution. Thus, in many applications all of the covariates will be modeled as discrete variables. The same problem affects conditional distributions.
Our recommended solution is to replace each discrete distribution with a continuous distribution. The simplest choice is to use a kernel density with a uniform kernel and a small bandwidth. In particular, select the bandwidth so that it is less than half the distance separating the two closest discrete probability points.
The likelihood function requires the kernel density pdf and cdf at each data point. Let b be the bandwidth. For any observed value, let p(x) be the probability function and P(x) be the distribution function of the discrete distribution. For writing the likelihood function, for an observation at x, the kernel density pdf is p(x)/(2b) and the cdf is P(x) – p(x)/2. Note that the value of b has no impact on the MLE.
For the conditional distribution, the only case that needs further study is when the response variable is discrete. The formula for the conditional density function (for continuous variables) can be written
f(xn∣w)=fn(xn)g[Fn(xn),w]
where
represents all the other variables, and is defined by removing from (3.4). To obtain the discrete conditional probability it makes sense to integrate the continuous conditional probability over the bandwidth:p(xn∣w)=∫xn+bxn−bp(xn)(2b)−1×g[Pn(xn)+p(xn)(t−xn−b)/(2b),w]dt=p(xn)2b∫b−bg[P(xn)+p(xn)(z/2b−0.5),w]dz.
The second line follows from the transformation z = t − x_{n}.
The integral must be evaluated numerically; the value of b has no impact on the result.
4. Examples
To illustrate our methodology, we present six examples using simulated data. In all of the examples, we have three variables—one dependent and two independent and the sample size is 50 in all the cases. We use the same correlation matrix in all the examples as given below.
R=[10.70.70.710.70.70.71]
To simulate the 50 trivariate data points (see the Appendix for the simulated data), we used the methodology described in Clemen and Reilly (1999). It uses the following process:
Let
be specified along with the correlation matrix.
First generate a vector (u_{1}, u_{2}, u_{3}) from a multivariate normal distribution with zero means and unit variances and the specified correlation matrix.

Calculate
for each of the three variables. 
Calculate
The resulting vector from step (3) will have the specified marginals and copula correlation structure (see Appendix 1 for one of the data sets).
We measured the error in estimation using
and compared the results to OLS and GLM (where appropriate). We chose OLS and GLM for comparison as they are the procedures commonly used in practice. Nowadays, GLM is commonly used whenever the distribution of the dependent variable is nonnormal.Example 1: The distributions of the three variables^{[1]} are: Table 1.
Pareto Pareto and is the dependent variable. The parameters were estimated in two stages, using the method of maximum likelihood. In stage 1, only the data from the marginal distribution was used to estimate the parameters. In the second stage, the joint density was maximized using estimates from stage 1 as starting values. In this example, parameter estimates for did not converge, so we approximated it using a gamma distribution. The copula regression model performed much better than OLS. The MLEs and errors are given inIn Table 2 are the asymptotic standard deviations of the estimates and correlations among them (diagonal terms are standard deviations and offdiagonal terms are correlations). The SAS IML procedure NLPFDD was used to calculate the Hessian matrix. It uses a finite difference approximation procedure to calculate the Hessian. We used the Hessian to calculate the observed information matrix (asymptotic variance matrix).
From this it is possible to construct confidence intervals. For example, the 95% confidence interval for alpha of the Pareto variable (X_{1}) is given by 3.44 ± 1.96*0.266606.
The maximum likelihood estimate of the correlation matrix is given below.
ˆR=[10.7110.6990.71110.7130.6990.7131]
As a reminder, while this is called a correlation matrix, it is merely the parameters of the normal copula. These values do not represent the correlations of the marginal variables.
Example 2: Once again, the distributions of the three variables were: X_{1} ~ Pareto ( = 3, θ = 100), X_{2} ~ Pareto (4, 300), X_{3} ~ Gamma (3,100) and X_{3} was the dependent variable. This time, instead of assuming a distribution for X_{1} and X_{2} and estimating the parameters using MLE, we estimate the distributions empirically. This is similar to OLS and GLM wherein we make no assumption on the distribution of the independent variables. Since the empirical distribution is discrete, we approximated it by a continuous distribution and in particular we used the kernel density with uniform kernel as previously described. We then took the limiting case of the uniform kernel as band width goes to zero, resulting in the following equations for the distribution and density functions (where x is one of the data points).
F(x)=#obs≤xn−#obs=x2n
and
f(x)=1n
We also analyzed the data using a generalized linear model using the gamma distribution and log link. Once again, the copula regression model performed better than OLS and, much better than GLM which performed very poorly. The results are given in Table 3.
The number of parameters in the various models is five for the copula model (two gamma parameters and three correlation coefficients, four for OLS (three regression parameters and the standard error), and four for GLM (three regression parameters and the gamma shape parameter). The large differences in the sums of squares are not likely due to the one parameter difference. A more formal comparison might use the loglikelihood functions and an information criterion.
Example 3: The three variables are: X_{1} ~ Poisson (λ = 5), X_{2} ~ Pareto (4, 300), X_{3} ~ Gamma (3,100) and X_{3} is the dependent variable. We estimated the parameters using the method of maximum likelihood in two stages as in Example 1. The distribution of X_{2} was approximated by an exponential distribution. Here again, the copula regression model did better than OLS. The results are given in Table 4.
Example 4: The situation is same as in Example 3 but the distributions of X_{1} and X_{2} were estimated empirically as in Example 2. Since X_{1} is discrete, its distribution was approximated by a continuous distribution. Once again, we used the uniform kernel as in Example 2.
We also analyzed the data using GLM using gamma distribution and log link. The copula regression model performed better than OLS and GLM. Surprisingly, GLM performed the worst. The results are given in Table 5.
Example 5: The distribution of the three variables are X_{1} ~ Poisson (λ = 5), X_{2} ~ Pareto (4, 300), X_{3} ~ Gamma (3,100), and X_{1} is the dependent variable. In this example, the dependent variable is discrete. The distribution of X_{2} was approximated by an exponential distribution. In this example, the copula regression model performed better than OLS. The results are given in Table 6.
Example 6: The situation is same as in Example 5 but the distributions of X_{2} and X_{3} are estimated empirically as in Example 2. The data were also analyzed using GLM using the Poisson distribution and log link. The copula regression model performed the best and surprisingly, GLM performed the worst. The results are given in Table 7.
Summary
In this paper, we have provided an alternative approach to generalizing OLS regression using a multivariate copula. When compared to GLM, the approach seems to be different rather than better or worse. Its strength lies in the ability to choose distributions for dependent variables that are not members of the exponential family. Also, it allows the researcher to arbitrarily choose distributions for the marginals (GLM only requires specification of the distribution of the dependent variable). As skewed heavytailed distributions are common in insurance, this method provides a way to incorporate arbitrary distributions into regression models. Like GLM, this method allows nonlinear dependence to be modeled. Because correlation is not a useful measure of dependence in a nonnormal world (Embrechts 2002), it is vital to describe the relationship between variables appropriately. In this regard, copula regression provides a good alternative to OLS and GLM.