# 1. Introduction

In collective risk theory, the total amount of losses an insurance company incurs during a time period is modeled by a compound random variable

S=M∑i=1Xi,

where

is a discrete random variable representing the number of claims; are nonnegative independent and identically distributed (i.i.d.) claim size random variables independent of The tail probability,θ=P[S>c]

for some specified value

and the tail mean (mean excess loss), defined byτ=E[(S−c)+],

where

(S−c)+={0S≤cS−cS>c

are important risk measures of

because they are closely related to insurance/reinsurance pricing and capital requirement.Evaluation of the tail probability and the tail mean is not easy, even when the distribution of Panjer (1981), for the case when the distribution of belongs to the class. There is extensive literature on further developments related to Panjer’s recursive formula. For details, see the comprehensive book by Sundt and Vernic (2009).

and are known. One approach is to resort to recursive formulas, such as those proposed byTransform-based techniques, such as fast Fourier transform (FFT), are also widely used in calculating the distribution of aggregated claims. For an introduction, see the papers by Robertson (1992) and Wang (1998) and the references therein. Embrechts and Frei (2008) provided an excellent comparison of the recursive and FFT methods.

Simulation methods are flexible and can be handy in estimation of the tail probability/moments of compound distributions. However, they are subject to sampling errors. That is, different runs of the same simulation method will give different results.

For example, the one-sample crude estimator for

isˆθ0=I(S>c),

where

is an indicator function that takes the value of one if the argument is true and zero otherwise. is an unbiased estimator of becauseE[ˆθ0]=θ.

The variance of

isVar[ˆθ0]=θ(1−θ),

and the coefficient of variation (CoV) is

CoV(ˆθ0)=√θ(1−θ)θ=√(1−θ)θ.

Quite often, we are interested in a tail probability of

so that is close to zero. Then the coefficient of variation is huge, which makes the crude method inefficient. Notice that when the simulation is conducted times, the estimator for isˆθ0,n=1nn∑j=1I(S(j)>c),

which has variance

and CoVCoV(ˆθ0,n)=√θ(1−θ)/nθ=1√n⋅√(1−θ)θ≃1√nˆθ0,n=(n∑j=1I(S(j)>c))−1/2.

When

is large, the distribution of is approximately normal by the central limit theorem, based on which the confidence interval of is given by(ˆθ0,n−z1−α/2√Var[ˆθ0,n],ˆθ0,n+z1−α/2√Var[ˆθ0,n]).

In other words,

P(|ˆθ0,n−θ|θ≤z1−α/2CoV(ˆθ0,n))=1−α.

Taking

this means that with probability, the relative error of is less than Therefore, can be a measure for relative error.In fact, for any estimator

of some parameter similar reasoning applies. As a result, is a measure of relative error for Therefore, we say that an estimator (simulation method) with a smaller CoV is more efficient.For our case, the relative error of

is given in equation (1.3). When is close to zero, rarely equals one. So is small, resulting in a large CoV. For example, suppose that the value of is such that the tail probability is roughly (from prior knowledge) If we expect that is around Consequently, is approximately This means that with a probability of the relative error of is within which is not very satisfactory. Increasing the sample size to the relative error of is which is still not very accurate. Recalling that we see that in order to decrease the relative error (CoV) of an estimator tenfold, the sample size has to increase a hundredfold.**Remark 1.1.** *This discussion may remind actuaries of the situation in limited fluctuation credibility theory, in which the sample size of loss data is determined such that the sample mean will provide a credible (low relative error) estimate of the population mean. The idea there is similar and the same square rule applies. In limited fluctuation credibility, to double the credibility factor, the sample size has to be quadrupled.*

In this paper, we study simulation methods (estimators) of the tail probability

of compound random variables. As shown, increasing the sample size of the crude estimator is one way to reduce relative error. However, because of the square rule, it is costly and inefficient and sometimes not feasible. Therefore, instead of increasing the sample size, we propose several more efficient estimators of that, given the same sample size, are still unbiased, yet have a lower CoV (relative error).Reducing the CoV requires reducing the variance of the estimator. In the simulation literature, various variance reduction methods exist. Commonly used techniques include importance sampling, stratified sampling, the conditioning method, and the control variates method, which we review briefly in the next section. For detailed introductions to variance reduction methods, see, for example, the books by Ross (2012) and Asmussen and Glynn (2007).

Needless to say, simulation methods for evaluating the tail probability and tail mean of compound variables are very important for actuaries. However, quite surprisingly, to our knowledge, the literature in this area is very thin. In addition to some examples from Ross (2012), one of the most relevant references is Peköz and Ross (2004), in which the conditioning method was introduced specifically for compound variables. Blanchet and Li (2011) introduced an efficient importance sampling algorithm for estimating the tail distribution of heavy-tailed compound sums. Glasserman, Heidelberger, and Shahabuddin (2000) proposed a method for simulating tail probability (value at risk) of the returns of investment portfolios by combining the importance sampling and stratified sampling methods.

In this paper, we first briefly review some commonly used general variance reduction methods for simulation, then we propose several novel combinations of variance reduction methods specifically for compound distributions. This includes the combination of importance sampling and the conditioning method and the combination of importance sampling and stratified sampling.

Secondly, we extend our methods to simulate the tail probability and tail mean of bivariate compound variables, defined by

(S1,S2)=(M∑i=1Xi,N∑j=1Yj),

where

is a vector of (dependent) random variables representing the number of the claims in two lines of business. The claim size random variables and for are mutually independent and are independent of the claim numbers andThe remaining parts of the paper are organized as follows. Section 2 reviews commonly used variance reduction methods. Section 3 applies them in estimation of tail probability. Section 4 studies the simulation methods for the tail mean. Sections 5 and 6 extend the results to bivariate compound variables. Section 7 concludes.

# 2. Review of variance reduction methods

We start by briefly reviewing several variance reduction methods that will be used in this paper. For more comprehensive introductions to the topic, see books such as those by Ross (2012) and Asmussen and Glynn (2007).

## 2.1. Importance sampling

Let

denote a vector of random variables having a joint density function and suppose that we want to estimateθ=E[g(Z)]=∫g(z)f(z)dz,

where the integral is

-dimensional and over the support ofImportance sampling finds the probability density function (PDF)

such that whenever Sinceθ=∫g(z)f(z)f∗(z)f∗(z)dz=E[g(Z∗)f(Z∗)f∗(Z∗)],

where

has densityˆθI=g(Z∗)f(Z∗)f∗(Z∗)

is an unbiased estimator for

The importance sampling approach aims to choose an appropriate

so that has a smaller variance compared with the crude estimatorSince

Var[g(Z∗)f(Z∗)f∗(Z∗)]=E[(g(Z∗)f(Z∗)f∗(Z∗)−θ)2]=∫(g(z)f(z)−θf∗(z))2f∗(z)dz,

in order to achieve a smaller variance,

should be chosen such that the numerator is close to zero. That is, is proportional toIf

has a finite moment generating function (MGF)MZ(t)=E[eZ⋅t]=E[eZ1t1+⋯+Zntn],

for some

then it is usually handy to choose to be the Esscher transform (exponential tilting) of That is, we let have PDFf∗(z)=eh⋅zf(z)MZ(h)

for some tilting parameter

In addition, the MGF of is given byMZ∗(t)=E[et⋅Z∗]=E[e(t+h)⋅Z]E[eh⋅Z]=MZ(t+h)MZ(h).

Notice that if

is a discrete random variable, we interpret and as probability mass functions (PMFs).For presentation simplicity, in the sequel, we assume that the goal is to estimate the mean of some random variable

θ=E[Z],

instead of

for some function Doing so does not lead to loss of generality because can have any form, such as## 2.2. The conditioning method

Suppose that

is a random variable that is correlated with and can be simulated efficiently and that can be evaluated. ThenˆθCD=E[Z∣W]

is a more efficient estimator of

than the crude estimator because it is unbiased; that is,E[ˆθCD]=E[E[Z∣W]]=E[Z],

and it has a smaller variance, as shown in the following line.

Var[Z]=Var[E[Z∣W]]+E[Var[Z∣W]]≥Var[E[Z∣W]].

## 2.3. Stratified sampling

Stratified sampling resembles the conditioning method in the sense that a random variable

can help simulate the mean of a random variable Suppose that takes values in strata with probability for ThenE[Z]=k∑i=1E[Z∣W∈Wi]pi.

Suppose that

is the total number of samples and is the number of samples from stratum Then the stratified estimate of isˆθS,n=k∑i=1ˉZipi,

where for

is the sample average of s conditional onLet

then the variance of isVar[ˆθS,n]=k∑i=1p2iσ2ini.

If we choose

then must have a smaller variance than the crude simulation estimator becauseVar[ˆθS,n]=1nk∑i=1piσ2i=1nE[Var[Z∣W]]≤1nVar[Z]=Var[ˆθ0,n].

It is worthwhile to note that (Fishman 1995), then is minimized.

is not necessarily the optimal number of simulations in stratum Particularly, if is chosen to be proportional to## 2.4. The control variates method

When using a control variable to reduce the variance of the estimator of

we select a control variate that is strongly positively or negatively correlated with Then an unbiased estimator for isˆθCV=Z−γ(W−E[W])

for some constant

Then we haveVar[ˆθCV]=Var[Z]+γ2Var[W]−2γCov(Z,W),

which is minimized if

Note that the parameter can be estimated using the simulated values of and In fact, it is the least square estimate of the slope of the simple linear regressionZ=γ0+γ1W+ϵ.

Obviously, there are many other general-purpose simulation variance reduction methods in the literature. We have only introduced the four that we use for simulating the compound random variables in this paper. In the following sections, we illustrate how these methods can be used in combination to reduce the simulation variance even further.

# 3. Simulating tail probability

In this section, we apply the four variance reduction methods to the estimation of the tail probability

where the compound random variable is defined in (1.1).The one-sample crude estimator is simply

The -sample estimator is given by defined in (1.2).The problem with the crude method is that when

is large, the sample size needs to be very large in order for to occur. Therefore, as shown in equation (1.3), the crude estimator is not efficient.## 3.1. Importance sampling

Let ^{[1]} We assume that has finite MGF on the interval where

In importance sampling, instead of sampling

we sample where has PDF Since we assume that has a finite MGF, we may choose to be the Esscher transform (exponential tilted transform) of That is, we let have PDFf∗(x)=ehxf(x)MS(h),

for some tilting parameter θ=E[I(S>c)]=E[I(S∗>c)MS(h)ehS∗].

Then we haveHence, the importance sampling estimator of

is given byˆθI=I(S∗>c)MS(h)exp(−hS∗).

If the sample size is

then we haveˆθI,n=MS(h)nn∑j=1I(S∗(j)>c)exp(−hS∗(j)), where is the th simulated value of

In order to apply the importance sampling method with exponential tilting, we need to (1) select an appropriate value for the tilting parameter

and (2) simulate with the tilted distribution.As mentioned in Section 1, the problem with the crude simulation method is that when

is small, is zero for most rounds of simulation, which results in a large value of the CoV of In importance sampling, we choose such that is more likely to occur than An intuitively natural choice for the value of is This choice is indeed optimal, as the following note shows.**Note 3.1.** *As pointed out by Ross (2012),*

ˆθI=I(S∗>c)MS(h)e−hS∗≤MS(h)e−hc.

*Thus, one can stabilize the results of every round of simulation by choosing the value of * to minimize the upper bound Setting the derivative of it to zero, we get

M′S(h)−cMS(h)=0.

*Thus, the optimal * should satisfy

c=M′S(h)MS(h)=E[SehSMS(h)]=E[S∗].

We next study the distribution of

(in order to simulate it). The result is stated in the following note.**Note 3.2** *The Esscher transform of * with parameter is

S∗=M∗∑i=1X∗i,

*where * is the Esscher transform of with parameter and is the Esscher transform of with parameter

*Proof.* Firstly, the MGF of is

MS∗(t)=MS(t+h)MS(h)=PM(MX(t+h))PM(MX(h))=PM(MX(h)MX∗(t))PM(MX(h)),

where

represents the probability generating function (PGF) of and is the Esscher transform of with parameterLet

and let be the Esscher transform of with parameter Then the PGF of isPM∗(z)=E[ec(h)M⋅zM]E[ec(h)M]=PM(ec(h)⋅z)PM(ec(h))=PM(MX(h)z)PM(MX(h)).

Comparing (3.2) and (3.3) leads to

MS∗(t)=PM∗(MX∗(t)),

which shows that

◻Note 3.2 suggests that

is a compound sum of and Therefore, it can be easily simulated if the distribution of latter two are known. This is actually the case for many commonly used distributions; the following are Esscher transforms (with parameter of some commonly used distributions.-
If pM(k)=(nk)pk(1−p)n−k,k=0,1,2,…,n, then

with the PMF -
If pM(k)=(k+r−1k)(1−p)rpk,k=0,1,2,…, then

with the PMF -
If pM(k)=λkk!e−λ,k=0,1,2,…, then

with the PMF -
If fX(x)=βαxα−1e−βx/Γ(α),x>0, then

with the PDF

**Example 3.1.** *Assume that * and have common distribution Then

S∗=M∗∑i=1X∗i,

*where* *and*

## 3.2. Combining importance sampling and stratified sampling

A method for applying stratified sampling in simulating quantities related to compound distribution was introduced by Ross (2012, Section 9.5.3). This is done by treating the claim number as a stratifying variable. Specifically, in order to simulate

θ=E[gM(X1,…,XM)],

we choose a number

such that is small and make use of the fact thatθ=m∑n=0E[gn(X1,…,Xn)]pn+E[gM(X1,…,XM)|M>m](1−m∑n=0pm).

This method can be applied to estimate the tail probability

if we let, forgm(X1,…,Xm)=I(X1+⋯+Xm>c).

However, directly applying the method is not efficient because when

is large and is small, the function is likely to be zero on most strata, and more so with small values ofTherefore, we propose here a method to combine importance sampling and stratified sampling. For this purpose, we let

be the Esscher transform of with parameter andDefine

g∗m(x1,…,xm)=MS(h)I(m∑i=1xi>c)exp(−hm∑i=1xi)

and

p∗m=P[M∗=m],m=1,2,….

Then we have, for a given value of

such that is small,θ=E[I(S>c)]=E[I(S∗>c)MS(h)ehS∗]=E[g∗M∗(X∗1,…,X∗M∗)]=ml∑m=0E[g∗M∗(X∗1,…,X∗M∗)∣M∗=m]p∗m+E[g∗M∗(X∗1,…,X∗M∗)∣M∗>ml]P[M∗>ml]=ml∑m=0E[g∗m(X∗1,…,X∗m)]p∗m+E[g∗M∗(X∗1,…,X∗M∗)∣M∗>ml](1−ml∑m=0p∗m).

With this, we can follow the procedure from Ross (2012) to simulate Firstly, we generate a value of conditional on it exceeding Suppose the generated value is then we generate independent random variables Then the estimator of from this run is

E=ml∑m=1g∗m(X∗1,…,X∗m)p∗m+g∗m′(X∗1,…,X∗m′)(1−ml∑m=0p∗m).

As pointed out by Ross (2012), since it is easy to compute the functions we can use the data in the reverse order to obtain a second estimator and then average the two estimators. That is, we let

E′=ml∑m=1g∗m(X∗m′,…,X∗m′−m+1)p∗m+g∗m′(X∗m′,…,X∗1)(1−ml∑m=0p∗m),

then use

ˆθI+S=12(E+E′)

as the estimator for

**Remark 3.1.** *By equation (3.4), the second term in equation (3.5)*

g∗m′(X∗1,…,X∗m′)(1−ml∑m=0p∗m)≤MS(h)P[M∗>ml].

*Thus if we select a value for * sufficiently large that is negligible, it can be omitted.

## 3.3. The conditioning method

Peköz and Ross (2004) introduced a very effective way of simulating the tail probability of compound distribution based on the conditioning method. Let

T(c)=min(m:m∑i=1Xi>c).

Then, we have

S>c⇔M≥T(c).

Equation (3.7) agrees with proposition 1.1 of Lin (1996).

Therefore,

E[I(S>c)]=E[E[I(S>c)∣T(c)]]=E[E[I(M≥T(c))∣T(c)]].

Thus, an estimator for

applying the conditioning method is given byˆθCD=E[I(M≥T(c))∣T(c)]=P[M≥T(c)∣T(c)]=P[M≥T(c)],

where the last line is because

and are independent.To implement the method, we generate samples of

in sequence until the sum of generated values exceeds If the generated value of is then is the estimate of for this run of simulation.## 3.4. Combining the conditioning method and the control variates method

As shown by Peköz and Ross (2004), the estimator can be further improved by using a control variate. The idea is to select a control variate that is strongly positively or negatively correlated with Peköz and Ross (2004) suggested that one such choice is

W=T(c)∑i=1(Xi−E[X]),

which has a mean of zero.

is positively correlated with because when is large, (1) is small, and (2) s are likely to be small so that will be small.

With this choice, when combining the conditioning method with the control variates method, the estimator for

isˆθCD+CV=ˆθCD−γW,

where

which may be estimated using the simulated values of and## 3.5. Combining importance sampling and the conditioning method

In the conditioning method, we need to simulate

defined in equation (3.6), which may be time consuming if is large relative to the s. In this section, we introduce a method to improve this. The key is to replace withT∗(c)=min(m:m∑i=1X∗i>c),

where

is the Esscher transform of with tilting parameter This way, we combine the conditioning and importance sampling methods. The following estimator of is obtained.ˆθI+CD=P[M≥T∗(c)]MX(h)T∗(c)eh∑T∗(c)i=1X∗i.

The unbiasedness of this estimator is shown in Section A.1 of the Appendix.

To carry out the simulation, we generate samples of

in sequence until the sum of generated values exceeds The values of and are stored and used in (3.8).**Remark 3.2.** *A control variate*

W∗=T∗(c)∑i=1(X∗i−E[X∗])

can be used to improve the estimator

This results inˆθI+CD+CV=ˆθI+CD−γW∗,

*where *

## 3.6. Numerical experiments

In this section, we compare the different estimators of

introduced in Section 3 through a numerical example. We assume that and for Note that the distributional assumptions are entirely hypothetical and only for illustration purposes. Naturally, in actual application, the aggregate loss model and the parameters need to be estimated from data.For each method, the tail probabilities at

are estimated using 1000 simulated samples (one round). The standard deviation (SD) of each estimator is calculated based on 100 rounds of simulations. We will follow this convention in all the numerical examples in the rest of the paper. In addition, for estimator 3, in which stratified sampling is applied, we setThe simulation results are summarized in Table 3.1. A visual comparison of the CoV of different estimators is given in Figure 3.1.

To provide a reference point for comparing the estimators, we calculate the analytical result of the tail probability by

P(S>c)=∞∑m=1P(M=m)P(X1+⋯+Xm>c).

This formula is calculable for this simple example because when

Thus, the probabilities for any can be computed easily. For the infinite sum, we can truncate it at some value such that is small enough. In the example, and we select so that is smaller than This means that our analytical result is accurate at least up to decimal points.In addition, for comparison, we provide the normal approximation of the tail probability, which is calculated by

P(S>c)≃1−Φ(c−E[S]√Var(S)),

where

stands for the normal cumulative distribution function. We can see that normal approximation is not very accurate even for this simple example, especially when is large.As explained previously, estimators with smaller CoVs have smaller relative errors and, therefore, are better. Table 3.1 indicates that the combinations of importance sampling and stratified sampling (I+S); the conditioning method and the control variates method (CD+CV); importance sampling and the conditioning method (I+CD); and importance sampling, the conditioning method, and the control variates method (I+CD+CV) performed well. Methods involving importance sampling tend to have a small CoV when is large. Therefore, they are recommended for simulating small tail probabilities.

# 4. Simulating mean excess loss

In this section, we introduce variance reduction methods for simulating the mean excess losses

τ=E[(S−c)+].

## 4.1. Combining importance sampling and stratified sampling

The importance sampling method for simulating

is similar to that for : we simply replace with in equation (3.1). This yieldsˆτI=(S∗−c)+MS(h)exp(−hS∗).

To combine importance and stratified sampling, we replace the function

in (3.4) withg∗m(x1,…,xm)=(m∑i=1xi−c)+MS(h)e−h∑mi=1xi.

## 4.2. The conditioning method

The following method for simulating the mean excess loss using the conditioning method was discussed by Peköz and Ross (2004). Define

A=T(c)∑i=1Xi−c.

Then the conditional expectation estimator is constructed as

ˆτCD=E[(S−c)+∣T(c),A]=∑i≥T(c)(A+(i−T(c))E[X])P[M=i]=(A−T(c)E[X])P[M≥T(c)]+E[X]E[MI(M≥T(c))]=(A−T(c)E[X])P[M≥T(c)]+E[X](E[M]−∑i<T(c)iP[M=i])

The steps for implementing this simulation method are similar to those in Section 3.3; the only difference is that the value of

must be recorded in addition to## 4.3. Combining the conditioning method and the control variates method

The following two control variates can be used to enhance the performance of the conditioning method without increasing computational efforts:

W1=T(c)∑i=1(Xi−E[X])

and

W2=A−E[A].

This yields the estimator

ˆτCD+CV=ˆτCD−γ1W1−γ2W2,

where

and are chosen to minimize the variance of This could be achieved by setting the values of and to the least square estimate of the corresponding coefficients of the linear regressionˆτCD=γ0+γ1W1+γ2W2+ϵ,

where the values of

and are generated in simulation using the conditioning method.For general discussions on least square or regression-based methods for using control variates, see the papers by Lavenberg and Welch (1981) and Davidson and MacKinnon (1992).

## 4.4. Combining importance sampling and the conditioning method

Similar to Section 3.5, importance sampling and the conditioning method can be combined. This results in the estimator

ˆτI+CD=[(A∗−T∗(c)E[X])P[M≥T∗(c)]+E[X]E[MI(M≥T∗(c))]]MX(h)T∗(c)eh∑T∗(c)i=1X∗i

where

A∗=T∗(c)∑i=1X∗i−c.

The proof of the unbiasedness of this estimator is given in Section A.2 of the Appendix.

**Remark 4.1.** *The quantity Denuit 2020). We have* is related to the size-biased transform (see

E[MI(M≥T∗(c))]=E[M]P[˜M≥T∗(c)],

*where * is the size-biased transform of with distribution function

P[˜M=k]=kP[M=k]E[M],k=0,1,….

*Equation (4.1) can be calculated efficiently because the distribution of Ren (2021), if belongs to the class with parameter then is in the class with parameter Particularly, if follows the Poisson distribution with mean then also follows the Poisson distribution with mean Therefore, for this case, (4.1) becomes* and are often related. For example, as shown by

E[MI(M≥T∗(c))]=λP[M≥T∗(c)−1],

*which is straightforward to evaluate.*

**Remark 4.2.** *Control variates can be utilized to improve the results further. For example, let*

W∗1=T∗(c)∑i=1(X∗i−E[X∗])

*and*

W∗2=A∗−E[A∗].

*Then the estimator is*

ˆτI+CD+CV=ˆτI+CD−γ1W∗1−γ2W∗2,

*where the values of * and can be estimated by running the linear regression

ˆτI+CD=γ0+γ1W∗1+γ2W∗2+ϵ.

## 4.2. Numerical experiments

In this section, we compare the different methods for simulating Table 4.1 and Figure 4.1.

with the example set forth in Section 3.6. The results are shown inIn order to provide a reference point for the comparison, we list in the table values of

estimated using the importance sampling (I) method, but with a huge sample size of These values are proxies for “analytical value” and labeled as “Target” in the table.For this example, we did not provide the results for normal approximation of

because the approximation would not be accurate. The reason is that the normal approximation results for the tail probabilities are already inaccurate.Table 4.1 and Figure 4.1 show that the combinations I+S, I+CD, and I+CD+CV have small CoVs and thus are efficient for estimating mean excess losses. In particular, the I+CD+CV method has by far the smallest CoV.

**Remark 4.3.** *When carrying out the importance sampling methods, we have set the value of the tilting parameter to be the same as that for estimating the tail probability. That is, * is such that However, this may not be the optimal choice.

*For example, we have used Figure 4.2 plots the SD of the estimator (based on repeating each method 100 times) against the values of The figure shows that compared with the tail probability case, it may be preferable to set to a greater value when estimating mean excess losses. Determining theoretical results for the optimal value of tilting parameter would be a great topic for future research.* for the case of in the previous example. To explore the optimal value of we experimented, and

# 5. Simulating tail probability: The two-dimensional case

In this section, we study methods for simulating the tail probability

θ=P[S1>c,S2>d]

for the two-dimensional compound variable

defined in equation (1.4). The crude estimator is simplyˆθ0=I(S1>c,S2>d).

As in the one-dimensional case, this crude estimator has a large CoV and is inefficient. Therefore, we introduce several variance reduction methods to improve it.

## 5.1. Importance sampling

When using importance sampling to simulate

instead of sampling we sampleˆθI=I(S∗1>c,S∗2>d)f(S∗1,S∗2)f∗(S∗1,S∗2),

where

has a joint PDF We assume that has finite moment generating function and choose to be the Esscher transform of That is,f∗(x1,x2)=f(x1,x2)eh1x1+h2x2MS1,S2(h1,h2),

where

is the joint moment generating function of With this, (5.1) becomesˆθI=I(S∗1>c,S∗2>d)MS1,S2(h1,h2)eh1S∗1+h2S∗2.

As in Section 3.1, we need to determine the value of the tilting parameters

and a method to simulate in order to apply importance sampling.**Note 5.1.** *Similar to the one-dimensional case, we have*

ˆθI=I(S∗1>c,S∗2>d)MS1,S2(h1,h2)e−h1S∗1−h2S∗2≤MS1,S2(h1,h2)e−h1c−h2d.

*Then, a good choice for * is to minimize the upper bound. To this end, taking the derivative of the upper bound on the right-hand side of the above equation with respect to and and setting to we have

c=∂∂h1M(S1,S2)(h1,h2)M(S1,S2)(h1,h2)=E[S1eh1S1+h2S2]E[eh1S1+h2S2]=E[S∗1],

*and*

d=E[S∗2].

*The values of * and can be determined from (5.3) and (5.4).

The next note provides a representation of the distribution of

The proof of this statement is given in Section A.3 of the Appendix.**Note 5.2.** *The Esscher transform of * with parameter is

(S∗1,S∗2)=(M∗∑i=1X∗i,N∗∑j=1Y∗j),

*where * and are the Esscher transforms of and with parameters and respectively, and is the Esscher transform of with parameter

In many important special cases, the distributions of

and have similar structure, as the following examples show.**Example 5.1.** *Let * Conditional on assume claim frequencies and Claim sizes are i.i.d. and follow a gamma distribution with parameters and are i.i.d. and follow a gamma distribution with parameters With this setup, we have

MS1,S2(t1,t2)=E[et1S1+t2S2]=EΛ[E[et1S1+t2S2∣Λ]]=EΛ[PM∣Λ(MX(t1))⋅PN∣Λ(MY(t2))]=EΛ[eλ1Λ(MX(t1)−1)+λ2Λ(MY(t2)−1)]=MΛ(λ1(MX(t1)−1)+λ2(MY(t2)−1))=(1−λ1MX(t1)+λ2MY(t2)−λ1−λ2β)−α.

*Therefore,*

MS∗1,S∗2(t1,t2)=MS1,S2(t1+h1,t2+h2)MS1,S2(h1,h2)=(1−λ1MX(t1+h1)+λ2MY(t2+h2)−λ1MX(h1)−λ2MY(h2)β+λ1+λ2−λ1MX(h1)−λ2MY(h2))−α=(1−λ1MX(h1)(MX(t1+h1)MX(h1)−1)+λ2MY(h2)(MY(t2+h2)MY(h2)−1)β+λ1+λ2−λ1MX(h1)−λ2MY(h2))−α,

*which implies that * has the representation

(S∗1,S∗2)=(M∗∑i=1X∗i,N∗∑j=1Y∗j),

*where * and, conditional on For the claim sizes, and

*This result means that * is still a bivariate compound Poisson with common mixture and can easily be simulated.

**Example 5.2.** *Suppose that the claim frequencies * and have an additive common shock. That is, and where and are independent. Let and then

(S∗1,S∗2)=(M∗∑i=1X∗i,N∗∑j=1Y∗j),

*where * and and where and are the Esscher transforms of and with parameters and respectively. In addition, and are the Esscher transforms of and with parameters and

*This statement can be proved as follows.*

*Firstly,*

MS∗1,S∗2(t1,t2)=PM,N(eaMX∗(t1),ebMY∗(t2))PM,N(ea,eb)=PM1(eaMX∗(t1))PM1(ea)PM2(ebMY∗(t2))PM2(eb)PM0(ea+bMX∗(t1)MY∗(t2))PM0(ea+b).