1. Introduction
The threat of cyber risk is ubiquitous and increasing. The FBI notifies over 3,000 U.S. companies each year, from financial institutions to defense contractors to mega retailers, that they were victims of cyber security breaches (Segal 2016). In a public statement on December 14, 2016, Yahoo’s chief information security officer reported a security breach that is “associated with more than one billion user accounts,” subsequent to a separate security breach report back in September 2016, in which 500 million accounts were affected. According to PwC’s 2014 Global Economic Crime Survey, an astounding 19% of U.S. organizations have claimed losses between $50,000 and $1 million, and 7% of U.S. organizations lost over $1 million due to cybercrime in the previous year. The Center for Strategic and International Studies has estimated the annual cost of cybercrime and economic espionage to the world economy at more than $445 billion, or almost 1 percent of the global GDP.[1]
One important feature of cyber risk is that it is potentially contagious. Analogous to its original meaning of communication of disease from one to another by close contact, the study of contagion risk has long been extended beyond physical proximity in many areas such as financial market contagion (Kodres and Pritsker 2002), where market disturbances spread from one asset or market to another through co-movements in exchange rates, stock prices, sovereign spreads, and capital flows. Applied to the internet, cyber risk is likely contagious, given the increasing interconnectedness of the web-based global economy and the resulting probability of affecting multiple entities. Using information security data on threats to various components of a firm’s information system, Baldwin et al. (2017) demonstrate the presence of contagion between cyber risks against the different critical services—such as email, databases, name and directory servers, website operations, and shared storage—provided by the organization’s systems. In its 2016 Financial Stability Report to Congress, the Office of Financial Research (OFR) of the U.S. Department of the Treasury also states that “cybersecurity incidents” introduced specific risks of contagion due to the complexity of internet system. The leading technology research firm Gartner forecast about 21 billion connected devices worldwide by 2020, up more than 300 percent from 2015 (Gartner 2015). Every organization in a more connected world is increasingly vulnerable to cyber risks. Examples include the Y2K problem,[2] and contagious attacks from viruses and hackers, such as the WannaCry ransomware attack, motivated by malice, monetary, or political incentives, inflicting physical, financial, and reputational damages.
Not surprisingly, cyber risk and its management have significant implications for the global (re)insurance industry (Gordon, Loeb, and Sohail 2003; Bodin, Gordon, and Loeb 2008). According to the 2016 RIMS Cyber Survey, risk transfer by insurance is the primary risk management method used by organizations worldwide: nearly 70% of the respondents chose to transfer their cyber risks and over 80% of those purchased a standalone cyber insurance policy, a near 30% increase from 2015. The spread of cyber risks across organizations that are interrelated will cause the spread of losses. As greater understanding of cyber risk is being accumulated, the potential systemic risk and risk aggregation caused by cyber contagion is becoming an important concern for (re)insurers because the interconnectedness of cyber risk exposures can be a major impediment to the insurability and market formation of cyber security risks. If contagion is indeed a concern, its impact on cyber insurance product design, actuarial pricing and risk management is paramount for (re)insurance companies. Given the high demand for cyber insurance products, it is essential for the (re)insurance industry to fully understand the nature of the risk exposure.
1.1. Challenges of cyber risk and its management
In the early history of cyber security, typical hackers’ focus was mostly on fame and recognition from the hackers’ circles. That focus has quickly shifted to achieving financial gains or political goals against targeted organizations. Professional and elite hackers often maintain global operations and many belong to well-organized and profit-motivated groups hired and paid to perform illegal hacking (Romanosky, Hoffman, and Acquisti 2014). To achieve these goals, a large variety of sophisticated methods and tactics have been developed to exploit vulnerability in the targets’ cyber systems.
Social engineering and phishing are perhaps the most commonly reported forms of cyber attacks through ostensibly legitimate email attachments, links, software downloads or other operating system vulnerabilities. With a single casual click from the victim, hackers may be able to breach the computer system, evade detection tools, and leverage their vulnerabilities. In this process, malware, spyware and ransomware are often introduced into the target’s system. Malware is an all-encompassing term for a variety of malicious software, including Trojans, viruses, and worms that are created with the intent to steal data or destroy something on the computer. Spyware specializes in tracking keystrokes to get passwords or electronically spying in order to gain unauthorized access to confidential information, sometimes staying undetected for weeks or longer. A worldwide ransomware attack in 2017 employed the WannaCry ransomware to lock down Microsoft Windows operating systems and demanded ransom payments in the Bitcoin cryptocurrency to gain access back to the system. Some hostile cyber attacks don’t even require any type of malicious software to run on the system. For instance, hackers may launch brute force attacks using sophisticated algorithms to simply crack the password-protection of the target system. Another popular form of attack, the distributed denial of service attack (DDoS) attack, focuses on overloading the server with high volumes of data in order to disrupt the website or bring down the network.
Due to the heterogeneous, sophisticated, and dynamic nature of cyber risks, it is increasingly challenging for organizations to effectively reduce the risk of being compromised and protect their own cyber integrity (Gordon, Loeb, and Zhou 2011). The extent of cyber losses can range from nuisance damage to catastrophic damage that seriously erodes data integrity, compromises host and client information, and reduces system availability. More importantly, while cyber risk contagion is a very real threat through emails, mobile apps, website operations, operating systems, electronic payment systems, online databases, cloud servers, and shared online storage (Baldwin et al. 2017), the true extent of the contagion risk has yet to be assessed, let alone fully managed. Ignorance of such contagious interrelationships may underestimate the underlying risk, and thus undermine the firm’s value to stakeholders.
Cyber risk contagion also has important implications for cyber risk management through the use of insurance. Very recently, Kwon (2018) examines how the current insurance market has been dealing with cyber risk and concludes that the industry is still in dire need of basic infrastructure support to continue operations in the physical-cyber world of risk. According to an NAIC report,[3] while there is no standard insurance underwriting form for cyber coverage, the available cyber liability policies often include coverage on both contagious risk and noncontagious risks, which may have very different implications for insurance pricing. We independently reviewed popular cyber insurance contracts provided by insurers such as AIG and Farmers and confirmed such observations. The noncontagious cyber risks are more subject to the law of large numbers and therefore can be priced in a fashion similar to many conventional property and casualty insurance products. However, due to the potential systemic risk and risk aggregation, the contagious cyber risks are more likely dependent and remain difficult to quantify due to the lack of actuarial data specifically identifying such risk exposures and proper modeling of their interdependence. Therefore, in this paper we attempt to fill in the gap to thoroughly understand the presence and the extent of cyber risk contagion as well as develop practical modeling tools for assessing and managing cyber risk contagion.
1.2. Literature review
Despite its growing importance, cyber risk has been the subject of very limited academic research in the insurance literature. Eling and Schnell (2016) provide an overview of existing literature on cyber risks. They summarize seven core topics for cyber risk and cyber risk insurance, including definition and categorization, costs and consequences, data availability, risk management strategies, contagion and systemic risk nature, and cyber risk modeling. Their study, among others, noted that one particularly important challenge is the lack of cyber risk modeling frameworks that can capture the various unique aspects of cyber risk exposures and facilitate subsequent empirical, practical, and policy discussions.
Most of the existing literature on cyber risk focuses on the economic incentives of self-protection and insurance risk transfer in light of important issues such as moral hazard, adverse selection, and interdependent risks (Hofmann and Ramaj 2011; Öğüt, Raghunathan, and Menon 2010). Böhme and Schwartz (2010) provides a critical survey of existing economic models for cyber insurance, discussing the challenges of a viable insurance market for cyber risks and encouraging further theoretical and empirical research to improve the understanding of this important topic. Existing empirical studies are mostly restricted to the use of aggregate survey data and rely upon conceptual frameworks to identify and organize the sources of operational cyber risk (Mukhopadhyay et al. 2013; Marotta et al. 2015). In particular, Biener, Eling, and Wirfs (2014) study the insurability of cyber risks. In addition to a comprehensive review on cyber risk insurability, they suggest that cyber risk losses differ substantially from other operational risk losses and more research is needed to better understand cyber risks in order to develop cyber insurance products. Due to the ever-increasing importance of cybersecurity, actuarial societies have also conducted extensive research on cyber threats to businesses and the opportunities and challenges for the insurance market, and have jointly produced series of essays (Joint Risk Management Section 2017) and practical guidelines to help actuarial professionals consider this issue (see, for example, Solomon 2017; Maxwell 2017; Shang 2017; Dionisi 2017).
A strand of research also discusses the correlated nature of cyber risk exposures. Böhme and Kataria (2006) make use of “honeypots” (data placed on the internet to attract malicious activities) from 2003 to 2005 to provide some evidence of cyber risk contagion. Using SANS data provided by The SANS Institute as a cooperative research and education organization on threats to various components of a firm’s information system from 2003 to 2011, Baldwin et al. (2012) develop and estimate a vector equation system of threats to ten important IP services and find strong evidence of cyber risk contagion. Similarly, Wang and Kim (2009) conduct an empirical study of cyber attacks across 62 countries from 2003 to 2007 and find strong evidence for the spatial autocorrelation of cyber attacks across countries over time. Shang (2017) points out that cyber risk is more contagious than traditional operational risk and sets new challenges to the insurance industry. These studies shed light on the interconnected nature of cyber risk exposures and suggest that any cyber risk modeling approach should capture this feature.
Despite these preliminary efforts, there has been little attempt to examine the patterns and implications of cyber risk contagion that are practically relevant for insurance companies. Zurich Insurance Group and Atlantic Council (2014) in their insightful white paper compare cyber risk contagion to the subprime mortgages contagion that prompted the most recent financial crisis in the U.S. Just like the subprime mortgages, the heavily interconnected nature of cyber risk exposures and the common underlying driving forces make it highly susceptible to the domino effects of failures. Although the recent financial crisis has heightened awareness of risk contagion and promoted abundant academic research on systemic risk in financial institutions (c.f. Duan and Wei 2009; Cummins and Weiss 2014), similar research on cyber risk contagion in insurance industry is scarce. Eling and Pankoke (2016) review extant research on systemic risk in the insurance context from either academia or practitioner organizations. Their study reveals virtually no theoretical or empirical research on cyber risk contagion as a potential source of systemic risk in insurance industry. This paper adapts methodologies and empirical approaches from the existing literature on data science and financial systemic risks, such as clustering method and the factor copulas method (cf. Billio et al. 2012; Oh and Patton 2017), to develop a framework for modeling and empirically analyzing cyber risk contagion.
1.3. Objective
The aim of this research is to provide the first systematic discussion of cyber risk contagion and to contribute a general framework of contagious cyber risk to the risk management and insurance literature. We propose a model framework that can serve as a stepping-stone for businesses, insurers, regulators, and academics in developing their own models. Specifically, we propose and illustrate a two-step method for modeling cyber risk contagion that is flexible to accommodate specific concerns of the end users. As such, this research can serve as a critical starting component for organizations and (re)insurers to gradually build cyber risks into a broader ERM framework. We also benefit from a unique dataset, the SAS OpRisk Global Data, to analyze cyber risk and empirically examine contagion among cyber attacks.
The remainder of this paper is organized as follows. Section 2 introduces the SAS OpRisk Global Data and describes our data and variables. Section 3 discusses how to refine the dataset for cyber risk contagion analysis. Section 4 builds the empirical method and presents a case study. Model and analysis insights are subsequently discussed. Section 5 concludes the paper and discusses future research.
2. Data description
2.1. Introduction to SAS OpRisk Global Data
The SAS OpRisk Global Data is the world’s largest and most comprehensive collection of publicly reported operational losses in excess of US$100,000 (Wei, Li, and Zhu 2018). In our analysis, we use the database as of October 2017. It documents more than 34,360 events across all industries worldwide. These events span across many industry sectors, ranging from agriculture to manufacturing to financial services. Relevant background information is provided for the events such as the name, country, industry sector of the business and the specific business lines involved. The starting and end year of the event together with the year of settlement are also documented. Important financial features, such as the value of the loss (both original and CPI adjusted), are also available in the data set, including a finer breakdown into items such as legal liability and restitution. Despite these rich characteristics, the SAS OpRisk Global Data is still new to insurance literature.[4] This data set provides the broadest possible statistical sample from which to model cyber risk factors, probabilities, and costs. In addition, we use financial market data from the Center for Research in Security Prices (CRSP) in the Wharton Research Data Services (WRDS) to supplement the main data in our analysis.
Following the existing literature (c.f. Cebula and Young 2010; Biener, Eling, and Wirfs 2014; Eling and Wirfs 2015) and consistent with the operational risk frameworks in Basel II and Solvency II, we define cyber risk as a subgroup of "operational risk to information and technology assets that have consequences affecting the confidentiality, availability or integrity of information or information systems" (Cebula and Young 2010), with detailed categories shown in Table 1. By structuring cyber risk as a subcategory of operational risk, we can clearly identify cyber risks separately from the established framework of operational risks in SAS OpRisk Global Data.
Box 1 provides a sample description of a recent prominent cyber attack on the insurance industry that we extracted from the SAS OpRisk Global Data.
As the example shows, the event description is textual in nature. Therefore, we applied one of the recent developments in the big data analysis, the text mining method implemented by the Python programming language, to identify cyber risk events from keyword strings. This method ensures the accuracy, repeatability and scalability of our identification.
2.2. Initial identification method based on Biener, Eling, and Wirfs (2014)
Because of very limited prior empirical analysis using this type of cyber risk data, we first applied an initial identification method in the recent literature to extract cyber risk events from the SAS OpRisk Global Data for validation. More specifically, we closely follow Biener, Eling and Wirfs (2015) and use a set of broadly defined keyword strings. Three criteria (critical asset, actor and outcome) in combination are deemed to be relevant for a cyber risk incident. We searched the descriptions of each observation for a combination of keywords, where each combination consisted of one word from each group/criteria (i.e., three-word combinations for the keyword strings) in our sample data. We then checked all identified observations individually for their affiliation with cyber risk and if necessary we excluded those irrelevant incidents from the cyber risk data set. Table 2 describes the keywords for each of the criteria and further group the keywords under “actor” into four categories (categories 1-4) to capture the different natures of the events, including “actions by people,” “systems and technical failure,” “failed internal processes,” and “external events.”
To better understand the quality of the identified cyber risk data set, especially in the context of cyber risk contagion analysis, we provide two examples to illustrate the strength and weakness of the keyword search string method previously described. In the first example, a typical incident of cyber risk involving the Bank of Brazil was identified and included in this data set. The event description in the original database is as follows. “In October 2004, Banco do Brasil, a Brazilian financial institution, reported an estimated loss of $0.1M (0.29M BRL) due to an online phishing scam that used a Trojan horse virus to attack the company’s online ecommerce site. Typically, in a phishing scam internet users enter their user name and password into a fake website that looks identical to the company’s site that they are trying to access. This fake website is only online for several days. It records all the user names and passwords that were entered into it so that whoever runs the site can access the real site to transfer money out of the visiting persons’ accounts. A Trojan horse virus is a derivative of a typical phishing scam. However, a person does not have to enter their information into a fake site. Instead a program is unknowingly downloaded onto a computer when internet users click on a bogus site and scroll through the page to find out what the site is about. The virus downloaded then monitors the activity of the internet users and records all of their user names and passwords using a key logger. The key logger then sends the information back to the scammer so that the persons’ bank accounts can be accessed. Fifty-three people have been arrested by the police who are believed to have been involved in the scam. Eighteen people charged had previously been charged with similar offenses. Banks have lost $30M overall from the Brazilian Trojan horse virus. Other banks that also experienced losses were Banco Bradesco SA, Banco Itau Holding Financeira SA, Caixa Economica Federal, HSBC, and Unibanco.” This event clearly has implications for studying cyber risk contagion across different types of financial institutions and beyond.
Another instance identified in the data set exhibits different characteristics because while there is contagion within the (large) network of the specific financial institution impacted, there is unlikely to be any contagion across different companies in a larger setting. The description of this event from the original database is reproduced below. “In March 2005, Barclays Bank, a UK financial institution and the primary subsidiary of Barclays PLC, reported that it lost an estimated $0.33M (0.18M GBP) due to ATMs malfunctioning. On March 27, 2005, from 2am to 5pm, customers were unable to withdraw money from approximately 1500 cash machines after a computer breakdown stopped them from accessing their accounts. Telephone and internet banking was also out of service, but customers were able to make purchases with their cards. The cause of the computer glitch was unknown; however, speculations were that it was caused by the clocks going forward or by a piece of IT hardware. Although the glitch was resolved by 4pm, internet banking remained offline until 5pm. The 1500 Barclays cash points that were out of service represented half of the bank’s southern network. The northern network was not affected, as it resided on a different server.”
These two examples showcase that not all identified cyber risk incidents are equally relevant for the purpose of our study. There is clearly a trade-off in this identification method. On one hand, the keyword search string method has the advantage of comprehensively including all aspects of a possible cyber incident. However, because of its all-encompassing nature, many different types of cyber events are mixed together, resulting in a highly heterogeneous sample. More importantly, upon a second screening of the outcomes of the initial keyword search, we find that certain identified events, while appropriately classified as cyber risk, do not seem to have implications for cyber risk contagion. This calls for a more sophisticated process of defining and screening for cyber incidents for use in our study. For example, some identified incidents result from fraudulent activities of one particular employee or physical damage to certain computers and network equipment. While these qualify as cyber risk events according to the widely accepted broad definition in Biener, Eling and Wirfs (2015), they don’t necessarily have any potential for contagion within a company or across many different companies. We address this concern by creating a more elaborate method to identify cyber risk incidents and further refine the data set to study contagion.
It is worth noting that we also considered an alternative identification method by using the event category and subcategory definitions provided in the SAS OpRisk Global Data to identify cyber risk incidents. Three event categories (or subcategories) are considered relevant for identifying possible cyber risk incidents: (1) Event = internal fraud and subcategory = unauthorized activity, (2) Event = external fraud and subcategory = systems security, and (3) Event = business disruption and system failures, in which (2) and (3) better identify contagious cyber risk than (1). This method has its own limitations that the event category classification is predetermined by the data vendor and hence is subject to any potential bias therein. In addition, only a relatively small set of activities were described and employed when classifying incidents into event categories and subcategories. This might lead to an incomplete set of relevant cyber risk incidents. For this reason, we do not focus on this identification method in the subsequent analysis.[5]
3. Refined data set for cyber risk contagion
Because cyber risk is a very broadly defined concept, there has not been much work as of yet focusing on narrowing down its definition for the purpose of studying cyber risk contagion. To help advance understanding of contagion, we propose a refined data extraction method to identify a more accurate set of data points pertaining to the contagious cyber risks and base all subsequent analysis upon this refined data set.
Recall that the Biener, Eling, and Wirfs (2014) method relies on the set of broadly defined keyword strings based on the three criteria (critical asset, actor, and outcome) to identify the relevant cyber risk events as described in Table 1. While all of the actor categories are related to cyber risk, we consider only a subset of it to be potentially contagious. More specifically, we consider the “systems and technical failure,” “external events,” and the highlighted “actions by people” to be more prone to contagious cyber risks and hence decompose the actor category into noncontagious and contagious groups. We again applied the text mining analysis and Python programming language to conduct the new keyword strings extraction. Only the keywords under “Actor Category” in Table 3 are considered by us to pertain to contagious cyber risks and hence are included in the refined keyword search.
3.1. Descriptive statistics
While the SAS OpRisk Global Data covers events from earlier years, we decided to only use extracted events in 1990 and forward when the use of computers and internet became more prevalent. Also focusing on recent data helps us obtain a more relevant and homogenous data set for our analysis. Our descriptive analysis also confirms that cyber incidents since 1990 account for over 95% of all identified cyber events in our data set.
With the refined cyber risk data set identified by the new keyword search, we have conducted an analysis of firm characteristics to understand the differences between companies in the SAS OpRisk Global Data that have had a cyber risk incident and those that have had other operational risk events to be included but not a cyber risk event. From Table 4, we can see that companies with cyber risk incidents experienced larger loss amounts and tend to have more employees, which may proxy for increased complexity of the business.
Overall, our descriptive analysis of the cyber risk incidents shows that (consistent with our intuitive understanding) cyber risk exposures are widespread across industries, business lines, and countries. To conserve space, tables containing the full set of descriptive analysis results are relegated to the Appendix (Tables A1-A4). Based on the keyword search method, we identified a total of 491 cyber risk incidents that are potentially relevant for our analysis, about 30% of which are results of “actions of people,” i.e., category 1 as defined in Table 3, 63% of which are results of system and technical failure, i.e., category 3, and 6.3% of which are results of external events, i.e., category 4. We can observe a U-shaped trend in the frequency of cyber risk incidents over the 28-year period (i.e., 1990-2017) that we have collected data on, with a larger number of events being recorded during the years 1997 to 2010. This pattern may be due to the fact that sometimes cyber incidents can take years to uncover (e.g., the infamous hack of email accounts at Yahoo!) and some incidents in more recent years have not yet been discovered and included in the data base. Indeed, when we examine the length of the documented cyber events, we find that over 60% of the events last more than one year and nearly 15% of the events last five years and longer.
Our data set also shows that cyber events happen to companies in a wide array of countries, although the intensity differs greatly across these countries. A total of 82 countries have at least one recorded cyber event in our data set and sample period. The United States by far has the largest number of recorded events, with other developed economies such as the United Kingdom, Germany, Japan, France, Canada, and Australia closely following behind. It is interesting to note that the largest developing countries are also quite susceptible to cyber risk, with India ranked the third in the number of cyber incidents. While our study mainly focuses on the U.S. market, the nature of many cyber risks suggests cross-country contagion. Our model framework can potentially extend to analyzing this type of contagion; detailed analysis of this cross-country contagion might be an interesting research avenue for future studies.
Cyber risk also affects many different industry sectors. Our descriptive analysis shows that 17 industry sectors have experienced at least one cyber incident in our sample period. However, cyber risk exposure seems to be extremely unevenly distributed across different industry sectors. The financial services industry has a much greater exposure to cyber risk than any other industry. In fact, the financial services industry accounts for almost 50% of all cyber risk incidents in our data set. Our finding is consistent with the Verizon (2017) survey and Kopp, Kaffenberger, and Wilson (2017), which also show that the finance industry has by far the most incidents with confirmed data losses. Retail, manufacturing, information services, professional services and utilities also account for a large portion of the cyber risk events. A larger amount of cyber risk events can naturally lead to a higher chance of cyber risk contagion.
A unique variable available in the data set is whether multiple firms are impacted by the same event. Out of the 491 identified cyber risk events, 129 or 26% of all events have impacted more than one firm. This finding provides initial evidence on the potential contagion effects of cyber risk incidents. While this variable alone is insufficient to characterize fully how cyber risk contagion is formed or what types of companies/incidents are most susceptible to contagion, it can be used to validate our analysis of cyber risk contagion, which will be presented in a subsequent section. The data set also provides information on whether a single event has resulted in multiple losses. About 16.5% of all incidents result in multiple losses. In our analysis, we aggregate all the losses associated with the same event to correctly identify the impact of one event.
4. Empirical models
Due to the nature of cyber risk exposures, cyber risk contagion is an important concern for many companies. Different firms are subject to different types of cyber risks at different levels and to different extents. For instance, they may be connected from being situated within the same supply chain or value net (i.e., contagion as shared attack vector). In addition, companies may use the same underlying technology platform and/or security software (i.e., contagion due technology monoculture/systemic risk), so cyber attacks on one system may lead to simultaneous attacks on many different companies. Therefore, it is important for the companies and insurers to understand the types of cyber risk exposures before further examining the contagion of the cyber risks.
As complexity grows exponentially with increased number of entities, the cyber risk dataset needs to be broken down into smaller, more manageable clusters in order to examine the formation of the contagion risk and explore its implications for the (re)insurance industry. Additionally, a critical component of our analysis is to evaluate the dependence (or co-integration) between entities, which also presents a tremendous challenge due to both the extremely sparse data and the high dimensionality due to the number of companies of the cyber risk dataset. To address this challenge, we propose to first use clustering techniques to reduce dimensionality and then use the factor copula model to assess interdependence among entities for their cyber risk exposures.
4.1. The clustering model for cyber risk data
One natural categorization of the cyber risk exposures is to use the subcategories as defined in Cebula and Young (2010) and Biener, Eling, and Wirfs (2014), or the event categories and subcategories as defined in SAS OpRisk Global Data and listed in Table 1. However, upon careful examination of the events, it is clear that these categories are too broadly defined and do not take many firm characteristics into consideration. Consequently, they cannot accurately capture the unique features of different types of cyber risks that may have different implications for firm risk managers and/or insurers.
Therefore, we propose to adopt clustering techniques from the machine learning literature to simplify the complex cyber risk graphs and reduce dimensionality. There is a wide set of classification techniques to select from, including cluster analysis, logistic regression, decision tree, support vector machine, and neural network methods, readily available from various statistical packages (such as SAS and R) for robustness and validation. Due to the fast-changing nature of cyber attacks, we need to place a heavier weight on unsupervised learning methods (i.e., methods that do not need to rely on a known “left-hand-side” variable, such as cluster analysis) so that our modeling framework can be easily updated to accommodate new data and patterns in practice on an ongoing basis.
For this reason, we build our classification model using the unsupervised method cluster analysis. Cluster analysis is perhaps one of the most commonly used unsupervised learning methods (c.f. Gan 2013). Under this class of models, data is partitioned according to certain “similarity” and “dissimilarity” measures. The choice of specific measures of “similarity” (or “dissimilarity”) is critical. Commonly used clustering methods include Ward’s method (which minimizes the within-cluster variance), K-means method (reassignment to the nearest centroid at each iteration), and average linkage (where distance is defined as the average distance between all pairs of members of the two clusters), among others. Many simulations and empirical analyses have shown that there is often not a rule of thumb in choosing the exact type of similarity measure as the performance of different cluster analysis methods depends heavily on the nature of the underlying data to be classified. When one is unsure of the underlying shape and distribution of the clusters in the data set, a nonparametric method is more conservative, such as the “density linkage” method that uses nonparametric probability density estimates to find the clusters. For this reason, we chose to use the two-stage density linkage method available in the SAS statistical package to conduct our cluster analysis of the cyber risk data. The two-stage density linkage is a modification of density linkage that ensures all points are assigned to modal clusters before the modal clusters are permitted to join.
While cluster analysis is a popular and easy to implement method, and application software (e.g., SAS) have ready-to-use programs to implement it, there are several common considerations in applying cluster analysis. First, cluster analysis can result in an uneven split of the sample, which may not be desirable in certain applications requiring either a predefined number of clusters or a more even sample split. Second, in applications where there are prespecified classes (such as the classes of fraud, non-fraud in the context of fraud detection), cluster analysis does not suggest a correspondence between these prespecified classes and the identified clusters. However, these considerations are not causes for concern in the context of cyber risk contagion modeling because we do not have a predetermined set of clusters. Our main purpose is to group cyber risk incidents by their characteristics and reduce dimensionality, which is precisely what cluster analysis does. In addition to being flexible enough to more easily adapt to the ever-changing landscape of cyber risks, the unsupervised nature of cluster analysis also sidesteps constraints imposed by the rather limited understanding of properly defining the subcategories within the domain of cyber risk exposure.
4.2. The within and in-between dependence of cyber risk clusters
Upon examining the available characteristics for their relevance, the cyber risk incidents are grouped into three clusters based on the following characteristics: Country of legal entity, Country of incident, First year of event, Industry sector code, Assets (size), and Net income (profitability). Since most of the companies are based in the U.S. and most events have occurred in the U.S, we denote the country of legal entity and country of incident to be 1 if in the U.S., and 0 otherwise.
The cluster analysis results showcase that by properly identifying relevant characteristics in cyber risk events, one can effectively group similar events while distinguishing between groups of events that exhibit significantly different patterns. In Table 5 we provide a set of descriptive statistics for each cluster and comparisons of the clusters. We can easily see that the three clusters are indeed quite different in terms of firm size, industry, and many firm-based characteristics. For example, cluster 3 contains larger, longer lasting events that occur to much less profitable companies. These differences are evident in the two-sample t-tests we have conducted across clusters, which is available upon request. Table 6 confirms that while there is indeed some correspondence between the three clusters identified based on event and firm characteristics and the event risk categories and/or activities defined in the SAS OpRisk Global Data, the cluster analysis reveals more subtle differences between different cyber risk events. For example, while “external fraud” [a loss-producing event that involves at least one criminal act aimed at benefiting the perpetrator(s) and aimed at causing a loss for the firm and/or some other associated party (generally a client); no perpetrator may be an employee] is considered to be one general event risk category, these events are unevenly distributed across the three identified clusters, suggesting that when modeling these external fraud risks, one should consider also the specific risk characteristics, such as those revealed by our cluster analysis.
Cyber risk events may exhibit dependence within clusters and/or in-between clusters. We hypothesize that firms within each cluster would be more subject to contagious cyber risks than firms between different clusters. One natural way to validate our hypothesis is to take advantage of the available information contained in the variable “multiform impacted” in the SAS OpRisk Global Data and test if multiple firms that are impacted by the same event are indeed included in the same cluster. As described previously, since 1990, there have been 491 events identified as cyber risk incidents, 129 of which have impacted multiple firms. Our validation results confirm that firms within each cluster would be more subject to contagious cyber risks than firms between different clusters at 91.5% accuracy.
4.3. The factor copula model
Based on our results, we can capture cyber risk dependence among entities within each of the clusters developed. A very common and intuitive way to model the dependence is to use copulas. Copulas have been studied in both actuarial science and finance to examine dependencies among risks (Frees and Valdez 1998; Venter et al. 2007; Ai, Brockett, and Wang 2017). In this paper, we propose to extract useful information from financial prices to enrich the sparse cyber risk data and to take advantage of a new statistical development in factor copula models based on a latent factor structure (Zhang and Jiao 2012; Oh and Patton 2013). A factor copula model is generated by the following structural equation
Xi=γiZ+εi,εi∼iid,Z⊥εi∀i,
X≡[X1,…,XN]′∼FX=C(FX1(x1),…,FXN(xN);θ),
where the
are latent variables, is the common factor, and the are idiosyncratic factors.The majority of our sample consists of public firms with financial prices that contain valuable information on cyber risks. Recent research by Lange and Burger (2017) shows that data breaches have an impact on the total returns and volatility of the affected companies’ stock. In addition, Ahern (2013) and Foucault and Fresard (2014) illustrate that a firm’s stock price learns and incorporates available market information from its network, such as peers or supply chain streams. Recently, Smith et al. (2019) examined stock market response to cyber breaches and confirmed that stock prices were negatively affected. Amir, Levi, and Livne (2018) find evidence that managers have incentives to withhold negative information and investors may use related firms’ information to infer the likelihood of an attack. They also find that withheld information on cyber-attacks are associated with a decline of approximately 3.6% in equity values in the month the attack is discovered. Accordingly, we hypothesize that cyber risk is reflected in the risk premium and hence stock price of the company, and a cyber risk event may affect the risk premium and stock price of its network as suggested in Ahern (2013) and Foucault and Fresard (2014). Therefore, we can match the SAS OpRisk Global Data with financial data of public firms for our subsequent analysis. Our proposed factor copula model approach is particularly attractive for cyber risk contagion modeling because it significantly enhances the available sparse data with public information from the market[6] and has a very flexible dependence structure for modeling systemic risks in the high dimensional space. The available real-world data will be used to tune and validate the model.
4.4. A Simple case study of Target and Home Depot data breach
We now illustrate the use of factor copulas with a simple case study of Target and Home Depot data breaches to examine the impact of contagious cyber risk exposures. Without loss of generality and for ease of presentation, we showcase the features of the factor copulas model using this simplified, representative, and tractable setup of Target and Home Depot data breaches. Generalizations and extensions concerning more firms can be relatively easily derived.
We chose retail companies for our case study due to their similarity and susceptibility to cyber attacks as shown in previous descriptive analyses. The retail companies tend to share similar cyber risk because of the use of the same payment card systems and therefore are exposed to cyber risk contagion. Indeed the previous cluster analysis has identified both companies in the same cluster. As described below, the Target and Home Depot data breaches are a typical example of cyber risk contagion. The two companies’ Point-of-Sale systems were compromised by similar exploitation methods, and the use of stolen third-party vendor credentials and RAM (a.k.a. random access memory) scraping malware were instrumental in the success of both data breaches. The Target data breach was disclosed by Brian Krebs on December 18, 2013, with 40 million payment cards stolen (Krebs 2014b). Ever since then, occurrences of similar retail data breaches have been on the rise, including Neiman Marcus and Michael’s in January 2014; Sally Beauty Supply in March 2014; P.F. Chang’s in June 2014; Goodwill Industries in July 2014; SuperValu and The UPS Store in August 2014. Until the Home Depot data breach, the Target breach was the largest retail breach in U.S. history. The Home Depot data breach topped that by having 56 million payment cards stolen on September 2, 2014, when law enforcement and some banks contacted them about signs of the compromise (Krebs 2014a). The impact of these data breaches on each of the companies was significant. After the Target data breach, its posted quarterly profits were 46 percent below the expected profits (Gertz 2014). Target and Home Depot stock prices both took significant hits as well when the breach happened.
We make use of information in the stock returns of these publicly traded companies around the cyber attacks to study the contagion risk. Following the financial systemic risk literature (Zhang and Jiao 2012), we examine the dependence relationships between fluctuations on stock returns by using copula models conditional on the common factors found through the factor analysis and the marginal impact due to cyber risk. We focus on the Principal Component Analysis (PCA) method to extract the common factors that are responsible for the covariation among the observed variables. The principal components are able to account for most of the variation in the observed stock returns. More specifically, we define and as the daily returns on Target and Home Depot during our sample period from December 18, 2013 (Target event date) to September 2, 2014 (Home Depot event date). If these returns were normally distributed, the joint distribution of them should be bivariate normal. However, a well-documented observation in the academic literature is that the probability distribution of financial series tends not be normal. Therefore, we followed the suggestions of Hull (2009)to transform the returns into normalized variables (i=1, 2) using where is the inverse of the cumulative standard normal distribution and the cumulative distribution functions for respective returns. In this transformation, the new variables are constructed to have a standard normal distribution with mean equal to zero and standard deviation equal to one. After transforming the non-Gaussian returns into normally distributed variables, we find the common factors of these variables with the factor loadings and the percentages that the common factors account for in the underlying data. These results are exhibited in Table 7.
Note that this transformation is percentile to percentile so the correlations among the returns can be measured by the ones among the new variables. In the two-factor model,
Xi=αiF1+βiF2+√(1−αi−βi)Zi
where
and are two common factors (latent factors from PCA or other factor analysis) affecting returns for both Target and Home Depot, which include the impact of cyber risks, and s have independent standard normal distributions. The and are constant parameters between -1 and +1. The correlation between and is thusThe calculated correlations from the factor copula model are referred to as the copula correlations. Both unconditional correlations and copula correlations between Target and Home Depot returns are reported in Table 8. The difference between the two factors’ copula correlations and the unconditional correlation thus represents the isolated marginal effect by cyber risk. Based on our results, we can see that without using the factor copula model, the unconditional linear correlation of the Target returns and the Home Depot returns is about 0.4027. When using the factor copula model, we identified a significant increase of copula correlations to 0.7580, which indicates the impact of contagious cyber risks during the sample period. Combined with empirical evidence that cyber attacks may negatively affect stock prices and firm value (Smith et al. 2019; Amir, Levi, and Livne 2018), these findings suggest that ignoring the increase of correlation due to cyber risk contagion may have significant impact on insurers and investors.
In this section, we used the example of Target and Home Depot for the purpose of illustrating our modeling approach and providing insights into possible cyber risk contagion with these well-publicized events. Our illustrative example builds on prior literature on financial market contagion and the financial impact of cyber risks, with the hypothesis that information regarding cyber risk is contained in financial prices and dependence among these possibly connected cyber risk exposures can be modeled by a factor copula model, as shown in this section. As supported by the prior literature (e.g., Lange and Burger 2017; Smith et al. 2019), in this illustrative example, the cyber attacks experienced by Target and Home Depot are codependent (i.e., representing contagion) and reflected in their stock prices during that period.
The two-stage model framework we have proposed in this paper is easily adaptable by insurers and businesses to build their own model for cyber risk contagion, based on either actual cyber risk incidents or simulated incidents for the analysis. In essence, selected characteristics and clustering techniques can be used to group the large set of cyber risk incidents into a desired number of groups. A factor copulas model can then be applied to each of these clustered groups to further model the dependence among different entities. Lastly, the difference between the unconditional correlations and the copula correlations can be used to assess the existence and the extent of the contagion effects. While the modeling process does involve choices that should be made according to the specific scenarios; the two-stage modeling approach is general and flexible enough to be used in many different settings. As a first step to propose such a modeling framework, there are naturally numerous avenues that future research can explore. We discuss some of these future research opportunities in the Conclusion.
4.5. Discussion on alternative data sources
In this research, we have focused on the SAS OpRisk Global Data which offers the world’s largest and most comprehensive, accurate repository of information on publicly reported operational losses in excess of US $100,000. Prior studies have also used a few alternative data sets. Very similar to the SAS OpRisk Global Data, an alternative data source is Algo OpData provided by IBM Algorithmics. Algo OpData database is also called OpVar database, originated by Fitch Group and later acquired and renamed by IBM. Algo OpData provides operational loss events that occurred worldwide in the financial and non-financial industries beginning from 1920 with a loss reporting threshold of US $1 million. While it has not been applied for cyber risk research yet, the Algo OpData has been used in operational risk literature such as Goldstein, Chernobai, and Benaroch (2011) and Moosa and Silvapulle (2012). Wei, Li, and Zhu (2018) provides a comprehensive overview of the worldwide operational loss datasets, including SAS OpRisk Global Data and Algo OpData.
In addition to the comprehensive datasets on operational risks with cyber risk as a category, there are also some datasets dedicated for cyber risks. For instance, Privacy Rights Clearinghouse (PRC), a California nonprofit corporation, maintains data breach records that were made public in the US since 2005. In addition, Wikipedia maintains a page for the list of data breaches that involve theft or compromise of 30,000 or more records. Makridis and Dean (2017) and Akey, Lewellen, and Liskovich (2018) utilize the Privacy Rights Clearinghouse (PRC) and the U.S. Department of Health and Human Services (HHS) databases of cyber security data breach incidents. Unfortunately, these datasets can only identify a small fraction of total information security incidents. For example, the PRC and HHS data only collect data breach events (Makridis and Dean 2017).
The modeling framework presented in this research is general and can be applied to alternative datasets from another vendor, and can be further enhanced by additional information, such as an internal dataset of an insurance company. In practice, when a cyber risk event impacts multiple firms covered by different insurers, it may be possible that each insurer does not have a complete picture of who in the population was impacted by that event. A comprehensive dataset such as the SAS OpRisk Global Data will help provide a more complete set of information for insurers to consider and analyze. In addition, with the increasing importance of cyber risk and cyber insurance, insurance associations such as SOA and CAS may play an important role in collecting cyber risks data from insurers for the collaborative benefit of the entire insurance industry.
5. Conclusion
The cyber risk landscape is evolving rapidly and cyber security is one of the key concerns to modern organizations. As the complexity and severity of cyber risk continues to expand, businesses face greater systemic risk from cyber threats. Modeling and empirically examining the interconnected risk exposures will help reduce vulnerability of individual organizations and hence the entire economic system. At the same time, it represents a great opportunity as well as a significant challenge for the cyber insurance providers.
In this paper, we provide new modeling insights on cyber risk contagion and illustrate a two-step method based on cluster analysis and the factor copulas approach. The proposed framework is simple and flexible to accommodate specific concerns of the end users and can serve as a stepping-stone for businesses, insurers, regulators, and academics to develop their own models. This research can also serve as a critical starting component for organizations and (re)insurers to gradually build cyber risks into a broader ERM framework.
There are many intuitive ways to extend the current modeling framework. For example, in the current analysis, we use a sample of identified cyber risk incidents. However, further research can also adopt the propensity score matching method to include entities that have yet to experience a cyber attack, or use the Monte Carlo simulations method based on the initial analysis to increase the amount of usable data for actuarial pricing and risk management purposes. The proposed factor copula model is also flexible enough to allow fat tail dependence and asymmetric dependence during recession or market boom and can be combined with semiparametric marginal distributions.
Acknowledgments
This work was supported by the Casualty Actuarial Society through a research grant in 2017–2018. The authors thank the Casualty Actuarial Society for their very helpful support and feedback on improving this paper.