Mulugeta, Dagmawi, Ben Goodman, and Steven Weber. 2022. “Security Posture-Based Incident Forecasting.” Variance 15 (1).
• Figure 1. Internet-facing and private network segments of an organization
• Figure 2. Contemporary approach in problem domain: collect victim and nonvictim organizations, attribute their assets, and compare rules that discern victim configurations
• Figure 3. Novel contributions to the contemporary approach
• Figure 4. Pipeline collects victim and nonvictim data to map, manage, and extract features from their assets
• Figure 5. Victim organization collection pipeline
• Figure 6. Nonvictim organization collection pipeline
• Figure 7. Host collection stage using Censys [@127206]
• Figure 8. Feature engineering stage to extract features from 26 protocols
• Figure 9. Challenge with the experimental setup: Features and target label at different levels
• Figure 10. Isolation forest [@127211]: Isolating xi (left) and xo (right)
• Figure 11. Isolation forest [@127211]: Average depths converge
• Figure 12. Outlier versus inlier classification ROC curve
• Figure 13. Outlier versus inlier classification for different-sized organizations
• Table 1. Outlier versus inlier using all attributions
• Table 2. Outlier versus inlier using only certificate attributions
• Figure 14. Host classification using outlier and inlier hosts
• Figure 15. Randomly sampled nonvictim versus victim host classification using all attributions: (a) DNS inliers; (b) DNS outliers; (c) CERT inliers; (d) CERT outliers; (e) SEC500 inliers; (f) SEC500 outliers
• Table 3. Victim versus nonvictim inlier host using all attributions
• Table 4. Victim versus nonvictim outlier host using all attributions
• Figure 16. Organization classification using the probability distributions from outlier and inlier classifications
• Figure 17. Victim versus nonvictim organization classification using all attributions: (a) DNS, (b) CERT, (c) SEC500
• Table 5. Victim versus nonvictim organization classification using all attributions
• Figure 18. Outlier versus inlier classification feature importance chart using all attributions
• Figure 19. Outlier versus inlier classification feature importance chart using only certificate attributions
• Figure 20. Victim versus nonvictim inlier host classification using all attributions
• Figure 21. Victim versus nonvictim inlier host classification using only certificate attributions
• Figure 22. Victim versus nonvictim outlier host classification using all attributions
• Figure 23. Victim versus nonvictim outlier host classification using only certificate attributions
• Figure 24. Victim versus nonvictim organization classification using all attributions
• Figure 25. Victim versus nonvictim organization classification using only certificate attributions
• Table 6. Performance comparison with contemporary methods
• Figure 26. [c@127213] model performance of separate features

## Abstract

The frequency and impact of cybersecurity incidents increases every year, with data breaches, often caused by malicious attackers, among the most costly, damaging, and pervasive. Although our ability to quantify this risk for organizations remains frustratingly low, the cyber insurance industry has grown rapidly over the past several years and is expected to continue this growth into the foreseeable future, elevating the importance of developing new techniques for organizational risk assessment. This paper presents a method of utilizing machine learning to conduct security posture-based forecasting which offers certain improvements over current methods of establishing the probability of cybersecurity incidents. Furthermore, we introduce a novel method of building a network configuration-centric feature space while reducing both the data space and the processing cost of this sort of analysis.

Accepted: November 23, 2020 EDT

# APPENDIX

## Design Decisions

Due to the relative novelty of the problem, there are no contemporary data collection standards. And that is only exaggerated when dealing with a large data set such as the one here. Our design decision process included some assumptions as well as certain challenges. The decisions are categorized based on the challenges that we encountered: cohort selection, host attribution, host collection, feature engineering, and general.

### Design Decision 1: Cohort Selection

The design decisions made in the cohort selection stage are as follows:

1. We make the assumption that organizations that have not reported a security incident are exemplary of a “good” security posture. As Y. Liu, Sarabi, et al. (2015) have shown, there are numerous cases where this does not hold. Three cases are likely. First, organizations might have experienced an incident but it went unreported due to intentional or unintentional negligence. An example of that could be a data breach where the data involved are not sufficient (number of records, exposed credit card info, etc.) to report an incident. Second, organizations might have reported an incident close to the report window but not within it. One of the requirements to be considered a victim organization is to have reported an incident in the incident report window. Third, some Cybersecurity 500 organizations had hosts that are honeypots. Honeypots are hosts that are intentionally left vulnerable to lure attackers away from other real hosts. This can lead to incorrect learning as these vulnerable hosts will be associated with the nonvictim subset.

2. This analysis assumes that the three victim data sources are an exhaustive representation of all victims that have reported security incidents. However, the lists are certainly not exhaustive.

3. As mentioned earlier, the selection of organizations with an international presence has certain issues. To mitigate this, we selected only organizations with a major office in the United States, regardless of where most of their business was conducted. We leave dealing with incidents outside of the United States as a direction of future work.

4. During the nonvictim cohort selection, this analysis avoided cloud providers (Amazon Web Services, GoDaddy, etc.) for the reason that the “ownership” for the configurations that appear in Censys would be uncertain. Again, analyzing the risk profiles of different cloud providers is left as a direction for future work.

As the above list shows, it is quite hard to collect sample organizations that have a “good” external network posture. However, there is one source of ground truth, which is that victim organizations have reported an incident. Therefore, a significant amount of scrutiny is necessary to ensure that the victim organizations are exemplary of “bad” network postures.

### Design Decision 2: Host Attribution

The design decisions made during the host attribution stage are as follows. First, the domain name attribution step does not account for organizations that have more than one domain. Incorporating more than one domain name per organization is left as a direction for future work. Second, during the subdomain enumeration step, no guaranteed check is done to see if the subdomains found are updated/registered before or after the lookup date. Given the scope of the time frame and resources, we did not conduct a historical attribution of the IP addresses. A direction for future work would be to use Rapid7 Open Data (https://opendata.rapid7.com/) to check for historical data points. Third, during the subdomain resolution step, the list of resolvers used as a parameter for massdns had some malicious servers. These are name servers that are intentionally returning incorrect IPv4 addresses for the supplied subdomains. To mitigate this issue, we collected 10 reliable name server lists and ran each subdomain against five randomly selected groups. If four or more of the five groups return the same IP address for a subdomain, we take that value; otherwise we drop that subdomain—similar to a majority vote problem. Even though this back-of-the-envelope calculation yielded good results, this is left as a direction for future work.

Ideally, this footprinting technique should efficiently convert an organization name into a domain, or set of domains, that has a high level of confidence of correctly attributing all the public digital resources. However, even with the methods described here, there is still a possibility of false attributions and an even greater likelihood of missed detection when it comes to accurately identifying organization assets. This is because locating all the assets for an organization is a difficult endeavor as there is no ground truth to compare to.

### Design Decision 3: Host Collection

Design decisions made in the host collection stage are as follows:

1. We make the assumption that the hosts that have been attributed to the organizations exist in Censys. However, that might not be the case as the company might have blacklisted Censys’s probes.

2. There is a cost associated with running a query against BigQuery’s API. Due to the fiscal constraints, running daily queries was not a feasible option. To make the step more affordable, we did an aggregated weekly lookup instead of daily lookups. The assumption here is that weekly lookups are close enough to daily looks not to skew the results; however, there are no scientific tests to ensure that this is the case.

### Design Decision 4: Feature Engineering

During the feature engineering stage, we made the following design decisions:

1. There lies a subtle issue in our feature engineering technique. The lack of features that describe certain protocols as deeply as others leaves a certain imbalance in the way in which protocols dominate the feature space. Building a balanced scheme to extract reliable features from outside-in network posture data is left as a direction for future work.

2. This analysis focused on host-based features. However, inter-host-based (organization-level) features, such as the ratio of hosts that are on the cloud, could prove useful and are a good direction for future work.

### Design Decision 5: General

Here are two general design decisions that are not mentioned above:

1. A huge challenge, also mentioned in Liu et al.'s (2015) work, is the challenge of acquiring high-quality incident data. This results in a feature space we cannot more deeply analyze with confidence. First mentioned in Liu et al. (2015), this would not be an issue if a more systematic incident reporting model were in place.

2. Given the resource constraints, we had neither the time nor the resources to do a full reconnaissance on the organizations. However, with more tools and techniques we could provide visibility into much more than just their external network posture.

## GLOSSARY

This glossary provides definitions for security and privacy terminology and acronyms used within the analysis.

ACCESS: The ability or the means necessary to read, write, modify, or communicate data/information or otherwise use any system resource.

ACCESS CONTROL: A security mechanism used to grant users access to a system, based upon the identity of the user, and prevent access to unauthorized users. The user is commonly predefined to the system by the systems administrator with a user ID and password.

ASSETS: These include information, software, personnel, hardware, and physical resources (such as the computer facility).

AUTHENTICATION: The corroboration that an entity (user, process, etc.) is the one claimed.

AUTHORIZE/AUTHORIZATION: A document signed and dated by the individual who authorizes use and disclosure of their protected health information (PHI) for reasons other than treatment, payment, or health care operations. An authorization must contain a description of the PHI, the names or class of persons permitted to make a disclosure, the names or class of persons to whom the covered entity may disclose, an expiration date or event, an explanation of the individual’s right to revoke and how to revoke, and a statement about potential re-disclosure.

AVAILABILITY: Assurance that there exists timely, on demand, reliable access to data by authorized entities, commensurate with mission requirements.

BREACH: In general, the term ‘‘breach’’ means the unauthorized acquisition, access, use, or disclosure of protected health information that compromises the security or privacy of such information, except where an unauthorized person to whom such information is disclosed would not reasonably have been able to retain such information.

CLOUD ACCESS: To make contact with or gain access to cloud services.

CLOUD CONSUMER: Person or organization that maintains a business relationship with, and uses

services from, cloud service providers.

CLOUD PROVIDER: Person, organization, or entity responsible for making a service available to

service consumers.

COMPUTER SECURITY / CYBERSECURITY: The concepts, techniques, technical measures, and administrative measures used to protect the hardware, software, and data of an information processing system from deliberate or inadvertent unauthorized acquisition, damage, destruction, disclosure, manipulation, modification, use, or loss.

COMPUTER SYSTEM: Any equipment or interconnected system or subsystems of equipment used in the automatic acquisition, storage, manipulation, management, movement, control, display, switching, interchange, transmission, or reception of data or information, including computers; ancillary equipment; software, firmware, and similar procedures; services, including support services; and related resources.

CONFIDENTIALITY: Assurance that data are protected against provision or disclosure to unauthorized individuals, entities, or processes.

CRYPTOGRAPHY: A collection of tools and techniques to encrypt information to make it secure and maintain its confidentiality.

CRYPTOGRAPHIC HASH: A common function of computer security applications, used most frequently in digital signatures, passwords, message authentication codes, and other types of authentication. Most cryptographic algorithms take a string of any length as input (the “message”) and produce a unique fixed-length hash value (the “digest”).

CYBER INCIDENT: Actions taken through the use of computer networks that result in a compromise of or an actual or potentially adverse effect on a protected information system and/or the protected information residing therein.

DATABASE: A collection of interrelated data, often with controlled redundancy, organized according to a schema to serve one or more applications; data are stored so as to be used by different programs without concern for the data structure or organization. A common approach is used to add new data and to modify and retrieve existing data.

DIGITAL SIGNATURE: An electronic signature based upon cryptographic methods of originator authentication, computed by using a set of rules and a set of parameters, such that the identity of a signer and the integrity of the data can be verified.

DISCLOSURE: The release, transfer, provision of access to, or divulging of information outside the entity holding the information.

DNS: Domain Name System.

ELECTRONIC HEALTH RECORD: Electronic records of patient encounters in a healthcare delivery setting. An electronic health record typically consists of information including patient demographics, progress notes, medication history, vital signs, and laboratory results.

ENCRYPTION: The process of making information indecipherable by means of an algorithm process to protect it from unauthorized viewing or use, especially during transmission, or when it is stored on a transportable magnetic medium. A confidential key is required to de-encrypt the data.

EXTERNAL MEDIA: Any digital media that can be transported separately from the information systems on which it is created, edited, or read. This includes optical media such as CD-ROMs and DVDs, as well as floppy disks, USB drives (or flash drives and thumb drives), and tapes.

FIREWALL(S): Hardware and software components that protect one set of system resources (e.g., computers, networks) from attack by outside network users (e.g., Internet users) by blocking and checking all incoming network traffic. Firewalls permit authorized users to access and transmit privileged information and deny access to unauthorized users.

FIXED ENDPOINTS: A physical device, fixed in its location, that provides a man/machine interface to cloud services and applications. A fixed endpoint typically uses one method and protocol to connect to cloud services and applications.

FTP: File Transfer Protocol.

HACKER: A person who invades others’ computers, inspecting or tampering with the programs or data stored on them.

HIPAA: The Health Insurance Portability and Accountability Act, enacted in 1996, under which the HIPAA Security Rule, the HIPAA Privacy Rule, and other HIPAA rules were created. Provisions of the act address issues of security and privacy with regard to protected health information (PHI). Subsequent revisions of the law as well as the HITECH (Health Information Technology for Economic and Clinical Health) Act expanded the privacy requirements and enforcement regulations.

HIPAA BREACH: Unauthorized acquisition, access, use, or disclosure of unsecured protected health information.

HIPAA BREACH NOTIFICATION RULE: A rule, published in the United States Federal Register as 45 CFR §§ 164.400-414, that requires HIPAA-covered entities and their business associates to provide notification following a breach of unsecured protected health information. Similar breach notification provisions implemented and enforced by the Federal Trade Commission apply to vendors of personal health records and their third-party service providers, pursuant to section 13407 of the HITECH Act.

HIPS: Host intrusion prevention system.

IaaS: See INFRASTRUCTURE AS A SERVICE

IAP: Internet access point.

IDENTITY PROTECTION: Establishing appropriate administrative, technical, or physical safeguards to ensure the security and confidentiality of records and to protect against any anticipated threats or hazards to their security or integrity that could result in substantial harm, embarrassment, inconvenience, or unfairness to any individual on whom the information is maintained.

ILLEGAL ACCESS AND DISCLOSURE: Activities of employees that involve improper systems access and sometimes disclosure of information found thereon, but not serious enough to warrant criminal prosecution.

INCIDENT: A reported adverse event or group of adverse events. An incident may also be an identified violation or imminent threat of violation of information technology (IT) security policies, acceptable use policies, standard security practices, or a threat to the security of system assets (per NIST SP 800-61 Rev2). The following are some examples of possible IT security incidents:

1. Loss of confidentiality of information

2. Compromise of integrity of information

3. Loss of system availability

4. Denial of service

5. Misuse of service, systems, or information

6. Damage to systems from malicious code attacks, such as viruses, Trojan horses, or ransomware.

INFORMATION: Any communication or reception of knowledge, such as facts, data, or opinions; including numerical, graphic, or narrative forms, whether oral or maintained in any other medium, including computerized databases, paper, microform, or magnetic tape.

INFORMATION SYSTEM(S): An interconnected set of information resources organized for the collection, processing, maintenance, use, sharing, dissemination, or disposition of information, under the same direct management control that shares common functionality. A system normally includes hardware, software, information, data, applications, communications, and people.

INFORMATION TECHNOLOGY (IT) SECURITY INCIDENT: An event involving an IT resource that has had an adverse effect on the confidentiality, integrity, or availability of that resource or connected resources (data or network devices). Prompt detection and the appropriate handling of such security incidents is necessary to protect (an organization’s) information technology and data assets. Events involving sensitive or protected health information contained in nonelectronic systems, data, and media must also be investigated to determine whether an actual breach has occurred.

INFRASTRUCTURE AS A SERVICE (IaaS): The capability provided to the consumer to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications, and possibly limited control of select networking components (e.g., host firewalls).

INTEGRITY: Assurance that data are protected against unauthorized, unanticipated, or unintentional modification and/or destruction.

INTERNET: A worldwide electronic system of computer networks that provides communications and resource-sharing services to government employees, businesses, researchers, scholars, librarians, and students as well as the general public.

INTEGRITY: The property that data or information have not been altered or destroyed in an unauthorized manner. To ensure this, it may involve the use of programs and/or systems, and their monitoring, to detect attempts at improper system usage or information access and potential security breaches.

INTRUSION DETECTION: The use of programs and/or systems, and their monitoring, to detect attempts at improper system usage or information access and potential security breaches.

INTRUSION PREVENTION: The use of programs and/or systems, and their monitoring, to prevent attempts at improper system usage or information access and potential security breaches.

IPv4: Internet Protocol version 4.

IPv6: Internet Protocol version 6.

ISP: Internet service provider.

IT: Information technology.

LOCAL AREA NETWORK (LAN): A group of computers and other devices dispersed over a relatively limited area and connected by a communications link that enables any device to interact with any other on the network.

MALICIOUS SOFTWARE, MALICIOUS CODE (also known as MALWARE): The collective name for a class of programs intended to disrupt or harm systems and networks. The most widely known example of malicious software is the computer virus; other examples are Trojan horses and worms.

NETWORK: A group of computers and associated devices that are connected by communications facilities. A network can involve permanent connections, such as cables, or temporary connections made through telephone or other communications links, wired or wireless. A network can be as small as a LAN consisting of a few computers, printers, and other devices, or it can consist of many small and large computers distributed over a vast geographic area.

OSINT: Open source intelligence.

OWASP: Open Web Application Security Project.

PASSWORD(S): A confidential character string used to authenticate an identity or prevent unauthorized access. Passwords are most often associated with user authentication. However, they are also used to protect data and applications on many systems, including PCs. Password-based access controls for PC applications are often easy to circumvent if the user has access to the operating system (and knowledge of what to do).

PATCH/PATCHES: Vendor-supplied system and software updates designed to correct faults or vulnerabilities in installed systems and devices.

PERIMETER SECURITY: The use of technical means to protect the boundaries of systems and networks used to maintain or transmit personal or private information. Such means may include the use of firewalls and devices to detect or prevent unauthorized intrusion into systems and networks.

PERSONALLY IDENTIFIABLE INFORMATION (PII): Information in any form that consists of a combination of an individual’s name and one or more of the following: Social Security number, driver’s license or state ID, account numbers, credit card numbers, debit card numbers, personal code, security code, password, personal ID number, photograph, fingerprint, or other information that could be used to identify an individual.

PERSONAL INFORMATION: Personal information has many definitions including definitions by statute that may vary from state to state. Most generally, personal information is a combination of data elements that could uniquely identify an individual. Review applicable federal and state data breach statutes to determine what definition of personal information is applicable for purposes of any particular policy, procedure, or document.

PLATFORM AS A SERVICE (PaaS): The capability provided to the consumer to deploy onto the cloud

infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage but has control over the deployed applications and possibly configuration settings for the application-hosting environment.

POLICY: A high-level statement of enterprise beliefs, goals, and objectives and the general means for their attainment for a specified subject area.

PRIVACY: Information privacy is the assured, proper, and consistent collection, processing, communication, use, and disposition of disposition of personal information and personally identifiable information throughout its life cycle.

PRIVATE CLOUD: Cloud infrastructure provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.

PRIVATE INFORMATION: Information protected by the Privacy Act, referring to personally identifiable information, personal information, and protected health information collectively.

PROTECTED HEALTH INFORMATION: Individually identifiable health information except for education records covered by FERPA and employment records.

PUBLIC CLOUD: The cloud infrastructure provisioned for open use by the general public. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider.

RANSOMWARE: Ransomware is a type of malware that attempts to deny access to a user’s data, usually by encrypting the data using a key known only to the hacker who deployed the malware. The hacker then directs the user to pay a ransom (usually in an untraceable cryptocurrency such as Bitcoin) in order to be provided with the decryption key. Users also run the risk of having their data destroyed, exfiltrated, or restored only in part, to facilitate further ransom demands.

REMOTE ACCESS: The ability for an organization’s users to access its nonpublic computing resources from locations external to the organization’s facilities.

RISK: An ongoing or impending concern that has a significant probability of adversely affecting business continuity. The potential for harm or loss. Risk is best expressed as the answers to these four questions:

1. What could happen? (What is the threat?)

2. How bad could it be? (What is the impact or consequence?)

3. How often might it happen? (What is the frequency?)

4. How certain are the answers to the first three questions? (What is the degree of confidence?)

The key element among these is the issue of uncertainty captured in the fourth question. If there is no uncertainty, there is no “risk” per se.

RISK ANALYSIS: A process whereby cost-effective security/control measures may be selected by balancing costs of various security control measures against the losses that would be expected if those measures were not in place.

RISK ASSESSMENT: The identification and study of the vulnerabilities of a system and the possible threats to its security.

SaaS: Software as a service.

SECURITY INCIDENT: HIPAA definition—Any attempted or successful unauthorized access, use, disclosure, modification, or destruction of information, or interference with operations in an information system. FIPS (Federal Information Protection Standard) Publication 200 definition—An occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of an information system or the information the system processes, stores, or transmits or that constitutes a violation or imminent threat of violation of security policies, security procedures, or acceptable use policies.

SHARED CONTROL: A control that is managed and implemented partially by the cloud service provider and partially by the customer.

THREAT: An entity or event with the potential to harm the system. Typical threats are errors, fraud, disgruntled employees, fires, water damage, hackers, and viruses.

THREAT ACTOR: An entity that initiates the launch of a threat agent.

THREAT AGENT: An element that provides the delivery mechanism for a threat.

USER: The person with authorized access who uses a computer system and its application programs to perform tasks and produce results.

VIRTUAL LOCAL AREA NETWORK (also VLAN or Virtual LAN): A logical network, typically created within a network device, usually used to segment network traffic for administrative, performance, and/or security purposes. Because VLANs are based on logical instead of physical connections, they are extremely flexible.

VIRTUAL PRIVATE NETWORK (VPN): A connection that allows an organization to extend its internal/private network to a remote location through an untrusted network (e.g., the Internet).

VIRUS: A program that “infects” computer files, usually executable programs, by inserting a copy of itself into the file. These copies are usually executed when the “infected” file is loaded into memory, allowing the virus to infect other files. Unlike the computer worm, a virus requires human involvement (usually unwitting) to propagate.

VULNERABILITY: A condition or weakness in (or absence of) security procedures, technical controls, physical controls, or other controls that could be exploited by a threat.

WIDE AREA NETWORK (WAN): A group of computers and other devices dispersed over a wide geographical area (usually regional or national) that are connected by communications links. A WAN is a communications network that connects geographically separated areas.

WORLDWIDE WEB (WWW or WEB): The collection of electronic pages (documents) that are developed in accordance with the HTML (HyperText Markup Language) web format standard and may be accessed via Internet connections.

WORM: A program that propagates itself across computers, usually by spawning copies of itself in each computer’s memory. A worm might duplicate itself in one computer so often that it causes the computer to crash.