Blog The Data Dilemma in Cybersecurity

The Data Dilemma in Cybersecurity

Last week, the Wall Street Journal reported that the “scarcity of data needed to train models is slowing progress” toward the promise of fortifying cybersecurity for private and public sector customers alike. Anna Trikalinou, a lead security researcher at Intel, warns that attackers are using machine learning and AI, but cybersecurity professionals are in danger of falling woefully behind 1

We can’t blame security leaders who are drowning in data overload and alert fatigue for looking to the promise of automation, machine learning and AI as a life raft away from the whack-a-mole world of their security programs. But, what gives? How can we have too many alerts and not enough at the same time? Is there too much data or not enough?

As the WSJ article suggests but falls short of emphasizing, the core of the issue is not a general lack of data, but rather a lack of centralized and normalized data that can be used to train models.

It’s in our nature to covet the ‘easy button’, but despite what the marketing literature that AI vendor tells you or what the Analyst oracles promise in their most recent magic report, the fundamental truth is that automation, machine-learning and AI are only as effective as the data is plentiful and clean. 

As the 2020 Ponemon Study of Cyber Resilience points out, 70%+ of enterprise security leaders report data silos, fragmented tools and lack of integration are the key barriers to cyber resilience. It’s not that there isn’t enough data, it’s that the data sits locked inside a fragmented ecosystem of detection and response tools. These silos have their own data models, data structures and data formats. If the average security team has dozens of tools spitting out alerts in their own formats, who is going to do the centralizing and normalizing of that data?

The external threat intel provider universe is no different. In 2020, researchers from Hasso Plattner and the Delft University of Technology in the Netherlands collaborated on a study that found significant disparities across intelligence providers in data structures, context and scoring even for expensive providers that claimed to be tracking the same attackers 2.

The only path forward for an industry that is drowning in threat data and still seems to lack enough data to automate and machine-learn our way out of the challenge is to take a data-centric approach. As security leaders, we have to start valuing our internal and external sources of alerts and threat data as the primary elements of our ecosystem. Unsexy but critical missions and projects must first point toward centralizing, normalizing and integrating this data across an ecosystem of tools before we can expect real progress in automation, ML and AI.  

As TruSTAR co-founder, Paul Kurtz pointed out last week in a blog with some spot-on pop-culture references, enterprises will not progress forward without an ability to recall the past. Similarly, the Cloud Security Alliance research group for Cloud-based Intelligent Ecosystems, called for a move toward “Cyber Memory”: 

Rather than moving from one event to another we need to absorb what we learn from past events and build "cyber memory" with the ability to recall and connect event data gathered from across security systems. Creating a "virtual memory" will enable machine learning (ML) to more effectively and efficiently address evolving malicious activity. 

As an industry, we have to stop looking for tools promising automation, ML, or AI in a box. We have to start embracing solutions that focus on the critical prerequisites for progress like centralizing, normalizing and integrating. These solutions will rely on a different set of hyphenated buzzwords that all security leaders would do well to commit to their own cyber memory:

  • Cloud-native: The cloud is the right place to centralize and scale historical event collection and threat intelligence aggregation. TruSTAR Enclaves are the easily integrated storage solution to help our enterprise, MSP and ISAC/ISAO customers centralize events while preserving privacy and properly controlling access. See here how LogMeIn is using Enclaves for this today.

  • Data-centric: After centralization, normalization is the priority → normalizing structures, schemas and scoring frameworks. We have to be able to normalize ServiceNow cases, Splunk notable events, CrowdStrike alerts, and suspicious emails forwarded to the SOC, while also normalizing data across different threat intel sources and their various schemas and scoring frameworks.

  • API-first: The right solutions to drive your program toward automation, ML and eventually AI will have a user interface, but it is not a ‘single-pane-of-glass’ for all your problems. Alerts and insights will pass through normalization and enrichment pipelines and integrate with your detect, response, automation and business intelligence dashboards of choice. 

To learn more about how these principles and others can drive the future of your cyber resilience, check out our Data-Centric Security Automation white paper here.



Presidential Executive Order: “Collect and Preserve” Incident Data. Is this the Catalyst for Cybersecurity’s Black Box? President Biden’s Executive Order (EO) on Improving the Nation’s Cybersecurity defines a solid path forward for the Federal government and its ... Read More
Only the Paranoid Survive, Recast for Cybersecurity Andrew Grove's seminal business management book Only the Paranoid Survive offers a fitting title for the current state of cybersecurity and a roadmap ... Read More
The Good, Bad, and Ugly of Threat Intelligence with Patrick Coughlin Recently Co-Founder and CEO of TruSTAR, Patrick Coughlin, sat down with Ron Eddings and Chris Chocran from Hacker Valley Podcast to discuss how ... Read More