Blog How to Build a Data-Centric SOC

How to Build a Data-Centric SOC

Managing intelligence in enterprise security is about managing data to drive automation. 

While the traditional Intelligence Cycle is used in national security, we recommend shifting security operations closer to data management and data science cycles. For security operations, the faster we disabuse ourselves of the notion of the art of intelligence and embrace the science of data, the faster security leaders will achieve their automation goals. 

While there are many different frameworks and life cycles for data science, they all have some flavor of the below. 

Data-Centric Security Automation Cycle

  • Define - Plan and define the problem to be solved
  • Collect - Ingest relevant data sets
  • Enrich - Normalize, transform, enrich, and prepare the data 
  • Prioritize - Deploy logic models to surface insights
  • Connect - Integrate insights into an ecosystem
  • Evaluate - Evaluate feedback and effectiveness 



Just like in traditional intelligence operations, it’s important to start with the problem or requirement and not the data.

If the core challenge is accelerating operational efficiency through automation, we can turn to the two widely accepted success metrics for evaluating security operations effectiveness:

Mean-Time-to-Detect - how long does it take me to find something is bad

Mean-Time-to-Respond - how long does it take me to stop it

Different companies measure things in different ways. Your MTTD and MTTR depend on a number of factors, including the size and complexity of your network, the size and expertise of your IT staff, and the complexity of your industry. There are no industry-standard approaches to measuring MTTD and MTTR, so granular comparisons between organizations can be like apples-vs-oranges.

According to the SANS 2019 Incident Response survey, 52.6% of organizations had an MTTD of less than 24 hours, while 81.4% had an MTTD of 30 days or less.

Once an incident is detected, 67% of organizations report an MTTR of less than 24 hours, with that number increasing to 95.8% when measuring an MTTR of less than 30 days. However, according to the Verizon Data Breach Investigations Report, 56% of breaches took months or longer to discover at all. 

There are significant resources and formulas out there to help you track these figures that we point to in our upcoming blog series releases, but what’s more important than how you calculate them or what your MTTR and MTTD are today is agreeing that these are the north star metrics and the targets for improvement for automation in enterprise security operations.


For Data-Centric Security Automation, your ecosystem is viewed through the lens of sources and destinations for your intel. Before you can start accelerating automation, you have to take an inventory of your intel sources. 

There are many different ways to organize your data sources for Data-Centric Security Automation, but the most simple place to start is identifying internal vs. external sources.

Internal Intel Sources

Internal sources of intelligence are the historical events unique to your enterprise, such as incident reports, tickets, cases, suspicious emails → anything that captures technical information about a historical event that can inform how to prioritize a future event. 

The most valuable, and most often overlooked, intelligence an enterprise has is their own historical data about previous events.

It is not uncommon for a seemingly sophisticated SOC at a global Fortune 500 enterprise to be buying more external sources, when they are not properly collecting historical incidents and events for future enrichment. Too often, the suspicious IP address in an escalated incident report today was the same as the one that was closed false-positive yesterday. Now multiply this scenario across all your apps for SIEM, EDR, Vulnerability, Email Gateways, IR, and you can quickly see how internal intel is the quickest way to verify, “Has my team seen this before?”

External Intel Sources

External intel sources provide signals about maliciousness through feeds and reports on actors, campaigns, and malware based on external knowledge and often proprietary techniques. These external intel sources are useful for calibrating ‘ground truth’ on maliciousness. 

External sources typically come in two types:

  • Closed Sources - These are gated sources that require some commercial intelligence feeds or membership in a group, such as an ISAC/ISAO. The purveyors of these sources should be intelligence specialists and abide by a more traditional view of tradecraft and the intelligence lifecycle to effectively curate and disseminate valuable enrichment data and finished intelligence. 
  • Open Sources - These are ungated sources available to anybody and include blogs, RSS feeds, and Open APIs. By nature of the fact that these are ‘open’ there are less resources to invest in curation and preparation and often these are considered the noisiest of sources with less fidelity for the enterprise - meaning it shifts the burden of preparation and prioritization to the enterprise. 

The most valuable intelligence an enterprise has is their own historical data about previous events, because they hold a record of opinions about maliciousness specific to your enterprise.


Now that we have a good understanding of the ‘why’ and the ‘what’ behind intelligence for security automation, we can focus on the ‘how’. Data preparation is a core stage in any data science project and a core mission of Data-Centric Security Automation.

Preparation is about ‘cleaning’ data and transforming it so that it can be used in automation. Preparation is a key part of Data-Centric Security Automation because it improves your data quality and, in doing so, increases overall productivity. When you clean your data, all outdated or incorrect elements get eliminated – leaving you with the highest fidelity intelligence.

For Data-Centric Security Automation, this often comes down to looking at the level of structure in the data in the intel source. Structured intel sources are easier to prepare, whereas unstructured sources will require more inference, logic and, therefore, potential room for error.

In addition to the level of structure in the intel source, security leaders must also consider the format, identifying the types of objects to extract and normalize and the types of attributes to support, such as diamond model, kill chain and MITRE ATT&CK. Here is where STIX is helpful as a standard ontology for expressing cybersecurity events in machine-readable language.  

Below is a diagram of how TruSTAR's data science team weights objects and attributes from internal and external sources. As you read the diagram from left to right, you see objects increase in relevancy for enterprise security detection and response:


In a data-centric view, you’re thinking about how quickly you can articulate the heuristic the analyst is using to tag into patterns and models that can be coded, deployed and tuned over time, and how quickly you can get the outcomes from those heuristics into downstream apps for automation.

Many threat intelligence sources come with built-in prioritization frameworks, aka ‘scores’. Prioritization is about first applying, weighting, and leveraging scores often across multiple sources to surface the most relevant data for detection and response. 

But, if the data isn’t scored, then prioritization defaults back closer to an art.

Different sources can contribute to automation at different levels of immediacy. Structured and scored intel feeds should be immediately operationalized, while internal historical data have a lot of signal, but harder to deploy for automation .

For your external threat intel sources from vendors and ISACs/ISAOs, part of the value from their intelligence is prioritization through their scores. You cannot expect them all to have the same scoring framework or methodology, but if they are not applying a score then you will have to find a way to infer some concept of scores or other attribute labels for the data to be useful in automation. Similarly, you will want to apply a scoring framework for your internal, historical events to ensure these rich sources are contributing to your operational outcomes. 

In data-centric terms, there are only sources and destinations and data moves from sources to destinations through pipelines that are governed by prioritization models. These are your Intel Workflows. 

Your Intel Workflows are automated processes that transform intelligence and apply it to business processes. For example, intelligence workflows can take labels from different, independent intel sources to create a higher-fidelity data set that is published to a destination, often a downstream App to help automate a decision. 

Intel Workflows move away from preparation and prioritization as something only a highly-trained human analyst can do, and embraces preparation and prioritization as a configuration that can be applied to your sources at scale to automatically serve up insights that will accelerate automation across various detection and response apps downstream. 


With Intel Workflows surfacing prioritized indicators from a mix of internal and external intel sources, the next lifecycle stage for Data-Centric Security Automation is about integrating  intelligence into your ecosystem of detection, response and SOAR applications. 

The Ponemon Study underscored this need:

An excessive use of disconnected tools creates complex environments, which can inhibit efficiency. The study revealed that the number of security solutions and technologies an organization used had an adverse effect on its ability to detect, prevent, contain and respond to a cybersecurity incident. 

It’s not that these apps are not important, they are critical to accelerating automation, but they need to be fed prepared and prioritized data from the intel sources upstream to become ‘intelligent’ enough for automation. These apps, or destinations, are the places in your enterprise intelligence ecosystem where human behavioral automation or actions in your playbooks happen.

Robust APIs are how you get connectivity into the App world. APIs give you extensibility, helping you connect applications that don't have out-of-the-box integrations. However, this “tool rationalization” would serve security automation goals better if it was rebranded as “data rationalization.”


Consciously or not, security and intelligence practitioners in the enterprise have fended off automation and ROI for longer than most other departments in the enterprise. 

Traditional notions of intelligence and tradecraft have led us to legacy ways of demonstrating value. Too often, we use a fancy looking report on APT “X” to convince others that the work we do is complex enough and scary enough to give us resources and leave us alone to do our jobs. They leave the meeting feeling sufficiently scared and grateful it’s not their job and you grab your cloak and dagger and leave the meeting with the budget you wanted.

Dave McComb in his book Data-Centric Revolution: Restoring Sanity to the Enterprise references this as the classic technique used by the “High Priests” of complexity. Whether you're an engineering leader, a data scientist, or a cybersecurity professional, we deal with a significant level of complexity and it’s easier to default to elevating the complexity as a way to gain short-term trust instead of surfacing measurable outcomes. But, as McComb points out, these “High Priests further entrench silos” rather than punching through them. 

If incidents are increasing in both volume and severity, then it’s the enterprise security leader’s job to allocate resources efficiently to keep up. Any senior executive from any department can relate to that.

Data-Centric Security Automation is focused on demonstrating that you are detecting and responding to threats faster through improving MTTD and MTTR as your north star metrics. 

Once you’ve established these as your targets, the knobs you will twist begin with understanding how your primary assets - your intelligence sources - are performing for you in this mission.

If you are still early in your journey to adopting and implementing these metrics, then you can default to qualitative analysis by asking your security analysts, how much time are you spending ‘hunting and pecking’ for enrichment data?

For a more in depth look at how to build a data-centric SOC, read our white paper in its entirety here.