Securing the Supply Chain: Data Risks in a Connected World

Learn how to secure hyperconnected supply chains by segmenting telemetry pipelines, enforcing data masking, and adding visibility. Traditional vendor risk management isn’t enough.

January 9, 2026
Blog - Securing the Supply Chain: Data Risks in a Connected World | Databahn

Modern enterprises depend on a complex mesh of SaaS tools, observability agents, and data pipelines. Each integration, whether a cloud analytics SDK, IoT telemetry feed, or on–prem collector, can become a hidden entry point for attackers. In fact, recent incidents show that breaches often begin outside core systems. For example, OpenAI’s November 2025 disclosure revealed that a breach of their third party analytics vendor Mixpanel exposed customers’ names, emails and metadata. This incident wasn’t due to a flaw in OpenAI’s code at all, but to the telemetry infrastructure around it. In an age of hyperconnected services, traditional security perimeters don’t account for these “data backdoors.” The alarm bells are loud, and we urgently need to rethink supply chain security from the data layer outwards.

Why Traditional Vendor Risk Management Falls Short

Most organizations still rely on point-in-time vendor assessments and checklists. But this static approach can’t keep up with a fluid, interconnected stack. In fact, SecurityScorecard found that 88% of CISOs are concerned about supply chain cyber risk, yet many still depend on passive compliance questionnaires. As GAN Integrity notes, “historically, vendor security reviews have taken the form of long form questionnaires, manually reviewed and updated once per year.” By the time those reports are in hand, the digital environment has already shifted. Attackers exploit this lag: while defenders secure every connection, attackers “need only exploit a single vulnerability to gain access”.

Moreover, vendor programs often miss entire classes of risk. A logging agent or monitoring script installed in production seldom gets the same scrutiny as a software update, yet it has deep network access. Legacy vendor risk tools rarely monitor live data flows or telemetry health. They assume trusted integrations remain benign. This gap is dangerous: data pipelines often traverse cloud environments and cross organizational boundaries unseen. In practice, this means today’s “vendor ecosystem” is a dynamic attack surface that traditional methods simply weren’t designed to cover.

Supply Chain Breaches: Stats and Incidents

The scale of the problem is now clear. Industry data show supply chain attacks are becoming common, not rare. The 2025 Verizon Data Breach Investigations Report found that nearly 30% of breaches involved a third party, up sharply from the prior year. In a SecurityScorecard survey, over 70% of organizations reported at least one third party cybersecurity incident in the past year  and 5% saw ten or more such incidents. In other words, it’s now normal for a large enterprise to deal with multiple vendor-related breaches per year.

Highprofile cases make the point vividly. Classic examples like the 2013 Target breach (via an HVAC vendor) and 2020 SolarWinds attack demonstrate how a single compromised partner can unleash devastation. More recently, attackers trojanized a trusted desktop app in 2023: a rogue update to the 3CX telecommunications software silently delivered malware to thousands of companies. In parallel, the MOVEit Transfer breach of 2023 exploited a zero-day in a file transfer service, exposing data at over 2,500 organizations worldwide. Even web analytics are not safe: 2023’s Magecart attacks injected malicious scripts into ecommerce payment flows, skimming card data from sites like Ticketmaster and British Airways. These incidents show that trusted data pipelines and integrations are attractive targets, and that compromises can cascade through many organizations.

Taken together, the data and stories tell us: supply chain breaches are systemic. A small number of shared platforms underpin thousands of companies. When those are breached, the fallout is widespread and rapid. Static vendor reviews and checklists clearly aren’t enough.

Telemetry Pipelines as an Attack Surface

The modern enterprise is drowning in telemetry: logs, metrics, traces, and events flowing continuously from servers, cloud services, IoT devices and business apps. This “data exhaust” is meant for monitoring and analysis, but its complexity and volume make it hard to control. Telemetry streams are typically high volume, heterogeneous, and loosely governed. Importantly, they often carry sensitive material: API keys, session tokens, user IDs and even plaintext passwords can slip into logs. Because of this, a compromised observability agent or analytics SDK can give attackers unintended visibility or access into the network.

Without strict segmentation, these pipelines become free-for-all highways. Each new integration (such as installing a SaaS logging agent or opening a firewall for an APM tool)  expands the attack surface. As SecurityScorecard puts it, every vendor relationship “expands the potential attack surface”. Attackers exploit this asymmetry: defending hundreds of telemetry connectors is hard, but an attacker needs only one weak link. If a cloud logging service is misconfigured or a certificate is expired, an adversary could feed malicious data or exfiltrate sensitive logs unnoticed. Even worse, an infiltrated telemetry node can act as a beachhead: from a log agent living on a server, an attacker might move laterally into the production network if there are no micro-segmentation controls.

In short, modern telemetry pipelines can greatly amplify risk if not tightly governed. They are essentially hidden corridors through which attackers can slip. Security teams often treat telemetry as “noise,” but adversaries know it contains a wealth of context and credentials. The moment a telemetry link goes unchecked, it may become a conduit for data breaches.

Securing Telemetry with a Security Data Fabric

To counter these risks, organizations are turning to the concept of a security data fabric. Rather than an adhoc tangle of streams, a data fabric treats telemetry collection and distribution as a controlled, policy-driven network. In practice, this means inserting intelligence and governance at the edges and in - flight, rather than only at final destinations. A well implemented security data fabric can reduce supply chain risk in several ways:

  • Visibility into third - party data flows. The fabric provides full data lineage, showing exactly which events come from which sources. Every log or metric is tagged and tracked from its origin (e.g. “AWS CloudTrail from Account A”) to its destination (e.g. “SIEM”), so nothing is blind. In fact, leading security data fabrics offer full lifecycle visibility, with “silent device” alerts when an expected source stops sending data. This means you’ll immediately notice if a trusted telemetry feed goes dark (possibly due to an attacker disabling it) or if an unknown source appears.
  • Policy - driven segmentation of telemetry pipelines. Instead of a flat network where all logs mix together, a fabric enforces routing rules at the collection layer. For example, telemetry from Vendor X’s devices can be automatically isolated to a dedicated stream. DataBahn’s architecture, for instance, allows “policy-driven routing” so teams can choose that data goes only to approved sinks. This micro-segmentation ensures that even if one channel is compromised, it cannot leak data into unrelated systems. In effect, each integration is boxed to its own lane unless explicitly allowed, breaking the flat trust model.
  • Real-time masking and filtering at collection. Because the fabric processes data at the edge, it can scrub or redact sensitive content before it spreads. Inline filtering rules can drop credentials, anonymize PII, or suppress noisy events in real time. The goal is to “collect smarter” by shedding high risk data as early as possible. For instance, a context-aware policy might drop repetitive health - check pings while still preserving anomaly signals. Similarly, built -in “sensitive data detection” can tag and redact fields like account IDs or tokens on the fly. By the time data reaches the central tools, it’s already compliance safe, meaning a breach of the pipeline itself exposes far less.
  • Alerting on silent or anomalous telemetry. The fabric continuously monitors its own health and pipelines. If a particular log source stops reporting (a “silent integration”), or if volumes suddenly spike, security teams are alerted immediately. Capabilities like schema drift tracking and real-time health metrics detect when an expected data source is missing or behaving oddly. This matters because attackers will sometimes try to exfiltrate data by quietly rerouting streams; a security data fabric won’t miss that. By treating telemetry streams as security assets to be monitored, the fabric effectively adds an extra layer of detection.

Together, these capabilities transform telemetry from a liability into a defense asset. By making data flows transparent and enforceable, a security data fabric closes many of the gaps that attackers have exploited in recent breaches. Crucially, all these measures are invisible to developers: services send their telemetry as usual, but the fabric ensures it is tagged, filtered and routed correctly behind the scenes.

Actionable Takeaways: Locking Down Telemetry

In a hyperconnected architecture, securing data supply chains requires both visibility and control over every byte in motion. Here are key steps for organizations:

  • Inventory your telemetry. Map out every logging and monitoring integration, including cloud services, SaaS tools, IoT streams, etc. Know which teams and vendors publish data into your systems, and where that data goes.
  • Segment and policy-enforce every flow. Use firewalls, VPC rules or pipeline policies to isolate telemetry channels. Apply the principle of least privilege: e.g., only allow the marketing analytics service to send logs to its own analytics tool, not into the corporate data lake.
  • Filter and redact early. Wherever data is collected (at agents or brokers), enforce masking rules. Drop unnecessary fields or PII at the source. This minimizes what an attacker can steal from a compromised pipeline.
  • Monitor pipeline health continuously. Implement tooling or services that alert on anomalies in data collection (silence, surges, schema changes). Treat each data integration as a critical component in your security posture.

The rise in supply chain incidents shows that defenders must treat telemetry as a first-class security domain, not just an operational convenience. By adopting a fabric mindset, one that embeds security, governance and observability into the data infrastructure, enterprises can dramatically shrink the attack surface of their connected environment. In other words, the next time you build a new data pipeline, design it as a zero-trust corridor: assume nothing and verify everything. This shift turns sprawling telemetry into a well-guarded supply chain, rather than leaving it an open backdoor.

Ready to unlock full potential of your data?
Share

See related articles

Overall Incident Trends

  • 16,200 AI-related security incidents in 2025 (49% increase YoY)
  • ~3.3 incidents per day across 3,000 U.S. companies
  • Finance and healthcare: 50%+ of all incidents
  • Average breach cost: $4.8M (IBM 2025)

Source: Obsidian Security AI Security Report 2025

Critical CVEs (CVSS 8.0+)

CVE-2025-53773 - GitHub Copilot Remote Code Execution

CVSS Score: 9.6 (Critical) Vendor: GitHub/Microsoft Impact: Remote code execution on 100,000+ developer machines Attack Vector: Prompt injection via code comments triggering "YOLO mode" Disclosure: January 2025

References:

  • Attack Mechanism: Code comments containing malicious prompts bypass safety guidelines

Detection: Monitor for unusual Copilot process behavior, code comment patterns with system-level commands

CVE-2025-32711 - Microsoft 365 Copilot (EchoLeak)

CVSS Score: Not yet scored (likely High/Critical) Vendor: Microsoft Impact: Zero-click data exfiltration via crafted email Attack Vector: Indirect prompt injection bypassing XPIA classifier Disclosure: January 2025

References:

  • Attack Mechanism: Malicious prompts embedded in email body/attachments processed by Copilot

Detection: Monitor M365 Copilot API calls for unusual data access patterns, particularly after email processing

CVE-2025-68664 - LangChain Core (LangGrinch)

CVSS Score: Not yet scored Vendor: LangChain Impact: 847 million downloads affected, credential exfiltration Attack Vector: Serialization vulnerability + prompt injection Disclosure: January 2025

References:

  • Attack Mechanism: Malicious LLM output triggers object instantiation → credential exfiltration via HTTP headers

Detection: Monitor LangChain applications for unexpected object creation, outbound connections with environment variables in headers

CVE-2024-5184 - EmailGPT Prompt Injection

CVSS Score: 8.1 (High) Vendor: EmailGPT (Gmail extension) Impact: System prompt leakage, email manipulation, API abuse Attack Vector: Prompt injection via email content Disclosure: June 2024

References:

  • Attack Mechanism: Malicious prompts in emails override system instructions

Detection: Monitor browser extension API calls, unusual email access patterns, token consumption spikes

CVE-2025-54135 - Cursor IDE (CurXecute)

CVSS Score: Not yet scored (likely High) Vendor: Cursor Technologies Impact: Unauthorized MCP server creation, remote code execution Attack Vector: Prompt injection via GitHub README files Disclosure: January 2025

References:

  • Attack Mechanism: Malicious instructions in README cause Cursor to create .cursor/mcp.json with reverse shell commands

Detection: Monitor .cursor/mcp.json creation, file system changes in project directories, GitHub repository access patterns

CVE-2025-54136 - Cursor IDE (MCPoison)

CVSS Score: Not yet scored (likely High) Vendor: Cursor Technologies Impact: Persistent backdoor via MCP trust abuse Attack Vector: One-time trust mechanism exploitation Disclosure: January 2025

References:

  • Attack Mechanism: After initial approval, malicious updates to approved MCP configs bypass review

Detection: Monitor approved MCP server config changes, diff analysis of mcp.json modifications

OpenClaw / Clawbot / Moltbot (2024-2026)

Category: Open-source personal AI assistant Impact: Subject of multiple CVEs including CVE-2025-53773 (CVSS 9.6) Installations: 100,000+ when major vulnerabilities disclosed

What is OpenClaw? OpenClaw (originally named Clawbot, later Moltbot before settling on OpenClaw) is an open-source, self-hosted personal AI assistant agent that runs locally on user machines. It can:

  • Execute tasks on user's behalf (book flights, make reservations)
  • Interface with popular messaging apps (WhatsApp, iMessage)
  • Store persistent memory across sessions
  • Run shell commands and scripts
  • Control browsers and manage calendars/email
  • Execute scheduled automations

Security Concerns:

  • Runs with high-level privileges on local machine
  • Can read/write files and execute arbitrary commands
  • Integrates with messaging apps (expanding attack surface)
  • Skills/plugins from untrusted sources
  • Leaked plaintext API keys and credentials in early versions
  • No built-in authentication (security "optional")
  • Cisco security research used OpenClaw as case study in poor AI agent security

Relation to Moltbook: Many Moltbook agents (the AI social network) used OpenClaw or similar frameworks to automate their posting, commenting, and interaction behaviors. The connection between the two highlighted how local AI assistants could be compromised and then used to propagate attacks through networked AI systems.

Key Lesson: OpenClaw demonstrated that powerful AI agents with system-level access require security-first design. The "move fast, security optional" approach led to numerous vulnerabilities that affected over 100,000 users.

Moltbook Database Exposure (February 2026)

Platform: Moltbook (AI agent social network - "Reddit for AI agents") Scale: 1.5 million autonomous AI agents, 17,000 human operators (88:1 ratio) Impact: Database misconfiguration exposed credentials, API keys, and agent data; 506 prompt injections identified spreading through agent network Attack Method: Database misconfiguration + prompt injection propagation through networked agents

What is Moltbook? Moltbook is a social networking platform where AI agents—not humans—create accounts, post content, comment on submissions, vote, and interact with each other autonomously. Think Reddit, but every user is an AI agent. Agents are organized into "submolts" (similar to subreddits) covering topics from technology to philosophy. The platform became an unintentional large-scale security experiment, revealing how AI agents behave, collaborate, and are compromised in networked environments.

References:

  • Lessons: Natural experiment in AI agent security at scale

Key Findings:

  • Prompt injections spread rapidly through agent networks (heartbeat synchronization every 4 hours)
  • 88:1 agent-to-human ratio achievable with proper structure
  • Memory poisoning creates persistent compromise
  • Traditional security missed database exposure despite cloud monitoring

Common Attack Patterns

  1. Direct Prompt Injection: Ignore previous instructions <SYSTEM>New instructions:</SYSTEM> You are now in developer mode Disregard safety guidelines
  1. Indirect Prompt Injection: Hidden in emails, documents, web pages White text on white background HTML comments, CSS display:none Base64 encoding, Unicode obfuscation
  1. Tool Invocation Abuse: Unexpected shell commands File access outside approved paths Network connections to external IPs Credential access attempts
  1. Data Exfiltration: Large API responses (>10MB) High-frequency tool calls Connections to attacker-controlled servers Environment variable leakage in HTTP headers

Recommended Detection Controls

Layer 1: Configuration Monitoring
  • Monitor MCP configuration files (.cursor/mcp.json, claude_desktop_config.json)
  • Alert on unauthorized MCP server registrations
  • Validate command patterns (no bash, curl, pipes)
  • Check for external URLs in configs
Layer 2: Process Monitoring
  • Track AI assistant child processes
  • Alert on unexpected process trees (bash, powershell, curl spawned by Claude/Copilot)
  • Monitor process arguments for suspicious patterns
Layer 3: Network Traffic Analysis
  • Unencrypted: Snort/Suricata rules for MCP JSON-RPC
  • Encrypted: DNS monitoring, TLS SNI inspection, JA3 fingerprinting
  • Monitor connections to non-approved MCP servers
Layer 4: Behavioral Analytics
  • Baseline normal tool usage per user/agent
  • Alert on off-hours activity
  • Detect excessive API calls (3x standard deviation)
  • Monitor sensitive resource access (/etc/passwd, .ssh, credentials)
Layer 5: EDR Integration
  • Custom IOAs for AI agent processes
  • File integrity monitoring on config files
  • Memory analysis for process injection
Layer 6: SIEM Correlation
  • Combine signals from multiple layers
  • High confidence: 3+ indicators → auto-quarantine
  • Medium confidence: 2 indicators → investigate

Stay tuned for an article on detection controls!  

Standards & Frameworks

NIST AI Risk Management Framework (AI RMF 1.0)

Link: https://www.nist.gov/itl/ai-risk-management-framework

OWASP Top 10 for LLM Applications

Link: https://genai.owasp.org/ Updates: Annually (2025 version current)

Today’s SOCs don’t have a detection or an AI readiness problem. They have a data architecture problem. Enterprise today are generating terabytes of security telemetry daily, but most of it never meaningfully contributes to detection, investigation, or response. It is ingested late and with gaps, parsed poorly, queried manually and infrequently, and forgotten quickly. Meanwhile, detection coverage remains stubbornly low and response times remain painfully long – leaving enterprises vulnerable.

This becomes more pressing when you account for attackers using AI to find and leverage vulnerabilities. 41% of incidents now involve stolen credentials (Sophos, 2025), and once access is obtained, lateral movement can begin in as little as two minutes. Today’s security teams are ill-equipped and ill-prepared to respond to this challenge.

The industry’s response? Add AI. But most AI SOC initiatives are cosmetic. A conversational layer over the same ingestion-heavy and unreliable pipeline. Data is not structured or optimized for AI deployments. What SOCs need today is an architectural shift that restructures telemetry, reasoning, and action around enabling security teams to treat AI as the operating system and ensure that their output is designed to enable the human SOC teams to improve their security posture.

The Myth Most Teams Are Buying

Most “AI SOC” initiatives follow a similar pattern. New intelligence is introduced at the surface of the system, while the underlying architecture remains intact. Sometimes this takes the form of conversational interfaces. Other times it shows up as automated triage, enrichment engines, or agent-based workflows layered onto existing SIEM infrastructure.

This ‘bolted-on’ AI interface only incrementally impacts the use, not the outcomes. What has not changed is the execution model. Detection is still constrained by the same indexes, the same static correlation logic, and the same alert-first workflows. Context is still assembled late, per incident, and largely by humans. Reasoning still begins after an alert has fired, not continuously as data flows through the environment.

This distinction matters because modern attacks do not unfold as isolated alerts. They span identity, cloud, SaaS, and endpoint domains, unfold over time, and exploit relationships that traditional SOC architectures do not model explicitly. When execution remains alert-driven and post-hoc, AI improvements only accelerate what happens after something is already detected.

In practice, this means the SOC gets better explanations of the same alerts, not better detection. Coverage gaps persist. Blind spots remain. The system is still optimized for investigation, not for identifying attack paths as they emerge.

That gap between perception and reality looks like this:

Each gap above traces back to the same root cause: intelligence added at the surface, while telemetry, correlation, and reasoning remain constrained by legacy SOC architecture.

Why Most AI SOC Initiatives Fail

Across environments, the same failure modes appear repeatedly.

1. Data chaos collapses detection before it starts
Enterprises generate terabytes of telemetry daily, but cost and normalization complexity force selective ingestion. Cloud, SaaS, and identity logs are often sampled or excluded entirely. When attackers operate primarily in these planes, detection gaps are baked in by design. Downstream AI cannot recover coverage that was never ingested.

2. Single-mode retrieval cannot surface modern attack paths
Traditional SIEMs rely on exact-match queries over indexed fields. This model cannot detect behavioral anomalies, privilege escalation chains, or multi-stage attacks spanning identity, cloud, and SaaS systems. Effective detection requires sparse search, semantic similarity, and relationship traversal. Most SOC architectures support only one.

3. Autonomous agents without governance introduce new risk
Agents capable of querying systems and triggering actions will eventually make incorrect inferences. Without evidence grounding, confidence thresholds, scoped tool access, and auditability, autonomy becomes operational risk. Governance is not optional infrastructure; it is required for safe automation.

4. Identity remains a blind spot in cloud-first environments
Despite being the primary attack surface, identity telemetry is often treated as enrichment rather than a first-class signal. OAuth abuse, service principals, MFA bypass, and cross-tenant privilege escalation rarely trigger traditional endpoint or network detections. Without identity-specific analysis, modern attacks blend in as legitimate access.

5. Detection engineering does not scale manually
Most environments already process enough telemetry to support far higher ATT&CK coverage than they achieve today. The constraint is human effort. Writing, testing, and maintaining thousands of rules across hundreds of log types does not scale in dynamic cloud environments. Coverage gaps persist because the workload exceeds human capacity.

The Six Layers That Actually Work

A functional AI-native SOC is not assembled from features. It is built as an integrated system with clear dependency ordering.

Layer 1: Unified telemetry pipeline
Telemetry from cloud, SaaS, identity, endpoint, and network sources is collected once, normalized using open schemas, enriched with context, and governed in flight. Volume reduction and entity resolution happen before storage or analysis. This layer determines what the SOC can ever see.

Layer 2: Hybrid retrieval architecture
The system supports three retrieval modes simultaneously: sparse indexes for deterministic queries, vector search for behavioral similarity, and graph traversal for relationship analysis. This enables detection of patterns that exact-match search alone cannot surface.

Layer 3: AI reasoning fabric
Reasoning applies temporal analysis, evidence grounding, and confidence scoring to retrieved data. Every conclusion is traceable to specific telemetry. This constrains hallucination and makes AI output operationally usable.

Layer 4: Multi-agent system
Domain-specialized agents operate across identity, cloud, SaaS, endpoint, detection engineering, incident response, and threat intelligence. Each agent investigates within its domain while sharing context across the system. Analysis occurs in parallel rather than through sequential handoffs.

Layer 5: Unified case memory
Context persists across investigations. Signals detected hours or days apart are automatically linked. Multi-stage attacks no longer rely on analysts remembering prior activity across tools and shifts.

Layer 6: Zero-trust governance
Policies constrain data access, reasoning scope, and permitted actions. Autonomous decisions are logged, auditable, and subject to approval based on impact. Autonomy exists, but never without control.

Miss any layer, or implement them out of order, and the system degrades quickly.

Outcomes When the Architecture Is Correct

When the six layers operate together, the impact is structural rather than cosmetic:

  • Faster time to detection
    Detection shifts from alert-triggered investigation to continuous, machine-speed reasoning across telemetry streams. This is the only way to contend with adversaries operating on minute-level timelines.
  • Improved analyst automation
    L1 and L2 workflows can be substantially automated, as agents handle triage, enrichment, correlation, and evidence gathering. Analysts spend more time validating conclusions and shaping detection logic, less time stitching data together.
  • Broader and more consistent ATT&CK coverage
    Detection engineering moves from manual rule authoring to agent-assisted mapping of telemetry against ATT&CK techniques, highlighting gaps and proposing new detections as environments change.
  • Lower false-positive burden
    Evidence grounding, confidence scoring, and cross-domain correlation reduce alert volume without suppressing signal, improving analyst trust in what reaches them.

The shift from reactive triage to proactive threat discovery becomes possible only when architectural bottlenecks like fragmented data, late context, and human-paced correlation, are removed from the system.

Stop Retrofitting AI Onto Broken Architecture

Most teams approach AI SOC transformation backward. They layer new intelligence onto existing SIEM workflows and expect better outcomes, without changing the architecture that constrains how detection, correlation, and response actually function.

The dependency chain is unforgiving. Without unified telemetry, detection operates on partial visibility. Without cross-domain correlation, attack paths remain fragmented. Without continuous reasoning, analysis begins only after alerts fire. And without governance, autonomy introduces risk rather than reducing it.

Agentic SOC architectures are expected to standardize across enterprises within the next one to two years (Omdia, 2025). The question is not whether SOCs become AI-native, but whether teams build deliberately from the foundation up — or spend the next three years patching broken architecture while attackers continue to exploit the same coverage gaps and response delays.

The AI isn't broken. The data feeding it is.

The $4.8 Million Question

When identity breaches cost an average of $4.8 million and 84% of organizations report direct business impact from credential attacks, you'd expect AI-powered security tools to be the answer.

Instead, security leaders are discovering that their shiny new AI copilots:

  • Miss obvious attack chains because user IDs don't match across systems
  • Generate confident-sounding analysis based on incomplete information
  • Can't answer simple questions like "show me everything this user touched in the last 24 hours"

The problem isn't artificial intelligence. It's artificial data quality.

Watch an Attack Disappear in Your Data

Here's a scenario that plays out daily in enterprise SOCs:

  1. Attacker compromises credentials via phishing
  1. Logs into cloud console → CloudTrail records arn:aws:iam::123456:user/jsmith
  1. Pivots to SaaS app → Salesforce logs jsmith@company.com
  1. Accesses sensitive data → Microsoft 365 logs John Smith (john.smith@company.onmicrosoft.com)
  1. Exfiltrates via collaboration tool → Slack logs U04ABCD1234

Five steps. One attacker. One victim.

Your SIEM sees five unrelated events. Your AI sees five unrelated events. Your analysts see five separate tickets. The attacker sees one smooth path to your data.

This is the identity stitching problem—and it's why your AI can't trace attack paths that a human adversary navigates effortlessly.

Why Your Security Data Is Working Against You

Modern enterprises run on 30+ security tools. Here's the brutal math:

  • Enterprise SIEMs process an average of 24,000 unique log sources
  • Those same SIEMs have detection coverage for just 21% of MITRE ATT&CK techniques
  • Organizations ingest less than 15% of available security telemetry due to cost

More data. Less coverage. Higher costs.

This isn't a vendor problem. It's an architecture problem—and throwing more budget at it makes it worse.

Why Traditional Approaches Keep Failing

Approach 1: "We'll normalize it in the SIEM"

Reality: You're paying detection-tier pricing to do data engineering work. Custom parsers break when vendors change formats. Schema drift creates silent failures. Your analysts become parser maintenance engineers instead of threat hunters.

Approach 2: "We'll enrich at query time"

Reality: Queries become complex, slow, and expensive. Real-time detection suffers because correlation happens after the fact. Historical investigations become archaeology projects where analysts spend 60% of their time just finding relevant data.

Approach 3: "We'll train the AI on our data patterns"

Reality: You're training the AI to work around your data problems instead of fixing them. Every new data source requires retraining. The AI learns your inconsistencies and confidently reproduces them. Garbage in, articulate garbage out.

None of these approaches solve the root cause: your data is fragmented before it ever reaches your analytics.

The Foundation That Makes Everything Else Work

The organizations seeing real results from AI security investments share one thing: they fixed the data layer first.

Not by adding more tools. By adding a unification layer between their sources and their analytics—a security data pipeline that:

1. Collects everything once Cloud logs, identity events, SaaS activity, endpoint telemetry—without custom integration work for each source. Pull-based for APIs, push-based for streaming, snapshot-based for inventories. Built-in resilience handles the reliability nightmares so your team doesn't.

2. Translates to a common language So jsmith in Active Directory, jsmith@company.com in Azure, John Smith in Salesforce, and U04ABCD1234 in Slack all resolve to the same verified identity—automatically, at ingestion, not at query time.

3. Routes by value, not by volume High-fidelity security signals go to real-time detection. Compliance logs go to cost-effective storage. Noise gets filtered before it costs you money. Your SIEM becomes a detection engine, not an expensive data warehouse.

4. Preserves context for investigation The relationships between who, what, when, and where that investigations actually need—maintained from source to analyst to AI.

What This Looks Like in Practice

Article content

The 70% reduction in SIEM-bound data isn't about losing visibility—it's about not paying detection-tier pricing for compliance-tier logs.

More importantly: when your AI says "this user accessed these resources from this location," you can trust it—because every data point resolves to the same verified identity.

The Strategic Question for Security Leaders

Every organization will eventually build AI into their security operations. The question is whether that AI will be working with unified, trustworthy data—or fighting the same fragmentation that's already limiting your human analysts.

The SOC of the future isn't defined by which AI you choose. It's defined by whether your data architecture can support any AI you choose.

Questions to Ask Before Your Next Security Investment

Before you sign another security contract, ask these questions:

For your current stack:

  • "Can we trace a single identity across cloud, SaaS, and endpoint in under 60 seconds?"
  • "What percentage of our security telemetry actually reaches our detection systems?"
  • "How long does it take to onboard a new log source end-to-end?"

For prospective vendors:

  • "Do you normalize to open standards like OCSF, or proprietary schemas?"
  • "How do you handle entity resolution across identity providers?"
  • "What routing flexibility do we have for cost optimization?"
  • "Does this add to our data fragmentation, or help resolve it?"

If your team hesitates on the first set, or vendors look confused by the second—you've found your actual problem.

The foundation comes first. Everything else follows.

Stay tuned to the next article on recommendations for architecture of the AI-enabled SOC

What's your experience? Are your AI security tools delivering on their promise, or hitting data quality walls? I'd love to hear what's working (or not) in the comments.

Subscribe to DataBahn blog!

Get expert updates on AI-powered data management, security, and automation—straight to your inbox

Hi 👋 Let’s schedule your demo

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Trusted by leading brands and partners

optiv
mobia
la esfera
inspira
evanssion
KPMG
Guidepoint Security
EY
ESI