The DataBahn Blog

The latest articles, news, blogs and learnings from Databahn

All
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Cruz Banner Image
5 min read

Introducing Cruz: An AI Data Engineer In-a-Box

Read about why we built Cruz - an autonomous agentic AI to automate data engineering tasks to empower security and data teams
February 12, 2025

Introducing Cruz: An AI Data Engineer In-a-Box

Why we built it and what it does

Artificial Intelligence is perceived as a panacea for modern business challenges with its potential to unlock greater efficiency, enhance decision-making, and optimize resource allocation. However, today’s commercially-available AI solutions are reactive – they assist, enhance analysis, and bolster detection, but don’t act on their own. With the explosion of data from cloud applications, IoT devices, and distributed systems, data teams are burdened with manual monitoring, complex security controls, and fragmented systems that demand constant oversight. What they really need is more than an AI copilot, but a complementary data engineer that takes over all the exhausting work and freeing them up for more strategic data and security work.

That’s where we saw an opportunity. The question that inspired us: How do we transform the way organizations approach data management? The answer led us to Cruz—not just another AI tool, but an autonomous AI data engineer that monitors, detects, adapts, and actively resolves issues with minimal human intervention.

Why We Built Cruz

Organizations face unprecedented challenges in managing vast amounts of data across multiple systems. From integration headaches to security threats, data engineers and security teams are under immense pressure to keep pace with evolving data risks. These challenges extend beyond mere volume—they strike at the effectiveness, security, and real-time insight generation.

  1. Integration Complexity

Data ecosystems are expanding, encompassing diverse tools and platforms—from SIEMs to cloud infrastructure, data lakes, and observability tools. The challenge lies in integrating these disparate systems to achieve unified visibility without compromising security or efficiency. Data teams often spend days or even weeks developing custom connections, which then require continuous monitoring and maintenance.

  1. Disparate Data Formats

Data is generated in varied formats—from logs and alerts to metrics and performance data—making it difficult to maintain quality and extract actionable insights. Compounding this challenge, these formats are not static; schema drifts and unexpected variations further complicate data normalization.

  1. The Cost of Scaling and Storage

With data growing exponentially, organizations struggle with storage, retrieval, and analysis costs. Storing massive amounts of data inflates SIEM and cloud storage costs, while manually filtering out data without loss is nearly impossible. The challenge isn’t just about storage—it’s about efficiently managing data volume while preserving essential information.

  1. Delayed and Inconsistent Insights

Even after data is properly integrated and parsed, extracting meaningful insights is another challenge. Overwhelming volumes of alerts and events make it difficult for data teams to manually query and review dashboards. This overload delays insights, increasing the risk of missing real-time opportunities and security threats.

These challenges demand excessive manual effort—updating normalization, writing rules, querying data, monitoring, and threat hunting—leaving little time for innovation. While traditional AI tools improve efficiency by automating basic tasks or detecting predefined anomalies, they lack the ability to act, adapt, and prioritize autonomously.

What if AI could do more than assist? What if it could autonomously orchestrate data pipelines, proactively neutralize threats, intelligently parse data, and continuously optimize costs? This vision drove us to build Cruz to be an AI system that is context-aware, adaptive, and capable of autonomous decision-making in real time.

Cruz as Agentic AI: Informed, Perceptive, Proactive

Traditional data management solutions are struggling to keep up with the complexities of modern enterprises. We needed a transformative approach—one that led us to agentic AI. Agentic AI represents the next evolution in artificial intelligence, blending sophisticated reasoning with iterative planning to autonomously solve complex, multi-step problems. Cruz embodies this evolution through three core capabilities: being informed, perceptive, and proactive.

Informed Decision-Making

Cruz leverages Retrieval-Augmented Generation (RAG), to understand complex data relationships and maintain a holistic view of an organization’s data ecosystem. By analyzing historical patterns, real-time signals, and organizational policies, Cruz goes beyond raw data analysis to make intelligent, autonomous decisions enhancing efficiency and optimization.

Perceptive Analysis

Cruz’s perceptive intelligence extends beyond basic pattern detection. It recognizes hidden correlations across diverse data sources, differentiates between routine fluctuations and critical anomalies, and dynamically adjusts its responses based on situational context. This deep awareness ensures smarter, more precise decisions without requiring constant human intervention.

Proactive Intelligence

Rather than waiting for issues to emerge, Cruz actively monitors data environments, anticipating potential challenges before they impact operations. It identifies optimization opportunities, detects anomalies, and initiates corrective actions autonomously while continuously evolving to deliver smarter and more effective data management over time.

Redefining Data Management with Autonomous Intelligence

Modern data environments are complex and constantly evolving, requiring more than just automation. Cruz’s agentic capabilities redefine how organizations manage data by autonomously handling tasks traditionally consuming significant engineering time. For example, when schema drift occurs, traditional tools may only alert administrators, but Cruz autonomously analyzes the data pattern, identifies inconsistencies, and updates normalization in real-time.

Unlike traditional tools that rely on static monitoring, Cruz actively scans your data ecosystem, identifying threats and optimization opportunities before they escalate. Whether it's streamlining data flows, transforming data, or reducing data volume, Cruz executes these tasks autonomously while ensuring data integrity.

Cruz's Core Capabilities

  • Plug and Play Integration: Cruz automatically discovers data sources across cloud and on-prem environments, providing a comprehensive data overview. With a single click, Cruz streamlines what would typically be hours of manual setup into a fast, effortless process, ensuring quick and seamless integration with your existing infrastructure.
  • Automated Parsing: Where traditional tools stop at flagging issues, Cruz takes the next step. It proactively parses, normalizes, and resolves inconsistencies in real time. It autonomously updates schemas, masks sensitive data, and refines structures—eliminating days of manual engineering effort.
  • Real-time AI-driven Insights: Cruz leverages advanced AI capabilities to provide insights that go far beyond human-scale analysis. By continuously monitoring data patterns, it provides real-time insights into performance, emerging trends, volume reduction opportunities, and data quality enhancements, enabling better decision-making and faster data optimization.
  • Intelligent Volume Reduction: Cruz actively monitors data environments to identify opportunities for volume reduction by analyzing patterns and creating rules to filter out irrelevant data. For example, it identifies irrelevant fields in logs sent to SIEM systems, eliminating data that doesn't contribute to security insights. Additionally, it filters out duplicate or redundant data, minimizing storage and observability costs while maintaining data accuracy and integrity.
  • Automating Analytics: Cruz operates 24/7, continuously monitoring and analyzing data streams in real-time to ensure no insights are missed. With deep contextual understanding, it detects patterns, anticipates potential threats, and uncovers optimization. By automating these processes, Cruz saves engineering hours, minimizes human errors, and ensures data remains protected, enriched, and readily available for actionable insights.

Conclusion

Cruz is more than an AI tool—it’s an AI Data Engineer that evolves with your data ecosystem, continuously learning and adapting to keep your organization ahead of data challenges. By automating complex tasks, resolving issues, and optimizing operations, Cruz frees data teams from the burden of constant monitoring and manual intervention. Instead of reacting to problems, organizations can focus on strategy, innovation, and scaling their data capabilities.

In an era where data complexity is growing, businesses need more than automation—they need an intelligent, autonomous system that optimizes, protects, and enhances their data. Cruz delivers just that, transforming how companies interact with their data and ensuring they stay competitive in an increasingly data-driven world.

With Cruz, data isn’t just managed—it’s continuously improved.

Ready to transform your data ecosystem with Cruz? Learn more about Cruz here.

|
5 min read

Data Pipeline Management and Security Data Fabrics

Data Pipeline Management and Security Data Fabrics In the recent past, DataBahn has been featured in 3 different narratives focused on security data – Being mentioned by these experts is a welcome validation. It is also a recognition that we are solving a relevant problem for businesses – and for these mentions to come from...
November 14, 2024

Data Pipeline Management and Security Data Fabrics

In the recent past, DataBahn has been featured in 3 different narratives focused on security data -

  1. Cybersecurity experts like Cole Grolmus (Strategy of Security) discussing how DataBahn's "Security Data Fabric" solution is unbundling security data collection from SIEMs (Post 1, Post 2)
  2. VCs such as Eric Fett of NGP Capital talking about how DataBahn's AI-native approach to cybersecurity was revolutionizing enterprise SOCs attempts to combat alert fatigue and escalating SIEM costs (Blog)
  3. Most recently, Allie Mellen, a Principal Analyst at Forrester, shouted DataBahn out as a "Data Pipeline Management" product focusing on security use cases. (LinkedIn Post, Blog)

Being mentioned by these experts is a welcome validation. It is also a recognition that we are solving a relevant problem for businesses – and for these mentions to come from these different sources represents the perspectives from which we can consider our mission.

What are these experts saying?

Allie’s wonderful article (“If You’re Not Using Data Pipeline Management for Security and IT, You Need To”) expertly articulates how SIEM spending is increasing, and that SIEM vendors haven’t created effective tools for log size reduction or routing since it “… directly opposes their own interests: getting you to ingest more data into their platform and, therefore, spend more money with them.”

This aligns with what Cole alluded to, when he stated reasons why "Security Data Fabrics” shouldn’t be SIEMs", pointing to this same conflict of interest. He added that these misaligned incentives spilled over into interoperability, where proprietary data formats and preferred destinations would promote vendor lock-in, which he had previously mentioned Security Data Fabrics were designed to overcome.

Eric’s blog was focused on identifying AI-native cybersecurity disrupters, where he identified DataBahn as one of the leading companies whose architecture was designed to leverage and support AI features, enabling seamless integration into their own AI assets to “ … overcome alert fatigue, optimize ingestion costs, and allocate resources to the most critical security risks.”

What is our point of view?

The reflections of these experts resonate with our conception of the problem we are trying to solve—SOCs and Data Engineering teams overwhelmed by the laborious task of data management and the prohibitive cost of the time and effort involved in overcoming it.

  • SIEM ingest costs are too high. ~70% of the data being sent to SIEMs is not security-relevant. Logs have extra fields you don’t always need, and indexed data becomes 3-5x the original size. SIEM pricing data depends upon the volume of data being ingested and stored with them – which strains budgets and reduces the value that SOCs get from their SIEMs.

    We deliver a 35%+ reduction in SIEM costs by reducing log sizes in 2-4 weeks – and our AI-enabled platform enables ongoing optimization to continue to reduce log sizes.
  • SIEM being the source of data ingestion is also a problem. SIEMs are not very good at data ingestion. While some SIEM vendors have associated cloud environments (Sentinel, SecOps) with native ingestion tools, adding new sources – especially custom apps or sources with unstructured data – requires extensive data engineering effort and 4-8 weeks of integration. Additionally, managing these data pipelines is challenging, as these pipelines are single points of failure. Back pressure and spikes in data volumes can cause data loss.

    DataBahn ensures loss-less data ingestion via a mesh architecture that ensures a secondary channel to ingest data in case of any blockage. It also tracks and identifies sudden changes in volume, helping to identify issues faster.
  • Data Formats and Schemas are a challenge. SIEMs, UEBAs, Observability Tools, and different data storage destinations come with their proprietary formats and schemas, which add another manual task of data transformation onto data engineering teams. Proprietary formats and compliance requirements also create vendor lock-in situations, which add to your data team’s cost and effort.

    We automate data transformation, ensuring seamless and effortless data standardization, data enrichment, and data normalization before forking the data to the destination of your choice.

Our solution is designed for specific security use cases, including a library of 400+ connectors and integrations and 900+ volume reduction rules to reduce SIEM log volumes, as well as support for all the major formats and schemas – which puts it ahead of generic DPM tools, something which Allie describes in her piece.

Cole has been at the forefront of conversations around Security Data Fabrics, and has called out that DataBahn has built the most complete platform/product in the space, with capabilities across integration & connectivity, data handling, observability & governance, reliability & performance, and AI/ML support.

Conclusion

We are grateful to be mentioned in these vital conversations about security data management and its future, and we appreciate the time and effort being spent by these experts to drive these conversations. We hope that this increases the awareness of Data Pipeline Management, Security Data Fabrics, and AI-native data management tools - a venn diagram we are pleased to be at the intersection of - and look forward to continue our journey in solving the problems that these experts have identified.

5 min read

The Case for a Security Data Lake

Upgrade your SIEM with a Security Data Lake! Improve threat detection, simplify compliance, & gain SIEM migration best practices. Download our guide to learn more.
September 2, 2024

The Case for a Security Data Lake

Today’s business challenges require organizations to consume and store a vast quantity of raw security data in the form of logs, events, and datasets, as well as a substantial volume of security analytics information to support threat hunting and forensics. For some industries, there are regulatory and compliance requirements which also necessitate the storage of security data for extensive time periods. With organizations supporting a diverse set of users across geographies, managing multiple cloud-environments and on-premises networks, and supporting multiple devices – all while ensuring a centralized and cohesive security solution. In such cases, ensuring that all security data is collected and maintained in one accessible location is important. Enterprises are increasingly turning to security data lakes to perform this task.

Read how DataBahn helped a company forced to use 3 different SIEMs for data retention for compliance by giving them control of their own data     DOWNLOAD  

What is a Security Data Lake?

A security data lake is a central hub for storing, processing, and securing vast quantities of structured, unstructured, and semi-structured security-related data and information. By creating a specialized data repository for security data and threat intelligence, security data lakes enable organizations and their security teams to find, understand, and respond to cybersecurity threats more effectively.

Having a security data lake overcomes two distinct problems – security and access. IT and Security teams at organizations would want to ensure that their security data isn’t easily accessible and / or editable to ensure it isn’t modified or corrupted but ensure that it is quickly and easily accessible for SOCs for a faster MTTD / MTTR. Keeping all security-relevant data in a specific security data lake makes it easier for SOCs to access relevant logs and security information at a high velocity, while keeping it secure from malicious or unpermitted access.

How does a Security Data Lake work?

In traditional cybersecurity, legacy SIEMs were built to collect all security-related data from various sources in one place, and then analyze them by structuring the data into predefined schemas to flag anomalous patterns and identify potential threats through real-time analysis and examining historical data for patterns and trends. However, with the explosion in security data volume, legacy SIEMs struggled to identify relevant threats, while enterprises struggled with the ballooning SIEM costs.

Security Data Lakes emerged to solve this problem, becoming a scalable repository for all security data that could be connected to multiple cybersecurity tools so that threats could be identified, and incidents could be managed and documented easily. SOCs architect a network whereby security data is collected and ingested into the SIEM, and then stored in the Security Data Lake for analysis or threat hunting purposes.

Use cases of a Security Data Lake

  • Easy and fast analysis of data across large time periods – including multiple years! – without needing to go to multiple sources due to the centralization of security data.
  • Simplified compliance with easier monitoring and reporting on security data to ensure all relevant security data is stored and monitored for compliance purposes.
  • Smarter and faster security investigations and threat hunting with centralized data to analyze, leading to improved MTTD and MTTR for security teams.
  • Streamline access and security data management by optimizing accessibility and permissions for your security teams across hybrid environments.

Benefits of a Security Data Lake

As a centralized security data repository, there are numerous benefits to modern cybersecurity teams in leveraging a security data lake to manage and analyze security data. Here are some of the key advantages:

  • Enhanced Visibility and Contextual Insights: Security Data Lakes allow organizations to seamlessly blend security data with business information that provides contextual value. This enables more informed assessments and quicker responses by providing greater visibility and deeper insights into potential threats for security teams.
  • Cost-Effective Long-Term Storage: Security Data Lakes enable organizations to store large volumes of historical security data. This enables more effective threat hunting, helping cybersecurity teams identify anomalous behavior better and track incidents of concern without incurring very high SIEM licensing costs.
  • Accelerated Incident Investigation: By supporting a high number of concurrent users and substantial computing resources, security data lakes facilitate rapid incident investigations to ensure faster detection and faster response. This enables security teams to deliver improved MTTD and MTTR.
  • Advanced Security Analytics: Security Data Lakes provide a platform for advanced analytics capabilities, enabling collaboration between security and data science teams. This collaboration can lead to the development of sophisticated behavior models and machine learning analytics, enhancing the organization’s ability to detect and respond to threats.
  • Scalability and Flexibility: As data volumes and attack surfaces grow, security data lakes offer the scalability needed to manage and analyze vast amounts of data, especially as the volume of security data continues to rise and becomes troublesome for SOCs to manage. Using a Security Data Lake makes it easier for organizations to collect security data from multiple sources across environments with ease.
  • AI-Readiness for SOCs: Owning your own security data is critical for SOCs to integrate and adopt best-of-breed AI models to automate investigations and enhance data governance. Preparing aggregated security data for use in AI models and building applications and workflows that leverage AI to make SOCs more effective requires centralized data storage.

Read how DataBahn helped a company forced to use 3 different SIEMs for data retention for compliance by giving them control of their own data  DOWNLOAD  

5 min read

Cybersecurity Data Fabric: What on earth is security data fabric?

Understand what a Security Data Fabric is, and why an enterprise security team needs one to achieve better security while reducing SIEM and storage costs
March 12, 2024

What on earth is security data fabric, and why do we suddenly need one?

Every time I am at a security conference, a new buzzword is all over most vendors’ signage, one year it was UEBA (User Entity and Behavioral Analytics), next EDR (Endpoint Detection and Response), then XDR (Extended Detection and Response), then it was (ASM) Attack Surface Management. Some of these are truly new and valuable capabilities, some of these are rebranding of an existing capability. Some vendors have something to do with the new capability (i.e., buzzword), and some are just hoping to ride the wave of the hype. This year, we will probably hear a lot on GenAI and cybersecurity, and on the security data fabric. Let me tackle the latter in this article, with another article to follow soon on GenAI and Cybersecurity.

Problem Statement:

Many organizations are dealing with an explosion of security logs directed to the SIEM and other security monitoring systems, Terabytes of data every day!

  • How to better manage the growing cost of the security log data collection?
  • Do you know if all of this data clogging your SIEM storage has high security value?
  • Are you collecting the most relevant security data?

To illustrate, here is an example of windows security events and a view on what elements have high security value compared to the total volume typically collected:

  • Do you have genuine visibility into potential security log data duplication and underlying inconsistencies? Is your system able to identify missing security logs and security log schema draft fast enough for your SOC to avoid missing something relevant?
  • As SIEM and security analytics capabilities evolve, how do to best decouple security log integration from SIEM and other threat detection platforms to allow not only easier migration to lasted technology but provide cost-effective and seamless access of this security data for threat hunting and other user groups?
  • Major Next Gen SIEMs operate on a consumption-based model expecting end users to break down queries by data source and/or narrowed time range; which increases the total # of queries executed and increases your cost significantly!! Major Next-Gen SIEMs operate on a consumption-based model expecting end users to break down queries by data source and/or narrowed time range; which increases the total # of queries executed and increases your cost significantly!!

As security practitioners, we either accepted these issues as the cost of running our SOC, handled some of these issues manually, or hoped that either the cloud and/or SIEM vendors would one day have a better approach to deal with these issues, to no avail. This is why you need a security data fabric.

What is a Security Data Fabric (SDF)?

A data fabric is a solution that connects, integrates, and governs data across different systems and applications. It uses artificial intelligence and metadata automation to create flexible and reusable data pipelines and services. For clarity, a data fabric is simply a set of capabilities that allows you a lot more control of your data end to end, on how this data is ingested and where to forward it and stores it, in service of your business end goals, compared to just collecting and hoarding a heap of data in an expensive data lake, and hoping one day some use will come of it. The security data fabric is meant to tightly couple these principles with deep security expertise and the use of artificial intelligence to allow mastery of your security data and optimize your security monitoring investments and enable enhanced threat detection.

They key outcome of a security data fabric is to allow security teams to focus on their core function (i.e., threat detection) instead of spending countless hours tinkering with data engineering tasks, which means automation, seamless integration and minimal overhead on ongoing operations.

Components of a Security Data Fabric (SDF):

Smart Collection:

This is meant to decouple the collection of the security data logs from the SIEM/UEBA vendor you are using. This allows the ability to send the relevant security data to the SIEM/UEBA, sending a copy to a security data lake to create additional AI-enabled threat detection use cases (i.e., AI workbench) or to perform threat hunting, and send compliance-related logs to cold storage.

    Why important?         
  1. Minimize vendor lock-in and allow your system to leverage this data in various environments and formats, without needing to pay multiple times to use your own security data outside of the SIEM - particularly for requirements such as threat hunting and the creation of advanced threat-detection use cases using AI.
  1. Eliminate data loss with traditional SIEM log forwarders, syslog relay servers.
  1. Eliminate custom code/scripts for data collection.
  1. Reduced data transfer between cloud environments, especially in the case of having a hybrid cloud environment.

Security Data Orchestration:

This is where the security expertise in the security data fabric becomes VERY important. The security data orchestration includes the following elements:

  • Normalize, Parse, and Transform: Apply AI and security expertise for seamless normalization, parsing, and transforming of security data into the format you need for ingestion into your SIEM/UEBA tool, such as OCSF, CEF, CIM, or to a security data lake, or other data storage solutions.
  • Data Forking: Again, applying AI and security expertise to identify which security logs have the right fields and attributes that have threat detection value and should be sent to the SIEM, and which other logs should be sent straight to cold storage for compliance purposes, as an example.
  • Data Lineage and Data Observability: These are well-established capabilities in data management tools. We are applying it here to security data, so we no longer need to wonder if the threat detection rule is not firing because the log source is dead/MIA or because there are no hits. Existing collectors do not always give you visibility for individual log sources (at the level of the Individual device and log attribute/telemetry). This capability solves this challenge.
  • Data Quality: Ability to monitor and alert on schema drift and track the consistency, completeness, reliability, and relevance of the security data collected, stored, and used
  • Data Enrichment: This is where you start getting exciting value. The security data fabric uses its visibility to all your security data with insights using advanced AI such as:

    • Correlate with threat intel showing new CVEs or IoCs impacting your assets, here is how it looks in the MITRE Att&ck kill chain and provides a historical view of the potential presence of these indicators in your environment.
    • Recommendations on new threat detection use cases to apply based on your threat profile.
   Why important?
  1. Automation: At face value, existing tools promise some of these capabilities, but they usually need a massive amount of manual effort and deep security expertise to implement. This allows the SOC team to focus on their core function (i.e., threat detection) instead of spending countless hours tinkering with data engineering tasks.
  2. Volume Reduction: This is the most obvious value of using a security data fabric. You can reduce 30-50% of the data volume being sent to your SIEM by using a security-intelligent data fabric, as it will only forward data that has security value to your SIEM and send the rest to cheaper data storage. Yes, you read this correctly, 30-50% volume reduction! Imagine the cost savings and how much new useful security data you can start sending to your SIEM for enhanced threat detection.
  3. Enhanced Threat Detection: An SDF will enable the threat-hunting team to run queries more effectively and cheaply by giving them the ability to access a separate data lake, you get full control of your security data, and ongoing enrichments in how to improve your threat detection capabilities. Isn’t this what a security solution is about at the end of the day?
|
5 min read

Navigating the New Security Data Frontier: The Synergy of Databahn.ai, AWS Security Lake, and OCSF

Learn how OCSF's structured data hierarcy and security teams opting to build their own security lakes requires a security data fabric to maximize value
April 26, 2024

Navigating the New Security Data Frontier: The Synergy of Databahn.ai, Amazon Security Lake, and OCSF

In recent months, we've witnessed a paradigm shift where security teams are increasingly opting to build their own security data lakes. This trend isn't entirely new, as attempts have been made in the past with cloud storage systems and data warehouse solutions. Previously, the challenges of integrating data from disparate sources, normalizing it, and ensuring consistent usage through enterprise-wide security data models were significant barriers. However, the landscape is changing as more security teams embrace the idea of crafting their own data lakes. This isn't just about creating a repository for data; it's the beginning of a modular security operations stack that offers unprecedented flexibility. This new approach allows teams to integrate various tools into their stack seamlessly, without the complexities of data access, normalization, or the limitations imposed by incompatible data formats.

Driving Forces Behind the Shift

One pivotal factor propelling this shift is the development of the Open Cybersecurity Schema Framework (OCSF). Initiated in August 2022, OCSF aims to standardize security data across various platforms and tools and is now powered by a consortium of over 660 contributors from 197 enterprises. This framework strives to eliminate data silos and establish a unified language for security telemetry, promoting easier integration of products and fostering collaboration within the cybersecurity community. Achieving these benefits on a broad scale, however, requires ongoing cooperation among all stakeholders involved in cybersecurity.

The adoption of OCSF's structured data hierarchy significantly enhances security operations by enabling seamless communication through standardized data formats, which eliminates the need for extensive data normalization. This standardization also accelerates threat detection by facilitating quicker correlation and analysis of security events. Additionally, it improves overall security operations by streamlining data exchange, enhancing team collaboration, and simplifying the implementation of security orchestration, automation, and response (SOAR) strategies.

The Emergence of Amazon Security Lake

In tandem with the rise of OCSF, solutions like Amazon Security Lake have come to the forefront, offering specialized capabilities that address the limitations often encountered with traditional cloud SIEM vendors, such as data lock-in and restricted tool integration flexibility or traditional cloud data warehouses/data lakes that were often general purpose lacking the right foundations of managing security data. Amazon Security Lake acts as a central repository for security data from multiple sources—be it AWS environments, SaaS providers, on-premises data centers, or other cloud platforms. By consolidating this data into a dedicated data lake within the user’s AWS account, it enables a holistic view of security data across the organization.

Integrating Amazon Security Lake with OCSF facilitates the normalization and amalgamation of this data, crucial for consistent and efficient analysis and monitoring. One of the standout features of Amazon Security Lake is its ability to centralize vast amounts of data into Amazon S3 buckets, allowing security teams to utilize their chosen analytics tools freely. This capability not only circumvents vendor lock-in but also empowers organizations to adapt their analytics tools as security needs evolve and new technologies emerge.

The Rise of Security Data Fabrics - DataBahn.ai

DataBahn.ai plays a crucial role in this synergy, offering its Security Data Fabric platform. The platform enables AWS customers with the flexibility to select from an array of OCSF-enabled tools and services that best meet their needs, without the hassle of manually reformatting data. This capability enables teams to analyze security data from endpoints, networks, applications, and cloud sources in a standardized format. Quick identification and response to security events are facilitated, empowering organizations with enhanced access controls, cost-efficient data storage, and regulatory compliance.

DataBahn simplifies the process of enriching and shaping raw data from third-party sources to meet the specifications of Amazon Security Lake's Parquet schema. This transformation is facilitated by a repeatable process that minimizes the need for modifications, making data integration seamless and efficient.

Through DataBahn’s Security Data Fabric, Amazon Security Lake users can:

  • Simplify data collection and ingestion into Amazon Security Lake: DataBahn’s plug-and-play integrations and connectors, along with its native streaming integration, allow for hassle-free, real-time data ingestion into Amazon Security Lake without the need for manual reformatting or coding.
  • Convert logs into insights: Utilizing volume reduction functions like aggregation and suppression, DataBahn helps convert noisy logs (e.g., network traffic/flow) into manageable insights, which are then loaded into Amazon Security Lake to reduce query execution times.
  • Increase overall data governance and quality: DataBahn identifies and isolates sensitive data sets in transit, thereby limiting exposure.
  • Get visibility into the health of telemetry generation: The dynamic device inventory generated by DataBahn tracks devices to identify those that have gone silent, log outages, and detect any other upstream telemetry blind spots.

The greatest advantage of all is that it's your data, in your lake, formatted in OCSF, which allows you to layer any additional tools on top of this stack. This flexibility empowers your teams to achieve more and enhance your security posture.

Conclusion: A Unified Security Data Management Approach

This shift towards a more unified and flexible approach to security data management not only streamlines operations but also enables security teams to focus on strategic initiatives. With the combined capabilities of Databahn.ai, Amazon Security Lake, and OCSF, organizations are better positioned to enhance their security posture while maintaining the agility needed to respond to emerging threats. As the cybersecurity landscape continues to evolve, we are at the cusp of a new wave of Security operations powered by tools that will play a crucial role in shaping a more integrated, efficient, and adaptive security data management framework.

5 min read

Scaling Security Operations using Data Orchestration

Learn how decoupling data ingestion and collection from your SIEM can unlock exceptional scalability and value for your security and IT teams
February 28, 2024

Scaling Security Operations using Data Orchestration

Lately, there has been a surge in discussions through numerous articles and blogs emphasizing the importance of disentangling the processes of data collection and ingestion from the conventional SIEM (Security Information and Event Management) systems. Leading detection engineering teams within the industry are already adapting to this transformation. They are moving away from the conventional approach of considering security data ingestion, analytics (detection), and storage as a single, monolithic task.

Instead, they have opted to separate the facets of data collection and ingestion from the SIEM, granting them the freedom to expand their detection and threat-hunting capabilities within the platforms of their choice. This approach not only enhances flexibility to bring the best-of-breed technologies but also proves to be cost-effective, as it empowers them to bring in the most pertinent data for their security operations.

Staying ahead of threats requires innovative solutions. One such advancement is the emergence of next-generation data-focused orchestration platforms.

So, what is Security Data Orchestration?

Security data orchestration is a process or technology that involves the collection, normalization, and organization of data related to cybersecurity and information security. It aims to streamline the handling of security data from various sources, making it more accessible in destinations where the data is actionable for security professionals.

 

Why is Security Data Orchestration becoming a big deal now?

Not too long ago, security teams adhered to a philosophy of sending every bit of data everywhere. During that era, the allure of extensive on-premise infrastructure was irresistible, and organizations justified the sustained costs over time. However, in the subsequent years, a paradigm shift occurred as the entire industry began to shift its gaze towards the cloud.

This transformative shift meant that all the entities downstream from data sources—such as SIEM (Security Information and Event Management) systems, UEBA (User and Entity Behavior Analytics), and Data Warehouses—all made their migration to the cloud. This marked the inception of a new era defined by subscription and licensing models that held data as a paramount factor in their quest to maximize profit margins.

In the contemporary landscape, most downstream products, without exception, revolve around the notion of data as a pivotal element. It's all about the data you ingest, the data you process, the data you store, and, not to be overlooked, the data you search in your quest for security and insights.

This paradigm shift has left many security teams grappling to extract the full value they deserve from these downstream systems. They frequently find themselves constrained by the limitations of their SIEMs, struggling to accommodate additional valuable data. Moreover, they often face challenges related to storage capacity and data retention, hindering their ability to run complex hunting scenarios or retrospectively delve deeper into their data for enhanced visibility and insights.

It's quite amusing, but also concerning, to note the significant volume of redundant data that accumulates when companies simply opt for vendor default audit configurations. Take a moment to examine your data for outbound traffic to Office 365 applications, corporate intranets, or routine process executions like Teams.exe or Zoom.exe.


Sample data redundancy illustration with logs collected by these product types in your SIEM Upon inspection, you'll likely discover that within your SIEM, at least three distinct sources are capturing identical information within their respective logs. This level of data redundancy often flies under the radar, and it's a noteworthy issue that warrants attention. And quite simply, this hinders the value that your teams expect to see from the investments made in your SIEM and data warehouse.

Conversely, many security teams amass extensive datasets, but only a fraction of this data finds utility in the realms of threat detection, hunting, and investigations. Here's a snapshot of Active Directory (AD) events, categorized by their event IDs and the daily volume within SIEMs across four distinct organizations.

It is evident that, despite AD audit logs being a staple in SIEM implementations, no two organizations exhibit identical log profiles or event volume trends.

 

Adhering solely to vendor default audit configurations often leads to several noteworthy issues:

  1. Overwhelming Log Collection: In certain cases, such as Org 3, organizations end up amassing an astronomical number of logs from event IDs like EID 4658 or 4690, despite their detection teams rarely leveraging these logs for meaningful analysis.
  2. Redundant Event Collection: Org 4, for example, inadvertently collects redundant events, such as EID 5156, which are also gathered by their firewalls and endpoint systems. This redundancy complicates data management and adds little value.
  3. Blind spots: Standard vendor configurations may result in the omission of critical events, thereby creating security blind spots. These unmonitored areas leave organizations vulnerable to potential threats

On the other hand, it's vital to recognize that in today's multifaceted landscape, no single platform can serve as the definitive, all-encompassing detection system. Although there are numerous purpose-built detection systems painstakingly crafted for specific log types, customers often find themselves grappling with the harsh reality that they can't readily incorporate a multitude of best-of-breed platforms.

The formidable challenges emerge from the intricate intricacies of data acquisition, system management, and the prevalent issue of the ingestion layer being tightly coupled with their SIEMs. Frequently, data cascades into various systems from the SIEM, further compounding the complexity of the situation. The overwhelming burden, both in terms of cost and operational intricacies, can make the pursuit of best-of-breed solutions an impractical endeavor for many organizations.

Today’s SOC teams do not have the strength or capacity to look at each source that is logging to weed out these redundancies or address blind spots or take only the right and relevant data to expensive downstream systems like the SIEM or analytics platforms or even manage multiple data pipelines for multiple platforms.

This underscores the growing necessity for Security Data Orchestration, with an even more vital emphasis on Context-Aware Security Data Orchestration. The rationale is clear: we want the Security Engineering team to focus on security, not get bogged down in data operations.

So, how do you go about Security Data Orchestration?

In its simplest form, envision this layer as a sandwich, positioned neatly between your data sources and their respective destinations.

 

The foundational principles of a Security Data Orchestration platform are -

Centralize your log collection:-  Gather all your security-related logs and data from various sources through a centralized collection layer. This consolidation simplifies data management and analysis, making it easier for downstream platforms to consume the data effectively.

Decouple data ingestion:- Separate the processes of data collection and data ingestion from the downstream systems like SIEMs. This decoupling provides flexibility and scalability, allowing you to fine-tune data ingestion without disrupting your entire security infrastructure.

Filter to send only what is relevant to your downstream system:- Implement intelligent data orchestration to filter and direct only the most pertinent and actionable data to your downstream systems. This not only streamlines cost management but also optimizes the performance of your downstream systems with remarkable efficiency.

Enter DataBahn

At databahn.ai, our mission is clear: to forge the path toward the next-generation Data Orchestration platform. We're dedicated to empowering our customers to seize control of their data but without the burden of relying on communities or embarking on the arduous journey of constructing complex Kafka clusters and writing intricate code to track data changes.

We are purpose-built for Security, our platform captures telemetry once, improves its quality and usability, and then distributes it to multiple destinations - streamlining cybersecurity operations and data analytics.

DataBahn seamlessly ingests data from multiple feeds, aggregates compresses, reduces, and intelligently routes it. With advanced capabilities, it standardizes, enriches, correlates, and normalizes the data before transferring a comprehensive time-series dataset to your data lake, SIEM, UEBA, AI/ML, or any downstream platform.


DataBahn offers continuous ML and AI-powered insights and recommendations on the data collected to unlock maximum visibility and ROI. Our platform natively comes with

  • Out-of-the-box connectors and integrations:- DataBahn offers effortless integration and plug-and-play connectivity with a wide array of products and devices, allowing SOCs to swiftly adapt to new data sources.
  • Threat Research Enabled Filtering Rules:- Pre-configured filtering rules, underpinned by comprehensive threat research, guarantee a minimum volume reduction of 35%, enhancing data relevance for analysis.
  • Enrichment support against Multiple Contexts:- DataBahn enriches data against various contexts including Threat Intelligence, User, Asset, and Geo-location, providing a contextualized view of the data for precise threat identification.
  • Format Conversion and Schema Monitoring:- The platform supports seamless conversion into popular data formats like CIM, OCSF, CEF, and others, facilitating faster downstream onboarding. It intelligently monitors log schema changes for proactive adaptability.
  • Schema Drift Detection:- Detect changes to log schema intelligently for proactive adaptability.
  • Sensitive data detection:- Identify, isolate, and mask sensitive data ensuring data security and compliance.
  • Continuous Support for New Event Types:- DataBahn provides continuous support for new and unparsed event types, ensuring consistent data processing and adaptability to evolving data sources.

Data orchestration revolutionizes the traditional cybersecurity data architecture by efficiently collecting, normalizing, and enriching data from diverse sources, ensuring that only relevant and purposeful data reaches detection and hunting platforms. Data Orchestration is the next big evolution in cybersecurity, that gives Security teams both control and flexibility simultaneously, with agility and cost-efficiency.

5 min read

The Ultimate Guide to Microsoft Sentinel Optimization for Enterprises

Slash Microsoft Sentinel SIEM pricing & Cost Reduction! Master Microsoft Sentinel SIEM optimization! Learn how to Cost Reduction, improve threat detection & response, and maximize SIEM value. Download our guide for enterprises.
September 2, 2024

The Ultimate Guide to Microsoft Sentinel optimization for Enterprises

Are you struggling with inflating costs and increased time and effort in managing Microsoft Sentinel for your business? Is optimizing data ingestion cost, improving operational efficiency, and saving your team’s time and effort important for your business? With ~13% of the SIEM market according to industry sources, many enterprises across the world are looking for ways to unlock the full potential of this powerful platform.

What is Microsoft Sentinel?

Microsoft Sentinel (formerly known as “Azure Sentinel”) is a popular and scalable cloud-native next-generation security information and event management (“SIEM”) solution and a security orchestration, automation, and response (“SOAR”) platform. It combines a graphical user interface, a comprehensive analytics package, and advanced ML-based functions that help security analysts detect, track, and resolve cybersecurity threats faster.

It delivers a real-time overview of your security information and data movement across your enterprise, providing enhanced cyberthreat detection, investigation, response, and proactive hunting capabilities. Microsoft Sentinel natively incorporates with Microsoft Azure services and is a popular SIEM solution deployed by enterprises using Microsoft Azure cloud solutions.

Find out how using DataBahn’s data orchestration can help your Sentinel deployment – download our solution brief here.         DOWNLOAD  

Text Microsoft Sentinel is deployed by companies to manage increasingly sophisticated attacks and threats, the rapid growth of data volumes in alerts, and the long timeframe for resolution.

What is the Microsoft Sentinel advantage?

The four pillars of Microsoft Sentinel

Microsoft Sentinel is built around four pillars to protect your data and IT systems from threats: scalable data collection, enhanced threat detection, AI-based threat investigations, and rapid incident response.

Scalable data collection

Microsoft Sentinel enables multi-source data collection from devices, security sensors, and apps at cloud scale. It allows security teams to create per-user profiles to track and manage activity across the network with customizable policies, access, and app permissions. This enables single-point end-user management and can be used for end-user app testing or test environment with user-connected virtual devices.

Enhanced threat detection

Microsoft Sentinel leverages advanced ML algorithms to search the data going through your systems to identify and detect potential threats. It does this through “anomaly detection” to flag abnormal behavior across users, applications, or app activity patterns. With real-time analytics rules and queries being run every minute, and its “Fusion” correlation engine, it significantly reduces false positives and finds advanced and persistent threats that are otherwise very difficult to detect.

AI-based threat investigations

Microsoft Sentinel delivers a complete and comprehensive security incident investigation and management platform. It maintains a complete and constantly updated case file for every security threat, which are called “Incidents”. The Incidents page in Microsoft Sentinel increases the efficiency of security teams and offers automation rules to perform basic triage on new incidents and assign them to proper personnel, and syncs with Microsoft Defender XDR for simplified and consistent threat documentation.

Rapid incident response

The incident response feature in Microsoft Sentinel helps enterprises respond to incidents faster and increases their ability to investigate malicious activity by up to 50%. It creates advanced reports that make incident investigations easier, and also enables response automations in the form of Playbooks, which are collections of response and remediation actions and logics that are run from Sentinel as a routine.

Benefits of Microsoft Sentinel

Implementing Microsoft Sentinel for your enterprise has the following benefits:

  • Faster threat detection and remediation, reducing the mean time to respond (MTTR)
  • Improved visibility into the origins of threats, and stronger capability for isolating and stopping threats
  • Intelligent reporting that drives better and faster incident responses to improve outcomes
  • Security automation through analytics rules and automations to allow faster data access
  • Analytics and visualization tools to understand and analyze network data
  • Flexible and scalable architecture
  • Real-time incident management

What is Microsoft Sentinel Optimization?

Microsoft Sentinel Optimization is the process of fine-tuning the powerful platform to reduce ingestion costs, improve operational efficiency, and enhancing the overall efficiency, cost-effectiveness, and efficacy of an organization’s cybersecurity team and operations. It addresses how you can manage the solution to ensure optimal performance and security effectiveness while reducing costs and enhancing data visibility, observance, and governance. It involves configuration changes, automated workflows, and use-case driven customizations that help businesses and enterprises get the most value out of the use of Microsoft Sentinel.

Why Optimize your Microsoft Sentinel platform?

Despite the reduction in costs compared to legacy SIEM solutions, Microsoft Sentinel’s cost reduction in data ingestion is still subject to the incredible increase in security data and log volumes. With the volume of data being handled by enterprise security teams growing by more than 20% year-on-year, security and IT teams are finding it difficult to find critical data and information in their systems as mission-critical data is lost in the noise.

Additionally, the explosion in security data volumes also has an impact in terms of costs – SIEM API costs, storage costs, and the effort of managing and routing the data makes it difficult for security teams to allocate bandwidth and budgets to strategic projects.

With proper optimization, you can:

  • Make it faster and easier for security analysts to detect and respond to threats in real-time
  • Prioritize legitimate threats and incidents by reducing false positives
  • Secure your data and systems from cyberattacks more effectively

Benefits of using DataBahn for optimizing Sentinel

Using DataBahn’s Security Data Fabric enables you to improve Microsoft Sentinel ingest to ensure maximum value. Here’s what you can expect:

  • Faster onboarding of sources: With effortless integration and plug-and-play connectivity with a wide array of products and services, SOCs can swiftly integrate with and adapt to new sources of data
  • Resilient Data Collection: Avoid single-point of failures, ensure reliable and consistent ingestion, and manage occasional data volume bursts with DataBahn’s secure mesh architecture
  • Text BoxReduced Costs: DataBahn enables your team to manage the overall costs of your Sentinel deployment by providing a library of purpose-built volume reduction rules that can weed out and less relevant logs.

Find out how DataBahn helped a US Cybersecurity firm save 38% of your SIEM licensing costs in just 2 weeks on their Sentinel deployment.   DOWNLOAD  

Why choose DataBahn for your Sentinel optimization?

Optimizing Microsoft Sentinel requires extensive time and effort from your infrastructure and security teams. Some aspects of the platform also ensure that there will continue to be a requirement to allocate additional bandwidth (integrating new sources, transforming data from different destinations, etc.).

By partnering with DataBahn, you can benefit from DataBahn’s Security Data Fabric platform to create a future-ready security stack that will ensure peak performance and complete optimization of cost while maximizing effectiveness.

  DOWNLOAD  

Featured Authors

No items found.