Dezible

Hello! My name is Aditya Bhardwaj.

Dezible is a project where I build a SIEM from scratch.

What is a SIEM you might ask. The following is a nice definition1.

Details

Security Information and Event Management (SIEM) is an essential tool that collects, stores, and performs log management and analyzes extensive volumes of log data from across an organization’s network, generating security alerts when needed.

Every SIEM does five fundamental things:

  • Data collection and Normalization
  • Storage and Indexing
  • Detection and Analysis
  • Response and Orchestration
  • Intelligence and Visualization

With the above tools/capabilities, security teams can assess, triage and respond to security events.

I do not want to re-invent the wheel, but rather have insights into the components that make up SIEM.

LLMs and AI agents will be useful in extending capabilities of such a SIEM, though the first goal is to work out the basics.

Details

Dezible is work in progress.

From scratch?

Actually.. not from scratch.

The 5 fundamental components of any SIEM would remain the same (as listed above), but I will use MSTICPy as the starting point.

MSTIC or Microsoft Threat Intelligence Center, created this library to query log data, enrich and then anlayze the same. By using a Jupyter notebook, users can re-use functionalities in security investigation and hunting.

Hence, I will shamelessly copy the documentation structure of MSTICPy, follow and improve upon the same.


Connect with me here: LinkedIn and Google scholar

Subsections of Dezible

Getting started

msticpy is a set of Python tools intended to be used for security investigations and hunting. I aim to extend it and use with other open-source software (OSS) and data sources.

Installation

First step is to install and get a feel of msticpy library.

Details

Work in progress.

Overview

Details

Work in progress. Linking will be done when the section is ready.

Subsections of Getting started

Installation

According to the documentation, we need Python 3.8

Motivation

This question was asked on Reddit:

Details

As a SOC analyst, how do you effectively correlate data from multiple sources? It seems like too much manual work.

The person goes on and explains the challenge with an example scenario of malware infection where one has to gather data such as: details about the endpoint, roles/permission of the user, Indicators of Compromise (IoCs) among other things.

You can see the full post on the Reddit board - r/cybersecurity.

While the original poster seems to mix the concept of enrichment and correlation, an observation can be made after going through the post as well as the answers:

If we understand the underlying data and have a common-unified model for this data, our job becomes easier.

And of course, industry tools like Splunk address this with CIM (Common Information Model) - a schema that normalizes data from multiple vendors into consistent fields.

CIM provides a unified model for mapping fields across data sources and logs. It means that while the raw data remains untouched, the user can search-browse-query data from different vendors using consistent and common information.

There might be other solutions like CIM that tackle the issue of standardized mapping, but I asked myself these questions:

Details

Can we truly reason over these logs? What kind of theory would enable that? Are there underlying concepts in these events, that are yet to be uncovered or explained?

This is what I explore in this project - Dezible

A unified data model tells us how to structure fields in the event logs but it doesn’t tell us what these events mean or how they fundamentally relate?

Perhaps there is scope for building a model or framework that captures not just mappings, but the relationships between such events.

and after that…

Once we understand our data more, we need a platform-agnostic way of using it and conducting security investigations.

SIEM tools are bloated and require quite some overhead. They are rigid, bloated and require maintaining the infrastructure to support them.

How can we reduce this overhead?

Can we make better-reusable functionalities that analysts can share, use and learn from?


My name is Aditya Bhardwaj and am currently a PhD candidate at University of Twente - you can find me here LinkedIn profile