Modern SecOps
Posts
A Deep Dive into Parsing with KQL

A Deep Dive into Parsing with KQL

Parsing the same firewall log message in 4 different formats (JSON, CEF, BSD Syslog, and Syslog RFC 5424) with full KQL breakdown

Seyed Amirhossein Nouraie
August 19, 2024

What this article covers
- Show me the code!
JSON
- About the format
- Parsing KQL
  - Using parse_json()
  - Using bag_unpack
Common Event Format (CEF)
- About the format
- Parsing KQL
  - Separating headers with split()
  - Parsing the extensions with parse-kv
BSD Syslog (RFC 3164)
- About the format
- Parsing KQL
RFC 5424 Syslog
- About the format
- Parsing KQL
Glossary of KQL Functions

What this article covers

I’ll dive deep here. Explaining each log format and the logic behind the parsing KQL.

To keep things useful, the focus is on parsing real-world messages.

Each section is dedicated to a message format.

First, we break down the format itself. Then we build the KQL required to parse it from scratch.

Show me the code!

Wanna skip the explanation and get straight to the KQL?

I don’t blame you... Here it is on my GitHub:

https://github.com/seyed-nouraie/SecOps/blob/main/Sentinel/KQL/parsers/README.md#code

JSON

About the format

{
    "Version": "CEF:0",
    "Vendor": "Palo Alto Networks",
    "DeviceProduct": "PAN-OS",
    "DeviceVersion": "9.0",
    "DeviceEventClassId": "10001",
    "Name": "TRAFFIC",
    "Severity": "5",
    "src": "192.168.10.10",
    "dst": "192.168.20.20",
    "spt": "443",
    "dpt": "80",
    "proto": "TCP",
    "cat": "local_website",
    "act": "ALLOW",
    "msg": "Allowed traffic from source to destination"
}

Good ol’ JSON

JSON is semi-structured. So it has a strict key-value format but no schema.

The no-schema part just means that each variable can be any supported data type (i.e.: just because “Name” is a string in one row it doesn’t have to be a string in the next.)

Parsing KQL

JSON only takes two functions to parse in Sentinel.

Using parse_json()

First we need to tell Sentinel to interpret that RawData string column as a dynamic property bag.

A… what bag?

We can think of dynamic property bags like JSON. Mapping a string (the index) to a value.

Here’s how:

extend parse_json(RawData)

Using bag_unpack

Now that we have our bag, we need to convert it to a table.

To do that, we use the bag_unpack plug-in. This plug-in converts each index (our JSON keys) to a column. Like so:

evaluate bag_unpack(RawData)

And there you have it!

Two lines of KQL… only if life were always this easy.

my_table
| extend parse_json(RawData)
| evaluate bag_unpack(RawData)

Common Event Format (CEF)

About the format

CEF:0|Palo Alto Networks|PAN-OS|9.0|100001|TRAFFIC|5|src=192.168.10.10 dst=192.168.20.20 spt=443 dpt=80 proto=TCP cat=/Security/Application/Firewall act=ALLOW msg=Allowed traffic from source to destination

With CEF, we have to do a little bit more work.

CEF (common event format) is a structured data format. It has a strict format, and a schema.

The schema part means that the src field must always be a string, spt always an integer, and so on.

Here is a diagram breaking CEF down:

A diagram showing the example CEF message: "CEF:0|Palo Alto Networks|PAN-OS|9.0|100001|TRAFFIC|5|src=192.168.10.10 dst=192.168.20.20 spt=443 dpt=80 proto=TCP cat=local_website act=ALLOW msg=Allowed traffic from source to destination" and labels pointing to each part of the message as delimited by the pipe, | symbol. The parts are Version, Device Vendor, Device Product, Deivce Event Class ID, Name, and severity for the header and Key=Value pairs for the extensions.

Sample Common Event Format Log Explained in Depth

CEF has two components, the header and extensions. The header is 7 values separated by a | symbol.

The header has the following metadata about the event:

CEF version
Vendor
Product
Device Version
Event type
Name
Severity

The other part of CEF is the extensions. These are pairs of values in key-value (fieldname=fieldvalue) format.

Usually these fields are configurable on the device sending the logs.

Parsing KQL

Let’s start with separating the header and extensions.

Separating headers with split()

To do this, we will can use the KQL split() function. This function takes a column and a delimiter, and returns an array of values that were separated by that delimiter.

In our case, the column is RawData and the delimiter is the pipe symbol.

Here’s the KQL we need:

extend SplitData = split(RawData, "|")

We store the results in an array column called SplitData. Here is what SplitData looks like:

["CEF:0","Palo Alto Networks","PAN-OS","9.0","100001","TRAFFIC","5","src=192.168.10.10 dst=192.168.20.20 spt=443 dpt=80 proto=TCP cat=/Security/Application/Firewall act=ALLOW msg=Allowed traffic from source to"]

You’ll notice that the first 7 values contain each CEF header, and the last value is the extensions.

How convenient!

Now we can grab those CEF headers from the array using an extend. Since CEF’s header order is always the same, we always know which index each header element will be in.

extend
    Version=SplitData[0],
    Vendor=SplitData[1],
    DeviceProduct=SplitData[2]
...

We have the headers parsed, but we’re not done yet. Now we need to parse the CEF extensions.

Parsing the extensions with parse-kv

Lucky for us, there’s a KQL function we can use to do this. It’s called parse-kv.

What does parse-kv do? It parses key value pairs, of course.

Here’s how:

First, we tell it what column we want to parse.

Since extensions is the last element after using split(), we use the last element of SplitData. To get the last element we use -1 as the index (-2 would give us the second to last element, and so on…)

Next, we define the schema. This determines the column name and the data type of the column. The format is (name: type, name2:type…)

One last thing.

We want to tell the function what character separates each key value pair, and what character separates each key from each value. We do this with the pair and kv delimiters.

Here’s our final parse-kv statement:

parse-kv SplitData[-1] as (src: string, dst: string, spt: int, dpt: int, proto: string, cat: string, act: string, msg: string) with (pair_delimiter=" ", kv_delimiter="=")

And we’re done!

All we have left to do is project-away the unparsed columns, and we have beautiful, parsed, tables!

my_table
// Separate by the | delimiter
| extend SplitData = split(RawData, "|")
// Parse the header from SplitData
| extend
    Version=SplitData[0],
    Vendor=SplitData[1],
    DeviceProduct=SplitData[2],
    DeviceVersion=SplitData[3],
    DeviceEventClassId=SplitData[4],
    Name=SplitData[5],
    Severity=SplitData[6]
// Parse the Key-Value Extensions
| parse-kv SplitData[-1] as (src: string, dst: string, spt: int, dpt: int, proto: string, cat: string, act: string, msg: string) with (pair_delimiter=" ", kv_delimiter="=") 
// Remove un-parsed columns
| project-away SplitData, RawData

BSD Syslog (RFC 3164)

About the format

<34>Aug 13 12:34:56 edgefirewall01 paloalto[1234]: Allowed local_website traffic from source 192.168.10.10:443 to destination 192.168.20.20:80 using TCP

Here’s a diagram for BSD syslog:

A diagram showing the example Syslog message: "<34>Aug 13 12:34:56 edgefirewall01 paloalto[1234]: Allowed local_website traffic from source 192.168.10.10:443 to destination 192.168.20.20:80 using TCP" and labels pointing to each part of the message as follows: <34> as PRI, Aug 13 12:34:56 as the date edgefirewall01 as the hostname, paloalto as the tag, [1234] as the process id, Allowed local_website traffic from source 192.168.10.10:443 to destination 192.168.20.20:80 using TCP as the content unstructured data message.

Sample BSD Syslog Explained in Depth

BSD syslog is also broken up into a header and a message. The header contains the PRI, (combination of severity and facility), timestamp, and hostname while the message contains the tag, process ID, and content.

BSD syslog is tricky.

The trouble is in that unstructured data section, it can be (almost) anything! So there’s only so much we can parse without knowing the vendor’s formatting.

The best we can usually do is parse the headers into columns and the message into its own column.

Parsing KQL

Parsing the fields with parse

First we use the parse function to grab all the fields available.

How does parse work?

Parse works by taking column names and strings that come in between those columns. For example “<“ PRI “>” extracts the values between <34> as the PRI (in this case PRI=34)

Here’s how we use parse to extract all the syslog fields:

parse RawData with "<" Unparsed_PRI:int ">" Unparsed_Month " " Unparsed_Day " " Unparsed_Time " " Hostname " " AppName "[" ProccessId "]" ":" Msg

Not too bad, right?

Calculating the year with format_datetime() and math

This second step is to parse the datetime.

This one’s a little tricky.

The authors of RFC 3164 decided not to include the year in the time. Back then, keeping logs around for long periods of time wasn’t as popular.

Can you imagine!

So, we need to calculate the year. We use the following logic to do that:

Most of the time, the year the log was ingested was the same year the log was generated. To get that year, we use format_datetime(ingestion_time(), “yyyy”)

Let’s get the year and month the log was ingested (we will use month shortly):

extend
    Current_Year=format_datetime(ingestion_time(), "yyyy"),
    Current_Month=format_datetime(ingestion_time(), "M")

But that’s not the full story….

What if the log is generated a minute before new year’s, and there is a 2 minute delay?

Then our calculated datetime would be ahead by a full year!

Here’s the trick we use.

When that happens, the ingestion month is less than the syslog timestamp month. That means the event occurred before it was ingest. That’s impossible!

So we just dial the year back by one, and there we have it!

Here’s that logic in KQL:

extend Unparsed_Year=iff(toint(Current_Month) < toint(Unparsed_Month), tostring(toint(Current_Year) - 1), Current_Year)

Keep in mind this doesn’t work if there’s an ingestion delay of over 1 year. But then you have bigger concerns…

Now we have the full datetime and we can stitch it together using strcat().

Calculating the PRI with some more math

We also need to do some math to work our way back from the PRI to facility and severity

Just follow these formulas

Facility=floor(PRI) //We use the round() function
Severity=PRI - (Severity*8)

Here’s that in KQL:

extend Facility=round(Unparsed_PRI/8)
// Put together the datetime field
| extend Severity=Unparsed_PRI-(Facility*8)

And there you have it, you’ve just parsed BSD syslog!

my_table
// Extract all the fields available
| parse RawData with "<" Unparsed_PRI:int ">" Unparsed_Month " " Unparsed_Day " " Unparsed_Time " " Hostname " " AppName "[" ProccessId "]" ":" Msg
// Calculate the current year and current month
| extend
    Current_Year=format_datetime(ingestion_time(), "yyyy"),
    Current_Month=format_datetime(ingestion_time(), "M")
// Calculate the log year, accounting for ingestion delay
// Also calculate the severity and facility
| extend Unparsed_Year=iff(toint(Current_Month) < toint(Unparsed_Month), tostring(toint(Current_Year) - 1), Current_Year), Facility=round(Unparsed_PRI/8)
// Put together the datetime field
| extend DateTime=(strcat(Unparsed_Month, " ", Unparsed_Day, " ", Unparsed_Time, " ", Unparsed_Year)), Severity=Unparsed_PRI-(Facility*8)
// Get rid of unparsed content and put DateTime first
| project-away RawData, Unparsed_*, Current_*
| project-reorder DateTime

RFC 5424 Syslog

About the format

<165>1 2024-08-12T12:34:56.789Z edgefirewall01 paloalto 1234 ID47 [connection@32473 src="192.168.10.10" dst="192.168.20.20" spt="443" dpt="80" proto="TCP" cat="local_website" act="ALLOW"][protocol@32473 class="TCP"] Allowed traffic from source to destination

This syslog format is more flexible, since the authors were able to learn from the prior RFC.

The diagram below for reference:

A diagram showing the example CEF message: "<165>1 2024-08-12T12:34:56.789Z firewall01 paloalto 12 ID47 [connection@32473 src="192.168.10.10"] Allowed traffic from..." and labels pointing to each part of the message as follows: <34> as PRI, 1 as version, 2024-08-12T12:34:56.789Z as the timestamp, firewall01 as the hostname, paloalto as the appname, 12 as the process id, ID47 as the msg id, [connection@32473 src="192.168.10.10"] as the structured data elements, Allowed traffic from... as the message

Sample RFC 5424 Syslog Explained in Depth

There are a few key differences:

First, we have the full timestamp (finally!)
Second, RFC 5424 introduced structured elements

What are structured elements?

They’re very similar to the key-value data in CEF.

Each structured element starts with a SDID. The SDID is like a namespace for what goes inside that element.

IANA has their own reserved SDIDs but organizations can also make their own SDIDs by using their private enterprise number. When an SDID has an @ symbol, what comes before the symbol is the SDID and what comes after is the private enterprise number.

Why do SDIDs matter?

SDIDs define the schema of the data inside the structured data element, called structured data params. These params are the key-value data that we want to extract.

Now that we got that out of the way, let’s get to parsing.

Parsing KQL

This query should look familiar, because we’re just combining the two queries from before.

Look at the query below, everything in the query is something we’ve worked on before! We just used the parse function to extract our structured elements, and the parse-kv function to parse the columns inside.

No new functions are added here.. but you get to parse a lot more columns thanks to RFC 5424!

my_table
// Extract all fields available
| parse RawData with "<" Unparsed_PRI: int ">" Version " " DateTime: datetime " " Hostname " " AppName " " ProcessId " " EventId "[" * " " Unparsed_ConnectionEvent "][" * " " Unparsed_ProtocolEvent "] " Msg
// Parse structured data key-value pairs
| parse-kv Unparsed_ConnectionEvent as (src: string, dst: string, spt: string, dpt: string, proto: string, cat: string, act: string) with (pair_delimiter=" ", kv_delimiter="=")
| parse-kv Unparsed_ProtocolEvent as (protocol: string)  with (pair_delimiter=" ", kv_delimiter="=")
// Calculate facility and severity from PRI
| extend Facility=round(toint(Unparsed_PRI) / 8)
| extend Severity=Unparsed_PRI - (Facility* 8)
// Get rid of unparsed content
| project-away RawData, Unparsed_*, Current_*

Congratulations on getting through all that. It wasn’t easy, but you’ve leveled up your Sentinel parsing logic.

This is by no means the end of KQL’s parsing capabilities. Let me know if you want to see more advanced commands in part 2.

And.. happy parsing!

Don’t want to miss new content from Modern SecOps? Subscribe for free below:

Glossary of KQL Functions

Enjoyed the article (even a little bit)? Follow me on LinkedIn on to hear more of my rants: https://www.linkedin.com/in/nouraie/

Reply

or to participate.

A Deep Dive into Parsing with KQL

Parsing the same firewall log message in 4 different formats (JSON, CEF, BSD Syslog, and Syslog RFC 5424) with full KQL breakdown

Table of Contents

What this article covers

Show me the code!

JSON

About the format

Parsing KQL

Using parse_json()

Using bag_unpack

Common Event Format (CEF)

About the format

Parsing KQL

Separating headers with split()

Parsing the extensions with parse-kv

BSD Syslog (RFC 3164)

About the format

Parsing KQL

Parsing the fields with parse

Calculating the year with format_datetime() and math

Calculating the PRI with some more math

RFC 5424 Syslog

About the format

Parsing KQL

Glossary of KQL Functions

Reply