Advanced KQL Detection - Similarity Matching

Using jaccard index in KQL to find threats hidden in the patterns

Table of Contents

The Power Behind Patterns

Patterns are present everywhere you look, you just have to know how to find them.

Today, I want to train your eyes to catch a specific type of pattern. A type of pattern that, when implemented through a few simple detections, will not only help you catch attackers, but do so even against evasive techniques.

That pattern is similarity through time.

Usually this pattern is a clue that attackers are walking a path, taking small steps through our environment. Each step is close to the past one. The distance between them small.

Here’s a very basic example:

An attacker is enumerating our directories. Logs show us something like…

  • downloads\folder1

  • downloads\folder2

  • downloads\folder4

Obvious to our eyes, but do you have a rule for catching this?

Let’s build one with a real world example.

Phishing Domain Detection

A smart attacker tries to find your weakest entry point, and oftentimes that starts with the clicking of a link.

Who doesn’t love getting users to click links?

I might not click on a link with random numbers or letters.. but what if someone stood up a website m0dernsecops.com, how about modernssecops.com?

Even I’m tempted to click on that.

These examples should be basic enough for a simple rule to catch. We want to build a rule that can compare the similarity of our domain and URLs that we see.

Will this rule catch everything? No.

Is this rule intelligent enough to catch visually similar characters? Unfortunately… definitely not.

What this rule is is the first step towards better pattern based detections.

Let’s start with web session data:

timestamp [UTC]

username

url

9/1/2024, 12:00:00.000 AM

Seyed

https://m0dernsecops.com/

9/1/2024, 12:00:00.000 AM

Seyed

https://modernsecops.com/

That first URL looks, to say the least, a tad bit suspicious.

To catch this we need a KQL function. A function that we will use throughout this article.

That function is jaccard_index.

Intro to Jaccard Index

Here is a fancy formula for Jaccard Similarity:

Here’s what that means in English.

We have two sets, A and B.

For example, A is [1,2,3,4,5] and B is [1,2,3,4,4]

The Jaccard similarity is the size of the intersection of the two sets divided by the size of the union of the two sets.

In this example the intersection of the two sets is [1,2,3,4] and the union is [1,2,3,4,5]

Therefore the Jaccard index is 4/5 or .8

Now let’s build the KQL for detecting this similar domain…

KQL

First, we need to convert the URLs into numbers. To do that, we can use the to_utf8 function:

let data = datatable(url: string)[
    "https://modernsecops.com",
    "https://m0dernsecops.com",
    "https://somerandomwebsite.com"
];
data
| extend to_utf8(url)

This function outputs an array of our string characters, each utf-8 encoded. Something like this:

Now that we have an array of numbers, we can use that array to calculate the jaccard index. Here’s how:

let data = datatable(url: string)[
    "https://modernsecops.com",
    "https://m0dernsecops.com",
    "https://website.com"
];
data
| extend utf8 = to_utf8(url)
| project url, jaccard = jaccard_index(utf8, to_utf8("https://modernsecops.com"))

Notice that in this case, we compare each string to our main domain, and the results look reasonable!

Now that we have the fundamentals, let’s move onto a slightly more advanced example, and that example is a web crawler.

Web Directory Enumeration

Crawlers can crawl through your directories, looking for valid paths and valid files.

A crawler could take the following path:

  • https://modernsecops.com/index/file.html

  • https://modernsecops.com/index/file2.html

  • https://modernsecops.com/index/file3.html

Easy enough, right? Of course we have rules to catch this… right?

If you don’t, don’t worry, we can make one together.

Let’s start with a table that looks like this:

timestamp [UTC]

username

url

9/1/2024, 12:00:00.000 AM

Seyed

https://modernsecops.com/index/home.html

9/3/2024, 12:00:00.000 AM

Seyed

https://modernsecops.com/index/about.html

9/1/2024, 12:00:00.000 AM

Bad guy

https://modernsecops.com/index/file.html

9/2/2024, 12:00:00.000 AM

Bad guy

https://modernsecops.com/index/file1.html

9/7/2024, 12:00:00.000 AM

Bad guy

https://modernsecops.com/index/file3.html

In this case, we don’t want to compare each row with the same URL, but to the one right before it. For that, we need the next function in KQL. Here’s how to use it:

let data = datatable(timestamp: datetime, username: string, url: string)[
    "2024-09-01", "Bad guy", "https://modernsecops.com/index/file.1.html",
    "2024-09-01", "Seyed", "https://modernsecops.com/index/home.html",
    "2024-09-02", "Bad guy", "https://modernsecops.com/index/file1.html",
    "2024-09-03", "Seyed", "https://modernsecops.com/index/about.html",
    "2024-09-07", "Bad guy", "https://modernsecops.com/index/file3.html"
];
data
| sort by username, timestamp asc
| project next_url = next(url)

Notice that we have to use sort before we use next, otherwise next wouldn’t work since it could give us any random row.

Now, let’s calculate the similarity:

data
| sort by username, timestamp asc
| project
    next_url = next(url),
    url_utf = to_utf8(url),
    url,
    similarity = jaccard_index(to_utf8(next(url)), to_utf8(url)),
    username

But when we look at the output, we have two issues, can you spot them?

Need a hint? They issues are in the second and last row…

Last chance before I spoil it…

Ok here they are:

In row 2 we are comparing the similarity of Seyed’s traffic with the Bad guy’s…

and in the last row we compare a URL with an empty URL.

We can fix that with a condition:

Only include the similarity when the next username is the same as the last one and when the next URL is not empty. Here’s the KQL:

| extend
    similarity = iif(isnotempty(next_url) and username == next(username), similarity, real(null))

And our results look much better!

To put it all together, let’s gather some metrics and here’s our final KQL:

let data = datatable(timestamp: datetime, username: string, url: string)[
    "2024-09-01", "Bad guy", "https://modernsecops.com/index/file.1.html",
    "2024-09-01", "Seyed", "https://modernsecops.com/index/home.html",
    "2024-09-02", "Bad guy", "https://modernsecops.com/index/file1.html",
    "2024-09-03", "Seyed", "https://modernsecops.com/index/about.html",
    "2024-09-07", "Bad guy", "https://modernsecops.com/index/file3.html"
];
data
| sort by username, timestamp asc
| project
    next_url = next(url),
    url_utf = to_utf8(url),
    url,
    similarity = jaccard_index(to_utf8(next(url)), to_utf8(url)),
    username
| extend
    similarity = iif(isnotempty(next_url) and username == next(username), similarity, real(null))
| summarize similarity = avg(similarity) by username

And the final output!

Now, before you go converting this pattern into a high severity rule, it’d only be fair for me to make you aware of some limitations.

Some Limitations

First, Jaccard index isn’t a perfect similarity measure for what we’re doing.

As an example, can you guess the similarlity between these two sets?

[4, 4]…

and [4, 4, 4, 4]?

It’s 100%

And what if the attacker mixes things up?

So instead of taking small steps, they take random leaps through files?

There are some solutions for these problems:

We can always use more sophisticated functions (even vector embeddings) for similarity measures.

And we can still try to capture an attackers similarity across a whole session, not individual steps.

Both of those are topics for the future, which you’ll be ready for now that you’ve sharpened your KQL pattern catching skills.

To make sure you don’t miss those advanced articles, subscribe with the link below:

Reply

or to participate.