Request a demo
Contact Us

Spectre and Meltdown | The Data Science Approach

Chapter 1

Introduction

Data science in cybersecurity is rapidly growing. At Capsule8, we in data science work in tandem with the security research team to collaborate on state of the art detection models against the latest threats.

Now in machine learning, we all know that feature engineering is the secret sauce. The advantage for us here, given the strength of our  security research team , is that there is no dearth of interesting features for our models. One particular set of attacks for which Capsule8 released an open source detector the very next day (read our blog on this for more) became our target— Spectre and Meltdown. What if we can come up with a machine learning model that can use the features from the current deterministic detector, thereby providing better detection? Will the accuracy increase significantly ? Can it keep the false positives at an acceptable level ? Is it production worthy ?

So many questions! Well the short answer to all — Yes.

Background

In early 2018, details began to emerge surrounding a vulnerability in the Intel microprocessor. And not just one version of the chip. The weakness exists in all Intel chips built since the eighties. The root cause is a core design flaw in the chip itself, and nearly every computer on the planet has the vulnerability.

 

Chapter 2

The Actors

The flaw emerges as the result of several design decisions in the Intel microprocessor:

Shared memory cache. In contemporary microprocessors, several computational tasks must use the same hardware resources cooperatively. This sharing is made possible by a number of intricate mechanisms in hardware and the operating system. Cooperative multitasking is essential for machines that run server workloads. Servers need to support a variety of different kinds of tasks at any one time. The memory cache is one hardware resource that all the tasks share, and its a key player in the Spectre and Meltdown vulnerabilities.

Task isolation. In cooperative multi-tasking, the tasks need to be isolated from each other. This is especially important with regard to task memory. For example, a lower priority process should not be able to freely read memory contained in a higher priority process. The latter process could have access to sensitive data such as passwords, encryption keys, etc. A key component of cooperative multitasking is a mechanism that isolates and protects the memory of individual tasks.

Speculative execution. A task is composed of a sequence of instructions. At any one point in time, a long running task is executing a specific instruction in its linear timeline of instructions. The Intel processor has likely pre-fetched the future instructions that the task is going to need to run in the very near future. Since the processor has so much power, it can execute “speculatively” some of the future code and it will keep around the results of computations in the “speculative execution” engine. When the task eventually finds it needs to perform those instructions, it checks first to see if it was already executed previously during “speculative execution.” If so, the task uses the pre-computed, speculatively executed result instead of running the computation. Speculative execution is an important speed-up and exists in almost every microprocessor running on the planet.

Chapter 3

The Problem

Those three design features can be abused by malicious code, resulting in a set of vulnerabilities we now call Spectre and Meltdown. A hacker can craft a malicious program that can use the vulnerability to reveal secrets (passwords, encryption keys, etc. ) contained in higher priority tasks such as the operating system task itself.

Here’s how it’s accomplished:

1. The malicious program attempts to read highly privileged memory. In this case, task isolation and memory protection kick in and cause the program to fail and exit (called a fault). Its a specific fault called a “memory protection fault.”

2. The malicious program relies on the fact that the microprocessor has speculatively executed some of the program code before the fault and the program exits. The speculatively executed code can freely access higher priority memory. Now, when the actual fault and exit eventually happens, the speculative executed results (secrets) of the task are also cleared. But, there is a small window of opportunity in which the task can leak the secrets, if it figures out how to leak the secrets before the exit.

3. It leaks the secrets through the shared cache. But, it can’t directly leak the data because the cache respects task and memory isolation. In fact, once the faulty program exits, it will clear the cache of any of the task’s use of the cache. Instead, the malicious code instruments the cache to have a particular “shape.” For example, it can inject some elements of the cache to be filled, while making sure other elements are empty. The pattern is an encoding of the secret. Another part of the malicious code reads the pattern of the cache and decodes the secret. Thus, the data is leaked through the cache indirectly through the instrumented pattern. This is called a cache side-channel attack.

Want to reference this later? Download this page as a PDF!

The different variants of Spectre and Meltdown attacks keep materializing. So, we’ll continue to do more research, more feature engineering and keep improving the model to stay on top of the attacks and share what we find.

If you have any additional questions about the Spectre and Meltdown vulnerabilities, or how our detection works, feel free to reach out.

Chapter 4

Detection

As you can see, a lot of orchestration needs to happen in order to take advantage of the vulnerability and to leak sensitive data. It would seem that there is plenty of opportunity to detect an active attack, assuming you could monitor 1) instructions the processor is running 2) code running in the speculative execution engine 3) or even how the cache is being instrumented to encode and decode secrets.

The problem is that modern microprocessors are very fast. Consider these days chips run at several gigahertz of instructions per second. Active monitoring and logging any of these extremely low level hardware activities would be too taxing for the systems, in addition to producing too much data if logged somewhere. Microprocessors have been designed and optimized to run instructions very very fast, and not to be actively monitored.

Fortunately, Intel chips have a feature called “hardware performance counters.” The counters collect some statistics about particular hardware operations, such as total counts of instructions that are run and event some aggregate statistics related to the cache. These are just coarse statistics, just counts. They were mainly put in place to help compiler writers optimize higher-level programs in order to effectively use the hardware. Fortunately, these counters have remained and enabled in most Intel chips, and they can be effective in detecting many types of Spectre and Meltdown attacks.

Chapter 5

Performance Counters

How do you get access to these performance counters ? Fortunately, the Linux operating system exposes access to hardware counters in recent versions of the kernel. This is very useful, because properly programming the counters can require a lot of advanced knowledge of hardware internals and very low-level code.

The following shows an example of using the “perf” utility at a command line shell to acquire some hardware statistics of the hardware cache:

perf stat -e cache-misses,cache-references -x, -o spectre_poc

Chapter 6

Cache Statistics

When it comes to Spectre and Meltdown, what hardware counters can help use detect an active attack ? We explored several hardware counters and settled on the following:

  1. cache-misses
  2. cache-references
  3. branch-misses

We recorded the hardware counters in two different conditions, one when under attack, and one when running baseline programs:

Proof-of-concept attacks:

  1. https://github.com/IAIK/meltdown
  2. https://github.com/crozone/SpectrePoC
  3. https://gist.github.com/ErikAugust/724d4a969fb2c6ae1bbd7b2a9e3d4bb6
  4. https://bugs.chromium.org/p/project-zero/issues/detail?id=1528

Example baseline programs:

  1. Libjit unit tests
  2. ML benchmarks
  3. Kernel Compile

We chose these programs for few reasons. For one, ML benchmarks showed similar performance counter ratios as that of the POCs, though being true negatives, these coupled with other normal programs, made up for a good baseline dataset.

The following graphs plot the cache-miss ratio statistics during the execution of the tasks above, first three from the attack POCs and the last last three from few baseline programs. The green dotted lines in the plot shows the threshold used by Capsule8’s open source detector, which is Cache-miss ratio of 0.97. [Note that time (in ms) in x-axis is not of the same scale in all subplots because of the different lengths of programs]
  •  

    Cache StatisticsCache-miss ratio for various programs

Chapter 7

Analysis

One thing that is clear is that the cache-miss ratio we used is much higher in value during the time that the POC task executes. This led us to believe that a simple threshold check might be used to detect an active attack.

To formalize this concept, we decided to perform a simple linear separation of the data, trying both perceptrons and linear support vector machines.

We found that these simple models were highly effective ways to obtain the threshold given the training set. For example, a linear SVM learned the following threshold:

linear SVM with cache miss ratioLinear SVM with cache-miss ratio as its feature

Want to reference this later? Download this page as a PDF!

Chapter 8

Feature Engineering and Final Model

Another feature  crucial to improving the model’s performance is the ratio of cache-misses to branch-misses. The final model that was chosen was a support vector machine with RBF kernel, and the SVM plot looks as follows (x and y axis labels/ticks removed but you get the idea!)

SVM decidion boundary (partial)SVM decision boundary (partial)

The blue shaded region is what the SVM learnt as Inliers and the dark brown/red region as Anomalies.

Sampling rate

It’s also important to consider sampling rate. It was most influential in term of the FP/FN balance. If the sampling rate is too high (perf counters measured too frequently), running the detector itself could result in high cache-miss ratio, thereby resulting in false positives. If the sampling rate is too low, we could have missed the point where the attack was happening. So testing the detector under various sampling rates to come up with the optimum value was very important, if this was to be deployed in production.

How model interpretability helped in getting it integrated into the product — Capsule8 Analytics

Even though we hear a lot about the importance of interpretable machine learning models, it is of the highest importance when it comes to cybersecurity. A black box without explanation as to what the detector is doing is a security threat in itself. A simple Support Vector Machine from which deterministic rules could be derived, meant that we knew exactly how the distinguishing features would behave when an attack happens, and that’s very valuable to the user.

Chapter 9

What's Next?

The different variants of Spectre and Meltdown attacks keep materializing. So, we’ll continue to do more research, more feature engineering and keep improving the model to stay on top of the attacks and share what we find.

If you have any additional questions about the Spectre and Meltdown vulnerabilities, or how our detection works, feel free to reach out.

Want to reference this later? Download this page as a PDF!

book_magazine_mockup_free_by_viscondesign.png