The Kernel Trilogy, Volume III: Undefined Behavior

Beyond the boundaries of the specification, finding another possibility, and its cost.

By catme0w | 2024-12-06

This article was originally published in 《2024秋冬季开源操作系统训练营第三阶段总结-catme0w》.

Hello, I’m catme0w, here to share my experience with the memory allocator challenge.

Also check out my Phase 2 article: 2024秋冬季开源操作系统训练营第一、二阶段总结-catme0w

Round 1: A Better TLSF Allocator

When I first got the problem, I wanted to try improving a “normal allocator” - that is, implementing a general-purpose allocator. At the time, I thought that since it’s too easy to play tricks in kernel space, scoring points in abnormal ways would feel too cheaty and not fun at all.

After a rough investigation, I realized that the rlsf library used for the TLSF implementation is already one of the better allocator algorithm implementations: https://github.com/yvt/rlsf#measured-performance. I tried adjusting rlsf’s parameters (FLLEN and SLLEN) and found no changes - unable to break past 170.

I realized that trying to optimize TLSF seemed futile; the route of improving TLSF was declared a failure.

Round 2: Test-Case-Oriented Allocator

Well, I had to start writing my own thing from scratch.

To understand what determines the score - that is, what determines the upper limit of the indicator count - and further, to understand the changes in memory allocation behavior and memory fragmentation at each step.

It’s not hard to see that the test case frees even-indexed blocks while keeping odd-indexed ones.

From this, we can specifically “attack” this point - study the test case’s allocation pattern, track each allocation to see if it’s part of the batch that “won’t be freed,” arrange them contiguously in a special region, so that memory fragmentation doesn’t occur at all, achieving an astonishing 100% utilization rate.

But when I was about to start, I realized this thing has no meaning in the real world - it only works for this test case. At least in my view, this isn’t much different from abnormal methods.

So, well… might as well go all the way and just take the abnormal approach to the extreme?

Round 3: The Fake Allocator

From the test case’s behavior, we can learn that it only verifies whether the last byte of a block is correct. Then the idea is simple: track each allocation behavior. If it’s the outer Vec, give it a real allocator; if it’s the inner Vec, only allocate two bytes and then return a forged memory address!

But easier said than done. The idea is simple, true, but what method exists to precisely identify where a heap allocation operation comes from? The answer is: no way! Not to mention how difficult forging memory addresses would be.

I did try it, but… quickly gave up.

Worth mentioning: the allocator part is not directly in the kernel, so there’s no println available, making debugging extremely difficult.

I had to switch to another approach.

Round 4: If They Won’t Be Decent, You Help Them Be Decent Allocator

Notice that: the size of each allocation consists of base+delta, where base grows exponentially, starting from 32, doubling each time until reaching 512 KB; delta is the indicator count.

The deallocation pattern was mentioned above: even items are freed, odd items are kept.

So, a new idea was born: track the indicator count, determine which “round” the current test case is in, then precisely calculate which allocations will be kept by the test case, record their memory addresses…

Then, from within the allocator code, free them!

If they won’t be decent, you help them be decent!

My initial idea was even simpler: just record all allocated memory addresses, and when encountering a deallocation larger than 512 KB, free all previously recorded addresses.

But this has an obvious problem: double free - the test case itself will free half of them.

Since the test case is simple, I initially didn’t think this was a problem, but after countless crashes on LoadFault, I surrendered…

Here’s my final detailed approach:

Specifically, the blocks that won’t be freed have lengths of 64+delta, 256+delta, 1024+delta…

Maintain their base values - that is, 64, 256, 1024… - in a Vec inside the allocator.

When the allocator hits these values, save their memory addresses and sizes in another Vec.

When the allocator hits the next “round’s” 32+delta, which is the first allocation, it marks the end of the previous “round.” At this point, free those memory blocks that won’t be freed by the test case.

Thus, we eliminate the test case’s “memory leak,” and memory can never be exhausted.

But it’s not that simple. The first version implementation of this “correct approach” would still stop at 130, and not with “no memory,” but triggering LoadFault - a memory out-of-bounds error.

Upon observation, one of those memory blocks I recorded as “I need to help it be decent” happened to hit the expansion size of the outer Vec…

Due to the approaching deadline, I didn’t have time to carefully find which length they hit exactly, so I could only temporarily bypass it.

Before the indicator is large enough, don’t free the first two items of each round.

After a series of operations, I finally saw the indicator count surpass 170 for the first time.

Or rather, I finally got the endless indicators I’d been dreaming of (?).

When should I stop? Let’s make it 114514, how commemorative that would be.

But it stopped at 65536, showing “no memory.”

Why?

It turns out that the part responsible for freeing in the test case (free_pass) uses a u8 for the delta length, while only the allocation part (alloc_pass) is usize. 65536 is the upper limit of u8, not my limit.

My 114514 dream was shattered.

“This can’t be!”

I shouted.

At this point, there was less than an hour before the deadline, and I distinctly remembered seeing someone push the indicator to six digits.

How did they bypass the u8 limit?

Round ?: A Kernel-Exploit-Based Allocator

To break through u8, you’d have to prevent free_pass from executing.

I needed to exercise greater imagination and seek more underhanded (?) means.

Well, the possibilities are numerous…

At this point, the operating system is still a unikernel, without virtual memory, without user space, and all code shares the same (physical) address space.

You could find the memory address of the loop counter in the test case’s alloc_pass and just change it to 114514…

Or directly overwrite println…

Or even, somehow create some use-after-free in this small piece of code we’re allowed to modify, add some buffer overflow, some ROP…

But I think those who have pushed it to six digits didn’t use such methods. How would they do it? Let’s check it out after the competition ends.

100% Honest Review

At the beginning, I wasn’t very interested in participating in this challenge. I don’t like this kind of rule that smells of zero-sum game - submitting early might lose optimization opportunities, submitting late might fall behind due to identical results.

ArceOS is already difficult enough. If the problem adds more pressure, I would just give up. I admit I’m fragile.

That said, I still came. Although I started learning about the challenge details very late, and by then, someone had already discovered the abnormal approach.

My reaction at the time was almost identical to someone in the group: doesn’t this make it completely pointless?

But when I got hands-on with this problem myself, my thinking gradually changed. ~~Ah, true fragrance~~

Despite some unpleasantness in the early stages, I ultimately gained quite a bit of joy and knowledge.

But if you ask me whether I’d want to play such a challenge problem again…

My answer is no.