Invisible stuff, volume I

I didn't get to work on Calcite much during the Fall, but now I am spending some of my Winter break getting lots of work done on it. Most of the things I have done since the hello world entry are not very easy to show off. I'm going to spend this post talking about some of the behind the scenes things that I've gotten done. There might be more posts like this in the future, hence volume I. This post will be pretty long, but I've tried to split it so that each section can be read independently.

The source code for this version can be found at https://git.benjidial.net/benji/calcite/src/tag/weblog-invisible-stuff-i.

Previous post: Hello world!

Paging
Interrupts and exceptions
System calls
Build system

Paging

I mentioned in the previous post that I wanted to get started on paging. The paging system certainly has some areas for expansion, but for now it does what it needs to do. I'm going to present some of the theory of paging on x86 computers, and then present how Calcite in particular handles it. The theory is mostly sourced from:

The AMD64 Architecture Programmer's Manual (at the time of writing, you can find all five volumes in PDF form at https://kib.kiev.ua/x86docs/AMD/AMD64/40332-r4.00.pdf);
The OSDev Wiki, particularly the Paging page.

History

In the earliest IBM PC's, any running application could access any of the system's memory. This is a nonstarter for a modern consumer operating system for a few reasons. First, there's the security issue. If an application knows where another application is, it can read all of its memory and find out its secrets. Second, if two applications expect to be loaded at the same place, they can't both be run at the same time. This is maybe less of a problem now since applications often don't completely care where they are loaded. The third problem is that if an application has a bug which causes it to corrupt memory, it can also corrupt the memory of other applications or even the operating system.

The Intel 80286 CPU of 1982, included in the IBM PC/AT of 1984, added a tool to solve all three of these problems: memory segmentation. With segmented memory, a running application has segments for code, data, and possibly more, which specify a base, a limit, and permissions (some combination of read, write, and execute). When an application attempts to access a memory address, the relevant segment base is added to that address to get a physical memory address. The address the application specified is called the virtual address. Any virtual addresses past the limit result in a CPU exception (more on those later) and trying to do something with the address that the permissions don't allow also results in an exception.

By giving each application its own segments that don't overlap any other applications' segments, we solve all three of the problems. An application can't read another application's memory, since it would require using a virtual address outside the limit of the segment. Two applications can be loaded at the same virtual address, corresponding to distinct physical addresses. And if memory corruption happens, it can't touch anything outside the writable segments of the application.

The Intel 80386 CPU of 1985, included in some models of the IBM PS/2 of 1987, added a new tool: memory paging. Under this system, the operating system manages a table that translates virtual memory to physical memory. Contiguous regions of virtual memory do not have to translate to contiguous regions of physical memory. When an application attempts to access an address that isn't mapped to any physical memory, or attempts to use an address in a way that permissions do not allow, an exception happens.

Details of paging on x86 CPUs

On a modern 64-bit x86 CPU, virtual addresses are 64 bits long. Depending on the CPU, some number of upper bits are not fully meaningful, and a virtual address is really a sign-extension of a 48-bit or a 57-bit address. The virtual address space is split into pages of memory, and each page can be mapped to any page-aligned physical address. When an application attempts to access a memory address, the page of virtual memory is translated to a page of physical memory, and the offset into the virtual page serves as an offset into the physical page to read from / write to / execute at. In general, pages can be different sizes, and an operating system can even mix and match different size pages within the same mapping. To simplify the discussion, I will stick to the model that Calcite uses: virtual addresses are sign-extended 48-bit addresses, and every page is 4096 bytes long.

Using this model, we can split a 64-bit virtual address into parts as follows (from high bit to low bit):

16 bits: sign extension
36 bits: page number
12 bits: address in page

When addresses were 32-bit, we had 20 bits of page number and 12 bits of address in page. Then, the high 10 bits of the page number would correspond to an index into a “page directory” with a physical address of a “page table”, along with some flags for permissions and more. The low 10 bits of the page number would correspond to an index into that page table with a physical address of the actual page, and more flags.

Now, instead of this two-layer system, we have a four-layer system. The high 9 bits of the page number give an index into a “page map level 4”, the next 9 bits an index into a “page directory pointer table”, the next 9 bits an index into a page directory, and the bottom 9 bits an index into a page table.

Each layer of the page map includes the following flags (and some more that Calcite doesn't use):

Present: When looking up a page, if this flag is not set, an exception happens.
Read/Write: If this bit is not set, the page is read-only. Attempting to write to the page will cause an exception. If this bit is set, the page can be both read and written.
User/Supervisor: If this bit is not set, then the page can only be accessed by the kernel. Attempting to access from user code will cause an exception. If this bit is set, then the page can be accessed by both the kernel and user code.
Execute Disable: If this bit is not set, then the CPU can execute code from this page. If this bit is set, then attempting to execute code from this page causes an exception.

The flags in layers above the page table apply to every page under that entry, in the sense that a page is only considered present / writable / accessible by user code if the corresponding bit is set in every level, and only considered executable if the execute disable bit is set in no level.

Calcite implementation

The implementation of paging in Calcite is spread across a few files, all in src/kernel: paging.asm, paging.c, process.c.

Calcite tracks the physical memory of the system with a simple bitmap. The bitmap itself is statically allocated, and Calcite will only care about physical addresses in the bottom 64GiB of the physical address space (the particular limit is easy to change by changing a #define directive in the source code). The bitmap uses one bit per 4096-byte page, so one byte per 32768 bytes of physical memory tracked. Tracking 64GiB of physical memory makes the bitmap 2MiB long. At startup, the memory map provided by Limine is used to fill out the bitmap. When a new physical page is needed, the kernel just does a linear search through the bitmap until it finds a free page, marks it not free, and returns its address.

Calcite reserves the top 1GiB (i.e. one page directory worth of pages) of virtual memory for kernel use. This page directory is shared across every process, but marked as only accessible by kernel code. This way when a system call happens, the CPU already has the kernel mappings cached. All 512 page tables under this page directory are statically allocated. When new kernel memory is needed, the kernel looks through each page table in that page directory until it finds a long enough contiguous sequence of entries in the page tables that are not marked present, and returns the corresponding region of virtual memory.

Calcite allows user processes to map pages into the bottom 512GiB (i.e. one page directory pointer table worth of pages) of their virtual address space. For now the only mechanism for mapping a user page that can be initiated by user code is the system call to map the framebuffer into user-accessible pages. More on system calls later.

Possible future expansion

The paging system could grow much more complex. I could allow the kernel to use more than one page directory, or processes to use more than one page directory pointer table for user pages.

I could allow some pages to be allocated with bigger sizes than 4096 bytes. In particular, one can use an entry in a page directory to map a 2MiB page, rather than to specify a page table with 512 4096-byte pages. This way, no memory needs to be used for the page table itself. Similarly, one can use an entry in a page directory pointer table to map a 1GiB page, which saves a page directory and 512 page tables.

Multiple processes could share pages. I am already doing this with the kernel pages, but user pages could be shared as well. This could be used for communication, to share application code across multiple processes with the same code, and other purposes.

A technique that modern consumer operating systems use when physical memory is exhausted is to save pages of memory somewhere like a disk, then unmap the virtual pages that used it and mark the physical memory as free. Later, if code attempts to access the unmapped virtual pages, the kernel can detect the exception, “swap” the page back in, and resume the code. At the moment, if Calcite runs out of physical memory, it simply gives up and halts.

Interrupts and exceptions

The x86 architecture includes an interrupt system. When an interrupt happens, the CPU looks up the handler for that interrupt, stores some information about the current state of the CPU, and executes that handler. At the end of the handler, we execute an instruction that restores the previous state. Whatever code was running before might not have any idea that an interrupt happened.

Theory is sourced from the AMD64 manual mentioned in the paging section, and from the OSDev Wiki, particularly the pages on Interrupts, the Interrupt Descriptor Table, Interrupt Service Routines, the 8259 PIC, and Exceptions.

Interrupts on x86 CPUs

An interrupt can be caused by one of three things:

Software interrupts are caused by code executing the “int” instruction. In the past this was the usual way for code running with the user privilege level to cause something to happen in the kernel privilege level, on everything from MS-DOS to 32-bit x86 Linux. In Calcite software interrupts are not used. Instead we use the syscall instruction, as do all of the major 64-bit x86 operating systems.
Hardware interrupts are caused by some hardware on the motherboard via an “interrupt request” line. This could be to let us know that there is keyboard data available, that a data transfer to / from a hard drive has completed, that a timer has gone off, etc.
CPU exceptions are caused by code doing something exceptional, like attempting to divide by zero or attempting to access an unmapped page. CPU exceptions will still cause an interrupt even if interrupts are “disabled” (or maybe we are supposed to think about them as distinct from interrupts?).

Each interrupt has an associated number. To find the handler for an interrupt, the CPU uses the number as an index into an “interrupt descriptor table” which has the address of the handler along with some other information for each interrupt number. This information is:

The interrupt stack table index: this is a 1-based index into the interrupt stack table, which is a table of seven stack pointers. This can be anything from 0 to 7. If it is not 0, then the stack is switched to the one specified in the interrupt stack table before control passes to the handler. Otherwise, the stack from the interrupted code is kept. Switching to a different stack first makes the interrupt more transparent to the interrupted code.
Gate type: this can be the “interrupt” type or the “trap” type. If this is set to “interrupt”, then interrupts (not exceptions) are disabled at entry, which is useful if an interrupt handler being interrupted by itself could cause problems. If this is set to “trap”, then interrupts are not disabled at entry. In Calcite, every interrupt uses the “interrupt” type. This might not be strictly necessary, but it makes things easier for me to think about.
Descriptor privilege level: this is the minimum privilege level needed to perform a software interrupt with this number. In Calcite, every interrupt has this set to the kernel privilege level, so attempting to do a software interrupt from user code will cause an exception.
Present bit: if this is not set, then an exception happens instead of any handler being executed.
Segment selector: this specifies the code segment that the handler runs with, which determines whether the handler is run as kernel code or as user code. In Calcite, every interrupt handler is run as kernel code.

When an interrupt happens:

The interrupt handler is looked up in the interrupt descriptor table.
The stack pointer changes to the specified interrupt stack if any.
These values are pushed onto the (possibly new) stack:

Old stack segment
Old stack pointer
Old flags register
Old code segment
Old instruction pointer (for hardware interrupts, this is an instruction that was about to execute; for exceptions, this is either the instruction that caused the exception or the instruction after that one, depending on the exception)
For some exceptions, an “error code” with additional information about the exceptional state

The handler runs.

At the end of an interrupt, we execute an iretq instruction and the old stack pointer, instruction pointer, flags register, and privilege level are restored.

CPU exceptions have interrupt numbers in the range 0–31. For hardware interrupts, there are 16 “interrupt request” lines (IRQ lines), numbered 0–15. When a piece of hardware sets an IRQ line, the programmable interrupt controller (PIC) either ignores it or sends a hardware interrupt to the CPU. The operating system tells the PIC which interrupts to ignore, a range of eight contiguous interrupt numbers for IRQs 0–7, and another range for IRQs 8–15.

Calcite implementation

Interrupts, including exceptions, are implemented in the files interrupts.asm and interrupts.c in the src/kernel directory.

We tell the PIC to not ignore any IRQ line, and we have it send interrupt number 32 + n for IRQ n. The interrupt component of the kernel has a list of 16 wrapped handlers for IRQs. Other components can set any one of them. The actual interrupt handlers for IRQs do the following:

Save any registers that the wrapped handler might trash (so that the interrupted code doesn't have its state corrupted).
Look up the wrapped handler, and if it is set, call it.
Restore the saved registers and execute iretq.

For exceptions, the kernel currently just halts. In the future, it would be better to do something else. For most exceptions, the best thing to do is probably to terminate the process that caused the exception. The page fault exception in particular is where we would do a lot of the paging tricks described in that section.

System calls

Calcite uses the syscall instruction for system calls.

During startup, the kernel tells the CPU to enable the syscall instruction and tells it where to find the system call handler and some other relevant information like which segments to use for which privilege levels. When user code executes syscall, these things happen:

The flags register is saved into the r11 register.
The instruction pointer is saved into the rcx register.
The flags register is AND'd with a mask specified during startup by the kernel.
The instruction pointer is set to the handler specified during startup.
The privilege level is set to kernel privilege.

When kernel code executes sysret, these things happen:

The flags register is set from r11.
The instruction pointer is set from rcx.
The privilege level is put back to user privilege.

More details can be found in the AMD64 manual from the paging section.

In Calcite, the system call component has a list of wrapped functions, each with an associated ID. When the syscall handler runs, it sets the stack pointer to a special stack just for system calls and saves r11, rcx, and the old stack pointer to it. Next we look up the wrapped function using rax as the ID. If that ID has been set, it runs the function with rdi, rsi, rdx as arguments. Otherwise, it halts (probably it should either terminate the process or return an error to the process). After the wrapped function returns, we restore r11, rcx, and the old stack pointer, and then execute sysret.

Currently, the only system calls are “end thread” and “map framebuffer”. The user library calcite has wrappers for these system calls to make them nicer to perform from C code. The user application hello maps the framebuffer, fills it with a gradient, and then ends its only thread.

Build system

The repository has three top-level directories: disk, include, and src. The disk directory contains files to be copied directly to the disk image (currently just the Limine configuration file). The include directory contains files that can be included from any code, and the src directory contains C and assembly code that is directly compiled, as well as “private” C headers.

Building Calcite consists of three steps: running the shell script get-dependencies.sh, running the shell script make-build.sh, and running ninja. The first script get the Limine bootloader, which is included on the disk. The second script generates a Ninja build file for the last step to use.

The generated build file will build the kernel by building every C and assembly file in src/kernel, and linking them together using src/kernel/link.ld. Each directory in src/user-libs defines a library which is built by building each C and assembly file in that directory and then linking them together as a relocatable object. Each directory in src/user-apps defines an application which is built by building each C and assembly file in that directory and then linking them together along with the libraries listed in libraries.txt in that directory if that file exists.

In the past for projects like this, I use a Makefile. I am doing this generated Ninja file system this time partly as an experiment. My hope is that I don't end up needing to touch the build script very often, even if I add a new library or application.

That's it for this post. There are a few other components that I have been working on, but I don't think they are in a complete enough state to write about yet. Until next time!

Invisible stuff, volume I

Table of contents

Paging

History

Details of paging on x86 CPUs

Calcite implementation

Possible future expansion

Interrupts and exceptions

Interrupts on x86 CPUs

Calcite implementation

System calls

Build system