I didn't get to work on Calcite much during the Fall, but now I am spending some of my Winter break getting lots of work done on it. Most of the things I have done since the hello world entry are not very easy to show off. I'm going to spend this post talking about some of the behind the scenes things that I've gotten done. There might be more posts like this in the future, hence volume I. This post will be pretty long, but I've tried to split it so that each section can be read independently.
The source code for this version can be found at https://git.benjidial.net/benji/calcite/src/tag/weblog-invisible-stuff-i.
Previous post: Hello world!
I mentioned in the previous post that I wanted to get started on paging. The paging system certainly has some areas for expansion, but for now it does what it needs to do. I'm going to present some of the theory of paging on x86 computers, and then present how Calcite in particular handles it. The theory is mostly sourced from:
In the earliest IBM PC's, any running application could access any of the system's memory. This is a nonstarter for a modern consumer operating system for a few reasons. First, there's the security issue. If an application knows where another application is, it can read all of its memory and find out its secrets. Second, if two applications expect to be loaded at the same place, they can't both be run at the same time. This is maybe less of a problem now since applications often don't completely care where they are loaded. The third problem is that if an application has a bug which causes it to corrupt memory, it can also corrupt the memory of other applications or even the operating system.
The Intel 80286 CPU of 1982, included in the IBM PC/AT of 1984, added a tool to solve all three of these problems: memory segmentation. With segmented memory, a running application has segments for code, data, and possibly more, which specify a base, a limit, and permissions (some combination of read, write, and execute). When an application attempts to access a memory address, the relevant segment base is added to that address to get a physical memory address. The address the application specified is called the virtual address. Any virtual addresses past the limit result in a CPU exception (more on those later) and trying to do something with the address that the permissions don't allow also results in an exception.
By giving each application its own segments that don't overlap any other applications' segments, we solve all three of the problems. An application can't read another application's memory, since it would require using a virtual address outside the limit of the segment. Two applications can be loaded at the same virtual address, corresponding to distinct physical addresses. And if memory corruption happens, it can't touch anything outside the writable segments of the application.
The Intel 80386 CPU of 1985, included in some models of the IBM PS/2 of 1987, added a new tool: memory paging. Under this system, the operating system manages a table that translates virtual memory to physical memory. Contiguous regions of virtual memory do not have to translate to contiguous regions of physical memory. When an application attempts to access an address that isn't mapped to any physical memory, or attempts to use an address in a way that permissions do not allow, an exception happens.
On a modern 64-bit x86 CPU, virtual addresses are 64 bits long. Depending on the CPU, some number of upper bits are not fully meaningful, and a virtual address is really a sign-extension of a 48-bit or a 57-bit address. The virtual address space is split into pages of memory, and each page can be mapped to any page-aligned physical address. When an application attempts to access a memory address, the page of virtual memory is translated to a page of physical memory, and the offset into the virtual page serves as an offset into the physical page to read from / write to / execute at. In general, pages can be different sizes, and an operating system can even mix and match different size pages within the same mapping. To simplify the discussion, I will stick to the model that Calcite uses: virtual addresses are sign-extended 48-bit addresses, and every page is 4096 bytes long.
Using this model, we can split a 64-bit virtual address into parts as follows (from high bit to low bit):
When addresses were 32-bit, we had 20 bits of page number and 12 bits of address in page. Then, the high 10 bits of the page number would correspond to an index into a “page directory” with a physical address of a “page table”, along with some flags for permissions and more. The low 10 bits of the page number would correspond to an index into that page table with a physical address of the actual page, and more flags.
Now, instead of this two-layer system, we have a four-layer system. The high 9 bits of the page number give an index into a “page map level 4”, the next 9 bits an index into a “page directory pointer table”, the next 9 bits an index into a page directory, and the bottom 9 bits an index into a page table.
Each layer of the page map includes the following flags (and some more that Calcite doesn't use):
The flags in layers above the page table apply to every page under that entry, in the sense that a page is only considered present / writable / accessible by user code if the corresponding bit is set in every level, and only considered executable if the execute disable bit is set in no level.
The implementation of paging in Calcite is spread across a few files, all in src/kernel: paging.asm, paging.c, process.c.
Calcite tracks the physical memory of the system with a simple bitmap. The bitmap itself is statically allocated, and Calcite will only care about physical addresses in the bottom 64GiB of the physical address space (the particular limit is easy to change by changing a #define directive in the source code). The bitmap uses one bit per 4096-byte page, so one byte per 32768 bytes of physical memory tracked. Tracking 64GiB of physical memory makes the bitmap 2MiB long. At startup, the memory map provided by Limine is used to fill out the bitmap. When a new physical page is needed, the kernel just does a linear search through the bitmap until it finds a free page, marks it not free, and returns its address.
Calcite reserves the top 1GiB (i.e. one page directory worth of pages) of virtual memory for kernel use. This page directory is shared across every process, but marked as only accessible by kernel code. This way when a system call happens, the CPU already has the kernel mappings cached. All 512 page tables under this page directory are statically allocated. When new kernel memory is needed, the kernel looks through each page table in that page directory until it finds a long enough contiguous sequence of entries in the page tables that are not marked present, and returns the corresponding region of virtual memory.
Calcite allows user processes to map pages into the bottom 512GiB (i.e. one page directory pointer table worth of pages) of their virtual address space. For now the only mechanism for mapping a user page that can be initiated by user code is the system call to map the framebuffer into user-accessible pages. More on system calls later.
The paging system could grow much more complex. I could allow the kernel to use more than one page directory, or processes to use more than one page directory pointer table for user pages.
I could allow some pages to be allocated with bigger sizes than 4096 bytes. In particular, one can use an entry in a page directory to map a 2MiB page, rather than to specify a page table with 512 4096-byte pages. This way, no memory needs to be used for the page table itself. Similarly, one can use an entry in a page directory pointer table to map a 1GiB page, which saves a page directory and 512 page tables.
Multiple processes could share pages. I am already doing this with the kernel pages, but user pages could be shared as well. This could be used for communication, to share application code across multiple processes with the same code, and other purposes.
A technique that modern consumer operating systems use when physical memory is exhausted is to save pages of memory somewhere like a disk, then unmap the virtual pages that used it and mark the physical memory as free. Later, if code attempts to access the unmapped virtual pages, the kernel can detect the exception, “swap” the page back in, and resume the code. At the moment, if Calcite runs out of physical memory, it simply gives up and halts.
The x86 architecture includes an interrupt system. When an interrupt happens, the CPU looks up the handler for that interrupt, stores some information about the current state of the CPU, and executes that handler. At the end of the handler, we execute an instruction that restores the previous state. Whatever code was running before might not have any idea that an interrupt happened.
Theory is sourced from the AMD64 manual mentioned in the paging section, and from the OSDev Wiki, particularly the pages on Interrupts, the Interrupt Descriptor Table, Interrupt Service Routines, the 8259 PIC, and Exceptions.
An interrupt can be caused by one of three things:
syscall instruction, as do all of the major 64-bit x86 operating systems.Each interrupt has an associated number. To find the handler for an interrupt, the CPU uses the number as an index into an “interrupt descriptor table” which has the address of the handler along with some other information for each interrupt number. This information is:
When an interrupt happens:
At the end of an interrupt, we execute an iretq instruction and the old stack pointer, instruction pointer, flags register, and privilege level are restored.
CPU exceptions have interrupt numbers in the range 0–31. For hardware interrupts, there are 16 “interrupt request” lines (IRQ lines), numbered 0–15. When a piece of hardware sets an IRQ line, the programmable interrupt controller (PIC) either ignores it or sends a hardware interrupt to the CPU. The operating system tells the PIC which interrupts to ignore, a range of eight contiguous interrupt numbers for IRQs 0–7, and another range for IRQs 8–15.
Interrupts, including exceptions, are implemented in the files interrupts.asm and interrupts.c in the src/kernel directory.
We tell the PIC to not ignore any IRQ line, and we have it send interrupt number 32 + n for IRQ n. The interrupt component of the kernel has a list of 16 wrapped handlers for IRQs. Other components can set any one of them. The actual interrupt handlers for IRQs do the following:
iretq.For exceptions, the kernel currently just halts. In the future, it would be better to do something else. For most exceptions, the best thing to do is probably to terminate the process that caused the exception. The page fault exception in particular is where we would do a lot of the paging tricks described in that section.
Calcite uses the syscall instruction for system calls.
During startup, the kernel tells the CPU to enable the syscall instruction and tells it where to find the system call handler and some other relevant information like which segments to use for which privilege levels. When user code executes syscall, these things happen:
r11 register.rcx register.When kernel code executes sysret, these things happen:
r11.rcx.More details can be found in the AMD64 manual from the paging section.
In Calcite, the system call component has a list of wrapped functions, each with an associated ID. When the syscall handler runs, it sets the stack pointer to a special stack just for system calls and saves r11, rcx, and the old stack pointer to it. Next we look up the wrapped function using rax as the ID. If that ID has been set, it runs the function with rdi, rsi, rdx as arguments. Otherwise, it halts (probably it should either terminate the process or return an error to the process). After the wrapped function returns, we restore r11, rcx, and the old stack pointer, and then execute sysret.
Currently, the only system calls are “end thread” and “map framebuffer”. The user library calcite has wrappers for these system calls to make them nicer to perform from C code. The user application hello maps the framebuffer, fills it with a gradient, and then ends its only thread.
The repository has three top-level directories: disk, include, and src. The disk directory contains files to be copied directly to the disk image (currently just the Limine configuration file). The include directory contains files that can be included from any code, and the src directory contains C and assembly code that is directly compiled, as well as “private” C headers.
Building Calcite consists of three steps: running the shell script get-dependencies.sh, running the shell script make-build.sh, and running ninja. The first script get the Limine bootloader, which is included on the disk. The second script generates a Ninja build file for the last step to use.
The generated build file will build the kernel by building every C and assembly file in src/kernel, and linking them together using src/kernel/link.ld. Each directory in src/user-libs defines a library which is built by building each C and assembly file in that directory and then linking them together as a relocatable object. Each directory in src/user-apps defines an application which is built by building each C and assembly file in that directory and then linking them together along with the libraries listed in libraries.txt in that directory if that file exists.
In the past for projects like this, I use a Makefile. I am doing this generated Ninja file system this time partly as an experiment. My hope is that I don't end up needing to touch the build script very often, even if I add a new library or application.
That's it for this post. There are a few other components that I have been working on, but I don't think they are in a complete enough state to write about yet. Until next time!