Micro Code

Michael's blog about teaching, hardware, software and the things in between

Computer History

April 05, 2023 — Michael Engel

As some of my regular readers (if these exist...) may already know, I have a certain fondness for old computers and their history.

In February, members of the German Computer Society's SIG on Computer History met at the University of Bonn. We had a number of very interesting talks from different angles of computer history - preservation and museums, a film project, the restoration process of Zuse's Z1 mechanical computer, and program listings in 1980's home computer magazines. I contributed my report on reproducing CS research from the 1990s.

A big thank you to Stefan Höltgen for organizing the event!

One highlight was the visit to the Arithmeum, the University's computer museum. In addition to an impressive collection of mechanical computing machines (including an Enigma!), the museum also features an exhibition of electronic computers (tours on demand). I thought some pictures might be interesting...

An original Enigma cipher device

Lots of working computers from the 1980s

IBM 5100 APL computer

Apple Lisa 2/Macintosh XL

NeXT Cube

Commodore SX64

Tags: GI, InfoHist, Computergeschichte, Bonn, Arithmeum

Bamberg at Night

April 03, 2023 — Michael Engel

And some shots of Bamberg at night...

Elisabeth church

Elisabeth church advent concert

Night sky

A large swarm of birds over Markusplatz

Regnitz river at night

Tags: Bamberg, photos, night

Some Impressions of Bamberg

April 03, 2023 — Michael Engel

Just some pictures of Bamberg I took throughout the past year.

View of the Regnitz river and the Little Venice quarter from one of the many bridges

Bamberg cathedral

A tiny alleyway in the UNESCO world heritage area

Cafe in the ERBA park

An old boat on the Regnitz

Just a beautiful blossoming tree

Tags: Bamberg, Photos

One Year Later...

April 03, 2023 — Michael Engel

Big news, a bit late. Almost exactly one year ago (in late March 2022), I returned to Germany to start my new job as full professor and head of the System Programming research group at Bamberg University in the Faculty Information Systems and Applied Computer Sciences.

In Bamberg, for now I concentrate on teaching master level courses, projects and seminars on operating systems engineering and virtualization as well as research on microkernel design for modern architectures and related topics - with a focus on implementing systems on the RISC V architecture.

We also started to gain a lot of experience implementing system and bare-metal code in Rust with great results and will continue to offer Rust as an alternative implementation language to C (along with an optional Rust programming course).

Among many other things, during the previous year I was a co-organizer of the Third Winter School on Operating Systems (WSOS 2023) and acted as program chair for the 9th International Workshop on Plan 9, which takes place in Waterloo, Ontario, Canada, from April 21st-23rd, 2023.

Expect more updates soon including a report from iwp9 – I hope it won't take more than two years for another update...

Picture of the ERBA WE5 building and the playground in the ERBA park

Tags: Bamberg, University, WIAI, SYSNAP

Exploring Binary Translation on Apple Silicon (part 1)

November 21, 2020 — Michael Engel

You have probably read about Apple's new Macs which switched from Intel CPUs to Apple's new in-house designed M1 "Apple Silicon" chip, which is based on the 64-bit ARM (Aarch64) instruction set architecture (ISA).

When a company does a transition like this, one of the challenges is to ensure that the existing base of software products continue to work. Since the Aarch64 ISA is not compatible to the old x86-64 bit architecture, applications have to be recompiled (this would not be an insurmountable problem if all applications were open sourced) to enable running then on ARM. However, some companies are slow to recompile their software and some others might even have lost the source code to their software -- this happens far more often than you think -- or went out of business years ago. Nevertheless, there will be a customer out there who relies on such an old piece of binary-only x86 software and you don't want to discourage them from buying your shiny new hardware...

Thus, for a transition period, other solutions to enable running old, binary incompatible software are required to keep your users happy. This is what this blog post series is about, we will focus on Rosetta 2 in upcoming posts. In the first part of this blog post series, I will provide you with a bit of background on emulation and binary translation.

Interpreting Emulators

One very simple approach is to create an emulator that interprets the x86 machine instructions one after the other in order of the program's control flow and dynamically maps them to instructions the new CPU can understand, often building a model of the emulated CPU in C or assembly.

An example of how an interpreting emulator implements the x86 instruction "ADD AL, imm8" (add an 8-bit immediate value to the accumulator low byte) might look like this (caution, this is untested pseudo-code written for legibility):

while (!end) {
  instruction = get_opcode(regs.pc);
  switch (instruction) { 
    case 0x04: // opcode byte for ADD AL, imm8
      // fetch operand byte
      operand = get_byte(regs.pc+1);

      // update result
      regs.ax = (regs.ax & 0xFFFFFF00) | (((regs.ax & 0xFF) + operand) & 0xFF);

      // set condition flags
      regs.flags.cf = ...; // carry
      regs.flags.of = ...; // overflow

      // update program counter
      regs.pc = regs.pc + 2;

    case 0x05: // opcode byte for ADD EAX, imm32

It's probably easy to see that the interpreted emulation of this simple opcode in C requires quite a bit of overhead. In the emulator code, things that run in parallel on real hardware, such as updating the processor flags and the PC while the result of the addition is calculated, have to be executed serially and thus take much more time compared to execution in hardware.

The interpreter also has to assume the worst case that all of the calculated results are required, so it always has to calculate the values of the carry and overflow flags even though their values are never read or directly overwritten in subsequent code -- it has no knowledge of future instructions.

This mode of emulation is especially inefficient when loops are to be executed. Here, the same set of cases of our switch statement would have to be executed over and over again.

Dynamic Binary Translation -- Just-in-Time

To improve the execution performance for code that is executed multiple times in one program run, it is useful to cache the results of a translation process, i.e. the instructions of the emulator that are executed while a program is emulated.

This is the basic idea behind just-in-time translation (JIT). Instead of executing instructions whenever an instruction to be emulated is encountered, the JIT compiler instead generates a sequence of machine instructions to emulate the instruction and stores this sequence in a translation buffer. After the translation of some instructions, the JIT system jumps to the code.

The problem with this is how many instructions are to be translated before the JIT system can start to execute them. To get an idea of what is happening here, we need to take a look at a fundamental concept of program structure: basic blocks.

Wikipedia defines a basic block as follows:

In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.

Accordingly, the code execution inside of a basic block is predictable; since there is no control flow except at the end, we can always translate all of the instructions inside of a basic block at the same time (this can be made even more efficient through hyperblocks and methods such as trace scheduling [1]). So some pseudo-code for a JIT compiler might look like this (note this is not too dissimilar from our interpreter code above):

pc_type translate_basic_block(pc_type pc) {
  // mark basic block as translated

  // start translation of instructions from the start of the basic block until the first branch instruction
  do {
    instruction = get_opcode(pc);
    switch (instruction) {
      case 0x04: // opcode byte for ADD AL, imm8
        // fetch operand byte
        operand = get_byte(pc+1);

        // generate code to calculate the result
        emit(INSN_ANDI, nativeregs[REG_AX], 0xFFFFFF00);
        emit(INSN_ANDI, nativeregs[REG_TMP1], nativeregs[REG_AX], 0xFF);
        emit(INSN_ADD,  nativeregs[REG_TMP1], nativeregs[REG_TMP1], operand);
        emit(INSN_OR,   nativeregs[REG_AX], nativeregs[REG_AX], nativeregs[REG_TMP1);

        // emit code to calculate the condition flags
        emit(....); // carry
        emit(....); // overflow

        // update the program counter
        pc = pc + 2;

      case 0x05: // opcode byte for ADD EAX, imm32
  } while (type(instruction) != BRANCH);

  // emit an instruction to return to the JIT system's main loop

  // return next emulated PC

In this code I make a number of assumptions which might be problematic in a real-world JIT compiler. One assumption is that the target machine has more registers than the machine to be emulated. In the piece of code above, a temporary register REG_TMP1 is used to hold an intermediate result. A real-world JIT compiler would try to apply some optimizations, e.g. using register allocation methods known from compiler construction, to reduce the amount of registers used in the translated code.

Another simpification here is that before returning from the JITed piece of native code, there needs to be some sort of indication at which location in the emulated code the execution should continue. This could be implemented so that the code emulating the branch instructions would write the following PC value to a special register.

The JIT compiler would then run a loop like this:

pc = entry_point();

while (!end) {
  // check if basic block is already translated
  if (! basic_block_is_translated(pc)) {

  // call the generated native instructions
  pc = call(translation_buffer_address(pc));  

This code has no benefits if the program to be translated has no loops (or functions which are called multiple times); in fact, it would possibly imply some overhead since code is first translated and then executed only once. However, as soon as a basic block is executed multiple times, we only need to translate this basic block once and then only call the code repeatedly.

Possible Optimizations

This approach can be optimized in a number of ways. The first problem with JIT translation is that the translation process requires additional memory to store the native code. One can reduce the memory overhead here by evicting translations of basic blocks from the translation buffer after some time. Of course, this brings along all the well-known problems of cache replacement algorithms; so if an evicted basic block translation is needed again, the related code has to be retranslated.

The JIT approach also has some benefits. One of them is that the translation process is dynamic, so it follows the execution of the currently emulated program instance. This, if a path through a program is never taken -- for example, you only edit a document in a word processor but do not print it, the printing code is never executed in this specific program run -- the related code is never translated, saving time and memory space.

Practical Problems with JIT on Modern Computers and Operating Systems

Implementing a JIT compiler today is a bit more complex than the pattern described above, of course. I'll describe a selection of real-world problems below.

On modern operating systems, a process is not allowed to modify its executable code. This so-called "W^X" (write XOR execute -- the CPU can either write to a page or execute code from it, but not both) protection of code (text segment) memory pages serves as a protection against malware, which often tries to overwrite existing program code, e.g. by exploiting a buffer overflow, in order to change the instruction, and thus the behaviour, of the attacked program. Accordingly, some additional calls to the OS (e.g. mprotect on Unix) and possibly special capabilities are required so that a JIT compiler can also execute the code it generated.

Exception handling is another problem. Whenever a program does something out of the ordinary, e.g. it tries to divide by zero or attempts to read from an invalid memory address, the JIT translated program has to detect this condition and handle it accordingly. This exception handling can cause significant overhead.

For the multicore processors available today, another problem is the semantics of parallel execution of threads of a process on multiple codes. I won't go into details here (this might be an interesting topic for a future blog post), but differences in memory access ordering for concurrent reads and writes of different cores create problems that might change the semantics of a translated multithreaded program that is being executed on multiple cores. A correct implementation of a different memory ordering in software required significant overhead. Spoiler Apple has implemented additional functionality for different store order semantics in Apple Silicon cores to make emulation more efficient.

Much more information on approaches to JIT compilation and possible optimizations can be found in the great book on virtualization by Smith and Nair [2].

Static Binary Translation

One problem with JIT translation is that all the work invested to translate (parts of) the program is futile after the end of the program's execution. Some binary translation systems, such as digital's FX!32 [3], which JIT translated x86 code to code for digital's 64-bit Alpha AXP processor, cached translation results beyond the runtime of a program. The Alpha was essentially in the same position as Apple Silicon is today -- its performance was significantly higher than the performance of x86 processors of the time, so FX!32 enabled fast execution of JIT translated x86 binaries on that platform.

Can we improve this somehow? Let us compare the translation process of code to the translation of natural languages. On the one hand, you can hire a (human or AI) language translator to translate, for example Norwegian to German, as the words are spoken or read. This is interpretation which, of course, has significant overhead.

JIT translation for natural languages would require translating larger blocks, e.g. paragraphs, one at a time and cache the results. Since text does not tend to repeat that often in spoken or written texts, this unfortunately breaks the analogy a bit ;-).

For natural languages, of course, the problem of efficiency has been solved for many millenia. What you can do is to translate the foreign language text once and write down the translated result in a book or essay. After that, you don't need the original any more and can refer to the translated text. However, there are some problems with this, for example if the translated text is imprecise or ambiguous (I think readers of the Bible will have quite some experience with this) which require referencing the original text for clarity.

We can try to do the same to translate programs from one binary representation to another one. This is called static binary translation and comes with its own set of problems. For example, similar problems to translating books can also show here and require referencing the original binary. We will take a look at static binary translation in an upcoming blog post.

There's more to emulation and translation

Emulating the CPU is usually not sufficient to execute a program. One important question is if you only want to implement user mode programs or also run a complete operating system in emulation.

For user mode emulation, the overhead emulating the CPU itself is lower, since user mode programs only have access to a restricted subset of a CPU's functionality. However, all interaction of the program with the underlying OS has to be emulated. For similar systems (e.g. different Unix systems or even the same OS running on a different CPU platform, as it is the case for macOS) this can be relatively straightforward. Supporting the system calls of a different OS is much more work. This is, for example, implemented in the Windows Subsystem for Linux (WSL) which allows the execution of Linux user-mode programs on Windows 10 (but does not perform binary translation, since the source and target platform are both x86-64)

In case you also want to run a complete OS (or other bare-metal code) for a different architecture, you need to reproduce the behaviour of the underlying hardware of the complete system, including I/O, the memory system, possible coprocessors, etc. This comes with a lot of overhead, but is routinely done e.g. to emulate vintage computer systems or game consoles.

The next blog post in this series will have a closer look at static binary translation and the related problems before we dig deeper into Rosetta 2.


  • [1] Joseph A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers. 30 (7): 478–490. doi:10.1109/TC.1981.1675827.
  • [2] Jim Smith and Ravi Nair. Virtual Machines -- Versatile Platforms for Systems and Processes. 1st Edition 2005. Morgan Kaufmann. ISBN-13: 978-1558609105
  • [3] DIGITAL FX!32: Combining Emulation and Binary Translation from the Digital Technical Journal, Volume 9 Number 1, 1997. (pdf)

Tags: Apple Silicon, M1, Rosetta 2, binary translation, ARM, emulation, JIT

RISC V operating systems

September 05, 2020 — Michael Engel

Some of my master project students will work on porting different operating systems (Oberon, Plan 9 and Inferno) to 32-bit RISC V-based systems. The limitation to 32-bit means that it should be possible to run the ported systems on small FPGA-based boards, such as the ultraembedded RISC V SoC running on a Digilent Arty board (based on a Xilinx Artix 7 FPGA) or a Radiona ULX3S, which uses a Lattice ECP5 FPGA and is supported by the open source symbiflow Verilog toolchain.

To provide my students with a bit of example code, an existing more-or-less complex OS running on RISC V would be nice to have. One well executed and documented teaching OS is MIT's xv6, a reimplementation of 6th edition Unix in modern C which is missing some features not that relevant for an OS course (or left as an exercise for the students).

There is a port of xv6 available for 64-bit RISC V. This doesn't work out of the box for 32-bit RISC V (RV32I), since the size of data types and registers is obviously different and the virtual memory management is different (sv32 instead of sv39). Thus, I created a RV32I port of xv6 on a rainy afternoon here in Trondheim. This version currently runs in qemu, I'm working on a port to the ultraembedded RISC V SoC. Here's the most boring screenshot ever ;-).

xv6-rv32 running in qemu

My small project seems to have found at least one interested person. Jim Huang has re-based my port to have proper diffs against the 64-bit xv6 port and might use it for one of his courses:

Nince to see this is useful for people on the other side of the world ;-).

Tags: RISC V, xv6, operating system

Tiny computers are great!

September 05, 2020 — Michael Engel

So far, I was using my trusty old Macbook 12" from 2015 as my main office computer, still running MacOS X High Sierra (I don't agree with Apple's decisions to dumb and lock down the more recent versions of what they now call macOS). However, the Macbook is getting a bit long in the tooth, buying a new x86-based Macbook doesn't really make sense and the ARM-based Macs aren't out yet (and will only run macOS 11 "Big Sur" with all the downsides of the new version).

This means I am motivated to look for a new OS platform for the first time since 2000 - I have used MacOS X on a blue and white G3 PowerPC machine since then, starting with the developer previews. While running Raspberry Pi OS (formerly called Raspbian) on a Raspberry Pi 4 with 8 GB RAM is almost useful, especially web browsing on the RPi is a bit painful.

There are two major web browsers available on Raspberry Pi OS. Firefox (my preferred browser) is unfortunately rather slow on the RPi4. Chromium runs much faster, but is extremely crash-prone. Even though the previous session is restored after a restart, this is not really ideal. So, almost there, but not quite.

What other alternatives are available, then? The number of affordable ARM-based systems is rather low. I was thinking of buying a Honeycomb LX2K board made by SolidRun, which is based on an NXP LX2160A 16-core ARM Cortex-A72 SoC that is intented for use in the high-end communication market (e.g., it has several 10 Gbps Ethernet ports). This board can take up to 64 GB RAM, but has long lead times and is rather expensive here in Norway (around 10,000 NOK plus RAM, disk, case, power supply and video card).

RISC V systems able to run a desktop OS are not yet available; the HiFive Unleashed board by SiFive is sold out and the new FPGA board based on Microsemi's PolarFire SoC/FPGA only has 2 GB of RAM.

So, back to x86-64 for now. I try to avoid systems with Intel CPUs (had no choice with the Macbook, unfortunately) due to their handling of the Meltdown/Spectre fiasco and their creepy Management Engine. While AMD does not fare much better in both respects, it seemed like the less unattractive option. In addition, the new Ryzen Renoir systems seem rather attractive due to their price/performance relation.

However, I did not want a large tower-style PC, but something smaller. Luckily, Asus has recently announced the PN50, a mini PC (11x11x6 cm^3) with up to an 8-core Ryzen 4000, two DDR4 SO-DIMM slots, a M.2280 NVMe SSD slot and a slot for a regular 2.5" SATA drive. The PN50 comes as a barebones PC (bring your own RAM and disk, the Wifi PCIe card - an Intel Wi-Fi 6 AX200 - is included) All this for a reasonable price - RAM and SSD prices are rather affordable at the moment, too.

My new desktop system is now a PN50-BBR545MD-CSM with a six-core AMD Ryzen 4500 CPU (no hyperthreading), 64 GB DDR4-3200 SO-DIMM, a 1 TB Kingston SSD and a 2 TB ST2000LM015 spinning rust disk. All this for less than 10,000 NOK (ca. 1000 Eur). This machine is small enough to take home in the evening, though you need to remember to pack the (tiny) external power supply...

Itsy-bitsy six-core workstation PN50

So far, the system works quite well. A couple of things I noticed:

  • There seems to be no way to boot from the internal SATA disk
  • There is no boot device selection shortcut (but you can temporarily choose a different boot device in the BIOS)
  • The fan can get quite loud under load, but that's probably expected...
  • It could use more USB A ports, three (two in the back, one in the front) are tight for camera, keyboard and mouse (though it also has two USB-C ports)

However, the choice of an operating system was not easy. In Corona times, there are requirements to run commercial software such as:

  • zoom
  • Slack
  • Skype
  • and even Microsoft Teams (eek, absolutely horrible!)

This means that Linux is more or less the only option here. OpenBSD doesn't provide the Linux emulation any more. FreeBSD Linux emulation is an option I might try later, but I needed a system to work with... You didn't think I would consider running Windows, did you? ;) A Hackintosh is also out of the question for an official workstation.

Originally, I wanted to run a systemd(eek, more horrible!)-free distribution. There are not that many around nowadays, it seems. I first tried Alpine Linux, which is based on the musl libc instead of glibc. In general, this worked well after a kernel upgrade (the Renoir Ryzen need a Linux kernel >= 5.5 to support DRI on the GPU), only the sound was problematic. However, getting commercial software products to run was a nightmare, since they are all linked against glibc (and, of course, no source code is available). Next, I tried Void Linux, which I could not get to support the graphics (the 5.8-1 kernel of void needs "nomodeset" to boot in the framebuffer console, which disables the AMD GPU DRI functionality).

I am using Linux in one form or the other since kernel version 0.12 in 1992 (on an AMD (!) 386DX40 with 8 MB RAM, an ET4000 VGA card, an Adaptec 1542 ISA SCSI controller and a Quantum 730 MB (!) SCSI Disk). So it's really sad to see that Linux is in such a sorry state. Thus, a bad compromise currently is to run Ubuntu 20.04 with systemd. It works, the system is fast and stable, the only problematic thing is the audio output, but I got it to work (at least once...).

But I don't feel comfortable with all the (IMHO absolutely unnecessary) changes, configuration stupidity (have you tried enabling a getty on a UART with systemd? Yikes!) and complexity systemd brings along. I am administering Unix systems for almost 30 years now (started with SunOS 4.1.1 on my trusty old 3/60) and this doesn't feel right. So, still looking for a good alternative here.

Oh, btw., the PN50 hangs at reboot with an error message: "Waiting for process: systemd-journal". Thanks, I guess. Switching off the system in that state has worked so far fingers crossed.

Tags: AMD, Ryzen, Asus, PN50

Fifteen minutes of (Internet) fame

September 05, 2020 — Michael Engel

Since my previous post on the bare-metal Smalltalk-80 version I built, a number of things have happened.

First, the bare-metal runtime environments I used were rather dated; they did neither support more recent Raspberry Pi models than those based on the original ARM11 (BCM2835, i.e. Raspberry Pi 1 and Zero/Zero W) and the old version of the uspi USB library did not support USB hubs. Since the Raspberry Pi Zero only has a single USB port, connecting keyboard and mouse requires a hub...

So I switched my implementation (github link), now called "crosstalk", to the much more recent circle library, which enabled support for more recent Raspberry Pi models (up to the Raspberry Pi 4, however there is no multi-core support).

Thus, my fifteen minutes of Internet fame started. Since there was some interest in my little project, I created a post on linkedin, which was mentioned in a tweet by Michael Haupt (thanks for all the great feedback!).

Subsequently, the story was picked up by some news outlets:

I got lots of positive feedback (thanks to all who commented, liked my article and starred my github repo!) and some people on the Internet even dared to give it a try. No haters so far, so there's still hope for the Internet ;-).

So, that's my fifteen minutes of fame. However, a number of problems remain:

  • Line drawing operations crash on BCM2835-based systems (this has worked with the old version...)
  • USB didn't work on the 8 GB version of the Raspberry Pi 4 - this seems to be fixed in the most recent version of circle (have to test it)

Also, there are a number of limitations:

  • The resolution must not exceed 2^20 pixels. 1280x720 works nicely, as shown in the picture below (5.5 inch display by seeed studio)
  • The system is still relatively slow, since a significant part of the performance-cricital functionality is still implemented in Smalltalk (and there's no JIT compiler)

Smalltalk on Raspi Zero W + 5" display

Alas, more work to do - but there's also other interesting projects upcoming, together with my students. So... stay tuned!

Tags: Smalltalk, Raspberry, bare-metal