Micro Code

Michael's blog about teaching, hardware, software and the things in between

Exploring Binary Translation on Apple Silicon (part 1)

November 21, 2020 — Michael Engel

You have probably read about Apple's new Macs which switched from Intel CPUs to Apple's new in-house designed M1 "Apple Silicon" chip, which is based on the 64-bit ARM (Aarch64) instruction set architecture (ISA).

When a company does a transition like this, one of the challenges is to ensure that the existing base of software products continue to work. Since the Aarch64 ISA is not compatible to the old x86-64 bit architecture, applications have to be recompiled (this would not be an insurmountable problem if all applications were open sourced) to enable running then on ARM. However, some companies are slow to recompile their software and some others might even have lost the source code to their software -- this happens far more often than you think -- or went out of business years ago. Nevertheless, there will be a customer out there who relies on such an old piece of binary-only x86 software and you don't want to discourage them from buying your shiny new hardware...

Thus, for a transition period, other solutions to enable running old, binary incompatible software are required to keep your users happy. This is what this blog post series is about, we will focus on Rosetta 2 in upcoming posts. In the first part of this blog post series, I will provide you with a bit of background on emulation and binary translation.

Interpreting Emulators

One very simple approach is to create an emulator that interprets the x86 machine instructions one after the other in order of the program's control flow and dynamically maps them to instructions the new CPU can understand, often building a model of the emulated CPU in C or assembly.

An example of how an interpreting emulator implements the x86 instruction "ADD AL, imm8" (add an 8-bit immediate value to the accumulator low byte) might look like this (caution, this is untested pseudo-code written for legibility):

while (!end) {
  instruction = get_opcode(regs.pc);
  switch (instruction) { 
    case 0x04: // opcode byte for ADD AL, imm8
      // fetch operand byte
      operand = get_byte(regs.pc+1);

      // update result
      regs.ax = (regs.ax & 0xFFFFFF00) | (((regs.ax & 0xFF) + operand) & 0xFF);

      // set condition flags
      regs.flags.cf = ...; // carry
      regs.flags.of = ...; // overflow

      // update program counter
      regs.pc = regs.pc + 2;
      break;

    case 0x05: // opcode byte for ADD EAX, imm32
      ...
  }
}

It's probably easy to see that the interpreted emulation of this simple opcode in C requires quite a bit of overhead. In the emulator code, things that run in parallel on real hardware, such as updating the processor flags and the PC while the result of the addition is calculated, have to be executed serially and thus take much more time compared to execution in hardware.

The interpreter also has to assume the worst case that all of the calculated results are required, so it always has to calculate the values of the carry and overflow flags even though their values are never read or directly overwritten in subsequent code -- it has no knowledge of future instructions.

This mode of emulation is especially inefficient when loops are to be executed. Here, the same set of cases of our switch statement would have to be executed over and over again.

Dynamic Binary Translation -- Just-in-Time

To improve the execution performance for code that is executed multiple times in one program run, it is useful to cache the results of a translation process, i.e. the instructions of the emulator that are executed while a program is emulated.

This is the basic idea behind just-in-time translation (JIT). Instead of executing instructions whenever an instruction to be emulated is encountered, the JIT compiler instead generates a sequence of machine instructions to emulate the instruction and stores this sequence in a translation buffer. After the translation of some instructions, the JIT system jumps to the code.

The problem with this is how many instructions are to be translated before the JIT system can start to execute them. To get an idea of what is happening here, we need to take a look at a fundamental concept of program structure: basic blocks.

Wikipedia defines a basic block as follows:

In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.

Accordingly, the code execution inside of a basic block is predictable; since there is no control flow except at the end, we can always translate all of the instructions inside of a basic block at the same time (this can be made even more efficient through hyperblocks and methods such as trace scheduling [1]). So some pseudo-code for a JIT compiler might look like this (note this is not too dissimilar from our interpreter code above):

pc_type translate_basic_block(pc_type pc) {
  // mark basic block as translated
  basic_block_is_translated(pc);

  // start translation of instructions from the start of the basic block until the first branch instruction
  do {
    instruction = get_opcode(pc);
    switch (instruction) {
      case 0x04: // opcode byte for ADD AL, imm8
        // fetch operand byte
        operand = get_byte(pc+1);

        // generate code to calculate the result
        emit(INSN_ANDI, nativeregs[REG_AX], 0xFFFFFF00);
        emit(INSN_ANDI, nativeregs[REG_TMP1], nativeregs[REG_AX], 0xFF);
        emit(INSN_ADD,  nativeregs[REG_TMP1], nativeregs[REG_TMP1], operand);
        emit(INSN_OR,   nativeregs[REG_AX], nativeregs[REG_AX], nativeregs[REG_TMP1);

        // emit code to calculate the condition flags
        emit(....); // carry
        emit(....); // overflow

        // update the program counter
        pc = pc + 2;
        break;

      case 0x05: // opcode byte for ADD EAX, imm32
        ...
    }
  } while (type(instruction) != BRANCH);

  // emit an instruction to return to the JIT system's main loop
  emit(INSN_RET);

  // return next emulated PC
  return(next_pc);
}

In this code I make a number of assumptions which might be problematic in a real-world JIT compiler. One assumption is that the target machine has more registers than the machine to be emulated. In the piece of code above, a temporary register REG_TMP1 is used to hold an intermediate result. A real-world JIT compiler would try to apply some optimizations, e.g. using register allocation methods known from compiler construction, to reduce the amount of registers used in the translated code.

Another simpification here is that before returning from the JITed piece of native code, there needs to be some sort of indication at which location in the emulated code the execution should continue. This could be implemented so that the code emulating the branch instructions would write the following PC value to a special register.

The JIT compiler would then run a loop like this:

pc = entry_point();

while (!end) {
  // check if basic block is already translated
  if (! basic_block_is_translated(pc)) {
    translate(pc);
  }

  // call the generated native instructions
  pc = call(translation_buffer_address(pc));  
}

This code has no benefits if the program to be translated has no loops (or functions which are called multiple times); in fact, it would possibly imply some overhead since code is first translated and then executed only once. However, as soon as a basic block is executed multiple times, we only need to translate this basic block once and then only call the code repeatedly.

Possible Optimizations

This approach can be optimized in a number of ways. The first problem with JIT translation is that the translation process requires additional memory to store the native code. One can reduce the memory overhead here by evicting translations of basic blocks from the translation buffer after some time. Of course, this brings along all the well-known problems of cache replacement algorithms; so if an evicted basic block translation is needed again, the related code has to be retranslated.

The JIT approach also has some benefits. One of them is that the translation process is dynamic, so it follows the execution of the currently emulated program instance. This, if a path through a program is never taken -- for example, you only edit a document in a word processor but do not print it, the printing code is never executed in this specific program run -- the related code is never translated, saving time and memory space.

Practical Problems with JIT on Modern Computers and Operating Systems

Implementing a JIT compiler today is a bit more complex than the pattern described above, of course. I'll describe a selection of real-world problems below.

On modern operating systems, a process is not allowed to modify its executable code. This so-called "W^X" (write XOR execute -- the CPU can either write to a page or execute code from it, but not both) protection of code (text segment) memory pages serves as a protection against malware, which often tries to overwrite existing program code, e.g. by exploiting a buffer overflow, in order to change the instruction, and thus the behaviour, of the attacked program. Accordingly, some additional calls to the OS (e.g. mprotect on Unix) and possibly special capabilities are required so that a JIT compiler can also execute the code it generated.

Exception handling is another problem. Whenever a program does something out of the ordinary, e.g. it tries to divide by zero or attempts to read from an invalid memory address, the JIT translated program has to detect this condition and handle it accordingly. This exception handling can cause significant overhead.

For the multicore processors available today, another problem is the semantics of parallel execution of threads of a process on multiple codes. I won't go into details here (this might be an interesting topic for a future blog post), but differences in memory access ordering for concurrent reads and writes of different cores create problems that might change the semantics of a translated multithreaded program that is being executed on multiple cores. A correct implementation of a different memory ordering in software required significant overhead. Spoiler Apple has implemented additional functionality for different store order semantics in Apple Silicon cores to make emulation more efficient.

Much more information on approaches to JIT compilation and possible optimizations can be found in the great book on virtualization by Smith and Nair [2].

Static Binary Translation

One problem with JIT translation is that all the work invested to translate (parts of) the program is futile after the end of the program's execution. Some binary translation systems, such as digital's FX!32 [3], which JIT translated x86 code to code for digital's 64-bit Alpha AXP processor, cached translation results beyond the runtime of a program. The Alpha was essentially in the same position as Apple Silicon is today -- its performance was significantly higher than the performance of x86 processors of the time, so FX!32 enabled fast execution of JIT translated x86 binaries on that platform.

Can we improve this somehow? Let us compare the translation process of code to the translation of natural languages. On the one hand, you can hire a (human or AI) language translator to translate, for example Norwegian to German, as the words are spoken or read. This is interpretation which, of course, has significant overhead.

JIT translation for natural languages would require translating larger blocks, e.g. paragraphs, one at a time and cache the results. Since text does not tend to repeat that often in spoken or written texts, this unfortunately breaks the analogy a bit ;-).

For natural languages, of course, the problem of efficiency has been solved for many millenia. What you can do is to translate the foreign language text once and write down the translated result in a book or essay. After that, you don't need the original any more and can refer to the translated text. However, there are some problems with this, for example if the translated text is imprecise or ambiguous (I think readers of the Bible will have quite some experience with this) which require referencing the original text for clarity.

We can try to do the same to translate programs from one binary representation to another one. This is called static binary translation and comes with its own set of problems. For example, similar problems to translating books can also show here and require referencing the original binary. We will take a look at static binary translation in an upcoming blog post.

There's more to emulation and translation

Emulating the CPU is usually not sufficient to execute a program. One important question is if you only want to implement user mode programs or also run a complete operating system in emulation.

For user mode emulation, the overhead emulating the CPU itself is lower, since user mode programs only have access to a restricted subset of a CPU's functionality. However, all interaction of the program with the underlying OS has to be emulated. For similar systems (e.g. different Unix systems or even the same OS running on a different CPU platform, as it is the case for macOS) this can be relatively straightforward. Supporting the system calls of a different OS is much more work. This is, for example, implemented in the Windows Subsystem for Linux (WSL) which allows the execution of Linux user-mode programs on Windows 10 (but does not perform binary translation, since the source and target platform are both x86-64)

In case you also want to run a complete OS (or other bare-metal code) for a different architecture, you need to reproduce the behaviour of the underlying hardware of the complete system, including I/O, the memory system, possible coprocessors, etc. This comes with a lot of overhead, but is routinely done e.g. to emulate vintage computer systems or game consoles.

The next blog post in this series will have a closer look at static binary translation and the related problems before we dig deeper into Rosetta 2.

References

  • [1] Joseph A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers. 30 (7): 478–490. doi:10.1109/TC.1981.1675827.
  • [2] Jim Smith and Ravi Nair. Virtual Machines -- Versatile Platforms for Systems and Processes. 1st Edition 2005. Morgan Kaufmann. ISBN-13: 978-1558609105
  • [3] DIGITAL FX!32: Combining Emulation and Binary Translation from the Digital Technical Journal, Volume 9 Number 1, 1997. (pdf)

Tags: Apple Silicon, M1, Rosetta 2, binary translation, ARM, emulation, JIT

RISC V operating systems

September 05, 2020 — Michael Engel

Some of my master project students will work on porting different operating systems (Oberon, Plan 9 and Inferno) to 32-bit RISC V-based systems. The limitation to 32-bit means that it should be possible to run the ported systems on small FPGA-based boards, such as the ultraembedded RISC V SoC running on a Digilent Arty board (based on a Xilinx Artix 7 FPGA) or a Radiona ULX3S, which uses a Lattice ECP5 FPGA and is supported by the open source symbiflow Verilog toolchain.

To provide my students with a bit of example code, an existing more-or-less complex OS running on RISC V would be nice to have. One well executed and documented teaching OS is MIT's xv6, a reimplementation of 6th edition Unix in modern C which is missing some features not that relevant for an OS course (or left as an exercise for the students).

There is a port of xv6 available for 64-bit RISC V. This doesn't work out of the box for 32-bit RISC V (RV32I), since the size of data types and registers is obviously different and the virtual memory management is different (sv32 instead of sv39). Thus, I created a RV32I port of xv6 on a rainy afternoon here in Trondheim. This version currently runs in qemu, I'm working on a port to the ultraembedded RISC V SoC. Here's the most boring screenshot ever ;-).

xv6-rv32 running in qemu

My small project seems to have found at least one interested person. Jim Huang has re-based my port to have proper diffs against the 64-bit xv6 port and might use it for one of his courses:

Nince to see this is useful for people on the other side of the world ;-).

Tags: RISC V, xv6, operating system

Tiny computers are great!

September 05, 2020 — Michael Engel

So far, I was using my trusty old Macbook 12" from 2015 as my main office computer, still running MacOS X High Sierra (I don't agree with Apple's decisions to dumb and lock down the more recent versions of what they now call macOS). However, the Macbook is getting a bit long in the tooth, buying a new x86-based Macbook doesn't really make sense and the ARM-based Macs aren't out yet (and will only run macOS 11 "Big Sur" with all the downsides of the new version).

This means I am motivated to look for a new OS platform for the first time since 2000 - I have used MacOS X on a blue and white G3 PowerPC machine since then, starting with the developer previews. While running Raspberry Pi OS (formerly called Raspbian) on a Raspberry Pi 4 with 8 GB RAM is almost useful, especially web browsing on the RPi is a bit painful.

There are two major web browsers available on Raspberry Pi OS. Firefox (my preferred browser) is unfortunately rather slow on the RPi4. Chromium runs much faster, but is extremely crash-prone. Even though the previous session is restored after a restart, this is not really ideal. So, almost there, but not quite.

What other alternatives are available, then? The number of affordable ARM-based systems is rather low. I was thinking of buying a Honeycomb LX2K board made by SolidRun, which is based on an NXP LX2160A 16-core ARM Cortex-A72 SoC that is intented for use in the high-end communication market (e.g., it has several 10 Gbps Ethernet ports). This board can take up to 64 GB RAM, but has long lead times and is rather expensive here in Norway (around 10,000 NOK plus RAM, disk, case, power supply and video card).

RISC V systems able to run a desktop OS are not yet available; the HiFive Unleashed board by SiFive is sold out and the new FPGA board based on Microsemi's PolarFire SoC/FPGA only has 2 GB of RAM.

So, back to x86-64 for now. I try to avoid systems with Intel CPUs (had no choice with the Macbook, unfortunately) due to their handling of the Meltdown/Spectre fiasco and their creepy Management Engine. While AMD does not fare much better in both respects, it seemed like the less unattractive option. In addition, the new Ryzen Renoir systems seem rather attractive due to their price/performance relation.

However, I did not want a large tower-style PC, but something smaller. Luckily, Asus has recently announced the PN50, a mini PC (11x11x6 cm^3) with up to an 8-core Ryzen 4000, two DDR4 SO-DIMM slots, a M.2280 NVMe SSD slot and a slot for a regular 2.5" SATA drive. The PN50 comes as a barebones PC (bring your own RAM and disk, the Wifi PCIe card - an Intel Wi-Fi 6 AX200 - is included) All this for a reasonable price - RAM and SSD prices are rather affordable at the moment, too.

My new desktop system is now a PN50-BBR545MD-CSM with a six-core AMD Ryzen 4500 CPU (no hyperthreading), 64 GB DDR4-3200 SO-DIMM, a 1 TB Kingston SSD and a 2 TB ST2000LM015 spinning rust disk. All this for less than 10,000 NOK (ca. 1000 Eur). This machine is small enough to take home in the evening, though you need to remember to pack the (tiny) external power supply...

Itsy-bitsy six-core workstation PN50

So far, the system works quite well. A couple of things I noticed:

  • There seems to be no way to boot from the internal SATA disk
  • There is no boot device selection shortcut (but you can temporarily choose a different boot device in the BIOS)
  • The fan can get quite loud under load, but that's probably expected...
  • It could use more USB A ports, three (two in the back, one in the front) are tight for camera, keyboard and mouse (though it also has two USB-C ports)

However, the choice of an operating system was not easy. In Corona times, there are requirements to run commercial software such as:

  • zoom
  • Slack
  • Skype
  • and even Microsoft Teams (eek, absolutely horrible!)

This means that Linux is more or less the only option here. OpenBSD doesn't provide the Linux emulation any more. FreeBSD Linux emulation is an option I might try later, but I needed a system to work with... You didn't think I would consider running Windows, did you? ;) A Hackintosh is also out of the question for an official workstation.

Originally, I wanted to run a systemd(eek, more horrible!)-free distribution. There are not that many around nowadays, it seems. I first tried Alpine Linux, which is based on the musl libc instead of glibc. In general, this worked well after a kernel upgrade (the Renoir Ryzen need a Linux kernel >= 5.5 to support DRI on the GPU), only the sound was problematic. However, getting commercial software products to run was a nightmare, since they are all linked against glibc (and, of course, no source code is available). Next, I tried Void Linux, which I could not get to support the graphics (the 5.8-1 kernel of void needs "nomodeset" to boot in the framebuffer console, which disables the AMD GPU DRI functionality).

I am using Linux in one form or the other since kernel version 0.12 in 1992 (on an AMD (!) 386DX40 with 8 MB RAM, an ET4000 VGA card, an Adaptec 1542 ISA SCSI controller and a Quantum 730 MB (!) SCSI Disk). So it's really sad to see that Linux is in such a sorry state. Thus, a bad compromise currently is to run Ubuntu 20.04 with systemd. It works, the system is fast and stable, the only problematic thing is the audio output, but I got it to work (at least once...).

But I don't feel comfortable with all the (IMHO absolutely unnecessary) changes, configuration stupidity (have you tried enabling a getty on a UART with systemd? Yikes!) and complexity systemd brings along. I am administering Unix systems for almost 30 years now (started with SunOS 4.1.1 on my trusty old 3/60) and this doesn't feel right. So, still looking for a good alternative here.

Oh, btw., the PN50 hangs at reboot with an error message: "Waiting for process: systemd-journal". Thanks, I guess. Switching off the system in that state has worked so far fingers crossed.

Tags: AMD, Ryzen, Asus, PN50

Fifteen minutes of (Internet) fame

September 05, 2020 — Michael Engel

Since my previous post on the bare-metal Smalltalk-80 version I built, a number of things have happened.

First, the bare-metal runtime environments I used were rather dated; they did neither support more recent Raspberry Pi models than those based on the original ARM11 (BCM2835, i.e. Raspberry Pi 1 and Zero/Zero W) and the old version of the uspi USB library did not support USB hubs. Since the Raspberry Pi Zero only has a single USB port, connecting keyboard and mouse requires a hub...

So I switched my implementation (github link), now called "crosstalk", to the much more recent circle library, which enabled support for more recent Raspberry Pi models (up to the Raspberry Pi 4, however there is no multi-core support).

Thus, my fifteen minutes of Internet fame started. Since there was some interest in my little project, I created a post on linkedin, which was mentioned in a tweet by Michael Haupt (thanks for all the great feedback!).

Subsequently, the story was picked up by some news outlets:

I got lots of positive feedback (thanks to all who commented, liked my article and starred my github repo!) and some people on the Internet even dared to give it a try. No haters so far, so there's still hope for the Internet ;-).

So, that's my fifteen minutes of fame. However, a number of problems remain:

  • Line drawing operations crash on BCM2835-based systems (this has worked with the old version...)
  • USB didn't work on the 8 GB version of the Raspberry Pi 4 - this seems to be fixed in the most recent version of circle (have to test it)

Also, there are a number of limitations:

  • The resolution must not exceed 2^20 pixels. 1280x720 works nicely, as shown in the picture below (5.5 inch display by seeed studio)
  • The system is still relatively slow, since a significant part of the performance-cricital functionality is still implemented in Smalltalk (and there's no JIT compiler)

Smalltalk on Raspi Zero W + 5" display

Alas, more work to do - but there's also other interesting projects upcoming, together with my students. So... stay tuned!

Tags: Smalltalk, Raspberry, bare-metal

Smalltalk on a small computer

June 12, 2020 — Michael Engel

In my previous post, I mentioned the Smalltalk system developed at Xerox' PARC research center starting in the early 1970s.

Smalltalk is a system that pioneered one of the first object-oriented languages (Objective C by NeXT/Apple inherited Smalltalk's ideas of object orientation and using messages to communicate between objects) and also one of the first graphical user interfaces, starting with the highly influential Xerox Alto computer system. In addition, Smalltalk programs are compiled to virtual machine bytecodes, which are in turn interpreted or JIT translated by the VM system (you didn't think Java invented this, did you?).

I was recently exchanging emails with Michael Haupt, an old friend of mine since our time as students in Siegen. Michael mentioned that my work on getting students interested in Plan 9, Inferno and Oberon was incomplete without also taking a closer look at the Smalltalk system. Specifically, the Smalltalk-80 system, which has excellent documentation in a series of books (collected by Stéphane Ducasse, thanks a lot!) which nowadays are freely available. The Blue Book (Smalltalk-80: The Language and its Implementation by Adele Goldberg and David Robson) is especially relevant, since it describes the details of the VM operation and bytecodes.

Of course, Michael was completely right!

By coincidence, Dan Banay had published a version of the Smalltalk-80 VM implemented in C++ from information in the Blue Book a couple of days prior. His version of the VM uses SDL for graphical output and input management and usually runs on several different Unix systems as well as Windows.

So I thought that might be an interesting project for a rainy weekend to look at. But, of course, simply compiling and running the system on a Unix host is boring, so what else could I do?

Some of my students will be working on porting Plan 9, Inferno and Oberon to RISC V systems. Porting Smalltalk to RISC V was a bit too much for one weekend, so I looked for a challenge which was a bit less complex. I was already playing around with Plan 9 and Inferno on several Raspberry Pis here, so running Smalltalk bare-metal (without a supporting operating system) on the Raspberry sounded like a nice challenge.

I got lucky and found a bare-metal environment that supports running SDL on the Raspberry Pi GPU and also USB input devices using the uspi library. The most recent commit to the bare metal environment has been six years ago, so the platform it supports is the original Raspberry Pi (model 1B) - more recent Raspberry models have quite different processor cores. Luckily, the popular Raspberry Pi Zero models use the same system-on-chip (an ARM11 based Broadcom BCM2835) as the original Raspberry Pi, so these are supported by the library, too.

So I did not have to write too much code here - I combined the bare metal environment with the Smalltalk implementation and... nothing worked. Of course. A number of problems were still to solve:

  • Debug output was definitely required. I didn't have a JTAG interface set up on the Raspberry (it would be so nice if Raspberrys included a standard JTAG connector...), so I had to set up good old printf-style debugging on the Raspberry's serial interface.
  • The Smalltalk snapshot image (Smalltalk uses a persistent memory image, you don't have to boot the system from scratch every time you power up the system) could not be loaded, as there was no FAT filesystem support in the Smalltalk VM implementation (it usually uses standard POSIX calls). So I had to adapt the Smalltalk VM file system interface to use the FatFs module by Chan which is included in the bare metal environment.
  • Of course, this would have caused (and did...) lots of frustration right at the start of the project, so I employed a trick and converted the Smalltalk snapshot file to a nice, large C header file which was compiled with the VM program. Of course, this required changing all routines to access the image. But this worked well (Hint: xxd --include is really useful for this!).
  • The Raspberry Pi Zero has only one (OTG) USB port. Smalltalk, of course, needs a keyboard and a mouse, so my options were to either use a USB hub or a combined USB transceiver for a wireless keyboard/mouse set. Unfortunately, the ancient version of the uspi library only recognized one device on the wireless transceiver and none connected to the hub. So I spent more than a day integrating the most recent version of uspi. This one detects keyboard and mouse nicely but I was unable to get any data from the devices. After about a day of bug hunting I found out that I had inserted a debug printf in one of the interrupt handlers. Bad idea... the UART runs at 115200 bps, so outputting a 20 character string already takes about 2 milliseconds. No wonder that this completely messes up interrupt handling. Ouch. Now keyboard and mouse work nicely.

So I was able to stand on the proverbial shoulders of giants and add my little bit of hacking to it. But it runs, as you can see in this video and in the picture below.

However, there are still some problems:

  • The system hangs when switching to a different file in the editor without saving ("accepting") the changes to the current file first. A yes/no requester comes up (implemented as a BinaryChoiceView in the Smalltalk system), the cursor changes to a nice "thumbs up/down" cursor (long before facebook...) and after moving the mouse a bit the buttons are inverted and the system no longer accepts mouse clicks (as shown in the picture below). However, the mouse still moves and the VM is executing instructions. I am currently reading the Smalltalk-80 sources to figure out what is happening there.
  • The current version only runs on systems using the original ARM11-based BCM2835 SoC, but not the more recent multicore Cortex-A based SoCs in the Raspberry Pi 2/3/4. So the only supported Raspberry Pi versions are the original Raspberry Pi 1B (thanks for borrowing me yours, Joseph!), the Raspberry Pi Zero and the Zero W (with WLAN and Bluetooth, which are both unsupported in Smalltalk as of now). All these systems have been tested successfully. It should also work on the original Raspberry Pi Compute Module (CM1), but I don't have one to test it on. Porting to more recent Raspberrys will at least require new startup code and probably new code for interfacing with the GPU.
  • The system is quite slow and takes about 20 seconds to boot. This is probably caused by the fact that the VM does a separate seek and read for every object of four bytes or so. Here, caching should obviously help.
  • Drawing the GUI is also slow, I will need to take a closer look at the bitblit routines and the SDL implementation. In the long run, getting rid of SDL and running directly on the framebuffer is an interesting option.

So, this is all a wild hack right now and the source code definitely needs some cleanup and tuning in addition to fixing the bugs. But it definitely was a very fun project already and it's great to have a system which enables the user to read the code and understand what's happening under the hood!

It will also serve as a nice basis for upcoming student projects, e.g. to implement a JIT, to work on multiprocessor support and to port to a completely open source (hard- and software) RISC V-based FPGA system.

So, stay tuned for updates and the publication of the code on github... this will take a bit, since the weather here in Trondheim is really nice right now...

Smalltalk on Raspi Zero W screenshot

Summer in Trondheim

Tags: Smalltalk, Raspberry, bare-metal

What have I been up to?

June 12, 2020 — Michael Engel

No blog updates for almost half a year - so at least a summary of what I did since January is in order.

Thinking about all the changes since mid March due to Corona, I get the impression that I spent most of my time here in Trondheim installing countless different videoconferencing and communication tools (Zoom, Teams, Slack, ...) and using them for so many virtual meetings. But, luckily, that's not all.

In fact, I have been busy creating a new version of NTNU's compiler course, since March also in the form of youtube videos. The semester here in Trondheim is over (and "spring semester" definitely was a misnomer this year, we had new snow until mid May here in Trondheim) and there is time to prepare for student projects.

I am happy to have found five highly motivated master students to work on interesting topics on the intersection of system software and compilers/programming languages. Some of the students will be working on the Plan 9 and Inferno operating systems from Bell Labs as well as Project Oberon by Niklaus Wirth.

Amazingly, all these projects share common ideas. Plan 9 was the successor of research Unix (8th to 10th editions) at Bell Labs, developed initially by the experienced Unix creators (Rob Pike, Ken Thompson, Dave Presotto and Phil Winterbottom). It is especially interesting since it tries to get rid of all the cruft that has accumulated in commercial (and open source - looking at you, Linux!) Unix variants and to build an OS that makes it easy to build distributed systems in a highly networked environment. Inferno is a project inspired by and forked from Plan 9 (around version 2, I think) which replaces native user-mode software by code executed in a virtual machine environment. The success of Java (which is itself already 25 years old now, yikes) seems to have had an influence here, too...

Oberon is the successor of Niklaus Wirth's Pascal and Modula-2 languages. Wirth started to build his own hardware and create the required system software at ETHZ to support his research and teaching. Oberon is the culmination of these projects; with Project Oberon Wirth also designed his own hardware description language (Lola) and RISC CPU (confusingly called RISC5 as the sixth in a series of iteratively more complex RISC designs - but Wirth was first) to run the Oberon language, operating system and integrated user interface. Readers of my blog might remember my previous experiments to run Project Oberon on an FPGA.

How does all of this fit together now? The relation between Plan 9 and Inferno (which is now commercially supported by VitaNuova systems in York, UK) is obvious. But there is more to it, which you can see if you compare a screenshot of Project Oberon and Rob Pike's integrated Acme development system for Plan 9. In his paper, Rob mentions a lineage of influences between different system. All starts with the Cedar system at Xerox PARC, where Wirth spent sabbaticals in 1976/77 and 1984/85 (according to Wikipedia), which was a major inspiration for Oberon. Acme, in turn, tries to apply some of the Oberon ideas to a development system for Plan 9 and, in turn, Inferno with its own Limbo programming language as well as the predecessor Alef on Plan 9.

So we're talking ancient unsupported operating systems here, right? Not quite... some of the Unix developers (Rob Pike and Ken Thompson) now work at Google on the Go programming language. The third person in the Go design team is Robert Griesemer, who was a PhD student at ETH Zürich with Niklaus Wirth and Hanspeter Mössenböck. Small world, isn't it?

I already mentioned Xerox PARC (Palo Alto Research Center). One of the interesting projects developed at PARC was the Smalltalk system developed by Alan Kay, Dan Ingalls, Adele Goldberg and many more talented researchers. Do you remember that Wirth spent some time at Xerox? So Smalltalk is the system missing in the overview here. It's very interesting in itself and the topic of a future blog post.

Tags: Trondheim, NTNU, Update, Plan9, Inferno, Oberon, Smalltalk

Oberon

Acme

Picture credits: Oberon - SomPost (license BSDU), Acme - unknown, LPL license

Old VAXen never die...

January 05, 2020 — Michael Engel

...they just move North (shamelessly stolen from the NetBSD/VAX web page).

No, I didn't take an actual VAX up North (my two VAXen are currently in storage in Germany), but a large suitcase full of other electronic parts went with me to Trondheim, Norway in early December. As expected, this resulted in a special luggage exemination at Nuremberg airport. Luckily, the customs officers didn't require an explanation of every single of the many PCBs...

So, as some of my readers already know, I have decided to end my commitment with Coburg University at the end of 2019 in order to accept a position as associate professor for compiler design at NTNU in Trondheim.

NTNU letters

Arriving at the airport in Værnes, I was greeted with the appropriate weather:

Snow at Værnes airport

Luckily, I got an upgrade to my rental car from a puny Polo to a nice Mercedes GLC. This helped quite a bit, considering the road conditions:

Trondheim street] airport

So, last week I started my new job at the institutt for datateknologi og informatikk (Department of Computer Science). Stay tuned for more!

IDI sign]

Tags: NTNU, Trondheim, moving, snow, IDI

A Glimpse of Things to Come

November 19, 2019 — Michael Engel

Stay tuned for some big changes ahead... see the teaser picture ;-).

The times they are a-changing

Tags: teaser