Micro Code — Posts tagged "Silicon"

Exploring Binary Translation on Apple Silicon (part 1)

November 21, 2020 — Michael Engel

You have probably read about Apple's new Macs which switched from Intel CPUs to Apple's new in-house designed M1 "Apple Silicon" chip, which is based on the 64-bit ARM (Aarch64) instruction set architecture (ISA).

When a company does a transition like this, one of the challenges is to ensure that the existing base of software products continue to work. Since the Aarch64 ISA is not compatible to the old x86-64 bit architecture, applications have to be recompiled (this would not be an insurmountable problem if all applications were open sourced) to enable running then on ARM. However, some companies are slow to recompile their software and some others might even have lost the source code to their software -- this happens far more often than you think -- or went out of business years ago. Nevertheless, there will be a customer out there who relies on such an old piece of binary-only x86 software and you don't want to discourage them from buying your shiny new hardware...

Thus, for a transition period, other solutions to enable running old, binary incompatible software are required to keep your users happy. This is what this blog post series is about, we will focus on Rosetta 2 in upcoming posts. In the first part of this blog post series, I will provide you with a bit of background on emulation and binary translation.

Interpreting Emulators

One very simple approach is to create an emulator that interprets the x86 machine instructions one after the other in order of the program's control flow and dynamically maps them to instructions the new CPU can understand, often building a model of the emulated CPU in C or assembly.

An example of how an interpreting emulator implements the x86 instruction "ADD AL, imm8" (add an 8-bit immediate value to the accumulator low byte) might look like this (caution, this is untested pseudo-code written for legibility):

while (!end) {
  instruction = get_opcode(regs.pc);
  switch (instruction) { 
    case 0x04: // opcode byte for ADD AL, imm8
      // fetch operand byte
      operand = get_byte(regs.pc+1);

      // update result
      regs.ax = (regs.ax & 0xFFFFFF00) | (((regs.ax & 0xFF) + operand) & 0xFF);

      // set condition flags
      regs.flags.cf = ...; // carry
      regs.flags.of = ...; // overflow

      // update program counter
      regs.pc = regs.pc + 2;
      break;

    case 0x05: // opcode byte for ADD EAX, imm32
      ...
  }
}

It's probably easy to see that the interpreted emulation of this simple opcode in C requires quite a bit of overhead. In the emulator code, things that run in parallel on real hardware, such as updating the processor flags and the PC while the result of the addition is calculated, have to be executed serially and thus take much more time compared to execution in hardware.

The interpreter also has to assume the worst case that all of the calculated results are required, so it always has to calculate the values of the carry and overflow flags even though their values are never read or directly overwritten in subsequent code -- it has no knowledge of future instructions.

This mode of emulation is especially inefficient when loops are to be executed. Here, the same set of cases of our switch statement would have to be executed over and over again.

Dynamic Binary Translation -- Just-in-Time

To improve the execution performance for code that is executed multiple times in one program run, it is useful to cache the results of a translation process, i.e. the instructions of the emulator that are executed while a program is emulated.

This is the basic idea behind just-in-time translation (JIT). Instead of executing instructions whenever an instruction to be emulated is encountered, the JIT compiler instead generates a sequence of machine instructions to emulate the instruction and stores this sequence in a translation buffer. After the translation of some instructions, the JIT system jumps to the code.

The problem with this is how many instructions are to be translated before the JIT system can start to execute them. To get an idea of what is happening here, we need to take a look at a fundamental concept of program structure: basic blocks.

Wikipedia defines a basic block as follows:

In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.

Accordingly, the code execution inside of a basic block is predictable; since there is no control flow except at the end, we can always translate all of the instructions inside of a basic block at the same time (this can be made even more efficient through hyperblocks and methods such as trace scheduling [1]). So some pseudo-code for a JIT compiler might look like this (note this is not too dissimilar from our interpreter code above):

pc_type translate_basic_block(pc_type pc) {
  // mark basic block as translated
  basic_block_is_translated(pc);

  // start translation of instructions from the start of the basic block until the first branch instruction
  do {
    instruction = get_opcode(pc);
    switch (instruction) {
      case 0x04: // opcode byte for ADD AL, imm8
        // fetch operand byte
        operand = get_byte(pc+1);

        // generate code to calculate the result
        emit(INSN_ANDI, nativeregs[REG_AX], 0xFFFFFF00);
        emit(INSN_ANDI, nativeregs[REG_TMP1], nativeregs[REG_AX], 0xFF);
        emit(INSN_ADD,  nativeregs[REG_TMP1], nativeregs[REG_TMP1], operand);
        emit(INSN_OR,   nativeregs[REG_AX], nativeregs[REG_AX], nativeregs[REG_TMP1);

        // emit code to calculate the condition flags
        emit(....); // carry
        emit(....); // overflow

        // update the program counter
        pc = pc + 2;
        break;

      case 0x05: // opcode byte for ADD EAX, imm32
        ...
    }
  } while (type(instruction) != BRANCH);

  // emit an instruction to return to the JIT system's main loop
  emit(INSN_RET);

  // return next emulated PC
  return(next_pc);
}

In this code I make a number of assumptions which might be problematic in a real-world JIT compiler. One assumption is that the target machine has more registers than the machine to be emulated. In the piece of code above, a temporary register REG_TMP1 is used to hold an intermediate result. A real-world JIT compiler would try to apply some optimizations, e.g. using register allocation methods known from compiler construction, to reduce the amount of registers used in the translated code.

Another simpification here is that before returning from the JITed piece of native code, there needs to be some sort of indication at which location in the emulated code the execution should continue. This could be implemented so that the code emulating the branch instructions would write the following PC value to a special register.

The JIT compiler would then run a loop like this:

pc = entry_point();

while (!end) {
  // check if basic block is already translated
  if (! basic_block_is_translated(pc)) {
    translate(pc);
  }

  // call the generated native instructions
  pc = call(translation_buffer_address(pc));  
}

This code has no benefits if the program to be translated has no loops (or functions which are called multiple times); in fact, it would possibly imply some overhead since code is first translated and then executed only once. However, as soon as a basic block is executed multiple times, we only need to translate this basic block once and then only call the code repeatedly.

Possible Optimizations

This approach can be optimized in a number of ways. The first problem with JIT translation is that the translation process requires additional memory to store the native code. One can reduce the memory overhead here by evicting translations of basic blocks from the translation buffer after some time. Of course, this brings along all the well-known problems of cache replacement algorithms; so if an evicted basic block translation is needed again, the related code has to be retranslated.

The JIT approach also has some benefits. One of them is that the translation process is dynamic, so it follows the execution of the currently emulated program instance. This, if a path through a program is never taken -- for example, you only edit a document in a word processor but do not print it, the printing code is never executed in this specific program run -- the related code is never translated, saving time and memory space.

Practical Problems with JIT on Modern Computers and Operating Systems

Implementing a JIT compiler today is a bit more complex than the pattern described above, of course. I'll describe a selection of real-world problems below.

On modern operating systems, a process is not allowed to modify its executable code. This so-called "W^X" (write XOR execute -- the CPU can either write to a page or execute code from it, but not both) protection of code (text segment) memory pages serves as a protection against malware, which often tries to overwrite existing program code, e.g. by exploiting a buffer overflow, in order to change the instruction, and thus the behaviour, of the attacked program. Accordingly, some additional calls to the OS (e.g. mprotect on Unix) and possibly special capabilities are required so that a JIT compiler can also execute the code it generated.

Exception handling is another problem. Whenever a program does something out of the ordinary, e.g. it tries to divide by zero or attempts to read from an invalid memory address, the JIT translated program has to detect this condition and handle it accordingly. This exception handling can cause significant overhead.

For the multicore processors available today, another problem is the semantics of parallel execution of threads of a process on multiple codes. I won't go into details here (this might be an interesting topic for a future blog post), but differences in memory access ordering for concurrent reads and writes of different cores create problems that might change the semantics of a translated multithreaded program that is being executed on multiple cores. A correct implementation of a different memory ordering in software required significant overhead. Spoiler Apple has implemented additional functionality for different store order semantics in Apple Silicon cores to make emulation more efficient.

Much more information on approaches to JIT compilation and possible optimizations can be found in the great book on virtualization by Smith and Nair [2].

Static Binary Translation

One problem with JIT translation is that all the work invested to translate (parts of) the program is futile after the end of the program's execution. Some binary translation systems, such as digital's FX!32 [3], which JIT translated x86 code to code for digital's 64-bit Alpha AXP processor, cached translation results beyond the runtime of a program. The Alpha was essentially in the same position as Apple Silicon is today -- its performance was significantly higher than the performance of x86 processors of the time, so FX!32 enabled fast execution of JIT translated x86 binaries on that platform.

Can we improve this somehow? Let us compare the translation process of code to the translation of natural languages. On the one hand, you can hire a (human or AI) language translator to translate, for example Norwegian to German, as the words are spoken or read. This is interpretation which, of course, has significant overhead.

JIT translation for natural languages would require translating larger blocks, e.g. paragraphs, one at a time and cache the results. Since text does not tend to repeat that often in spoken or written texts, this unfortunately breaks the analogy a bit ;-).

For natural languages, of course, the problem of efficiency has been solved for many millenia. What you can do is to translate the foreign language text once and write down the translated result in a book or essay. After that, you don't need the original any more and can refer to the translated text. However, there are some problems with this, for example if the translated text is imprecise or ambiguous (I think readers of the Bible will have quite some experience with this) which require referencing the original text for clarity.

We can try to do the same to translate programs from one binary representation to another one. This is called static binary translation and comes with its own set of problems. For example, similar problems to translating books can also show here and require referencing the original binary. We will take a look at static binary translation in an upcoming blog post.

There's more to emulation and translation

Emulating the CPU is usually not sufficient to execute a program. One important question is if you only want to implement user mode programs or also run a complete operating system in emulation.

For user mode emulation, the overhead emulating the CPU itself is lower, since user mode programs only have access to a restricted subset of a CPU's functionality. However, all interaction of the program with the underlying OS has to be emulated. For similar systems (e.g. different Unix systems or even the same OS running on a different CPU platform, as it is the case for macOS) this can be relatively straightforward. Supporting the system calls of a different OS is much more work. This is, for example, implemented in the Windows Subsystem for Linux (WSL) which allows the execution of Linux user-mode programs on Windows 10 (but does not perform binary translation, since the source and target platform are both x86-64)

In case you also want to run a complete OS (or other bare-metal code) for a different architecture, you need to reproduce the behaviour of the underlying hardware of the complete system, including I/O, the memory system, possible coprocessors, etc. This comes with a lot of overhead, but is routinely done e.g. to emulate vintage computer systems or game consoles.

The next blog post in this series will have a closer look at static binary translation and the related problems before we dig deeper into Rosetta 2.

References

[1] Joseph A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers. 30 (7): 478–490. doi:10.1109/TC.1981.1675827.
[2] Jim Smith and Ravi Nair. Virtual Machines -- Versatile Platforms for Systems and Processes. 1st Edition 2005. Morgan Kaufmann. ISBN-13: 978-1558609105
[3] DIGITAL FX!32: Combining Emulation and Binary Translation from the Digital Technical Journal, Volume 9 Number 1, 1997. (pdf)

Tags: Apple Silicon, M1, Rosetta 2, binary translation, ARM, emulation, JIT