Container Tools, Tips, and Tricks - Issue #5: Digging into Cross-Platform Containers


Let's continue on the topic of Desktop Container Environments. This issue will focus specifically on running cross-platform containers:

  • QEMU VMs vs QEMU user-space emulation - what's the difference?
  • Where do the Apple Virtualization Framework and Rosetta meet the container ecosystem?
  • What are the most common ways to run cross-platform containers on Windows, macOS, and Linux?
  • Did OrbStack, a shiny new Docker Desktop for Mac alternative, bring AMD64 VMs to Apple Silicon?
  • What are the options if the user-space emulation breaks your container - can Lima (again) save the day?

A quick recap

There are different types of containers, but the most widespread type is Linux containers. In fact, they are so predominant that people usually omit the Linux part of the name when referring to them. Running such a [Linux] container on macOS or Windows requires a virtual machine - simply because only a real Linux kernel can provide the container runtime with the required building blocks like namespaces and cgroups. Even on Linux, using a separate VM might be a good idea to isolate containers further from the host, especially when the host system is your personal laptop. Provisioning such a service VM is the responsibility of the Desktop Container Environment - that's why Docker, Rancher, Podman Desktops, Lima, and OrbStack all implement very similar architecture:

Digging deeper

If you stare at the above diagram long enough, you may notice that QEMU is mentioned there twice - as a VM creation means and as a mysterious CPU emulator. Differentiating between these two QEMU modes is very important if you want to form a holistic understanding of the domain.

Forgetting about containers and VMs for a second, if you try running an ARM64 binary on an AMD64 Linux machine, most likely it'll fail with an error like "cannot execute binary file: Exec format error." It happens because the system doesn't understand the instructions from the ARM64 binary. However, there is a clever way around it that doesn't involve the "expensive" emulation of a full-blown ARM64 machine - translating the ARM64 instructions into AMD64 instructions while (or shortly before) executing the binary.

QEMU is not a single tool but rather a diverse collection of programs, and in particular, it has a family of commands known as qemu-user that can perform translations of a foreign instruction set into a native one:


$ cat > main.go <<EOF
> package main
>
> func main() {
>   println("Hello world")
> }
> EOF

$ GOOS=linux GOARCH=arm64 go build -o main_arm64 main.go
$ ./main_arm64: cannot execute binary file: Exec format error

$ apt-get install qemu-user

$ qemu-aarch64 ./main_arm64
Hello world

$ ./main_arm64
Hello world

The above snippet shows that after installing the qemu-user package, the main_arm64 binary becomes directly invocable too - thanks to the special kernel capability called binfmt_misc that allows registering custom user-space interpreters for different types of executables.

Thus, we can:

  • run ARM64 binaries on AMD64 (or vice versa)
  • using QEMU as a user-space interpreter
  • ...meaning no VMs and ok-ish performance
  • ...and often, the program would work just fine 🙈

Of course, nothing should stop us from trying this trick with containers. A vanilla Docker Engine installation likely wouldn't allow you to run cross-platform containers, but there is a well-known tonistiigi/binfmt image that brings the cross-platform support to Docker Engine (or containerd), and it does something very similar to apt-get install qemu-user from above:


$ docker run --platform linux/arm64 nginx
exec /docker-entrypoint.sh: exec format error

$ docker run --privileged --rm tonistiigi/binfmt --install arm64

$ docker run --platform linux/arm64 nginx
...
2023/07/22 17:16:58 [notice] 1#1: using the "epoll" event method
2023/07/22 17:16:58 [notice] 1#1: nginx/1.25.1
2023/07/22 17:16:58 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14)
2023/07/22 17:16:58 [notice] 1#1: OS: Linux 5.10.175

Back to VMs

Summarizing, there are two different problems - a) how to run cross-platform containers and b) how to launch a VM - and QEMU (well, different parts of it) just happens to be able to address both, but we should be clearly differentiating between a and b.

Why? Because thinking by analogy is a potent technique.

Apple's Virtualization Framework ≈ Microsoft's Hyper-V ≈ QEMU for VMs.

Rosetta ≈ QEMU for user space emulation.

The devil is in the details, of course, but conceptually I find this approximation practical. And understanding the nature of tools helps to predict what should be possible and what's not. For instance, if Apple's Virtualization Framework is for running VMs, it should be possible to have a non-QEMU VM with qemu-user emulation. And at the time of writing this (Jul 2023), Docker Desktop for Mac indeed supports such a mode.

Here is my take on the most common ways Desktop Container Environments do cross-platform today:

New kid in town

Now, when we're done with the theory, let's take a look at OrbStack - a shiny new container runtime that claims to be a drop-in (and faster) replacement for Docker Desktop for Mac.

The OrbStack's feature that actually caught my eye wasn't its performance. It wasn't even the fact that containers started with OrbStack can be accessed by their IP addresses from the macOS host (which is pretty cool, by the way). It was the promised support of AMD64 VMs on Apple Silicon.

Hypothetically, it should indeed be possible for a Desktop Container Environment to run not one but two or more VMs - one per requested container architecture. For instance, AMD64 containers could go to an AMD64 VM, and ARM64 containers could go to an ARM64 VM. However, full-blown hardware emulation is usually slow, and Desktop Container Environments typically start just one VM - of the same architecture as the host system using the user-space emulation trick for the rest.

So, when I saw the following option in OrbStack UI, I was truly intrigued:

And I became even more intrigued when the requested VM booted in no time, and the performance from inside felt close to native. But there's no miracles 😊

Yes, the software inside thinks it's an AMD64 machine. Even uname says so. However, the actual CPU architecture is ARM64, and it's Rosetta user-space emulation all the way down - starting from systemd. I didn't believe it till the very end - only when I compiled two Go binaries - one for AMD64 and one for ARM64, and the latter ran without Rosetta in its process tree, I finally accepted the reality. A clever trick, but not something I was hoping for...

Cross-platform VMs

I've been on the lookout for a more "native" way to run AMD64 containers on Apple Silicon for quite a while. QEMU user-space emulation is great, but its success rate isn't 100% - not every image works fine under user-space emulation. For instance, qemu-user doesn't implement inotify, and it has been a problem for github.com/slimtoolkit/slim (aka DockerSlim), which, in particular, relies on inotify to track filesystem events. Trying Rosetta as an alternative sounded promising, but slim build nginx from inside of an OrbStack-powered VM didn't succeed either.

And that's when Lima saved the day again. Turns out, with Lima, you can start an AMD64 VM (via QEMU, of course - Lima can use the Virtualization Framework, but it supports only native VMs) on an Apple Silicon Mac by editing just one line in the template file. The trick also works on Linux - you can start an ARM64 VM on an AMD64 Linux host:

Of course, this setup will be much slower than the user-space emulation, but on my very basic M1 MacBook Air 2020, slim build nginx finished successfully in a Lima-powered AMD64 VM, which is a win, IMO. The bottom line, though - native execution is the only reliable and performant way to run containers, at least for now.

Well, that's pretty much it - hopefully, it was at least somewhat helpful :)

In other news...

My work on iximiuz Labs continues, and I'm happy to share the key new features that were added since the last update a month ago:

  • ​Port publishing - it's now possible to launch web apps like Prometheus UI or the Kubernetes Dashboard in a playground and easily access them in the browser using a sharable (but protected) URL.
  • Terminal sharing - you can ask a friend or colleague to join the playground for more fun.
  • Long awaited in-browser IDE (VS Code) support - via the magnificent coder's code-server.

As always, I'll include a complete report, including some juicy technical details, in the monthly round-up next week.

Traditional reminder: You can support the platform's development and get access to premium content, unlimited playground time, more powerful VMs, and insights into my creative process via Patreon and Discord updates. Every contribution matters!
​
Cheers
Ivan

Ivan Velichko

Building labs.iximiuz.com - a place to help you learn Containers and Kubernetes the fun way 🚀

Read more from Ivan Velichko

Hey there 👋 I spent a few weeks deep diving into cgroup v2, and I'm happy to share my findings with you! Everyone knows that Docker and Kubernetes use cgroups to limit the resources of containers and Pods. But did you know that it's very easy to run an arbitrary Linux process in a cgroup using much more basic tools? The only kernel's interface for cgroups is the virtual filesystem called cgroupfs typically mounted at /sys/fs/cgroup. Creating folders there and writing to files in them is...

Hello friends! Ivan's here with the June roundup of all things Linux, Containers, Kubernetes, and Server-Side craft 🧙 What I was working on The first two lessons (and a few challenges) of my "Alternative Introduction to Dagger" course have not sparked much interest among my students, so I had to put this work on pause. With a heavy heart, though, because I do like Dagger, and I was enjoying working on the content about it. But no interest means fewer iximiuz Labs Premium subscribers, and I...

Hello friends! It's time for my traditional monthly roundup of all things Linux, Containers, Kubernetes, and Server-Side craft 🧙 Before we get started, I want you to know that this newsletter's previous issue (dispatched mid-May) was delivered to only about 1/5th of my usual email audience due to an unfortunate DNS misconfiguration. The good news is that you can still find it and all previous issues on newsletter.iximiuz.com. Also, if you reply to this email, it'd help to restore the domain's...