I recently discovered that apart from running services, scheduling timers, configuring your network interfaces, resolving names and a lot more you can also run containers with systemd using systemd-nspawn. This was completely new to me so I decided to take a deeper look into the necessary steps to get this up and running. First I looked into how to build images suitable for systemd-nspawn and then at the different ways to run and manage containers with the help of builtin tools. After reading this post you will hopefully also have a rough understanding about how this works and are able to run simple workloads in containers yourself using only systemd.

Prerequisites

To use systemd-nspawn you can install it on Debian based distributions via:

apt install systemd-container

On Arch based distributions it comes already pre-packaged with systemd. You also need to enable and start systemd-networkd.service and systemd-resolved.service so networking and name resolution work inside the containers.

Building an image

There are several ways in which you can build an image for use with systemd-nspawn depending on the distribution you want to use. In this post I am going to use Debian but you can of course use any distribution you like. You just have to make sure it contains a valid /etc/os-release file. It is also helpful if the distribution it uses systemd as the init system as well, but not necessary. systemd-nspwan will also run any other init system it finds inside the filesystem tree.

The default way to build images for Debian is to use debootstrap. Creating a minimal image based on the latest stable version Buster can be done by executing:

debootstrap --include=systemd-container stable /var/lib/machines/Buster

This creates a filesystem tree inside the /var/lib/machines/Buster directory which you can use with systemd-nspawn. To make everything work completely, you have to perform some post installation steps.

# systemd-nspawn -D /var/lib/machines/Buster
Spawning container Buster on /var/lib/machines/Buster.
Press ^] three times within 1s to kill container.
root@Buster:~# systemctl enable systemd-networkd
root@Buster:~# systemctl enable systemd-resolved
root@Buster:~# ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
root@Buster:~# ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
root@Buster:~# mkdir /etc/systemd/resolved.conf.d
root@Buster:~# echo "[Resolve]" > /etc/systemd/resolved.conf.d/dns.conf
root@Buster:~# echo "DNS=1.1.1.1 8.8.8.8" >> /etc/systemd/resolved.conf.d/dns.conf
root@Buster:~# echo "pts/0" >> /etc/securetty
root@Buster:~# exit
logout
Container Buster exited successfully.

Let’s go through this step by step. I first spawned a shell inside the newly created image. Then I needed to enable systemd-networkd and systemd-resolved in order to get networking and name resolution inside the container working properly. For this I also linked the stub-resolv.conf generated by systemd-resolved to /etc/resolv.conf and configured the DNS servers which systmed-resolved will use. Otherwise the running container cannot resolve anything. The DNS servers are configured inside /etc/systemd/resolved.conf.d/dns.conf. As a last step, I added pts/0 to /etc/securetty to enable root logins.

To make this process more automated, you can also use mkosi. It is a wrapper around debootstrap, pacstrap and zypper to create minimal, legacy free OS images. To install mkosi, clone the git repo somewhere, open a shell in it and simply run the install script.

git clone https://github.com/systemd/mkosi.git
cd mkosi
sudo setup.py install

For detailed information about how to use it, see the man page. Now create an empty directory somewhere and create the following files in it:

# tree
.
├── mkosi.default
└── mkosi.postinst
# cat mkosi.default

[Distribution]
Distribution=debian
Release=buster

[Output]
Format=directory
Bootable=no
Hostname=buster
Output=/var/lib/machines/buster

[Validation]
Password=root

[Packages]
Packages=
	iputils-ping
	systemd-container
	iproute2
# cat mkosi.postinst

#!/bin/sh
# make sure systemd-networkd and systemd-resolved are running
systemctl enable systemd-networkd
systemctl enable systemd-resolved
# make sure we symlink /run/systemd/resolve/stub-resolv.conf to /etc/resolv.conf
# otherwise curl will fail
ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
# Configure global DNS servers
mkdir /etc/systemd/resolved.conf.d
echo "[Resolve]" > /etc/systemd/resolved.conf.d/dns.conf
echo "DNS=1.1.1.1 8.8.8.8" >> /etc/systemd/resolved.conf.d/dns.conf
# set pts/0 in /etc/securetty to enable root login
echo "pts/0" >> /etc/securetty

mkosi will read the mkosi.default file for the settings of the image. According to the file, it will create a directory at /var/lib/machines/buster containing a Debian/Buster filesystem tree, make it not bootable inside a virtual machine and set the host name and root password . It will also install some additional packages. There are actually a lot more options you could use but I will leave it rather simple for this post. After the image was created, mkosi will run the mkosi.postinst script inside the image which performs all of the steps just done by hand when using debootstrap. Make sure to set the executable flag on the file after creating it.

The nice thing about mkosi is that you can easily and in an automated fashion create OS images for a number of different distributions. Now that we have created an image, it is time to run it.

Running the image

systemd-nspawn can either be invoked via the command line or run as a system service in the background. In the service mode, each container runs as it’s own service instance using a provided systemd-nspawn@ unit template. I will first look at how to invoke it via the command line to get a better understanding about how it works and then I will use the provided unit template for a more automated approach.

There are actually three different ways you can run an image with systemd-nspawn which all work slightly different. The default way is to boot the image using it’s init system just like you would boot a VM. It is important to note here that systemd-nspawn does not boot a kernel and doesn’t start a VM. Using the boot mode will provide you with an OS container that is running multiple processes as well as it’s own init system. You can compare this mode of operation to LXC containers or BSD jails. To use it, the --boot or -b flag need to be passed when invoking it. This is the default mode of operation when using the systemd-nspawn@ unit template.

systemd-nspawn --boot -D /var/lib/machines/buster

The command above will boot the image and present you with a login shell. If you followed the steps above to build the image, you can now login with the root user and password root to look around a bit. You will notice that the container shows the same interface names and IP addresses as your host because network separation was not enabled. Any network service started in this container or port that will be exposed, will directly be available on the IPs of the host.

Instead of full-fleged OS containers, you can also start something more similar to an application container which you might now from Docker or RKT. You can either start an application directly as PID 1 by passing no extra flag at all or run a stub init process which will then start the application by passing --as-pid2. Note that not all applications are suited to run as PID 1 since they have to meet a few special requirements that the PID 1 process has. For example they need to reap all processes spawned by it and also implement sysvinit compatible signal handling. Shells are generally able to satisfy these requirements but for all other applications is recommended to use the --as-pid2 switch.

To start a shell inside the created image running as PID 2, run the following command:

systemd-nspawn -a -D /var/lib/machines/buster /bin/bash

A big caveat in this mode of operation is, that name resolution does not seem to work properly (at least I could not get it working). If this is an issue for the application you want to run I would recommended you to use the boot mode. The man page has a nice comparison of the three modes

Switch Explanation
Neither –as-pid2 nor –boot specified The passed parameters are interpreted as the command line, which is executed as PID 1 in the container.
–as-pid2 specified The passed parameters are interpreted as the command line, which is executed as PID 2 in the container. A stub init process is run as PID 1.
–boot specified An init program is automatically searched for and run as PID 1 in the container. The passed parameters are used as invocation parameters for this process.

UPDATE:

The issues with name resolution in some containers can be explained by the way systemd-nspawn handles the /etc/resolv.conf file. It is configured by the --resolv-conf command line flag:

If set to “auto” the file is left as it is if private networking is turned on (see –private-network). Otherwise, if systemd-resolved.service is connectible its static resolv.conf file is used, and if not the host’s /etc/resolv.conf file is used. In the latter cases the file is copied if the image is writable, and bind mounted otherwise. […] Defaults to “auto”.

To use the same DNS servers in the container as on the host, set it to either copy-host or bind-host.

Networking

In general it can be a good idea to contain the container in a private network so you don’t have to worry about which ports it exposes unless you explicitly forward them. To do this, systemd-nspawn offers a variety of options which differ in complexity. To simply put a container inside it’s own private /28 subnet you have to pass the --network-veth or -n option. This will create a virtual ethernet link between the container and the host. Inside the container, it will be available as host0 and on the host side it will be named after the container, prefixed with ve-. systemd-networkd comes with a default configuration to set up the virtual interface on the host and inside the container as well if it is enabled and running on both. It also takes care of setting up DHCP on the link as well as the necessary routing options. A container with private networking can be started like this:

systemd-nspawn -bD /var/lib/machines/Buster -n

Note: If you are also using docker on your system, you have to do some tweaking of iptables rules so the container can communicate with the outside world. docker changes the default behavior of iptables so you have to allow in- and outgoing traffic on the created virtual interface. Example iptables rules can be found in the paragraph below.

Managing containers

If you want to run containers via systemd-nspawn in a more automated and management friendly fashion which is similar to how you would run docker containers, you can make use of machinectl which also ships with systemd. It uses the systemd-nspawn@ unit template mentioned above to start containers with sensible default settings. Those are:

ExecStart=/usr/bin/systemd-nspawn --quiet --keep-unit --boot \
--link-journal=try-guest --network-veth -U \
--settings=override --machine=%i

I did not cover all of them so make sure to look them up in the man page :-).

To start a container, you can first have a look at all images that are available. machinectl searches images stored in /var/lib/machines/, /usr/local/lib/machines/, /usr/lib/machines/ and /var/lib/container/.

# machinectl list-images 
NAME                                TYPE      RO USAGE CREATED                      MODIFIED                    
buster                              directory no   n/a n/a                          n/a                         

1 images listed.

Then you can simply run machinectl start buster and it will invoke the unit template with the used image.

# machinectl 
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES     
buster  container systemd-nspawn debian 10      192.168.68.28…

1 machines listed.

You can then use machinectl login or machinectl shell to login to the running container and do things or machinectl status to check the processes running inside your container. What is also pretty neat is that you can use journalctl -u systemd-nspawn@buster on your host to see all log output of the container.

As mentioned above, if you are also running docker on your system, you have to create a few iptables rules so your container can talk to the outside when you run it with private networking enabled. The easiest way to do this, is to create an override file for the systemd unit template via systemctl edit systemd-nspawn@ and adding the following content:

[Service]
ExecStartPre=-/usr/bin/iptables -A FORWARD -o ve-%i -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT ; \
-/usr/bin/iptables -A FORWARD -i ve-%i ! -o ve-%i -j ACCEPT ; \
-/usr/bin/iptables -A FORWARD -i ve-%i -o ve-%i -j ACCEPT
ExecStopPost=-/usr/bin/iptables -D FORWARD -o ve-%i -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT ; \
-/usr/bin/iptables -D FORWARD -i ve-%i ! -o ve-%i -j ACCEPT ; \
-/usr/bin/iptables -D FORWARD -i ve-%i -o ve-%i -j ACCEPT

It will invoke iptables before starting and after stopping the container to add and delete the necessary rules for the container.

Configuration per container

If you want to customize the options a container is started with using machinectl you can create a .nspawn file next to your image with the same name. On startup it will be parsed by systemd-nspawn and possibly override the default settings of the unit template. Have a look at the systemd.nspawn man page for the options. To forward port 80 of the buster container to port 8080 on the host, you could create the following buster.nspawn file in /etc/systemd/nspawn. It cannot be put next to the image since some options are privileged and therefore need to be set inside /etc/systemd/nspawn to be applied. Information about which options are privileged can also be found inside the man page.

[Network]
Port=8080:80
VirtualEthernet=yes

After creating the config file and starting the container again, port 80 on the container will be forwarded to port 8080 on your host. It is important to note that systemd-nspwan will not forward the port to your loopback interface. So it won’t be available via 127.0.0.1:8080 or localhost:8080. This caused quite some confusion for me :-)

That’s it for now. I hope I could give you a small and understandable introduction on how to run containers with the help of systemd-nspawn. I am currently trying to figure out how you could use systemd-nspawn with existing workload orchestrators like HashiCorp Nomad so stay tuned :-)

Jan