Fallback zeroconf with Network Manager

Let’s say there is this IIoT gateway device with only one ethernet interface that you want to use for configuration and also production network access. The device runs a Fedora IoT Remix, so NetworkManager and firewalld are your main networking tools. You already configured a connection profile with DHCP or alternatively a fixed IP. A question that comes to your mind is, how can I tell NetworkManager when to use a specific static IP solely for configuration purposes that works if only an engineers Notebook is directly connected to the device? Two thoughts come up: with a direct connection there is no DHCP server and no gateway. A solid distinction to the production case, since the connection will just fail. What about a fallback that kicks in exactly in that case? So “how to specify a fallback connection in NetworkManager?”. You type this question into the search engine of your choice and are greeted by many stackoverflow answers that all tell you that NetworkManager supports exactly this. Just define multiple connection profiles on your hardware interface and give them different priorities. Great!

Running this setup for a while now in many different scenarios I had to learn that there is more to it than that and as often the answer is rather simple, but not necessarily obvious. Let’s compare this first approach using NetworkManager’s ability of fallback profiles, what problems I ran into using it and what I’m using nowadays.

A fallback connection profile

In NetworkManager it is possible to specify multiple connection profiles per hardware interface. Just specify their conditions, network configuration and priorities. The crux with this approach is that only one connection profile can be active for one hardware interface at a time.

A primary profile with either DHCP or fixed IP set the priority=1 to be tried first and retries=5 so there is an end to NetworkManager trying this profile and allowing a fallback to replace it.

[connection]
id=eth primary
uuid=<uuid>
type=ethernet
autoconnect=yes
autoconnect-priority=1
autoconnect-retries=5
interface-name=eth0
permissions=

[ethernet]
mac-address-blacklist=

[ipv4]
dns-search=
method=auto

[proxy]

The fallback profile then uses the retries=-1 to never fail, in the end it’s the fallback, and priority=0 to run after the primary connection. The Internet Engineering Task Force (IETF) specified a IPv4 subnet for exactly this purpose in RFC 3927.

[connection]
id=eth primary fallback
uuid=<uuid>
type=ethernet
autoconnect=yes
autoconnect-priority=0
autoconnect-retries=-1
interface-name=eth0
permissions=

[ethernet]
mac-address-blacklist=

[ipv4]
addresses=169.254.1.1/16
dns-search=
method=manual

[proxy]

You can think of it as a chain of profiles that is tried in decreasing order of priority whenever the state of the hardware interface goes UP. This happens mainly when a link is established between your device and another, no matter if it’s a notebook, router or “just” a switch.

At first this proofed to work quite well. If a technician connects their notebook, no gateway and DHCP will be available which causes the primary connection to fail. The fallback kicks in and “rescues” device access. You just have to configure your interface to match the network of the fallback profile.

One does not simply rely on a fallback profile

However this introduced some unexpected side-effects. Falling back to a static profile if the prod configuration fails actually says that if something goes (temporarily) wrong in your prod network, the fallback will kick in and with a static configuration that always “works” from a NetworkManager perspective sometimes there is no automated way to tell when to try the prod profile again. As it turns out, there are a lot of things that can (and will) go wrong in a prod network over time.

Just a few examples:

The main problem with this whole approach of a dedicated fallback profile is in it’s semantics. I’m basically telling the device, if anything goes wrong for whatever reason fallback into a non-production state. A) there are a LOT of fault conditions that I can’t possible account for in advance. B) Why the heck would I want something to go into a non-production state automatically when it’s deployed in production? That’s a rhetorical question.

At first I thought this was a hard problem because: there are no properties on the device, no button I could press, no alternative ethernet port I could use or anything to distinguish if the device is now supposed to run in production or configuration mode. After some consultation of friends and this so called internet people are talking about I found out how to actually solve this.

One connection to bind them all

With modern tooling it’s actually pretty simple. Almost too simple to write about it, but now that I’ve lured you in so far let’s get this right. The primary profile, the production connection, will receive both configurations at once. A DHCP or static IP aimed for production use and a special, static IP. We’ll also write that IP down somewhere in a manual or so. Anyway as it turns out NetworkManager can do great with that nowadays. It will assign the special static IP directly and wait for DHCP to assign the rest. The device will then be available under both configurations. Let’s go through it with some commands and config file examples.

Beware: in the past (long ago) this was done through virtual network interfaces with ip or even ipconfig and looked like eth0:0. This is a hack around the limitations of long forgotten linux kernels. Nowadays the kernel and NetworkManager are able to handle this by assigning multiple IP addresses to the same interface or connection profile. No need for “virtual network devices” any more, as is pointed out in many stackoverflow answers.

On our eth primary connection we add an additional link-local IPv4 address.

# Give an additional IPv4 address to the interface
# This will be available although the network still tries to receive a DHCP configuration
nmcli connection modify "eth primary" +ipv4.addresses 169.254.1.1/16

We also want eth primary to always be active. This is distinct to the fallback approach, where we have to allow the production connection to fail so the fallback has a chance to be activated.

nmcli connection modify "production" connection.autoconnect yes
nmcli connection modify "production" connection.autoconnect-priority 1
nmcli connection modify "production" connection.autoconnect-retries -1

With DHCP available on the network we just ensure that the connection’s method is auto.

nmcli connection modify "production" ipv4.method auto

If the network requires a static IP address for production we ensure that the connection profile has both the prod address as well as our configuration address.

nmcli connection modify "production" ipv4.addresses "169.254.1.1/16,192.168.1.42/24"
nmcli connection modify "production" ipv4.gateway 192.168.1.1
nmcli connection modify "production" ipv4.method manual

As configuration file this looks like this:

[connection]
id=eth primary
uuid=<uuid>
type=ethernet
autoconnect=yes
autoconnect-priority=1
autoconnect-retries=-1
interface-name=eth0
permissions=

[ethernet]
mac-address-blacklist=

[ipv4]
address1=169.254.1.1/16
dns-search=
method=auto

[ipv6]
addr-gen-mode=stable-privacy
dns-search=
method=auto

[proxy]

Or for static IP just write down both IP addresses separated with a comma. Take note that only one gateway is supported and associated with the first IP address.

[ipv4]
address1=192.168.1.42/24,169.254.1.1/16
gateway=192.168.1.1
dns-search=
method=manual

Edits

Any thoughts of your own?

Feel free to raise a discussion with me on Mastodon or drop me an email.

Licenses

The text of this post is licensed under the Attribution 4.0 International License (CC BY 4.0). You may Share or Adapt given the appropriate Credit.

Any source code in this post is licensed under the MIT license.