Fallback zeroconf with Network Manager
Let’s say there is this IIoT gateway device with only one ethernet interface that you want to use for configuration and also production network access. The device runs a Fedora IoT Remix, so NetworkManager and firewalld are your main networking tools. You already configured a connection profile with DHCP or alternatively a fixed IP. A question that comes to your mind is, how can I tell NetworkManager when to use a specific static IP solely for configuration purposes that works if only an engineers Notebook is directly connected to the device? Two thoughts come up: with a direct connection there is no DHCP server and no gateway. A solid distinction to the production case, since the connection will just fail. What about a fallback that kicks in exactly in that case? So “how to specify a fallback connection in NetworkManager?”. You type this question into the search engine of your choice and are greeted by many stackoverflow answers that all tell you that NetworkManager supports exactly this. Just define multiple connection profiles on your hardware interface and give them different priorities. Great!
Running this setup for a while now in many different scenarios I had to learn that there is more to it than that and as often the answer is rather simple, but not necessarily obvious. Let’s compare this first approach using NetworkManager’s ability of fallback profiles, what problems I ran into using it and what I’m using nowadays.
A fallback connection profile
In NetworkManager it is possible to specify multiple connection profiles per hardware interface. Just specify their conditions, network configuration and priorities. The crux with this approach is that only one connection profile can be active for one hardware interface at a time.
A primary profile with either DHCP or fixed IP set the
priority=1 to be tried first and
retries=5 so there is an end to NetworkManager trying this profile and allowing a fallback to replace it.
[connection] id=eth primary uuid=<uuid> type=ethernet autoconnect=yes autoconnect-priority=1 autoconnect-retries=5 interface-name=eth0 permissions= [ethernet] mac-address-blacklist= [ipv4] dns-search= method=auto [proxy]
The fallback profile then uses the
retries=-1 to never fail, in the end it’s the fallback, and
priority=0 to run after the primary connection. The Internet Engineering Task Force (IETF) specified a IPv4 subnet for exactly this purpose in RFC 3927.
[connection] id=eth primary fallback uuid=<uuid> type=ethernet autoconnect=yes autoconnect-priority=0 autoconnect-retries=-1 interface-name=eth0 permissions= [ethernet] mac-address-blacklist= [ipv4] addresses=169.254.1.1/16 dns-search= method=manual [proxy]
You can think of it as a chain of profiles that is tried in decreasing order of priority whenever the state of the hardware interface goes UP. This happens mainly when a link is established between your device and another, no matter if it’s a notebook, router or “just” a switch.
At first this proofed to work quite well. If a technician connects their notebook, no gateway and DHCP will be available which causes the primary connection to fail. The fallback kicks in and “rescues” device access. You just have to configure your interface to match the network of the fallback profile.
One does not simply rely on a fallback profile
However this introduced some unexpected side-effects. Falling back to a static profile if the prod configuration fails actually says that if something goes (temporarily) wrong in your prod network, the fallback will kick in and with a static configuration that always “works” from a NetworkManager perspective sometimes there is no automated way to tell when to try the prod profile again. As it turns out, there are a lot of things that can (and will) go wrong in a prod network over time.
Just a few examples:
- Power goes down on site. Devices have different timing to boot up again. If your device is fast, it might be fast enough to try DHCP when the router is not yet up and running again. The prod profile fails and the fallback is activated
- A technician boots the device, does not plugin the network cable at first and does this later on when the device is long gone into fallback mode
- Your device is connected to a switch, the switch is connected to a router that provides internet or VPN access. A technician disconnects the router from the switch. At some point the the DHCP lease might run out and DHCP will now temporarily fail. The fallback kicks in before the technician reconnects the router
The main problem with this whole approach of a dedicated fallback profile is in it’s semantics. I’m basically telling the device, if anything goes wrong for whatever reason fallback into a non-production state. A) there are a LOT of fault conditions that I can’t possible account for in advance. B) Why the heck would I want something to go into a non-production state automatically when it’s deployed in production? That’s a rhetorical question.
At first I thought this was a hard problem because: there are no properties on the device, no button I could press, no alternative ethernet port I could use or anything to distinguish if the device is now supposed to run in production or configuration mode. After some consultation of friends and this so called internet people are talking about I found out how to actually solve this.
One connection to bind them all
With modern tooling it’s actually pretty simple. Almost too simple to write about it, but now that I’ve lured you in so far let’s get this right. The primary profile, the production connection, will receive both configurations at once. A DHCP or static IP aimed for production use and a special, static IP. We’ll also write that IP down somewhere in a manual or so. Anyway as it turns out NetworkManager can do great with that nowadays. It will assign the special static IP directly and wait for DHCP to assign the rest. The device will then be available under both configurations. Let’s go through it with some commands and config file examples.
Beware: in the past (long ago) this was done through virtual network interfaces with
ipconfigand looked like
eth0:0. This is a hack around the limitations of long forgotten linux kernels. Nowadays the kernel and NetworkManager are able to handle this by assigning multiple IP addresses to the same interface or connection profile. No need for “virtual network devices” any more, as is pointed out in many stackoverflow answers.
eth primary connection we add an additional link-local IPv4 address.
# Give an additional IPv4 address to the interface # This will be available although the network still tries to receive a DHCP configuration nmcli connection modify "eth primary" +ipv4.addresses 169.254.1.1/16
We also want
eth primary to always be active. This is distinct to the fallback approach, where we have to allow the production connection to fail so the fallback has a chance to be activated.
nmcli connection modify "production" connection.autoconnect yes nmcli connection modify "production" connection.autoconnect-priority 1 nmcli connection modify "production" connection.autoconnect-retries -1
With DHCP available on the network we just ensure that the connection’s method is
nmcli connection modify "production" ipv4.method auto
If the network requires a static IP address for production we ensure that the connection profile has both the prod address as well as our configuration address.
nmcli connection modify "production" ipv4.addresses "169.254.1.1/16,192.168.1.42/24" nmcli connection modify "production" ipv4.gateway 192.168.1.1 nmcli connection modify "production" ipv4.method manual
As configuration file this looks like this:
[connection] id=eth primary uuid=<uuid> type=ethernet autoconnect=yes autoconnect-priority=1 autoconnect-retries=-1 interface-name=eth0 permissions= [ethernet] mac-address-blacklist= [ipv4] address1=169.254.1.1/16 dns-search= method=auto [ipv6] addr-gen-mode=stable-privacy dns-search= method=auto [proxy]
Or for static IP just write down both IP addresses separated with a comma. Take note that only one gateway is supported and associated with the first IP address.
[ipv4] address1=192.168.1.42/24,169.254.1.1/16 gateway=192.168.1.1 dns-search= method=manual
- 2021-05-01: After a response from Sheogorath over at [shivering-isles.com][https://shivering-isles.com] in a TIL about the FRITZ!Box link local fallback I had to reconsider the used address range.
Any thoughts of your own?
Feel free to raise a discussion with me on Mastodon or drop me an email.