on
I'll shave that Yak as soon as I finish... ah, damn.
Buckle up. This post is going through quite some lengths to describe a journey of me wanting to write a blog post in my holidays, which I did not write. I wrote this instead, and, well, smashed a whole new chunk of raw infrastructure out of entropy. Let’s not get ahead of ourselves though, stay focused, yeah, like that.
It all began quite inconspicuous. Me chilling on the couch, eating chocolate and enjoying my favorite brew of hot coffee. Settling down like that, after months of intense development and admin tasks for my company, I began to revisit some old topics I worked on in early 2022. As I was looking into disk encryption using TPM2 chips for the gateways from my company I started wondering if I could use that for my notebook too, so I would not need entering the passphrase on every boot. I read documentation and specs and eventually found out how to do it (safely). I wanted to write about it back then, but you know how it is and I just did not take the extra time. So now here I am with some free time at last. Obviously I’m not going to write about it now.
I did not anticipate a certain chain of issues, that prevented me from pushing things to my blog. You see, my blog is a statically built website served with a pretty basic nginx packaged together in an OCI (open container image). This image is built through a CI/CD pipeline from a repository hosted at the gitlab instance of a friend of mine and pushed to quay.io - RedHats container registry. While I could have built and packaged the blog locally, then pushing it to quay.io manually I did not do that. That may be the error which lead me down the rabbit whole, where the Yak awaited.
I needed this to work. Building and pushing containers around manually was not an option. So what was the issue anyway? To begin with there was an issue with the build runners registered at the gitlab instance of my friend. The CI/CD pipeline of my blog would just error out. As co-admin of that instance I tried to log in to assess and fix the situation. Unfortunately the auth backend died too that day (which I cannot access administratively). My friend who runs the instance was unavailable at the time, traveling around the world. As I already imagined migrating to my own gitlab instance for a while and had some spare time at hand I thought to myself now or never. It was done in a couple of hours, as I already set up an instance for my company several weeks back. That’s just where the shaving began though.
As I finished setting up gitlab I realized I needed build runners to run the CI/CD pipelines. Registration and configuration of said runners took only an hour or so. I already registered a couple at the other instance and just had to replicate them. So I thought at the time. As I was tinkering with those gitlab runner configs I noticed a flaw, which prevented dind to work. I fixed that flaw and refactored the cloud-init config for the Hetzner docker+machine driver, which enables cross-architecture builds. That fact is awkwardly specific, right? Why mention it then? Because quite some time later I will come back to this to fix a bug I just introduced in the refactor. Never mix refactors and fixes together. It will only produce headache.
With a gitlab instance and corresponding gitlab-runners up and running, it was just a matter of pushing the blog repository and wait for the CI/CD pipeline to finish. Right? Well… not quite. Two issues. First the blog builds container images. It uses custom build tooling called build-ah-engine for this. As the reference to this build tooling is local it couldn’t find it on the new instance. So guess what? Of course I forked the tooling to my new gitlab instance and of course it’s CI/CD pipeline had to fail me as well. To be fair, I could have used remote references, but I wanted to keep it that way and have the build tooling under my control anyway. I don’t want my build to fail because another gitlab instance is unreachable. Unfortunately again, the arm64 build of the tooling didn’t work for a then unknown reason. The build tooling is built for multiple architectures, so multiple architectures can be build with it. Crazy, right? Remember that refactoring of cloud-init stuff previously? There is the bug that required some hours of digging and more than one cup of finest coffee to find. Sadly I missed one more thing which I’ll find out about soon enough.
If you remember, there was a second issue with the blog’s deployment. It is not deployed directly to the machine but instead relies on ansible code in another repository. The blog repo pushes it’s container image and triggers an infrastructure deployment of static sites (like the blog itself) by using gitlab pipeline triggers. So obviously I had to move said ansible repository too. That went surprisingly well without further issues.
A little recap: gitlab instance ready, gitlab-runner(s) available, build tooling available and built, infrastructure repository migrated and working, blog repository migrated. Just some refinement and fixes here and there would do the trick and voilà, here we are again with a working blog deployment. Except that once every few builds the pipelines would just crash right at the beginning which really bugged me. Deployments would work if I manually retry the CI jobs. I was not going to let that slide. After more digging and even more coffee I found out that some cloud-init config caused delays in the chosen Hetzner image of the docker+machine driver which lead to failure of the first pipeline of every docker-machine instance.
It appears you never shaved any yak!
Yeah, that’s true. So why did I pretend there was some wild story about some yak shaving? That’s because the term has a specific meaning in software development. Now there is some TIL for you. Be my guest!
So to put an end to this post let me summarize that I could have avoided much of the hassle by compromising on the approach and choose simpler solutions. Then again it’s my free time, I had a lot of fun doing all this and gained experience from it. Plus I finally have my own shiny gitlab instance with all the batteries included and I’m really happy about that. At (paid) work the argument was quite different. It took many more days as other topics were of higher priority but the approach was much more down to the core features with more care not to derail from the currently pressing issues.
Ah, and at some point in all of this I also added emoji support to this blog. 🦄
Any thoughts of your own?
Feel free to raise a discussion with me on Mastodon or drop me an email.