The Calm After The Storm

Last night was the first night in a long time that I could actually sleep – and not check my phone every time I nervously woke up.

Quite honestly, the last 1-2 weeks have not been the best for my mental health. For context, the storage engine that's powering Magic Pages has been extremly volatile and caused lots of headaches. More on that here:

The storage engine, Longhorn, had this nasty habit of hitting CPU limits several times a day. When that happened, a Longhorn server would temporarily become unavailable. Instead of handling this gracefully, it would cascade into bringing multiple Longhorn servers down, since they all relied on each other.

Recovery was a nightmare. Every recovery attempt was CPU-intensive, which would trigger more servers to go down. It was like trying to put out a fire with gasoline.

The only "solution" that worked was adding more and more servers. In the end, I had 46 servers running. Forty-six! In total, that was 1,472 GB of memory and 736 vCPU cores 😂

All of that, for running about 650 Ghost sites. The Ghost sites themselves need a negilible amount of CPU (all sites together could run on about 10 vCPU cores, that's how efficient Ghost is), and about 300 MB of memory per site on average.

Beside having to pay for this ridiculous memory and CPU overhead, I was spending more time fighting fires than actually helping my customers. And the worst part? These issues always seemed to happen at 2 AM, right when backups were running. Because apparently, backups caused Longhorn to spike CPU usage and bring down sites 🤷

So yes, I was losing sleep. And last week I made the decision: Longhorn had to go, and Ceph, the storage solution I initially only wanted to integrate in a full infrastructure overhaul, became my saviour.

I had heard about Ceph before, but always dismissed it as "too complex" or "overkill" for my needs. In fact, when I looked into Longhorn initially, everybody mentioned how complex Ceph was, and how Longhorn had issues, but seemed better for Kubernetes.

Well, I wanted to test this and set up a Ceph cluster using cephadm. Three Hetzner dedicated servers, each around 45 Euros per month, with 2x 512GB NVMe SSDs. One disk for the OS, one for storage.

What happened next surprised me.

I expected this to take days to properly work. But the entire Ceph cluster was set up in 2 hours. Including security hardening, documentation, and automation for future node additions.

Two hours.

Adding it as a storage class to Kubernetes took another 2 hours, mostly because I had to figure out the networking between the Kubernetes cluster and the CephFS cluster. They're all in the same Hetzner data center, but I wanted communication to happen over a private vSwitch, not the public internet.

Then I migrated my own blog from Longhorn to CephFS.

And wow. It was easy. Surprisingly easy. And somehow, it felt snappier. Which is weird, because Longhorn essentially ran on the same server as the Ghost site, while CephFS runs on different servers.

I took a couple of days to gradually move more and more sites onto Ceph. I wanted to be careful, test thoroughly, make sure everything was working properly.

Yesterday afternoon, I finished the migration. All sites were moved from Longhorn to CephFS. Running that helm uninstall command to nuke Longhorn felt good. All the therapy I needed (nah, probably not).

Since then, it's been the calmest 24 hours I've had in weeks. In fact, it was the first night I could sleep properly in a long time. No more 2 AM alerts. No more cascading failures. No more firefighting.

The infrastructure just...works.

The 3 dedicated Ceph servers cost me about 45 Euros each in the base configuration, plus an additional 20 Euros each for an extra 2TB NVMe SSD. So 195 Euros total for the Ceph infrastructure.

But I eliminated 33 servers from the Kubernetes cluster. Even at conservative estimates, that's a massive cost reduction of several hundred Euros.

This migration accomplished something I didn't expect: it bought me time and mental space.

Instead of constantly fighting infrastructure fires, I can get back to think strategically about the next steps in Magic Pages' evolution.

The calm has given me clarity. I can see the bigger picture again.

What's Next

This migration was just the first step in my infrastructure rebuilding journey. But it's a crucial one. With a stable, reliable storage foundation, I can now explore other improvements without the constant distraction of storage issues.

I have already talked about LXC and Incus in a previous post. Well, plot twist...

While the whole Longhorn debacle unfolded, I also talked to two people who have had first hand experience with running larger LXC clusters in production.

Unfortunately, LXC does have issues at scale – and the fact that there is a very small community around it doesn't necessarily infuse confidence.

I am back at the drawing board, but I am also letting my mind wander a bit, exploring all kinds of avenues.

I feel like there's still going to be a lot of this exploration. But at least, I have headspace for this again.

To wrap things up, I just want to say Thank You to all the people who were affected by the Longhorn issues. You've all been patient and I am confident that this new storage solution will bring back the reliability you all are used to from Magic Pages.

The Calm After The Storm

What's Next

Taking The Plunge

Swim, Docker, Swim!

The Calm After The Storm

What's Next

Read next

Taking The Plunge

Swim, Docker, Swim!