Taking The Plunge

You know that moment when theory meets reality? When all your research and planning suddenly gets a very real deadline? That's exactly what happened on June 24th, 2025.

If you've been following this series, you know I've been exploring alternatives to the current Kubernetes setup at Magic Pages. Not because it's broken, but because the growing pains were becoming unsustainable. I wrote about discovering LXC containers as a potential solution - offering the isolation of VMs with the efficiency of containers, and discovering Incus as a potential orchestration option.

But then came June 24th. A partial outage that lasted several hours and affected about 15% of Magic Pages sites. You can read the full post-mortem here, but the short version is this: Longhorn, the distributed storage solution I have been complaining about in my post, made a very simple issue much, much bigger.

The Reality Check

When I migrated from NFS to Longhorn just three months ago, it felt like the right move. Longhorn promised Kubernetes-native storage with built-in replication and high availability. Yes, people on all kinds of forums said it was unstable – but that usually concerned earlier versions. In fact, people were praising recent releases.

Well, over these three months, Longhorn has been great – but it also made some very simple things absolutely unnecessarily complicated.

June 24th exposed the cracks that had been forming:

Node storage bottlenecks: Individual server nodes ran out of disk space while others had plenty available. My assumption would have been, that Longhorn is smart enough to actively reschedule work when it notices that. But nope. Even with optimised settings, it didn't do it properly.
Replica synchronisation issues: Volumes got stuck in degraded states, requiring manual intervention. That literally meant a sleepless night and staying up until 04:30am to things out of one-replica-hell (e.g. if a server crashes, the storage would be gone).
Recovery complexity: What should have been a 15-minute fix turned into hours of debugging and manually going through volumes, since they all had somewhat different issues.

Now, I ended up asking myself one question: is this a viable solution?

I said here that I am accepting the shortfalls of Longhorn – but that was the case when all it meant was expensive running costs. Magic Pages is, thankfully, financially healthy enough to sustain that for a couple of months.

However, recovering from this partial outage showed that it's more than just that. Longhorn became a liability.

But what could I do? Go back to the NFS server I still had? I considered that for a minute, but got a better idea: isn't the whole promise of Ceph that it's an independent storage solution? What if...I could migrate things there, and then reuse it on whichever new infrastructure I was about to settle on?

Two Birds, One Stone

I spent the evening yesterday going through documentation, sourcing Reddit, and trying things. What I found out:

Ceph has a few different modes to run. Most notably, object-based and file-based.

The object-based mode (RADOS) is what most people think of when they hear "Ceph" – it's the foundation that powers lots of cloud providers. But for my use case, the file-based mode (CephFS) is much more interesting.

CephFS is essentially a POSIX-compliant distributed file system. You mount it like any other file system, and it just works. No special APIs, no complex integration - just regular file operations that happen to be backed by a distributed, replicated storage cluster.

Do you see it, too? Are you getting as excited as I got yesterday? 😄

CephFS can solve both my immediate Longhorn problems AND my future infrastructure flexibility.

Two birds, one stone.

No matter what infrastructure solution I end up with in the future – Ceph as distributed storage solution is pretty much set at this point. So, I would need to migrate there at some point anyway.

The fact that CephFS, in particular, is a normal file system, just means that it doesn't matter whether I migrate there now or later.

So, last night, I got a first test started. Instead of keeping all storage within the Kubernetes cluster like Longhorn does, I set up three dedicated storage servers in the same Hetzner data center (Falkenstein) as the Kubernetes worker nodes.

Each storage server has dedicated NVMe SSDs for maximum performance, and can ping the other storage servers and the Hetzner Cloud Kubernetes workers in 0.5-0.7ms.

And well, whenever I am ready to leave Kubernetes behind, I can just mount the same CephFS cluster onto the new infrastructure and just map the directories. In fact, I could even do that, while CephFS is still used by Kubernetes.

Today, I've selected about 10 internal test sites - my own blog, the Magic Pages marketing site, some theme demos, and some other internal test sites. These sites will run on CephFS for the next week while I monitor:

Performance characteristics compared to Longhorn.
Behavior under load and during maintenance windows.
Recovery scenarios when things inevitably go wrong (yes, I'll force this).
Resource usage patterns on both storage and compute sides.

If the test goes well, I'll start migrating the sites that were most affected by the June 24th outage. If it goes poorly, well... at least I'll have learned something and can pivot quickly.

To a certain extent, the June 24th outage was a wake-up call, but it also provided clarity. Sometimes you need reality to force your hand. I don't regret settling on Longhorn in March. It provided me with crucial learnings and was a first touchpoint in distrubuted storage. But it's time to face reality.

Taking The Plunge

The Reality Check

Two Birds, One Stone

Read next

Requirements First