This is the ninth post in a series about rebuilding the Magic Pages infrastructure. Previous posts: Rebuilding the Magic Pages Infrastructure in Public, The Townhouse in the Middle, Requirements First, Taking the Plunge, Swim, Docker, Swim!, The Calm After the Storm, The Rehearsal Stage, Drawing the Lines.
So far, I've talked a lot about the underlying architecture of my new infrastructure. But there's one piece I haven't mentioned yet: how do I actually keep ~1,200 Ghost sites up to date?
Batches of Fifty
Ghost releases updates frequently. Security patches, performance improvements, new features. Every release means I need to roll out new container images to over a thousand sites.
As a Ghost hosting provider, people trust that their sites are always up to date. But if I update all sites at once, the workload for my cluster would be so high that it has taken down a server or two in the past. Whoopsie. Nothing major happened, because sites just rescheduled onto other server, but...a Ghost update shouldn't cause this.
So, for the last two years or so, I've always done Ghost updates in batches. On Kubernetes, I wrote a script that chunks all sites into batches of 50, then downloads the new image, spins up a new pod, changes networking, and removes the old pod with the old Ghost image. Zero downtime.
It worked. Reliably.
Now, with my migration to Docker Swarm, I wanted to do something similar - but was wondering if there isn't a better solution.
Enter Gantry
The Docker ecosystem has a few options for automatic service updates. Watchtower is the famous one, but it's designed for standalone containers, not Swarm services. There's Shepherd, which works with Swarm, but it felt a bit basic for what I needed.
Then I found Gantry.
Gantry describes itself as "inspired by but enhanced Shepherd," and yeah...that's pretty much it. It runs as a service on your Swarm managers, periodically checks if any service images have newer versions available, and updates them automatically. But the interesting part is everything it handles that my old batch script didn't.
How It Actually Works
Gantry runs as a Docker service on a manager node. In my setup, every five minutes it wakes up and does the following:
- Queries the Swarm for services matching my filters
- Inspects the configured registry to see if there's a newer image available for each service's tag
- Updates services that have new images, respecting the parallelism limits I set
- Rolls back automatically if a service fails its health checks after updating
- Cleans up old images across all nodes to reclaim disk space
What I particularly like is the registry inspection step. Gantry uses docker buildx imagetools inspect to fetch image manifests from the registry without pulling the entire image. If the digest matches what's already running, it skips the update entirely. No unnecessary downloads.
My Configuration
This is the Gantry configuration I use on Magic Pages's new Docker Swarm cluster:
GANTRY_UPDATE_NUM_WORKERS: 50
GANTRY_SLEEP_SECONDS: 300
GANTRY_SERVICES_FILTERS: "label=gantry.enable=true"
GANTRY_ROLLBACK_ON_FAILURE: true
GANTRY_UPDATE_TIMEOUT_SECONDS: 600
Let's break it down:
- 50 parallel updates: With around 1,200 sites, updating one at a time would take quite some time. But updating all of them at once would overwhelm the cluster. 50 is still the sweet spot for me. A full rollout of a new Ghost version completes in about an hour.
- Five-minute check interval: When I push a new image (yes, I build my own Ghost image) to my Docker registry, every site will start updating within five minutes. Fast enough to respond to any issues that pop up, slow enough to respect rate limits of the registry.
- Label-based filtering: By default, Gantry updates all services. I prefer an opt-in approach, so I set
GANTRY_SERVICES_FILTERSto only update services with my customgantry.enable=truelabel. This means infrastructure services or anything else without that label won't be touched. - Automatic rollback: If a Ghost site fails to come up healthy after updating, Gantry rolls it back to the previous version. Hasn't happened yet, but you never know.
- Ten-minute timeout: Some Ghost sites have quite large databases that take a while to migrate when a Ghost update requires that. Ten minutes is more than generous to handle my biggest sites without Gantry giving up too early.
The Label System
Here's what a Ghost site's service definition looks like:
docker service create \
--name my-ghost-site \
--label gantry.enable=true \
--update-order start-first \
--update-failure-action rollback \
magicpages/ghost:6
Every Ghost site gets the gantry.enable=true label when it's provisioned. Since I've configured Gantry to only update services with this label, anything without it is simply ignored.
I have one bigger customer that doesn't want automatic updates (except for security patches). For these kind of sites, I just don't add the label - or remove it if it was already there:
docker service update --label-rm gantry.enable random-ghost-site
Gantry will skip that service from then on. Simple.
What Gantry Does That My Script Didn't
After running Gantry in production for a couple of weeks, the differences from my old batch approach are pretty convincing and the reason I will keep using it:
My Kubernetes script would happily update a site, watch it crash, and move on to the next batch. Ouch. I'd find out about failures when my monitoring sends me a notification a minute later or so. Gantry watches health checks and rolls back automatically. And I mean...that's what health checks are for, right? (Yes, I should have added that to my script)
Additionally, my old script would pull images and restart pods whether or not anything had actually changed. Gantry checks the registry digest first. If the image hasn't changed, it skips the service entirely. No unnecessary restarts, no wasted cluster resources.
My Ghost images are around 600MB each. My old approach left orphaned images everywhere. After a few releases, I'd need to manually clean these up. Gantry does that automatically after each update cycle.
For me, the best part though: My batch script was something I ran manually when a new Ghost version dropped. Gantry is always watching. I simply push a new image to the registry, and the whole cluster starts updating within five minutes. I don't have to remember to do anything. And yes, it did happen in the past that I pushed an image to the registry and forgot to run my manual script... 🙃
There's a theme running through this entire infrastructure rebuild: automation that respects failure.
Ceph replicates my data automatically, so a disk failure doesn't wake me up. Docker Swarm reschedules containers automatically, so a node going down doesn't cause outages. And now Gantry updates services automatically, so I don't have to remember to run a script every time Ghost releases a new version.
Each piece reduces the things I have to actively think about. And when something does go wrong, these systems are designed to fail gracefully: rollback the update, replicate to another disk, reschedule to another node.
Beautiful.