Industry & CompetitionSecurity & Supply Chain

A Scheduled Component Swap Halted Every Train in Germany

Around 22:30 on June 23, 2026, every railway in Germany stopped at once.

ICE long-distance, S-Bahn commuter, regional, freight — nothing was spared across a 33,400 km network and 5,400 stations. No bad weather, no strike, nothing to do with terrorism. DB InfraGO chief Philipp Nagl later told the press the cause was “the scheduled swap of a technical component.” ABC/AP

If you have ever written one of these incident postmortems, the shape is instantly familiar: a change that looked perfectly fine goes in during a maintenance window, and a second later production lights up red. Anyone who has done on-call needs no further explanation. The only difference this time is that what went down wasn’t a handful of microservices — it was every train currently running on German rails.

Strip the incident down and the core lesson reads clearest in distributed-systems terms. Germany’s national rail collapse came from three antipatterns colliding at once. Redundancy existed only on paper: primary and backup ran the same code and the same config, so a single burned card could fail over cleanly while a single bug or a single config error took both sides down together. The system had only two gears — full run and full stop — so when the core dependency dropped, there was no degradation path to any middle state. A scheduled maintenance window, the one kind of incident a system should be able to fully control, punched straight through the entire redundancy layer, because primary and backup had never actually been isolated. CrowdStrike used the identical failure mode in July 2024 to bring down global aviation, banking, and healthcare. Same disease, different body — a railway instead of an IT estate.

How does a single maintenance operation bring a national railway to a halt? First you have to understand what that link was, and why losing it forces trains to stop.

GSM-R is a railway-specific 2G mobile network rolled out across Europe since 2000. It does two jobs. One: voice between driver and dispatcher. The other, more lethal one: it holds an uninterrupted data channel between the train and the Radio Block Centre, over which the movement authority — how far ahead the train may proceed, and at what speed — is refreshed in real time. E3S Conferences The on-board equipment must receive the latest authority frame inside a defined time window, or it has no permission to keep moving. Break the link, miss the window, and fail-safe logic fires the brakes immediately. At the same moment the driver loses the voice channel to the dispatcher, so no emergency instruction can be confirmed either.

In distributed-systems terms, GSM-R plays the role of a heartbeat service wrapping your production database. It is not a nice-to-have; it is the precondition for being online at all. The only difference is that when your service dies it throws a 500; when this link dies, thousands of trains stop where they stand.

Putting the entire national rail safety floor on one communication system is not, in itself, a wrong decision. Most critical infrastructure lives with single points of dependency. What engineering owes back is to give that single dependency a layer of resilience proportional to the risk. That is precisely the step where DB missed.

Paper Redundancy Won’t Survive a Common-Mode Failure

GSM-R redundancy was written into the standard from day one. Dual equipment per site, MSC pool, group call register backups — all listed in ETSI TS 103 147. ETSI On paper, everything is there.

But the real question is: what kind of failure does it defend against? Dual equipment survives one burned card; an MSC pool absorbs one dead server. All of these defend against hardware single points — none of them touch software. Common-mode failure is a different animal: whatever software the primary runs, the backup runs; whatever config the primary loads, the backup loads — so one software defect or one config slip takes both down at once.

Paper redundancy shares the same layer between primary and backup, so one common-mode failure drops both sides; genuinely isolated redundancy relies on heterogeneous stacks and geographic separation, so when one side falls the other takes over

The Netherlands took the identical hit. On 31 May 2022 the national GSM-R network went down. Infrastructure operator ProRail put the cause in the sharpest possible terms: once the backup system kicked in, it was overloaded by the same error and shut itself down. IRJ Primary and backup ran the same code, read the same config, and the same bug knocked them down in sequence.

There is a subtler risk: a single vendor. Both Germany’s and the Netherlands’ GSM-R core networks were supplied by Nokia. IRJ The same core code runs across several national railways, so in principle one software defect could cross borders and hit multiple networks at once. A single vendor pushes the risk boundary past anything any one country’s redundancy design can reach — it becomes an architectural fragility at European scale.

CrowdStrike’s July 2024 incident is the same fruit from the same tree, in the IT world. A config update file, channel file 291, was pushed via auto-update to roughly 8.5 million Windows devices worldwide. A logic flaw in the Content Validator let through a batch of malformed updates; global aviation, banking, and healthcare ground to a halt, and insurers put the damage to the US Fortune 500 alone at close to $5.4 billion. TechTarget An update mechanism designed to be safe, missing effective staged rollout and isolation, amplified a local fault into systemic paralysis.

Nagl’s phrase — “the scheduled swap of a technical component” — is the single most valuable thing to interrogate in the whole affair. Of every factor that can trigger a failure, maintenance is the only one a system can fully control: you pick the time, you stage it, you do it on the backup first, you verify before you cut over, you can roll back whenever you want. ABC/AP If a scheduled maintenance can take down an entire national redundancy layer, there is only one explanation: primary and backup were never actually isolated; they shared the exact layer this swap touched — software, config, or underlying data. Redundancy did not fail that night. It had never existed.

From 100% Straight to 0%

DB’s response in the event was to hold every train across the network. Taken on its own, the safety logic is defensible: with comms gone, emergency stop orders cannot be delivered, so holding trains is the default action under fail-safe rules. But the other face of this is that the system jumped from full operation to complete standstill without a single step of deceleration in between. A properly designed piece of critical infrastructure, when its core dependency fails, should be able to fall back to a degraded-but-running state and hold there for a while — not be left with only two gears.

The engineering solution British rail delivered in 2014 proves this kind of graceful degradation is achievable. The UK’s Office of Rail and Road (ORR) put a challenge to the whole industry: re-examine how operations should continue when GSM-R fails. RSSB led, pulling in ORR, the drivers’ unions ASLEF and RMT, operators, and Network Rail; over about half a year they built a quantitative risk model that put two risk paths on the same spreadsheet: on one side, the increased risk of train collisions caused by GSM-R failure; on the other, the secondary risks to passengers from stoppages and delays — platform crowding, capacity collapse, passengers being pushed onto less safe modes of travel. RSSB

The conclusion: a train known to have a GSM-R fault before departure must not enter service, but a train already running whose GSM-R fails en route may continue up to 75 miles (about 120 km) before being dealt with — no immediate stop required. This principle later hardened into the industry standard RIS-3780-TOM. RSSB put it as bluntly as it could be put: taking a train out of service is not always the safer option. RSSB A stoppage itself introduces a whole new set of secondary risks into the rail system. Plenty of trains, inside a 75-mile window, can complete their scheduled leg without crossing any safety line — and avert a gratuitous total paralysis.

Whether Germany has an equivalent tiered operating procedure does not surface in any public reporting. Read backward from this blanket hold-everything response, the answer is either: there isn’t one, or there is but it isn’t granular enough. Graceful degradation, in its concrete form on the rail machine, is exactly what cashes in at a moment like this: when the core dependency drops, does your system just throw a 500, or can it switch to cached mode, read-only mode, rate-limited mode, or degraded features?

1990s Code, Running a 2026 National Railway

GSM-R sits on 1990s 2G technology. The wider mobile ecosystem left 2G behind long ago: hardware goes end-of-life, vendor support shrinks, security patches dry up — and all of that pressure lands on the railway in sync. The successor, FRMCS, runs on 5G; large-scale deployment is expected to grind on toward 2035, with compliant products likely reaching market around 2029. The Guardian Caught in this decade-long gap, DB’s survival tactic is to scavenge the global secondhand market, one unit at a time, for old spares.

Once technical debt piles up to the scale of national infrastructure, this is what it looks like. Punctuality is the most direct mirror: DB long-distance punctuality slid from about 85% in the early 1990s to 60.1% in 2025, then down again to 59.4% in February 2026 — the bottom of Europe. DB 2025 interim report Over the same window, Switzerland sat at 99%, France at 87%. The subtler part is the measurement bar: Germany’s punctuality threshold allows a delay of up to 6 minutes; Switzerland’s is 3 minutes. DB itself acknowledged this in its 2024 interim report: the infrastructure is degrading faster than expected. DB 2024 interim report

The same debt ledger lists a few more lines: Cologne Hauptbahnhof’s signal box held back by software issues, whole fleets of relay interlockings kept in service up to 70 years (against a 40-year design life), and ETCS laid across only 683 km of the network.

Every team that has ever said “ship now, refactor later” eventually grows up to be DB — scouring global marketplaces for secondhand spares. What differs is only the scale, and how fast time compounds inside the pile.

Run Your Own System Against These Checks

The lessons pulled out of this incident translate directly into a list of design judgments you can check line by line. Each one is anchored to the event itself or a sibling case.

Is your redundancy actually isolated? If primary and backup run the same code and the same config, a common-mode failure punches through on first contact. Redundancy that can actually hold looks different: different sites, different physical paths, and — if you can manage it — different technology stacks. Only then is it truly isolated. The 2025 Newark airport ATC redundancy-button episode says it clearly enough: when the real failure arrived, you pressed the button and got a blank screen, because the redundancy and the primary ran on the same copper underneath. NBC News A former NTSB investigator put his finger right on the bottom of it: the redundancy was there, but the scale and the landing point of the failure swept the redundancy away along with everything else.

Can a single planned change punch through your redundancy? Maintenance is the only failure-injection method you fully control. If your system cannot absorb one of your own scheduled deploys, don’t talk about absorbing a real failure. After CrowdStrike, the industry’s entire corrective aim lined up on this line: treat updates as code, roll out in concentric rings, roll back at will, let customers control their adoption cadence. TechTarget Whether your deploy pipeline has grown all of these capabilities is not a nice-to-have; it is a hard measure of feature completeness — miss one, and you leave a seam.

Does your system have a middle state? The British rail 75-mile procedure is the engineering instance of graceful degradation on the rail axis. The second your database won’t connect, your message queue backs up, or a third-party API times out — does your system throw an exception straight up, or can it switch to cache, rate-limiting, read-only? If a degradation path is written into the code at design time, it isn’t a step improvised on the fly during incident response.

Has the redundancy ever been tested? Nobody ever put a real failure against that Newark button; when it became unavoidable, it spat out a blank screen. Netflix’s chaos engineering is precisely the opposite: inject failures into production on purpose, and verify the redundancy actually holds. How long between your own failure-injection drills? Was the most recent one run in production, or in a test environment?

Is there a single vendor? Germany and the Netherlands both run Nokia. If all your regions sit on the same cloud vendor, the same database version, the same base image, one upstream bug is enough to flatten an entire redundancy layer at once. Multi-vendor or multi-stack does make the operations ledger heavier — but on a critical path, that weight is the premium you pay for resilience up front.

The Institutional Layer Is a Different Article

What is buried under this incident does not all belong on the engineering-judgment ledger. DB has run losses three years running, with net financial debt rolling up to €32.6 billion. Railway Gazette The new CEO, Evelyn Palla, has only been in the seat 8 months and was about to put a restructuring plan on the supervisory board’s table when the incident landed the night before the meeting. DW The federal transport minister has already drawn a direct line between rail disablement and a threat to democracy. ABC/AP Thirty years of underinvestment, a state-owned enterprise whose budget rhythm is hogtied to the political cycle, a 2G system run past its age to the point of surviving on global parts-scavenging — knead those into one dough, and any technical fault that drops in could not possibly stay just a technical fault.

But the engineering lessons distilled out of this incident stand on their own feet; they are not bound to rails. A system whose core dependency, once cut, leaves only full-on or full-off; redundancy on paper that cannot survive a common-mode failure; a single scheduled change that flips over, all at once, every weak point in your redundancy you never tested. Nobody designing a system that cannot be allowed to go down gets around these questions. Next code review or architecture review, push one more question onto the table: can this redundancy survive a common-mode failure?