One data center goes down. How much of your organization can keep going?

Just how reliable is your cloud, really?

A fire in a data center. Six days of chaos. First, the fire itself. Then, restoring power. And only then does the real work begin for customers: checking systems, restoring connections, restarting services, assessing damage, verifying data, and getting users back online.

The fire at NorthC in Almere highlights something many organizations would rather not dwell on for too long. Cloud, hosting, and colocation often feel abstract. All neatly arranged. Professionally managed. Contract in place. SLA attached. Done.

Until the power goes out.

Then it becomes clear just how physical the digital world still is. Behind every cloud environment are buildings, cables, distribution cabinets, emergency power, cooling, routers, racks, and people who have to make the right decisions under pressure. That’s not a criticism of NorthC. Disasters happen. Fire, water, power outages, cable breaks, human error, and digital attacks are part of the reality of IT.

So the question isn’t whether a data center can ever be affected. The question is what happens to your organization when that happens.

During the fire in Almere, multiple organizations came to a complete standstill. Not just a slight slowdown. Not just a few applications becoming unavailable. Simply: no IT, so no work. And when your digital systems go down, more than just your server farm goes down with them. Telephony. Access systems. Workstations. Customer contact. Internal processes. Sometimes even services provided to citizens, students, patients, or travelers.

That is the painful part.

By 2026, a fire in a single data center should not automatically mean that your entire operation collapses. Certainly not for your core systems. Redundancy, failover, separate connections, backups, and recovery procedures are no longer a luxury. They are part of business continuity.

Because things could have turned out much worse for Almere. If customer equipment had been severely damaged, we wouldn’t be talking about days. We’d quickly be talking about weeks.

This fire is therefore not an incident to simply click away. It’s a test question.

How reliable is your cloud, really, if the location where it runs goes down tomorrow?

The most dangerous SPOF is rarely on your architecture diagram

Redundancy sounds reassuring. Two systems. Two connections. Two suppliers. Two locations. Done, you might think. Until practice shows that duplicating systems is different from designing them to be truly independent.

I once had a client who thought they were fully redundant with two storage arrays. The data was neatly distributed across both arrays. On paper, that looked fine. Until a technician from the vendor made a mistake on one of those arrays. Then it turned out that all the boot volumes were on that very same system. The data was distributed, but the ability to boot up was not. The result: no quick failover, but a time-consuming restore from tape.

Another customer had set up redundant connections between two locations. Different building entrances. Two telecom providers. Everything seemed neatly separated. Until an excavator hit the cables at the one spot where both providers turned out to be running through the same conduit.

That’s the problem with redundancy. The weak spot is often not in the component you’ve deliberately duplicated, but in the dependency that no one noticed. A shared power supply. A common route. A single management account. A single configuration error. A single procedure that exists only on paper.

Can you protect yourself against this? In 99.999% of cases, yes. But then you mustn’t view redundancy as just a checkbox. You need to have a detailed overview of the entire chain.

The speed of recovery determines the actual damage

RTO and RPO sound like terms for a technical appendix. They aren’t. They determine how much damage your organization is willing to accept if something goes wrong.

Simply put, Recovery Time Objective means: how long can you be offline at most? Recovery Point Objective means: how much data can you lose at most? Two questions every executive team should be able to answer. Not in theory, but per system, per process, and per disaster.

With a failed disk, RTO and RPO often seem to be zero. Especially if you use RAID. The disk fails, the environment keeps running, someone replaces the disk, and that’s it. But what if the backplane fails? Or the controller? Or if there’s a firmware error in all disks from the same production batch? Then you suddenly lose much more than just one disk. First, you wait for hardware repair. Then you still have to perform a restore. Your RTO is then not zero, but perhaps six hours. Your RPO depends on the age of your last usable backup.

And then there’s that second storage in the second data center. Nice idea. Until it turns out that the same production series of hardware is running there, with the same flaw.

Software and configuration also determine how quickly you can switch over. Failover isn’t a magic button. The speed depends on your setup, dependencies, data replication, network paths, authentication, and management. If your failover depends on a resource from the data center that just went down, your fallback is just an expensive waiting room.

However, resilience starts with good design and regular testing.

Resilience requires more than just following vendor advice

Do best practices take resilience into account?

Best practices are valuable. They prevent you from having to reinvent the wheel over and over again. They bundle experience, knowledge, and mistakes that others have already made for you. But a best practice is not a law. And certainly not a substitute for thinking.

When it comes to resilience, you need to stay sharp.

The question isn’t: what does the vendor recommend? The question is: does this advice fit your environment, your risks, and your recovery goals?

A vendor naturally views things from the perspective of their own technology. That doesn’t make their advice wrong. But it also doesn’t automatically make it the best advice for your situation. The best solution a vendor can provide isn’t always the best solution for your organization.

An Oracle PAPA cluster can offer a much higher degree of resilience than a Microsoft SQL Server HA cluster. A PostgreSQL installation with region-based sharding, in turn, can offer capabilities that you won’t find in the same way with Oracle or Microsoft. The point isn’t which technology is “better.” The point is that context determines everything.

That’s why you need to evaluate best practices. Against your architecture. Against your dependencies. Against your RTO and RPO. Against your budget. Against your management organization. Against the damage if things go wrong.

How robust is your IT when the going gets tough?

Your IT won’t keep running on its own if something goes wrong. That requires technical architecture. Not just a pretty diagram in a document, but a practical design that accounts for fire, power outages, cable breaks, firmware errors, human errors, and vendor dependencies.

Resilience starts with insight. Which systems are truly critical? What dependencies exist between them? Where are the single points of failure? How quickly must you be able to recover? And how much data loss can you tolerate without your operations suffering?

These aren’t questions to ask after an incident. By then, it’s too late.

Want to know how robust your environment really is? Schedule a no-obligation consultation with me. Together, we’ll identify your vulnerabilities and determine which technical choices are needed to keep your operations running under pressure.