Have you ever actually stopped and reviewed your entire High Availability & Disaster Recovery (HA/DR) and backup setup? Not just whether jobs are running or features are enabled, but how the whole thing holds up when things go wrong More importantly, have you ever put it to the test?
It’s worth asking how well you’ve covered specific failure scenarios. Because outages don’t happen in neat categories, they show up in all sorts of ways:
- The SQL Server service crashing
- Someone running the wrong UPDATE or DROP
- Ransomware or malicious changes
- Corruption in your data or log files
- The OS or VM failing underneath you
- Planned or unplanned host reboots/Outages
- A datacentre or availability zone going offline
- A full regional outage
- A VM being deleted
- A database being dropped
- Losing access to an entire country or region
Individually, none of these are particularly unusual, but your strategy needs to account for all of them, not just the ones you expect. Because this is where most setups fall short — they’re designed for a general idea of failure, not the reality of how failures actually happen. When these scenarios play out, the impact is rarely just technical. Downtime affects revenue, disrupts operations, and erodes customer trust, especially when recovery takes longer than expected.
This isn’t a one-size-fits-all checklist. It’s a guide to what good looks like when HA/DR and backups are thought through properly. Use it as a guide and adapt it to your needs and budget. It’s also important to understand how much downtime your business is willing to accept within that budget.
If you don’t have HA, your downtime is your restore time
If there’s no Availability Group, no cluster, no geo-replication, then your recovery plan is backups.
This means your RTO (Recovery Time Objective) is however long it takes to restore everything and get it working again. That might be fine, but it needs to be understood, not assumed. If the business expects recovery in minutes and the reality is hours, you’ve already got a gap. Misaligned RTO and RPO expectations create real risk, leaving the business assuming a level of resilience that the platform may not actually deliver.
Recommendation: If the business expects fast recovery, implement an HA layer — don’t rely on backups alone. Alternatively set your businesses expectations.
Use the right HA option, not just the available one
On-prem and IaaS give you choices, but they’re not equal.
Availability Groups are where most setups end up, and for good reason. They’re fast, flexible, and generally the best fit for high availability. Failover Cluster Instances still have their place, but they bring shared storage complexity. Log shipping works, but it leans heavily on manual intervention, which is the last thing you want during an outage.
Each option works — but they behave very differently under pressure.
Recommendation: Use Always On Availability Groups where possible and make failover testing part of normal operations.
In Azure, one layer isn’t enough
Azure makes HA easier, but also easier to misunderstand. Built-in high availability covers local failures, but it doesn’t protect you from everything.
Zone redundancy protects you from losing a datacentre. Geo-replication protects you from losing a region. They solve different problems, and you usually need both. A lot of environments stop at one and assume they’re covered.
That’s where the risk sits. The key is designing for more than one failure scenario.
Recommendation: Combine zone redundancy with geo-replication or failover groups to cover both regional and cross-region failures.
Backups are still doing the heavy lifting
Even with solid HA, backups are still your safety net. If data gets corrupted, deleted, or changed incorrectly, HA just keeps everything in sync — including the problem.
That’s where backups come in. A good baseline is simple and consistent — nothing fancy, just reliable. Regular full backups, frequent transaction log backups, and a structure that matches your recovery point requirements.
But like everything else, the schedule isn’t the important part. The important part is knowing you can actually restore what you’ve taken.
Recommendation: Maintain a consistent backup strategy with frequent log backups aligned to your RPO and proven restore capability. Test your backups (mentioned later).
How you take backups matters more than people think
There’s a big difference between “backups exist” and “backups are well designed.”
On-prem and IaaS environments benefit from scripted approaches like Ola Hallengren’s solution. It’s flexible, predictable, and far easier to manage properly than maintenance plans, which tend to get left as they were first configured.
In Azure, most of the heavy lifting is done for you. Automated backups, geo-redundancy, point-in-time restore — it’s all there. But it still needs to be configured to match business expectations, especially when it comes to retention.
The key here is consistency. Use approaches that are predictable, repeatable, and easy to validate.
Recommendation: Use scripted backup solutions like Ola Hallengren on IaaS/on-prem, and rely on Azure’s automated backups in PaaS.
Store backups like you expect something to go wrong
Keeping everything local might make restores faster, but it doesn’t protect you if the environment itself fails. If the VM goes, or the region goes, you don’t want your backups disappearing with it.
That’s why offsite and geo-redundant storage matter.
The safest setups assume failure and design around it. If your backups only exist in the same place as your production system, they’re not really giving you full protection. They’re just another copy in the same failure domain.
Recommendation: Store backups in geo-redundant storage (preferably RA-GRS) to ensure recovery beyond the primary environment.
Retention should be deliberate, not inherited
Do you actually know how long your backups are kept for? Do you have a clear structure for weekly, monthly, and yearly retention, or is it just whatever was set up originally?
Retention isn’t just about business recovery needs — it’s also driven by compliance and cost. Keep too much and you’re wasting money. Keep too little and you risk not having what you need when it matters.
This is something that should be agreed properly and enforced automatically, not left to chance.
There’s nothing particularly exciting about retention, but getting it wrong creates problems that only show up later.
Recommendation: Define and enforce data retention periods based on business and regulatory requirements, ensuring data is only kept for as long as necessary, and automate policies for consistency.
Testing is what turns design into reality
This is probably the most common gap.
You’ve got HA in place. Backups are running. Everything looks healthy. But do they actually work or is it all just untested theory?
When did you last failover and let things run on the secondary for real? When did you last restore a database end-to-end, not just kick off the command?
Failover should be something you’ve seen happen, not something you hope works. Restores should be something you’ve timed properly, including everything around them — not just the restore itself.
Because when something breaks, that’s the worst possible time to find out how your system actually behaves.
Recommendation: Regularly test failover, restores, and DR scenarios in a way that reflects real-world failures.
Final thought
As a DBA, you’re relied on for more than just putting the right design in place. You’re responsible for making sure it actually works, that it’s been tested, and that you can execute it when it really matters.
When things go wrong, there’s no time to figure it out — you’re expected to deliver.
If you’re not confident that your setup meets your requirements, it’s worth stepping back and reviewing it properly.
Why not speak to Coeo and see where you stand?
Find out more about Coeo’s Restore Confidence Review service.