DevOps

Disaster Recovery Planning Actually Starts with Planning to Fail

Monday, September 12, 2016

For disaster recovery done right, failure is not only an option – it’s mandatory.

Why? A scan of recent news headlines gives us the answer. In just the last month, we’ve seen one of the earliest major airlines to adopt technology brought to its knees by a power issue that cascaded through its systems resulting in thousands of flights being canceled. In our modern, digital and connected age, any downtime at all is more than an inconvenience – it means untold millions of dollars in lost revenue. In fact, for just the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion.

Despite the obvious costs of unplanned downtime and the damage done to brands, many companies end up caught with either an inadequate disaster recovery (DR) plan or none at all. Market researcher IDC estimates that as many as 50 percent of organizations have DR plans that fall short of the mark.

To establish the foundation of a thorough DR plan, you need to accept the following five tenets and assumptions:

Planning to Failover is Planning to Fail

In recent years when large organizations have had a massive failure that lead to downtime, we’ve seen their DR plans predicated on a switchover from primary infrastructure to backup systems. But in fact, a failover architecture is often a recipe for failure, due to the constantly changing demands placed on IT. For example, with Amazon Web Services it’s not as though there is an entirely separate globe spanning infrastructure ready to pick up if AWS Plan A fails. In today’s interconnected economy, a failover strategy doesn’t align with the movement to modernize infrastructure through digital transformation. Modern DR has no single point of failure; rather, you continually send traffic through multiple regions and skip switching from primary to hot standby altogether.

Approach DR as Standard Operating Procedure

Proper disaster recovery is not built on the premise of an exception case in your infrastructure, serving as a backup for the worst case scenario. It shouldn’t be a master plan kept in your back pocket in case of emergency, but rather woven into your normal operations and your approach to daily tasks. Most organizations – to their peril – still approach DR as an afterthought and are typically managed by a specific department or team. Effective DR, however, isn’t owned by any one area or department within an organization. Instead, it should be part of a larger cultural mindset where everyone and every team, from the C-suite to the engineers, should take ownership and play a vital role. Bottom line: you treat IT disaster as the rule, not the exception, and build it into your organization’s day-to-day routine.

Practice Makes Perfect

Backups aren’t backups until they’re tested. This may sound obvious, but any credible and thorough DR plan must include routine verification that it can actually be executed successfully. To best prepare for unplanned downtime, organizations should schedule regular practice where backups must be restored, new hosts provisioned, and monitoring and alerting checked for accuracy. This needn’t be an onerous process, and can often be included in other efforts. For example, if you have a staging and production set of environments, you could restore the production database backup into staging on a regular basis. As you do this, take care to sanitize any sensitive data and ensure both that your backups are actually usable and that your staging environment has "fresh" data.

Failure Is Not Only an Option; It's a Requirement

There’s a saying at NASA that’s been around since the ‘60s: “Test Like You Fly.” In fact, over the years, the phrase has progressed from an undefined notion to an actual assessment and implementation process. It’s really no different in the modern enterprise. It’s imperative not only to acknowledge failure as a reality but actually to practice it. It’s important to institute failure and recovery scenarios as part of everyday operations. Many organizations do “Game Days,” where they purposely simulate (or even better, actually induce) a negative impact on part of their infrastructure to verify that monitoring, alerting and recovery automation work correctly. Load balancers, application backends and persistence tiers are all good candidates for inducing a targeted failure. Key to this is also testing your practiced failures on both customer-facing and non-customer-facing infrastructure. While some companies may have an aversion to potentially inconveniencing users, the negatives of prepared-for downtime are minimal compared with an unexpected outage for which your team isn’t ready. A cultural bonus is that embracing failure can support innovation; teams that don’t fear failure and are prepared for it are better able to take bigger risks.

Don't Have a Failure to Communicate

A good DR plan will have an effective and comprehensive communication plan woven throughout that includes internal stakeholders like IT Ops, upper management, customer service and support, as well as customers. Having a “people process” in place is crucial to effective recovery plans. Automating the communication and escalation processes when disaster strikes is also extremely valuable. Don’t wait until you have a big incident to set up a dashboard or create an email distribution list. It’s better to have these things and never need them than the other way around. Of course, you’re going to communicate internally with the relevant parties, get on the phone, hop on a chat client and keep a record, but don’t forget about your customers. In the immediate aftermath of a disaster, one of the first big decisions you’ll need to make is when you’re going to address the issue with customers. Outages are chaotic and it can be difficult to settle on the best way to let your customers know, but the most important thing is to start right away. Let your customers immediately know that you are aware of the issue and at work on a solution, maintaining a concise and authoritative tone.

As recent headlines have shown, whether you’re a major airline or an online retailer, it’s not a matter of if, but a question of when disaster will strike your infrastructure. Our world and economies are only becoming ever more reliant on IT and as a result, when IT is impacted, your revenue, brand, and customer relationships become casualties. As organizations continue to modernize their infrastructures and IT becomes more ingrained within enterprises, so, too, should disaster preparedness.

Read more: https://www.pagerduty.com/

This content is made possible by a guest author, or sponsor; it is not written by and does not necessarily reflect the views of App Developer Magazine's editorial staff.