1. https://appdevelopermagazine.com/devops
  2. https://appdevelopermagazine.com/disaster-recovery-planning-actually-starts-with-planning-to-fail/
9/12/2016 9:02:45 AM
Disaster Recovery Planning Actually Starts with Planning to Fail
Disaster Recovery,PagerDuty
/Pager-Duty-Planning-to-Fail-App-Developer-Magazine_41t70u4b.jpg
App Developer Magazine
Disaster Recovery Planning Actually Starts with Planning to Fail

DevOps

Disaster Recovery Planning Actually Starts with Planning to Fail


Monday, September 12, 2016

Eric Sigler Eric Sigler


For disaster recovery done right, failure is not only an option – it’s mandatory.

Why? A scan of recent news headlines gives us the answer. In just the last month, we’ve seen one of the earliest major airlines to adopt technology brought to its knees by a power issue that cascaded through its systems resulting in thousands of flights being canceled. In our modern, digital and connected age, any downtime at all is more than an inconvenience – it means untold millions of dollars in lost revenue. In fact, for just the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion.

Despite the obvious costs of unplanned downtime and the damage done to brands, many companies end up caught with either an inadequate disaster recovery (DR) plan or none at all. Market researcher IDC estimates that as many as 50 percent of organizations have DR plans that fall short of the mark.

To establish the foundation of a thorough DR plan, you need to accept the following five tenets and assumptions:

Planning to Failover is Planning to Fail

In recent years when large organizations have had a massive failure that lead to downtime, we’ve seen their DR plans predicated on a switchover from primary infrastructure to backup systems. But in fact, a failover architecture is often a recipe for failure, due to the constantly changing demands placed on IT. For example, with Amazon Web Services it’s not as though there is an entirely separate globe spanning infrastructure ready to pick up if AWS Plan A fails. In today’s interconnected economy, a failover strategy doesn’t align with the movement to modernize infrastructure through digital transformation. Modern DR has no single point of failure; rather, you continually send traffic through multiple regions and skip switching from primary to hot standby altogether.

Approach DR as Standard Operating Procedure

Proper disaster recovery is not built on the premise of an exception case in your infrastructure, serving as a backup for the worst case scenario. It shouldn’t be a master plan kept in your back pocket in case of emergency, but rather woven into your normal operations and your approach to daily tasks. Most organizations – to their peril – still approach DR as an afterthought and are typically managed by a specific department or team. Effective DR, however, isn’t owned by any one area or department within an organization. Instead, it should be part of a larger cultural mindset where everyone and every team, from the C-suite to the engineers, should take ownership and play a vital role. Bottom line: you treat IT disaster as the rule, not the exception, and build it into your organization’s day-to-day routine.


Practice Makes Perfect

Backups aren’t backups until they’re tested. This may sound obvious, but any credible and thorough DR plan must include routine verification that it can actually be executed successfully. To best prepare for unplanned downtime, organizations should schedule regular practice where backups must be restored, new hosts provisioned, and monitoring and alerting checked for accuracy. This needn’t be an onerous process, and can often be included in other efforts. For example, if you have a staging and production set of environments, you could restore the production database backup into staging on a regular basis. As you do this, take care to sanitize any sensitive data and ensure both that your backups are actually usable and that your staging environment has "fresh" data.

Failure Is Not Only an Option; It's a Requirement

There’s a saying at NASA that’s been around since the ‘60s: “Test Like You Fly.” In fact, over the years, the phrase has progressed from an undefined notion to an actual assessment and implementation process. It’s really no different in the modern enterprise. It’s imperative not only to acknowledge failure as a reality but actually to practice it. It’s important to institute failure and recovery scenarios as part of everyday operations. Many organizations do “Game Days,” where they purposely simulate (or even better, actually induce) a negative impact on part of their infrastructure to verify that monitoring, alerting and recovery automation work correctly. Load balancers, application backends and persistence tiers are all good candidates for inducing a targeted failure. Key to this is also testing your practiced failures on both customer-facing and non-customer-facing infrastructure. While some companies may have an aversion to potentially inconveniencing users, the negatives of prepared-for downtime are minimal compared with an unexpected outage for which your team isn’t ready. A cultural bonus is that embracing failure can support innovation; teams that don’t fear failure and are prepared for it are better able to take bigger risks.

Don't Have a Failure to Communicate

A good DR plan will have an effective and comprehensive communication plan woven throughout that includes internal stakeholders like IT Ops, upper management, customer service and support, as well as customers. Having a “people process” in place is crucial to effective recovery plans. Automating the communication and escalation processes when disaster strikes is also extremely valuable. Don’t wait until you have a big incident to set up a dashboard or create an email distribution list. It’s better to have these things and never need them than the other way around. Of course, you’re going to communicate internally with the relevant parties, get on the phone, hop on a chat client and keep a record, but don’t forget about your customers. In the immediate aftermath of a disaster, one of the first big decisions you’ll need to make is when you’re going to address the issue with customers. Outages are chaotic and it can be difficult to settle on the best way to let your customers know, but the most important thing is to start right away. Let your customers immediately know that you are aware of the issue and at work on a solution, maintaining a concise and authoritative tone.

As recent headlines have shown, whether you’re a major airline or an online retailer, it’s not a matter of if, but a question of when disaster will strike your infrastructure. Our world and economies are only becoming ever more reliant on IT and as a result, when IT is impacted, your  revenue, brand, and customer relationships become casualties. As organizations continue to modernize their infrastructures and IT becomes more ingrained within enterprises, so, too, should disaster preparedness.




Read more: https://www.pagerduty.com/




This content is made possible by a guest author, or sponsor; it is not written by and does not necessarily reflect the views of App Developer Magazine's editorial staff.

Subscribe to App Developer Magazine

Become a subscriber of App Developer Magazine for just $5.99 a month and take advantage of all these perks.

MEMBERS GET ACCESS TO

  • - Exclusive content from leaders in the industry
  • - Q&A articles from industry leaders
  • - Tips and tricks from the most successful developers weekly
  • - Monthly issues, including all 90+ back-issues since 2012
  • - Event discounts and early-bird signups
  • - Gain insight from top achievers in the app store
  • - Learn what tools to use, what SDK's to use, and more

    Subscribe here



Featured Stories


Tether QVAC SDK Powers AI Across Devices and Platforms
Tether QVAC SDK Powers AI Across Devices and Platforms Wednesday, April 22, 2026


APAC 5G expansion to fuel 347B mobile market by 2030
APAC 5G expansion to fuel 347B mobile market by 2030 Tuesday, April 21, 2026


How AI is causing app litter everywhere
How AI is causing app litter everywhere Tuesday, April 21, 2026




The App Economy Is Thriving
The App Economy Is Thriving Monday, April 20, 2026


NIKKE 3.5 anniversary update livestream coming soon
NIKKE 3.5 anniversary update livestream coming soon Friday, April 17, 2026


New AI tool targets early dementia detection
New AI tool targets early dementia detection Thursday, April 16, 2026


Jentic launch gives AI agents api access
Jentic launch gives AI agents api access Wednesday, April 15, 2026


Experts warn ai-generated health content risks misinterpretation without human oversight
Experts warn ai-generated health content risks misinterpretation without human oversight Wednesday, April 15, 2026


Ludo.ai Unveils API and MCP Beta to Power AI Game Asset Pipelines
Ludo.ai Unveils API and MCP Beta to Power AI Game Asset Pipelines Tuesday, April 14, 2026


AccuWeather Launches ChatGPT Integration for Live Weather Updates
AccuWeather Launches ChatGPT Integration for Live Weather Updates Tuesday, April 14, 2026


Stay Updated

Sign up for our newsletter for the headlines delivered to you

SuccessFull SignUp

Get More App News



/sites/themes/prod/assets/js/less.js"> ' ' %>