Connect dots, reveal blind spots
Determining which operations and other areas of the business are the most vital—and most vital to customers—requires considerable focus and commitment. The next step, mapping out the relevant processes, handoffs, and dependencies, can be equally challenging. Businesses struggle to do this well because it’s complex and can involve multiple departments and players. It’s not the usual approach in which business continuity is looked at solely within the siloed functions. The way the whole service is delivered, not individual mechanisms, needs to be identified. At that point, the strengthening of the processes involved can be more forensic, because weak areas are identified, and deliver greater resilience. And this sort of planning is time well spent, because it inevitably highlights blind spots and actionable improvement areas.
Though blind spots may lurk anywhere, they often build up where new tech systems have replaced old ones or were patched together on the fly after a merger or acquisition. Poor institutional memory can create unnoticed vulnerabilities, too. For example, an exercise to understand how a manufacturer’s key products are made found that an unassuming, overlooked part of a legacy computer system was at the heart of the production process. The executives recognized that their focus on financial performance had blinded them to the need for more mundane tech upgrades that directly supported operations. Only by getting the team to focus on the entire manufacturing process did the oversight come to light. If the server had failed, 80% of production would have gone down for days.
Blind spots aren’t just tech-related. The international payments system at a large bank remained offline for three days because the one employee who had the password necessary to access the back-end system was on a backcountry hiking trip with no internet coverage. The bank’s IT department had been unaware of the vulnerability and, in any event, had always assumed it could manage workarounds on the fly. The cost for the bank was a dented reputation and regulatory fines. The episode proved a wake-up call: even though international payments were used by nearly all the bank’s customers, no one at the bank had known that the service could fail exactly in this way.
Mistaken assumptions were also an issue at a large financial-services company that discovered flaws in its payments processes—the service that mattered most to customers across all divisions. Company leaders had been confident that manual workarounds would save the day. Each of the company’s six divisions had already codified the contingency processes that would be needed should any part of the system fail. Indeed, the company was legally obligated to stress test this function—and had.
But executives were chagrined to learn later (thankfully during a mapping exercise and not a crisis) that the planned workarounds the divisions had prepared didn’t account for the need to scale them companywide. And that was the blind spot: had the plans been invoked during an emergency, they only would have been able to handle 12 payments a day across all six divisions. And even that might have been optimistic, given that the plans were created in incompatible formats and used wildly different assumptions of how long it would take to recover operations, ranging from 48 hours to a week.
These realizations led to a larger conversation about what it would take to put in place a system that could absorb the impact of a payments failure—whether it was because of an IT problem or a cyberattack, or simply because the people with authority to sign off on payments were missing in action. Which customers could survive without cash, and for how long? Would it be better to help small businesses, which represented less of the company’s revenues but were more vulnerable to payment delays? Or was it better to help larger businesses?
Traditional business continuity planning would focus on recovering business as usual for the whole system, whereas creating resilience flips that assumption to instead ask: how do we recover some degree of operational capacity immediately—and survive a potential catastrophe?