The evolution from lean and agile to antifragile

The evolution from lean and agile to antifragile
New cloud development styles have implications for business operations, too.
By Alan Morrison and Bo Parker

Mike Krieger and Kevin Systrom, two entrepreneurial software developers with modest prior knowledge of back-end systems, built and launched Instagram in 2010 on a single server from one of their homes. Then they stood back to see what would happen.

Figure 1
Rapid social network growth at Instagram

Figure 1: Rapid social network growth at Instagram

On the first day, 25,000 users signed up for the mobile photo-sharing service. Within two years, Instagram had 2 million users. It acquired 1 million users in just 12 hours after it began offering an Android version. By May 2013, nine months after Facebook acquired Instagram for $1 billion, the photo-sharing service had more than 100 million active users per month and continued to show substantial new growth quarter to quarter. (See Figure 1.)

As user sign-ups skyrocketed after the launch, the founders quickly decided to move the service to the Amazon cloud. The cloud, as will be described later, offered the scalability needed, but Instagram’s rapid growth also demanded they revisit how the entire system operated. They recognized that trying to achieve the absolute stability of a traditional enterprise system was unrealistic given the exponential growth. Instead, they designed the system to take advantage of the uncertainty created by the pace of growth even as the growth itself forced changes to core elements. With this approach, they solved a succession of scaling problems one at a time, as they arose, using open source databases and various data handling and other techniques identified with the help of open developer communities.

In April 2012, Mike Krieger of Instagram acknowledged that they had learned the real meaning of scaling: “Replacing all components of a car while driving it at 100 mph.”

In April 2012, Krieger acknowledged that they had learned the real meaning of scaling: “Replacing all components of a car while driving it at 100 mph.”1

Like the Instagram founders, several other web entrepreneurs have taken advantage of cloud development, infrastructure, and operations capabilities that many established enterprises don’t use and only vaguely understand. These capabilities are consistent with an emerging systems management concept called antifragility, which has significant business implications for all kinds of enterprises, large or small, new or well established.

Testing as a service Click here

Testing as a service

Multi-application test strategies offer the advantage of testing at a layer of abstraction above the application software development life cycle level. These strategies can be used even if several software development life cycle systems are in use.

With typical managed service testing, companies pay for resources whether they use them or not. Cloud-based testing models provide companies greater flexibility to pay for resources, frameworks, and tools as they are needed. Enterprises that have a multi-app platform strategy can use a testing-as-a-service (TaaS) provider on a per-use basis. When the TaaS provider executes against the multi-app platform strategy from the cloud, resources can be engaged from any location, and testing artifacts and test data can be shared. In addition, the TaaS approach contains environment costs, because test environment provisioning and de-provisioning occur via the cloud.

 

This issue of the Technology Forecast describes the emerging IT development, infrastructure, and operations methodologies in the broader context of strategic and operational enterprise change initiatives. PwC forecasts that agile methods in businesses will force modifications in IT, leading to the adoption of change management techniques in IT similar to the antifragile approach that web-scale companies use to manage rapid and continuous change.

This first article discusses the antifragile approach and how it embraces continuous, disruptive change. It positions antifragility in the context of agile and lean methodologies, detailing how the principles underlying these methods remain in the antifragile approach. The article also positions antifragility as a catalyst for broader organizational change, and it traces the link between the pace of change that becomes possible using agile methods at a business design level and the potential disconnect if IT cannot keep pace.

The second article covers tools and techniques that broadly support the continuous delivery and deployment required of antifragile systems. Facebook, LinkedIn, Google, Yahoo, and others have needed to develop tools that support this approach. The good news is they generally have made these tools available to others as open source technology, and a network of support vendors is beginning to help companies adopt them.

The third article describes the CIO’s challenges in taking on the demand for a more agile business, including the cultural changes needed within IT. One key question is how do you get comfortable with the seeming paradox of simultaneous constant change and stability in production systems? The answer lies in the adoption of DevOps methods.

Antifragility: What it is

The Instagram example might seem like just another typical Silicon Valley startup success story, but its broader significance is the level of web-scale responsiveness that a handful of people were able to quickly discover and demonstrate. For Instagram, scaling meant anticipating, observing, and fixing performance problems when they arose as quickly as possible. They ended up developing a giant feedback-response loop; rapid responsiveness made it viable. This level of responsiveness has only recently become possible due to cloud computing. Negotiating the learning curve quickly to achieve a level of consistent responsiveness is fundamental to antifragility, and it is a capability sorely needed in today’s enterprises, whether by cloud developer teams or business more generally.

The term “antifragility,” coined by Nassim Nicholas Taleb in his book Antifragile: Things That Gain from Disorder, is defined by the source of its strength.2 Like the human body, whose immune system gets stronger from exposure to germs and diseases, the antifragile system improves or responds positively when shocked.

While fragile systems are easily injured and suffer from volatility, antifragile systems grow stronger in response to volatility. So-called robust systems remain unchanged. (See Figure 2.)

Figure 2
The fragile versus the robust, agile, and antifragile

Figure 2: The fragile versus the robust, agile, and antifragile 

The concept of antifragility becomes clearer when compared and contrasted with stability. As Taleb describes it, antifragile is beyond stable, beyond robust; stable and robust systems resist shocks but stay the same. In contrast, Taleb describes antifragile systems as those capable of absorbing shocks and being changed by them in positive ways. The key insight is that stable systems, because they don’t change, eventually experience shocks large enough to cause catastrophic failure. Antifragile systems break a little all the time but evolve as a result, becoming less prone to catastrophic failure.

Taleb, a former hedge fund manager and derivatives trader, argues that antifragile systems gain strength from volatility. They are the exact opposite of fragile systems. He spent years pondering the vulnerability to volatility and, in particular, infrequent catastrophes sometimes called “black swan” events. In The Black Swan, his earlier best-selling book, Taleb noted that many banks and trading companies used stable models that worked most of the time but were unable to predict or account for the rare event of house prices falling throughout the United States, leading to the recent financial crisis. When the magnitude of change stays within a normal range, robustness can be a state that seems resilient. During periods of unusual change, only the antifragile organizations prove to be resilient.

The concept of antifragility evolved from the socioeconomic context with which Taleb was most familiar, not from cloud computing. But it’s in the cloud that the concept may have its broadest practical application to date.

Rob England, author of Basic Service Management, has made the connection between the concept of antifragility and the design and operation of today’s IT systems. “If an organisation’s IT is unstable, we can move from a fragile condition to a spectrum of better states, ranging between antifragile and robust,”3 he writes. Fragile organizations seek to become less fragile or—ideally—antifragile. Organizations that become robust may still be fragile.

Life, Taleb says, is not as predictable or explainable as the rationalists would have us believe; instead, simple sets of rules help us to navigate through a complex and constantly changing landscape. He argues, “We have been fragilizing our economy, our health, education, almost everything—by suppressing randomness and volatility… If about everything top down fragilizes and blocks antifragility and growth, everything bottom up thrives under the right amount of stress and disorder.”4

Business uncertainty will demand agile, antifragile methods

The idea that businesses might be experiencing an unprecedented amount of “stress and disorder” should come as no surprise. CEOs and other senior executives consistently describe uncertain future business conditions as a key concern. Some of the biggest drivers of uncertainty include:

  • Narrower windows of opportunity: Companies offering products and services with distinctive value propositions are experiencing shorter windows of pricing power before competitors commoditize the market.
  • More convergence: Companies are expanding their addressable markets by moving into adjacent areas, often becoming intense competitors with enterprises they formerly bought from or sold to, forcing incumbents to bundle capabilities and expand their value footprints.
  • Slower growth in traditional markets: Developed economies have been in a deep slump for several years with low expectations for future growth. Developing economies offer the greatest potential for new revenue, but multinationals often cannot meet the lower price points or access distinctive channels to market in these economies.
  • Greater need for business model change and additional information services: These uncertainties are leading many companies to consider radical changes to their business models, such as bundling services with products and offering total solutions designed to help customers achieve their goals. (See Technology Forecast 2013, Issue 1.) Other companies are unbundling their vertically integrated offerings and introducing application programming interfaces (APIs) that encourage third parties to build new offerings based in part on these unbundled capabilities. (See Technology Forecast 2012, Issue 2.) In each case, the people, processes, technologies, leadership, organizational designs, and strategies in these enterprises are experiencing tremendous stress. That’s because large companies have scaled up by developing highly capable but siloed functional domains that minimize cross-organization information and interactions for efficiency purposes. This approach is ideal in a stable world, but in today’s dynamic markets these flows can’t be redesigned and redeployed fast enough. Most companies respond by using traditional business process reengineering methods that explicitly or implicitly seek a new, stable operating environment. In many cases, by the time that new stable design gets deployed, the market has moved on.
Toward a fully digital business Click here

Toward a fully digital business

The new IT platform

More and more, the quickest way to a broader customer base is through new digital touch points. Mobile is the fastest growing part of this shift, but it’s not the only part. Mobile payment transaction values will grow 44 percent worldwide in 2013, according to Gartner, rising to $236 billion from just under $201 billion in 2012.* E-commerce as a whole will reach $1.22 trillion in 2013, according to eMarketer, up more than 17 percent from $1.047 trillion in 2012.**

How do enterprises take better advantage of the rapidly growing online environment? By going to where the customer is and by being flexible enough to lead in an environment that constantly changes. Competitive advantage hinges on the ability to be responsive to customers, and that means providing user-centric interfaces, rich information services via application programming interfaces (APIs), consistently compelling value propositions, and a simple way to transact business on whatever kind of device a customer has.

Cloud development methods and the DevOps approach to frequent delivery and iteration are at the core of the digital customer touch-point challenge, because they’re a microcosm of the way a fully digital business operates. Companies that study how talented DevOps teams work can use the insights to inform the business side of their customer engagement models and their operating models generally. A fully digital business runs on what PwC calls the New IT Platform—a way to enable the business to immediately shift direction and be as responsive as new online customers demand.

The pivotal capabilities that enterprises need to tap into—whether sourcing expertise, co-creating with customers on a new product line, or motivating a purchase by a new customer with the help of gamification—are feasible at scale only when the business has made a full digital transformation. (See “Capitalizing on the strategic value of technology” for more information on PwC’s New IT Platform.)

* Gartner, “Gartner Says Worldwide Mobile Payment Transaction Value to Surpass $235 Billion in 2013,” news release, June 4, 2013, http://www.gartner.com/newsroom/id/2504915, accessed July 2, 2013.

** “B2C Ecommerce Climbs Worldwide, as Emerging Markets Drive Sales Higher,” eMarketer Daily Newsletter, June 27, 2013, http://www.emarketer.com/Article/B2C-Ecommerce-Climbs-Worldwide-Emerging-Markets-Drive-Sales-Higher/1010004, accessed July 2, 2013.

 

Management is becoming more and more aware that traditional approaches to business redesign can’t work in rapidly changing business environments. They recognize that employees who have the ability to work across their functional domains should be able to effectively respond. Often, existing practices for discovering and designing optimum solutions perpetuate the silos.

Company size and history can limit the ability to adopt DevOps

DevOps is a working style designed to encourage closer collaboration between developers and operations people: Dev+Ops=DevOps. (See the article, “Making DevOps and continuous delivery a reality,” on page 26 for more detailed information on DevOps methods.) Historically those groups have been working more or less at cross-purposes. DevOps collaboration seeks to reduce the friction between the two groups by addressing the root causes of the friction, making it possible to increase the volume and flow of production code, to improve the efficiency of developer teams, and to reduce the alienation of the ops people who must guard the stability of the system.

One of those root causes of friction is poor workflow design. DevOps encourages extensive automation and workflow redesign so developers can release small bits of code frequently (in a more or less continuous delivery cycle) and yet not disrupt the operational environment in doing so. The workflow includes buffers, compartmentalization, and extensive monitoring and testing—a very extensive and well-designed pipeline, but also a rich feedback loop. It’s a very test-driven environment. When the small bits of code get deployed, the individual changes to the user experience tend to be minor or limited to a small subaudience initially.

How antifragility fits in with other change management theories Click here

How antifragility fits in with other change management theories

Enterprise process optimization has a heritage that goes back at least to Gustavus Swift and his meat processing efficiency breakthroughs, which inspired Henry Ford’s automobile production assembly lines. In the postwar era, the key processing efficiency breakthrough was statistical process control, which led to Kaizen (a continuous improvement philosophy) in the 1980s in Japan, waterfall development, and the rise of IT service management in the 1990s. Since that time, IT development has adopted a version of the Lean Enterprise and sought effectiveness and efficiency through varieties of agile development. (See Figure A.)

 

The use of a test-driven, frequently evolving DevOps approach lays the groundwork for antifragility.

However, many companies don’t have the ability to adopt DevOps quickly, reap the benefits, and move toward an antifragile state. Compare the way enterprises adopt a DevOps approach with what an Instagram did (a cloud native, as in Figure 3) from the start or an Ancestry.com (a pre-cloud enterprise) did within a few years. Instagram is a couple of years old, and Ancestry is a couple of decades old.

Figure 3
Enterprise DevOps teams must navigate the maze of legacy systems and processes.

Figure 3: Enterprise DevOps teams must navigate the maze of legacy systems and processes 

By contrast, Nationwide Insurance has been around since 1926 and has presumably had computer systems for more than 50 years. In spite of that legacy, the company has adopted some agile development practices and is looking toward DevOps.

Vijay Gopal, vice president and enterprise chief architect at Nationwide Insurance, acknowledges that the company has “some antiquated systems that have not been developed with the degree of consistency or using the principles that we would like to have. We’re investing significantly to move toward that goal. But we also need to keep in mind that we operate in a regulated industry, and we can’t release code that could compromise customer information or some of the financial and sensitive data that we have. We would need to prove this agile development in some of the noncritical areas of our business first.”

An older and larger enterprise must deal with many, many issues when trying to make changes.

Moving toward antifragility: SpaceX

Some enterprises in established industries are taking steps toward an antifragile version of stability, even though they don’t use that term yet. Space Exploration Technologies (SpaceX), a private aerospace company founded in 2002, focuses on making commercial space travel viable by reducing launch vehicle costs. The following paragraphs describe some of the notable characteristics of the SpaceX approach.

A business strategy built on intentional disruption

SpaceX, led by PayPal and Tesla co-founder Elon Musk, expects and indeed tries to create market disruption. CIO Ken Venner says companies should expect and prepare for rapid industry change: “The guys who are successful today may not be paranoid enough to figure out that the world can get changed on them. There is a mechanism by which the world will get changed, and features will come out at a pace they can’t keep up with. That will be a problem.”

A rapid change management approach focused on small iterations, scrums, and sprints

SpaceX uses modest-sized teams or “scrums” of business analysts and developers who “sprint to the finish line” to deliver small collections of new features that can be added to the system in an automated and straightforward manner. They then repeat the process multiple times, responding to feedback that generates new insights.

Frequent IT collaboration with business units

“You’re interacting with the users more often and you’re getting stuff set up faster,” Venner notes. “So you’re all starting to quickly see, ‘If I make this change over here, what potential impact could it have somewhere else in the organization?’ You’re showing users things versus talking it through, and they’re quicker to figure out that it may or may not work.”

An operating environment designed for test drives

Plan to write the test before you code, and allocate more time to testing in general.

An ability to scale change efforts in parallel

Venner points out that cloud infrastructure makes it possible to manage several ongoing development streams simultaneously. “Virtualization lets you spin up a large number of instances of your application environment. Therefore, the ability to have multiple parallel scrums going at any one time is relatively cost-effective and relatively simple to do nowadays.”

Antifragility starts with responsiveness and the ability to make changes at will; cloud both enables and requires that responsiveness. The iterative, highly responsive, and collaborative development model used by companies such as Instagram is inspiring other users of the public cloud. “My development team is aware of how the build process takes place and how test-driven development is evolving at companies like Facebook, LinkedIn, and Google,” Venner says. “Not all of it is applicable when dealing with spaceflight, but our goal is to take best practices from multiple industries to create the best possible solutions for SpaceX. Right now we release at least once a week, sometimes midweek, and we’re trying to move toward a continuous build and integration environment.”

Encouraging antifragility: Chaos Monkey

Web-scale companies are not especially protective of their newest development, infrastructure, and operations philosophies or even their code. Netflix, LinkedIn, Facebook, Google, Twitter, and many others all share code and openly discuss how to update their sites and improve site performance.

Like biological organisms, antifragile systems adapt and evolve in response to stress and changes to their environment.

For example, Netflix shared Chaos Monkey with the open source community. Netflix designed Chaos Monkey to shut down cloud services at random to test the fault tolerance and redundancy of the services’ customer-facing systems. The description (not to mention the name) of Chaos Monkey makes it seem like a catastrophically disruptive tool, but in a cloud-computing context, the tool is valuable. Like biological organisms, antifragile systems adapt and evolve in response to stress and changes to their envirofnment. Chaos Monkey applies a helpful kind of stress in the cloud.

In 2010, Netflix executives pointed out that their adoption of Amazon Web Services (AWS) forced them to think differently about customer-facing services. The core of that new awareness was to eliminate dependencies on services delivered with varying degrees of latency. As John Ciancutti, formerly of Netflix and now director of engineering at Facebook, put it, “AWS is built around a model of sharing resources; hardware, network, storage, etc. Co-tenancy [that is, sharing services] can introduce variance in throughput at any level of the stack. You’ve got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must. Your best bet is to build your systems to expect and accommodate failure at any level.”5

The continuing software development dialogue with manufacturing

The theory and practice of organizational performance improvement may be going full circle—from every product built slightly differently to everything highly standardized and then back to high product variability. But this time, the variability is not accidental and loss producing; it is intentional and value creating, enabled by more and more granular flows of information.

Frederick Taylor’s “scientific management” transformed craft production into mass production by surfacing human knowledge and placing it into documentation, processes, and tools. Command and control management techniques imposed standards from the top down—epitomized by the notion that “You can have any color car you like as long as it’s black.” (See Figure 4.)

Figure 4
Before, manufacturing techniques informed software development; now, cloud development leads.

Figure 4: Before, manufacturing techniques informed software development; now, cloud development leads. 

W. Edwards Deming recognized that quality and value were diminished when information flowed only one way, and he taught the Japanese and then others about quality circles and plan-do-check-act (PDCA). He understood that feedback from shop floor employees was key—but in service to mass production of standard products. Mass personalization at the product level via highly programmable robotics now makes possible one-off car purchasing options—creating the appearance of a car built by hand “just for you,” as a craftsperson might have done in 1890.

Enterprise software has this same life cycle, but hasn’t always taken stock of Deming’s teachings. Enterprise applications have been very effective at standardizing information capture and distribution through enterprise resource planning (ERP), customer relationship management (CRM), and other packages. But standardization has meant a repeat of the top-down, one-way information flow Deming decried.

The web, with its direct connection to customers, raised the value of personalization and, more importantly, the need for feedback. Agile software development methods have answered that call. Now the more complete digitization of work—from back office to front office to supply chains to distribution channels and even customer consumption—faces the challenge of dynamic markets where change is constant. Leading companies are responding by running software in the cloud and adopting DevOps methods to support the continuous delivery of digital responses to market shifts.

Not moving toward antifragility-yet

According to VersionOne’s latest annual survey, 48 percent of global enterprises have five or more active agile development teams.6 There isn’t much evidence yet that they’ve moved toward antifragility or even know how to do so. But the concept does make sense to some who see it on the horizon and “get” how it might become part of their current agile development initiatives.

For example, Gopal of Nationwide says, “It just scares me to think about somebody unleashing Chaos Monkey within Nationwide, but we are taking somewhat of an intermediate step. We want to have more rigorous discipline around high-availability testing, and not just when completing major projects.”

Given the nature of Nationwide’s industry, viewing technology change through a risk lens is appropriate, and the company sets goals based on a rolling risk assessment. Nationwide realizes that a new definition of stability is emerging, but hasn’t seen a feasible way to reposition itself in light of that realization yet. Gopal notes three main categories of technology risk:

  • Technical debt, such as a compromise made to meet delivery deadlines
  • Currency, such as deferring the cost of upgrades
  • Consumption management (or efficiency from the vantage point of supply levels)

The methods associated with antifragility could have an impact on all three categories.

Even before the executives interviewed for this issue of the Technology Forecast encountered the antifragile concept, agile development has been a primary source of inspiration for them. Agile methods have been used for more than a decade, so the connection between practical business goals and agile methods is quite clear at this point. For some, the goal is efficiency.

John Esser, director of engineering productivity and agile development at Ancestry.com, for example, says efficient infrastructure is critical for the 23-year-old company. “In our case, it would take as much as a couple of months to get new services provisioned and set up,” Esser says. “We didn’t have the flexibility to move resources around. Existing hardware was dedicated to current services.” Esser points out that although the development process was lengthy, “the big long pull was that of our IT infrastructure.”

“We definitely improved the development side, and we definitely saw increases and productivity gains there, but we could get only so far until we addressed the infrastructure issues.”
—John Esser, Ancestry.com
 

At Ancestry, better configuration management via automation has been central to reducing the time it takes to provision a new service. “We definitely improved the development side, and we definitely saw increases and productivity gains there, but we could get only so far until we addressed the infrastructure issues,” Esser says.

Like Gopal, Esser tracks developments at major web companies. “I’ve been inspired by other companies, obviously Amazon and the whole idea of the infrastructure as a service and the cloud. Netflix is another one that I’ve looked toward when it comes to some of their processes. In fact, we’ve been exchanging ideas. In its day-to-day engineering, Netflix embodies a lot of what we’re trying to aspire to. Etsy is another good example of just pure scale. And Google is a great example for me, because the company has done such a great job of standardizing its infrastructure in a way it can leverage very quickly.”

Conclusion: Balancing quality, speed, collaboration, and resilience

These enterprise interviewees confirmed that they are on a journey of sorts—a quest to improve how they operate. None asserted that their development or other process improvement efforts are where they need to be yet. The antifragile concept is new to these executives, but it makes sense. Agile methods in businesses are forcing changes in IT, leading to the adoption of change management techniques in IT similar to the antifragile approach web-scale companies use to manage change.

Web-scale companies are not especially protective of their newest development, infrastructure, and operations philosophies or even their code.

As companies benefit from agile methods for software development and from watching what leading cloud development efforts can achieve, they are beginning to consider the possibility of using antifragility principles in support of broader organizational change. But these and other enterprises don’t view agile methods as a panacea, nor should they. During the last 80 years of building things, the emphasis has shifted from quality, to speed and efficiency, to resilience and stability, to scalability. Now, with web-scale companies beginning to work toward antifragility, the business world is returning to considerations of quality.


1 Mike Krieger, “Scaling Instagram,” Airbnb Tech Talk 2012, https://www.airbnb.com/techtalks, accessed May 29, 2013.

2 See Jez Humble, “On Antifragility in Systems and Organizational Architecture,” Continuous Delivery (blog), January 9, 2013, http://continuousdelivery.com/2013/01/on-antifragility-in-systems-and-organizational-architecture/, accessed June 28, 2013, for more information.

3 Rob England, “Kamu: a unified theory of IT management—reconciling DevOps and ITSM/ITIL” (blog), February 5, 2013, http://www.itskeptic.org/ content/unified-theory, accessed May 21, 2013.

4 See Nassim Taleb’s books Antifragile: Things That Gain from Disorder (New York: Random House, 2012), 2ff, http://www.contentreserve.com/ TitleInfo.asp?ID={BC3F6302-0727-4EBF-97D1- 7DBE6A0570C5}&Format=410, and The Black Swan: The Impact of the Highly Improbable (New York: Random House, 2007), 43ff.

5 John Ciancutti, “5 Lessons We’ve Learned Using AWS,” The Netflix Tech Blog, December 16, 2010, http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html, accessed May 16, 2013.

6 VersionOne, “7th Annual State of Agile Development Survey,” February 26, 2013, http://www.versionone.com/pdf/7th-Annual-State-of-Agile-Development-Survey.pdf, accessed May 29, 2013.