Using DevOps and continuous delivery to move at the pace of the business

John Esser

John Esser is director of engineering productivity and agile development at Ancestry.com.

John Esser of Ancestry.com discusses his role in the company’s transformation efforts.

Interview conducted by Alan Morrison, Bo Parker, and Bud Mathaisel

PwC: How did you get started at Ancestry?

JE: I’ve been here at Ancestry about three years. I was originally brought in to move the company to agile development methods, and I spent the first year teaching them scrum and agile methods in general. Then as agile methods and that mindset took root in the company, we began expanding agile thinking into other areas of the business, particularly in operations.

As we leverage more agile methods, tools, and processes in our operations group, we’re getting to the point where I feel the business has a good foundation so we can actually practice what I call business agility. The company is getting poised to respond to different market conditions and to leverage our IT resources, services, and processes so we can react better.

We also have a good foundation for an innovation engine, and we have the ability to iterate rapidly on new product ideas. Like most companies, we’re trying to find new markets and new customers.

PwC: It sounds like you’ve explored the DevOps philosophy as well.

JE: I have a lot of experience with it.

PwC: What is Ancestry’s approach in that respect?

JE: We are looking at the public cloud, but the basic approach we’ve taken now is to develop a private cloud. When I came to the company, we had the traditional data center. Now we’re smack in the middle of a heavy effort to virtualize our data center and to provide more infrastructure services, just like a private cloud would.

PwC: Which specific requirements were most pressing in your case?

JE: The biggest requirement was to reduce the time it takes to get something set up and out to the marketplace. Although development time is required for those things, the big long pull was that of our IT infrastructure. When working with physical machines, it would take as much as a couple of months to get new services provisioned and set up. So we had a long lead time, we didn’t have the flexibility to move resources around, and existing hardware was dedicated to current services.

PwC: Was storage also locked in?

JE: To some degree, although not as much because we had network-attached storage solutions that were a little more flexible. And our network was fairly flexible. The biggest problem was our hardware utilization. Whenever we needed to set up a new product, application, or service, we had to set up new machines, provision them, and all that goes along with it. That was our most difficult problem.

PwC: What were the key ingredients of getting the infrastructure to be agile?

Challenge #1 and #2

JE: The number one challenge I dealt with initially was education and mindset within our operations group. Once, shortly after I was hired, I came in and said, “We want to deploy code at least once a day. We want to get to a place where we can deploy our code daily.” People basically fell out of their chairs. That just wasn’t heard of. To them, that idea was big red flags all over the place.

I wasn’t escorted out, but that idea of daily deployment was really frightening to them. What I meant was that we need to change our mindset—we can’t be a constraint on the business. I wasn’t even saying we have to roll every day, but that we need to be able to roll when the business wants to roll—and if that’s daily, then we need to do that.

Before, IT would say we could release only in particular windows—only with a certain frequency or every so often. If you needed to roll outside of these windows, then we’d have to go through a bunch of red tape to make that happen.

Challenge number two was to persuade operations that it’s possible to change by changing our processes, our tooling, and some of the technologies we use.

PwC: What happened after you got past those immediate challenges?

JE: Fairly quickly we were rolling code consistently on a two-week basis. Not every development group would release code every two weeks, but the operations group had a two-week release cadence. And their experience was that every two weeks was a nightmare. We would have problems. So when we told them we wanted to go to daily, they took their experience that they had every two weeks and thought, “I’m going to experience that every day.”

In their eyes, we had multiplied their pain by ten. And that’s why they were very resistant. But I pointed out that if you can increase the number of deployments and if every deployment were smaller, each deployment would go easier. The deployments they were doing every two weeks were huge—they had many moving pieces and interdependencies—and so invariably something was going to go wrong.

PwC: How did you reduce the dependencies?

JE: We had our architects tell development, “You need to be able to release independently from anyone else and if you can’t do that, you must change what you’re doing.” And we put in more stringent controls to ensure backward and forward compatibility.

Sometimes we still have coordinated code rolls, of course, but in the last couple of years we’ve made a lot of progress. Different components in the architecture are much more independent. Now we can roll smaller pieces more independently, and if one of those things goes wrong, then we immediately know, “that was the thing that broke,” and you roll it back or fix forward.

PwC: Is there some kind of road map we could talk about that takes a step-by-step approach to becoming more agile and that gets into what the open source community is doing a lot of?

JE: Honestly, when I started, I didn’t have a road map. In hindsight, I have a road map. I can articulate very well the steps we went through, but at the time, the first step was to move into agile. At first, the effort was principally development and it also involved our product management group. Then the effort worked its way up into the executive ranks, where this idea of delivering value more frequently and readily started surfacing other issues. Those issues led us back to the infrastructure.

When you tell development teams that every two weeks they’re going to produce working software that should be deliverable to the customer and they say “Great,” but they can’t release it or get it to the customer, the lack of delivery points to other problems that must be resolved. Principally, those pointed to our operations area.

As I mentioned, we talked to the architects and we started pushing back and asked, “How can we architecturally decouple components?” Then the next phase was the concept of continuous delivery into the organization.

At this point we’ve standardized on that whole notion of continuous delivery. Each group is basically allowed to release value to the customer as they see fit.

PwC: Did you set some limits on the way they could do this continuous release?

JE: Yes, but it was more indirectly enforced through tooling and automation. It’s not a hand process. In the new method, everything is done in an automated fashion. As long as they conform to the automation and the tooling, then they’re fine. They really can’t release code on their own; they basically push a button and the code goes out.

Ultimately, we put a lot of responsibility and ownership on that team to own their service. With the continuous delivery model, the whole idea is that anytime you check in code, that code must progress through a series of quality gates and testing. It must be deployed into preproduction environments and tested in preproduction environments. Then, if it makes it through all those gates, it’s deemed ready and the team can opt to push the button and release it into production.

We hold the teams responsible. If they release new code into production and there’s a problem, we’ll do a postmortem and say, “The code you rolled resulted in a production outage of some severity. What happened, and how can we help you correct that problem?”

PwC: Do you have any metrics on the number of code rolls you’ve done, and how many resulted in a problem?

103 code rolls in 10 days

JE: In the last two weeks—that’s 10 days—we’ve done 103 code rolls. Of those 103, two resulted in a problem in production, but neither of those affected customers. In one case, the problem was detected and corrected within five minutes, and in the other case it took about an hour.

“The absence of serious production problems points to the fact that we’re rolling smaller increments of functionality, and so it’s a lot easier to make sure the quality level is high.”
 

The absence of serious production problems points to the fact that we’re rolling smaller increments of functionality, and so it’s a lot easier to make sure the quality level is high.

PwC: Do you have an example of how the frequent code rolls have had an impact?

JE: One of the principal case studies I have used was our 1940 census release in 2012. Things have matured even more since then. A lot of new features were released to the website to take advantage of the new census data in a different way. For example, we offered more advanced searching, including advanced image searching and image browsing. In that process, we released 16 new services in the infrastructure that supported that data release. During that campaign, we started to leverage this newer infrastructure and continuous delivery.

PwC: What sort of transition was it for your people, and to what degree did you find it necessary to bring in new talent who had experience with Chef or some of the other tools you’re using?

JE: My group is called engineering productivity, and that name came about because the focus of my group is to improve the productivity and the ability for the business to deliver value to the customer more rapidly.

My group was created from the ground up. I hired folks who have particular skill sets. We didn’t really want to disrupt the core, so the traditional operations kept going in the same vein. I hired a new team that has more specialization in automation. I looked for a skill set that combined a sysadmin-type person with a developer. In our operations group, we had more sysadmin-type folks or even network-type folks. But we really didn’t have too many developer folks. So I went after that combo skill set. From there, we could build up the automation that was necessary to build up the tooling.

PwC: The creation of a group around engineering productivity with staffing and tooling sounds like a relatively new thing.

JE: I’ve never really heard of it other places. I proposed it to be able to gain buy-in around this idea that we wanted to make an investment in this area. Whatever the mission is, we want to make the business more productive. If we’re not doing that, then we need to figure out how we make that happen.