Razorfish’s Mark Taylor and Ray Velez discuss how new techniques enable them to better analyze petabytes of Web data.
Interview conducted by Alan Morrison and Bo Parker
Mark Taylor is global solutions director and Ray Velez is CTO of Razorfish, an interactive marketing and technology consulting firm that is now a part of Publicis Groupe. In this interview, Taylor and Velez discuss how they use Amazon’s Elastic Compute Cloud (EC2) and Elastic MapReduce services, as well as Microsoft Azure Table services, for large-scale customer segmentation and other data mining functions.
PwC: What business problem were you trying to solve with the Amazon services?
MT: We needed to join together large volumes of disparate data sets that both we and a particular client can access. Historically, those data sets have not been able to be joined at the capacity level that we were able to achieve using the cloud.
In our traditional data environment, we were limited to the scope of real clickstream data that we could actually access for processing and leveraging bandwidth, because we procured a fixed size of data. We managed and worked with a third party to serve that data center.
This approach worked very well until we wanted to tie together and use SQL servers with online analytical processing cubes, all in a fixed infrastructure. With the cloud, we were able to throw billions of rows of data together to really start categorizing that information so that we could segment non-personally identifiable data from browsing sessions and from specific ways in which we think about segmenting the behavior of customers.
That capability gives us a much smarter way to apply rules to our clients’ merchandising approaches, so that we can achieve far more contextual focus for the use of the data. Rather than using the data for reporting only, we can actually leverage it for targeting and think about how we can add value to the insight.
RV: It was slightly different from a traditional database approach. The traditional approach just isn’t going to work when dealing with the amounts of data that a tool like the Atlas ad server [a Razorfish ad engine that is now owned by Microsoft and offered through Microsoft Advertising] has to deal with.
PwC: The scalability aspect of it seems clear. But is the nature of the data you’re collecting such that it may not be served well by a relational approach?
RV: It’s not the nature of the data itself, but what we end up needing to deal with when it comes to relational data. Relational data has lots of flexibility because of the normalized format, and then you can slice and dice and look at the data in lots of different ways. Until you put it into a data warehouse format or a denormalized EMR [Elastic MapReduce] or Bigtable type of format, you really don’t get the performance that you need when dealing with larger data sets.
So it’s really that classic tradeoff; the data doesn’t necessarily lend itself perfectly to either approach. When you’re looking at performance and the amount of data, even a data warehouse can’t deal with the amount of data that we would get from a lot of our data sources.