Skip to content Skip to footer

Loading Results

Q&A: How advanced virtual data warehousing architectures can benefit financial institutions

Start adding items to your reading lists:
or
Save this item to:
This item has been saved to your reading list.

Building and deploying a data warehouse involves integrating structured and unstructured data from numerous disparate sources. This aggregated data drives an enterprise’s reporting and data analysis efforts, and is a core component of business intelligence. To benefit from the increased capacity and flexibility of cloud platforms, data warehousing is moving to the cloud. This requires new skill sets covering new technologies and entirely new architectures to process data. We asked two of our tech partners, Melissa Siekierski and Ali Khan, to discuss innovation in data warehousing on cloud.

Melissa Siekierski

Principal, Financial Services, Digital and Technology Innovation, PwC US

Ali Khan

Financial Services Advisory Principal, PwC US

Melissa Siekierski: What are the different types of architectures for processing low latency, high volume, high throughput data-warehousing workloads in the cloud?

Ali Khan: Traditional massively parallel processing (MMP) data warehouse solutions have primarily been the mainstay for processing large data volumes, typically in the characteristic Inmon/Kimball paradigm. Some architectures are accompanied by dedicated physical appliances, while others are purely software-based solutions. Either way, these solutions usually involve dedicated data pipelining, substantial a priori data canonicalization, over-reliance on hard to scale capacity therefore limiting scale up or scale down elasticity, expert knowledge in performance tuning and overly restrictive assumptions on the structure of ingress data. We live in an age of contextual insight delivery and intraday petabyte scale data processing and indexing. Traditional data warehouses simply can’t keep up.

Melissa Siekierski: In this cloud environment, what role do you expect big data platforms to play?

Ali Khan: Let’s first make sure we define what we mean by big data platforms and their design implications. The big data axiom is broadly applied to platforms that are built on or run atop Hadoop, which in its most basic sense constitutes the HDFS (Hadoop Distributed File System) and the MapReduce processing framework. The ethos of Hadoop was designed around specific principles that are geared toward batch-oriented workloads. While stream processing can be enabled with Hadoop, it’s an afterthought at best. To understand the role of big data platforms in this low latency, high throughput, variable schema data processing world of today, is to recognize that data localization is primary, and computation is expected to move across an unreliable network closer to where the data lives. In addition, transactional commits are not expected to be fully ACID. (ACID stands for atomic, consistent, isolated and durable, the four qualities that ensure data integrity.) Scaling is usually a function of horizontally adding more compute and storage capacity in a single go, given the data-localization scheme previously mentioned. Moreover, given the fixed block size of HDFS, big data solutions fare poorly for small datasets. Finally, the engineering overhead of setting up and maintaining a big data platform can be significant, although some of this pain has been mitigated by services delivered by cloud providers such as Amazon EMR, Azure HDInsight and so on.

Now we can get to the role that we expect big data platforms to play, which pertains to the fundamental shared-nothing nature of big data solutions. All compute and storage units are homogenous and dedicated with no sharing. This design performs well for typical data warehouse star-schema operations because there’s no contention for data structures and hardware, which makes the software design quite elegant. However, most data-oriented workloads today are very heterogeneous, while the homogenous hardware configurations inherent in big data platforms lead to, at best, average resource utilization. For example, if a proportion of your workloads would benefit from a GPU versus a CPU configuration, a big data-oriented solution won’t afford you that flexibility. The other major concern is membership changes in a big data cluster. Node failure on cloud is a norm and hence cluster membership changes can occur quite frequently. In a big data platform, this can lead to massive resharding, which can in turn degrade performance significantly because the data-processing node is the same node doing the resharding.

Melissa Siekierski: Does this mean that big data platforms have no utility in data processing in the cloud?

Ali Khan: Big data platforms certainly have a place in the data processing landscape, especially where massive scaling is needed (for example, on the order of tens of petabytes over commodity hardware for primarily homogenous workloads). It does certainly scale well for video, sound and free-form text processing over large batch-based datasets. Moreover, if you are not constrained by relational datasets, big data solutions work. Plus, SQL DML semantics and relational support have been added to Hadoop-based solutions such as Apache Hive.

Melissa Siekierski: Is there an alternative to traditional MMP-based data warehousing platforms and big data solutions for operating real time, heterogenous, large data warehouses in the cloud?

Ali Khan: New offerings are evolving that are completely cloud native, managed as a service and require little to no tinkering in terms of optimizing query workloads. These offerings are resistant to cluster membership changes, support independent scaling of storage and compute resources, provide ACID semantics, support flattening and traversal of semi-structured datasets, and can survive a complete domain controller (DC) loss. These solutions have markedly different designs that are black-boxed for most customers. In order to develop a comparative understanding of these solutions relative to Big Data platforms, and employ them according to their strengths, it is imperative to lift the bonnet on these “cloud native data warehouses” and understand their inner workings. In our next series, we will offer detailed views of the inner mechanics of these cloud based data solutions.

Key takeaways

Cloud data warehouses have emerged as the go-to repositories for amassing huge amounts of data for running advanced analytics. To accommodate the ever-increasing volumes of data that come from internal and external sources, take the time to understand the current and future size and demand for the following:

  • Data-oriented workloads
  • Need for scaling
  • Degree of relational datasets

These factors influence how table structures are designed, loaded and queried, and they will therefore shape your ultimate design choices for data warehousing, big data platforms, and the potential for leveraging cloud native offerings.

Contact us

Melissa Siekierski

Melissa Siekierski

Principal, Financial Services, Digital and Technology Innovation, PwC US

Ali Khan

Ali Khan

Financial Services Advisory Principal, PwC US

Follow us