Jan 082019

Lots of Data, consumed fast and in many different ways

Over the years, the IT industry has had tectonic shifts in the way we do business. From mainframes we progressed to micro-computers to distributed computing environments in order to address the massive volumes, velocity, and dimensionality of data in the modern age.

One trend that was key in liberating the software and data from the underlying hardware was the advent of virtualization technologies. These technologies helped shape the modern computing landscape. Hyper-scale Public Clouds, built on these very same technologies, are no longer novelties, but rather, are ubiquitous components in the IT strategies of most Enterprises.

We see a trend of large scale adoption of public cloud technologies motivated by the flexibilities the logical and modular design patterns provide to customers. This is also true for what used to be a bastion of enterprise data centers — massively distributed computing platforms built on bare-metal infrastructure. This is the primary engine of Big Data today — built on the model of bringing one’s compute to the data and solving complex computational problems, with near-linear scalability built into the design of these platforms.

As data grows at a mind-bogglingly rapid pace, the need to process said data with increasing agility and flexibility drives the growth of distributed computing platforms that are contained within an enterprise Big Data stack. Towards that end, having a strategy with which to handle workloads with different types of Service Level Agreements (SLAs) would help in delivering these services effectively. That would call for defining tiers of service provided, and selecting the right architectural design patterns and components to deliver them. These should span across different architectures – Bare-metal as well as the various flavors of private cloud infrastructure.

Customers look to hyper-scale public clouds for several reasons, but various sources indicate that seldom do these constitute cost savings at any reasonable scale when compared to running platforms in their own data centers.

Caught between CAPEX and OPEX

In most IT-user companies, IT organizations look at business via the lenses of capital and operational expenditure. IT is considered a partner and enabler of business in the best of these companies and as a net-sink of revenue in others.

IT organizations are very sensitive to costs and are driven to deliver higher returns for progressively lower investments. The following two terms are critical in any IT strategy discussion.

  • Capital Expenditure, or CAPEX is the upfront cost required to provide the software and hardware infrastructure for various new projects.
  • Operational Expenditure, or OPEX is the ongoing cost required to maintain the software and hardware infrastructure once they have been built and are being used to serve the business needs of the organization.

Different organizations have different mandates on how they present the costs associated with these “buckets” of expenditure. The observed trend in most companies seems to be in favor of OPEX as opposed to CAPEX. This makes the Public clouds a very attractive option from a financial standpoint, as the customers rent the infrastructure and services provided by the Public Cloud vendors, thereby spending from the OPEX bucket.

However, several sources indicate that in the long term private clouds might actually be a more cost effective solution, provided the workload and SLAs associated with it are well understood, its peaks and troughs are well established, and a strategy to accommodate it is in place.

If the workload is a consistent entity — meaning it runs regularly over extended periods of time, for example one per week every month — it would be a good candidate to run on-prem in a private cloud.

Various factors can prevent organizations from adopting public clouds en-masse :

  • internal compliance requirements might result in data not being easily transferable to the public cloud
  • cost
  • data gravity

These factors make a compelling case to leverage private cloud technologies to solve customer problems in a similar way public clouds do.

Key Big Data challenges for Enterprise IT Organizations

Having set this as a backdrop to our topic at hand, let us explore some of the challenges we observe that traditional Enterprise IT organizations face in the domain of Big Data technologies.

The decision-makers for adoption of Big Data technologies are usually not the IT decision-makers, albeit they do play a vital role in the process. The primary decision-makers are Business owners of various lines of business within an organization, who seek to gain competitive advantages in the market, through deeper insights into customer behaviors, or market trends and so on. All of these insights are enabled using a combination of Big Data technologies, and from various sources and types of data (unstructured and structured).

In order to become truly agile and gain deep insights by leveraging Big Data technologies, customers typically deploy large scale distributed clusters in the 100s as opposed to 10s of nodes.

This is a space where hyperscale Public Clouds have a distinct advantage over most Enterprise IT shops.

  • They have specific design patterns very clearly thought through that can be leveraged very easily by end-users (not even IT infrastructure engineers) to rapidly deploy clusters for various use cases – such as machine learning, Data Science, and ad-hoc data analytics using solutions that provide schema-on-read capabilities.
  • They have the financial wherewithal to build infrastructure to meet burgeoning customer demand.

On one hand, the de facto mode of deployment in Enterprise Datacenters is using the traditional Apache Hadoop model of bare-metal servers with locally attached storage in large, scalable clusters. On the other hand, in the public clouds, there are three distinct deployment patterns which allow for far greater agility and flexibility in comparison.

  1. Virtual instances with locally attached ephemeral block storage
  2. Virtual instances with remote persistent block storage
  3. Virtual instances with locally attached ephemeral storage with data persistence in object stores

Two of these deployment patterns are relatively more mature, and available for Enterprise datacenters in the context of private cloud infrastructure, namely —

  1. Virtual instances with locally attached ephemeral block storage
  2. Virtual instances with remote persistent block storage

In the private cloud, the two primary technology options are VMware and OpenStack. Part of my work at Cloudera has been in developing reference architectures (RA) that leverage these design patterns and present best practices for these specific platforms.

The purpose of this blog is to present strategies that can be considered in order to achieve Public cloud-like flexibility and agility within private data centers, as part of the Big Data journey.

Where does Fully Converged Architecture Stand with Big Data?

The status quo with virtualized infrastructure on-premises is the use of what is called a “Fully Converged” architecture. This involves a compute-heavy hypervisor tier on which the Virtual machines (VMs) are instantiated, with a Storage Area Network (SAN) or Network Attached Storage (NAS) tier, from where logical volumes known as LUNs (Logical Unit Numbers) or shared file systems are presented to the hypervisor, which then divvies up the storage into virtual disk devices and presents them to the VMs for data storage.

This model is based on the idea that multiple applications that will share the storage tier and the IO workload of these applications is primarily Random, small block, and transactional in nature.

Hadoop workloads are primarily sequential in nature and the technology presupposes that the objective of the applications using these technologies is to process as much data as quickly as possible. This implies that these workloads are heavily throughput-intensive. Map and Reduce paradigms whether using traditional MapReduce (MR2), Apache Spark, or MPP compute engines like Apache Impala are designed to leverage the ability of the underlying infrastructure to process at the scale of 10s or 100s of gigabytes per second at a cluster level.

Traditional and price-optimized SAN and NAS based storage arrays are not designed to provide throughputs at such scales. Also, in order to meet the rapid scalability and high throughput demands, a fully converged solution might quickly become cost-prohibitive. Convergence based on distributed storage platforms on the other hand provide design related advantages and can be adapted for remote storage.

Anatomy of a Hadoop Worker Node

The diagram above is a slide from my talk at Strata Hadoop World San Jose in early 2017.

What’s key to note about this is that IO throughput generated at the storage layer needs to be transported over the network. That implies that the storage IO and the network IO pipes need to match in capacity.

Let us keep this point in mind for the next topic at hand.

Deconstructing IO performance in a Fully Converged solution

Let us assume that we have a dedicated SAN Storage array capable of providing 10 GB/s of storage throughput, based on the number of backend spindles (and with the clever use of a sizeable L1 cache tier and an SSD-based L2 cache tier that can accelerate reads and writes). With massive IO workloads that are generated in Hadoop, odds are that L1 and even L2 caches are going to be saturated quite easily; the performance of the storage array would be predicated on the number of spindles (disks) it holds, the type of spindles, and the underlying RAID level.

In a bare-metal node, a single SATA spindle can drive more than 100 MB/s throughput. Therefore, a 12-disk system will be able to generate roughly 1 GB/s of throughput. In a 10 Worker-node cluster, we can get approximately 10 GB/s of throughput.

NOTE: HDFS can drive throughput on JBOD (Just a bunch of disks) best. In the traditional Directly Attached Storage (DAS) servers, each disk can be driven well north of 100MB/s in sequential Read or Write IO.

Given that we typically see 2-RU 2-socket hosts serve as worker nodes, we should expect 20-28 physical cores per node, and at least 128 GB of RAM (needed for in-memory and MPP workloads). Assuming 20 physical cores, we can get IO throughput of approximately 50MB/s per physical core (or 25MB/s per HT core) in a 12-drive system. If the systems have 24 drives, we should get higher throughput per node (also core-count would probably be higher as well).

Taking that as a rule of thumb, on a fully converged environment with backend storage array  being capable of 10 GB/s of IO throughput, we can saturate that backend with 205 physical cores (10240 MB/s / 50 MB/s per core). That is essentially the same as 10-worker nodes.

If a standard VM size of 8 vCPUs was employed, one could run a 25-VM cluster against such a storage backend.

Now realistically, with a monolithic storage array that is shared with other services as is normally seen in private cloud environments, the effective throughput achievable is going to be significantly lower than it would be in the case of a dedicated array allocated to a single type of workload. So given a 10-worker DAS cluster and a 10 GB/s SAN Array, there is very limited scalability possible. Increasing the number of CPU cores that are run against this backend or increasing the storage density without increasing backend throughput capabilities will not yield significant results in terms of meeting increasingly stringent and demanding SLAs.

It is therefore very important to look at such a solution on both a Cost/Capacity as well as a Cost/Performance basis. While the Cost/Capacity metric might suggest a great solution, the Cost/Performance metric will most likely reveal a very different story.

Pragmatic Hadoop in the Private Cloud

We will now go over strategies that can be considered in order to maintain the performance promises of distributed storage platforms like Hadoop, while at the same time leveraging the agility and flexibility of private cloud infrastructure.

It might be worthwhile considering standardizing on, and building upon, a flavor of hardware that provides the high CPU core-count along with direct attached storage. Perhaps a storage-dense server with 50-60 SATA drives, 60 physical cores (120 HT cores), 512 GB of RAM with high speed networking (multiple 40 or 50 Gbps NICs per server).

Such a hypervisor will be able to accommodate 4-6 VMs, each with 10-12 physical cores, 128 GB of RAM and 10-12 locally attached drives.  This would be a platform worthy of consideration, in order to attain both the high consolidation ratios that private cloud solutions are famous for and without significantly compromising the performance promise of Hadoop. Using advanced rack awareness features such as Hadoop Virtualization Extensions (HVE), one can effectively create a highly scalable, agile, and flexible private cloud tier for Hadoop clusters.

There are hardware minimums to be considered of course, in order to ensure sufficient physical separation to ensure data availability.

All the rules that are espoused of in our reference architectures would still be applicable.

Alternately, if the workloads are not as IO intensive, one could build a distributed storage tier and present remote storage to VMs. This calls for a very performant storage backend, however this backend can be linearly scalable, just like a Hadoop cluster. As the compute and storage demands grow, the remote storage backend can be evaluated and scaled out proportionately.

The following table suggests some approaches that will help with capacity hedging, and help with rapid agility and flexibility with regard to Demand.

Workload TypeDesign PatternComments
High ThroughputVirtual farm with locally attached storageStandardize on a specific model of hardware – 60-70 physical cores, 512GB RAM, 50-60 SATA DrivesBuild larger VMs per hypervisorUse HVE to mitigate SPOF at the Hypervisor levelNetwork bandwidth should be sufficient to match or exceed IO throughput at the per-VM as well as hypervisor level.
Medium/Low throughputVirtual Farm with distributed remote block storage Set up a remote distributed storage cluster that can be scaled out as demand increases. Build the compute on the preferred hardware (standard to your organization) – be it blade servers or modular servers.The VM instance sizes can be variable depending on workload needs. Ensure QoS is applied at Storage and Network layers to ensure deterministic performance characteristics with multiple clusters.

Service delivery, orchestration and automation

For the incumbent technologies in the private cloud space, there are already vendor-provided orchestration and automation tools, as well as third-party tools such as Ansible that have a huge community of contributors and supporters, who provide recipes and such.

Bare-metal clusters are the best bet for deterministic performance and linear scalability. However, a private cloud strategy that defines different strata of services — for example, Platinum, Gold, Silver based on performance characteristics and reliability — would go a long way in ensuring that the business users get the services they need with the agility and flexibility that is required for timely consumption of their business-critical data.

Consider this table as an example —

Workload DesignationTierPlatform
Mission-criticalPlatinumBare-metal Cluster
Less critical, performance sensitiveGoldVirtualized with Direct Attached Storage
Non-performance sensitiveSilverVirtualized with Remote Block Storage

This does not of course preclude public cloud in the strategy. Aside from the agility and flexibility inherent in the public cloud option, there might be other business reasons to opt for public cloud too. For instance —

  • Bringing the application closer to the data and customers, geographically. For most organizations, opting to use hyper-scale public clouds which have significant presence in all key markets in the world, would make more sense than building out datacenters and infrastructure in various geographical regions, across the world.
  • Striated security model, where the data that needs most security would be retained on-premises, while the data that has been anonymized or that does not fall under the purview of stringent security requirements will be moved to the public cloud.
  • Randomly variable or shorter duration workloads, that are only run sporadically.

The future is cloudy, a hybrid of both on-prem and public cloud approaches. The main objective should be to use the right tools to solve problems. That likely would involve using a combination of on-prem bare-metal clusters, private cloud based solutions, as well as public cloud based solutions.

This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion.

 Posted by at 5:43 pm