Jan 082019
 

Lots of Data, consumed fast and in many different ways

Over the years, the IT industry has had tectonic shifts in the way we do business. From mainframes we progressed to micro-computers to distributed computing environments in order to address the massive volumes, velocity, and dimensionality of data in the modern age.

One trend that was key in liberating the software and data from the underlying hardware was the advent of virtualization technologies. These technologies helped shape the modern computing landscape. Hyper-scale Public Clouds, built on these very same technologies, are no longer novelties, but rather, are ubiquitous components in the IT strategies of most Enterprises.

We see a trend of large scale adoption of public cloud technologies motivated by the flexibilities the logical and modular design patterns provide to customers. This is also true for what used to be a bastion of enterprise data centers — massively distributed computing platforms built on bare-metal infrastructure. This is the primary engine of Big Data today — built on the model of bringing one’s compute to the data and solving complex computational problems, with near-linear scalability built into the design of these platforms.

As data grows at a mind-bogglingly rapid pace, the need to process said data with increasing agility and flexibility drives the growth of distributed computing platforms that are contained within an enterprise Big Data stack. Towards that end, having a strategy with which to handle workloads with different types of Service Level Agreements (SLAs) would help in delivering these services effectively. That would call for defining tiers of service provided, and selecting the right architectural design patterns and components to deliver them. These should span across different architectures – Bare-metal as well as the various flavors of private cloud infrastructure.

Customers look to hyper-scale public clouds for several reasons, but various sources indicate that seldom do these constitute cost savings at any reasonable scale when compared to running platforms in their own data centers.

Caught between CAPEX and OPEX

In most IT-user companies, IT organizations look at business via the lenses of capital and operational expenditure. IT is considered a partner and enabler of business in the best of these companies and as a net-sink of revenue in others.

IT organizations are very sensitive to costs and are driven to deliver higher returns for progressively lower investments. The following two terms are critical in any IT strategy discussion.

  • Capital Expenditure, or CAPEX is the upfront cost required to provide the software and hardware infrastructure for various new projects.
  • Operational Expenditure, or OPEX is the ongoing cost required to maintain the software and hardware infrastructure once they have been built and are being used to serve the business needs of the organization.

Different organizations have different mandates on how they present the costs associated with these “buckets” of expenditure. The observed trend in most companies seems to be in favor of OPEX as opposed to CAPEX. This makes the Public clouds a very attractive option from a financial standpoint, as the customers rent the infrastructure and services provided by the Public Cloud vendors, thereby spending from the OPEX bucket.

However, several sources indicate that in the long term private clouds might actually be a more cost effective solution, provided the workload and SLAs associated with it are well understood, its peaks and troughs are well established, and a strategy to accommodate it is in place.

If the workload is a consistent entity — meaning it runs regularly over extended periods of time, for example one per week every month — it would be a good candidate to run on-prem in a private cloud.

Various factors can prevent organizations from adopting public clouds en-masse :

  • internal compliance requirements might result in data not being easily transferable to the public cloud
  • cost
  • data gravity

These factors make a compelling case to leverage private cloud technologies to solve customer problems in a similar way public clouds do.

Key Big Data challenges for Enterprise IT Organizations

Having set this as a backdrop to our topic at hand, let us explore some of the challenges we observe that traditional Enterprise IT organizations face in the domain of Big Data technologies.

The decision-makers for adoption of Big Data technologies are usually not the IT decision-makers, albeit they do play a vital role in the process. The primary decision-makers are Business owners of various lines of business within an organization, who seek to gain competitive advantages in the market, through deeper insights into customer behaviors, or market trends and so on. All of these insights are enabled using a combination of Big Data technologies, and from various sources and types of data (unstructured and structured).

In order to become truly agile and gain deep insights by leveraging Big Data technologies, customers typically deploy large scale distributed clusters in the 100s as opposed to 10s of nodes.

This is a space where hyperscale Public Clouds have a distinct advantage over most Enterprise IT shops.

  • They have specific design patterns very clearly thought through that can be leveraged very easily by end-users (not even IT infrastructure engineers) to rapidly deploy clusters for various use cases – such as machine learning, Data Science, and ad-hoc data analytics using solutions that provide schema-on-read capabilities.
  • They have the financial wherewithal to build infrastructure to meet burgeoning customer demand.

On one hand, the de facto mode of deployment in Enterprise Datacenters is using the traditional Apache Hadoop model of bare-metal servers with locally attached storage in large, scalable clusters. On the other hand, in the public clouds, there are three distinct deployment patterns which allow for far greater agility and flexibility in comparison.

  1. Virtual instances with locally attached ephemeral block storage
  2. Virtual instances with remote persistent block storage
  3. Virtual instances with locally attached ephemeral storage with data persistence in object stores

Two of these deployment patterns are relatively more mature, and available for Enterprise datacenters in the context of private cloud infrastructure, namely —

  1. Virtual instances with locally attached ephemeral block storage
  2. Virtual instances with remote persistent block storage

In the private cloud, the two primary technology options are VMware and OpenStack. Part of my work at Cloudera has been in developing reference architectures (RA) that leverage these design patterns and present best practices for these specific platforms.

The purpose of this blog is to present strategies that can be considered in order to achieve Public cloud-like flexibility and agility within private data centers, as part of the Big Data journey.

Where does Fully Converged Architecture Stand with Big Data?

The status quo with virtualized infrastructure on-premises is the use of what is called a “Fully Converged” architecture. This involves a compute-heavy hypervisor tier on which the Virtual machines (VMs) are instantiated, with a Storage Area Network (SAN) or Network Attached Storage (NAS) tier, from where logical volumes known as LUNs (Logical Unit Numbers) or shared file systems are presented to the hypervisor, which then divvies up the storage into virtual disk devices and presents them to the VMs for data storage.

This model is based on the idea that multiple applications that will share the storage tier and the IO workload of these applications is primarily Random, small block, and transactional in nature.

Hadoop workloads are primarily sequential in nature and the technology presupposes that the objective of the applications using these technologies is to process as much data as quickly as possible. This implies that these workloads are heavily throughput-intensive. Map and Reduce paradigms whether using traditional MapReduce (MR2), Apache Spark, or MPP compute engines like Apache Impala are designed to leverage the ability of the underlying infrastructure to process at the scale of 10s or 100s of gigabytes per second at a cluster level.

Traditional and price-optimized SAN and NAS based storage arrays are not designed to provide throughputs at such scales. Also, in order to meet the rapid scalability and high throughput demands, a fully converged solution might quickly become cost-prohibitive. Convergence based on distributed storage platforms on the other hand provide design related advantages and can be adapted for remote storage.

Anatomy of a Hadoop Worker Node


The diagram above is a slide from my talk at Strata Hadoop World San Jose in early 2017.

What’s key to note about this is that IO throughput generated at the storage layer needs to be transported over the network. That implies that the storage IO and the network IO pipes need to match in capacity.

Let us keep this point in mind for the next topic at hand.

Deconstructing IO performance in a Fully Converged solution

Let us assume that we have a dedicated SAN Storage array capable of providing 10 GB/s of storage throughput, based on the number of backend spindles (and with the clever use of a sizeable L1 cache tier and an SSD-based L2 cache tier that can accelerate reads and writes). With massive IO workloads that are generated in Hadoop, odds are that L1 and even L2 caches are going to be saturated quite easily; the performance of the storage array would be predicated on the number of spindles (disks) it holds, the type of spindles, and the underlying RAID level.

In a bare-metal node, a single SATA spindle can drive more than 100 MB/s throughput. Therefore, a 12-disk system will be able to generate roughly 1 GB/s of throughput. In a 10 Worker-node cluster, we can get approximately 10 GB/s of throughput.

NOTE: HDFS can drive throughput on JBOD (Just a bunch of disks) best. In the traditional Directly Attached Storage (DAS) servers, each disk can be driven well north of 100MB/s in sequential Read or Write IO.

Given that we typically see 2-RU 2-socket hosts serve as worker nodes, we should expect 20-28 physical cores per node, and at least 128 GB of RAM (needed for in-memory and MPP workloads). Assuming 20 physical cores, we can get IO throughput of approximately 50MB/s per physical core (or 25MB/s per HT core) in a 12-drive system. If the systems have 24 drives, we should get higher throughput per node (also core-count would probably be higher as well).

Taking that as a rule of thumb, on a fully converged environment with backend storage array  being capable of 10 GB/s of IO throughput, we can saturate that backend with 205 physical cores (10240 MB/s / 50 MB/s per core). That is essentially the same as 10-worker nodes.

If a standard VM size of 8 vCPUs was employed, one could run a 25-VM cluster against such a storage backend.

Now realistically, with a monolithic storage array that is shared with other services as is normally seen in private cloud environments, the effective throughput achievable is going to be significantly lower than it would be in the case of a dedicated array allocated to a single type of workload. So given a 10-worker DAS cluster and a 10 GB/s SAN Array, there is very limited scalability possible. Increasing the number of CPU cores that are run against this backend or increasing the storage density without increasing backend throughput capabilities will not yield significant results in terms of meeting increasingly stringent and demanding SLAs.

It is therefore very important to look at such a solution on both a Cost/Capacity as well as a Cost/Performance basis. While the Cost/Capacity metric might suggest a great solution, the Cost/Performance metric will most likely reveal a very different story.

Pragmatic Hadoop in the Private Cloud

We will now go over strategies that can be considered in order to maintain the performance promises of distributed storage platforms like Hadoop, while at the same time leveraging the agility and flexibility of private cloud infrastructure.

It might be worthwhile considering standardizing on, and building upon, a flavor of hardware that provides the high CPU core-count along with direct attached storage. Perhaps a storage-dense server with 50-60 SATA drives, 60 physical cores (120 HT cores), 512 GB of RAM with high speed networking (multiple 40 or 50 Gbps NICs per server).

Such a hypervisor will be able to accommodate 4-6 VMs, each with 10-12 physical cores, 128 GB of RAM and 10-12 locally attached drives.  This would be a platform worthy of consideration, in order to attain both the high consolidation ratios that private cloud solutions are famous for and without significantly compromising the performance promise of Hadoop. Using advanced rack awareness features such as Hadoop Virtualization Extensions (HVE), one can effectively create a highly scalable, agile, and flexible private cloud tier for Hadoop clusters.

There are hardware minimums to be considered of course, in order to ensure sufficient physical separation to ensure data availability.

All the rules that are espoused of in our reference architectures would still be applicable.

Alternately, if the workloads are not as IO intensive, one could build a distributed storage tier and present remote storage to VMs. This calls for a very performant storage backend, however this backend can be linearly scalable, just like a Hadoop cluster. As the compute and storage demands grow, the remote storage backend can be evaluated and scaled out proportionately.

The following table suggests some approaches that will help with capacity hedging, and help with rapid agility and flexibility with regard to Demand.

Workload TypeDesign PatternComments
High ThroughputVirtual farm with locally attached storageStandardize on a specific model of hardware – 60-70 physical cores, 512GB RAM, 50-60 SATA DrivesBuild larger VMs per hypervisorUse HVE to mitigate SPOF at the Hypervisor levelNetwork bandwidth should be sufficient to match or exceed IO throughput at the per-VM as well as hypervisor level.
Medium/Low throughputVirtual Farm with distributed remote block storage Set up a remote distributed storage cluster that can be scaled out as demand increases. Build the compute on the preferred hardware (standard to your organization) – be it blade servers or modular servers.The VM instance sizes can be variable depending on workload needs. Ensure QoS is applied at Storage and Network layers to ensure deterministic performance characteristics with multiple clusters.

Service delivery, orchestration and automation

For the incumbent technologies in the private cloud space, there are already vendor-provided orchestration and automation tools, as well as third-party tools such as Ansible that have a huge community of contributors and supporters, who provide recipes and such.

Bare-metal clusters are the best bet for deterministic performance and linear scalability. However, a private cloud strategy that defines different strata of services — for example, Platinum, Gold, Silver based on performance characteristics and reliability — would go a long way in ensuring that the business users get the services they need with the agility and flexibility that is required for timely consumption of their business-critical data.

Consider this table as an example —

Workload DesignationTierPlatform
Mission-criticalPlatinumBare-metal Cluster
Less critical, performance sensitiveGoldVirtualized with Direct Attached Storage
Non-performance sensitiveSilverVirtualized with Remote Block Storage

This does not of course preclude public cloud in the strategy. Aside from the agility and flexibility inherent in the public cloud option, there might be other business reasons to opt for public cloud too. For instance —

  • Bringing the application closer to the data and customers, geographically. For most organizations, opting to use hyper-scale public clouds which have significant presence in all key markets in the world, would make more sense than building out datacenters and infrastructure in various geographical regions, across the world.
  • Striated security model, where the data that needs most security would be retained on-premises, while the data that has been anonymized or that does not fall under the purview of stringent security requirements will be moved to the public cloud.
  • Randomly variable or shorter duration workloads, that are only run sporadically.

The future is cloudy, a hybrid of both on-prem and public cloud approaches. The main objective should be to use the right tools to solve problems. That likely would involve using a combination of on-prem bare-metal clusters, private cloud based solutions, as well as public cloud based solutions.

This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion.

 Posted by at 5:43 pm
Aug 132018
 

The tool iperf (or iperf3) is popular as a synthetic network load generator that can be used to measure the network throughput between two nodes on a TCP/IP network. This is fine for point-to-point measurements. However, when it comes to distributed computing clusters involving multiple racks of nodes, it becomes challenging to find a tool that can easily simulate traffic that will test the network infrastructure sufficiently. One could of course always set up Hadoop on the cluster (or if Hadoop is already installed), run a TeraGen (MapReduce tool) that will generate heavy network traffic. However, often we are required to test and validate the boundaries and limitations of a network, especially the traffic between nodes of the cluster (which crosses Rack boundaries), often before a software stack can be deployed on the infrastructure. It is useful to be able to use a simple and well known tool (hence iperf3) so that it can provide a common frame of reference, and the tests can easily be reproduced irrespective of the underlying infrastructure (eg: bare-metal vs private cloud vs public cloud).

The objective of this document is, that, using the standard and freely available tool iperf3, we run parallel synthetic network loads to test the limits of a distributed cluster network. In other words, given a known set of iperf3 server instances, randomly run iperf3 client sessions against them simultaneously from multiple hosts. Towards that end, the same hosts can be set up as servers as well as clients (with the caveat that we avoid running a test from any given node against itself).

The methodology introduced in this document can be used to generate synthetic network loads to test East-West network bandwidth.  The diagram below shows the recommended (and standard) network topology for a distributed computing cluster (such as is Hadoop).

The Leaf to Spine uplink is recommended to be such that each node in one rack be able to talk in a non-blocking fashion to each node in another rack. Or, in other words, the E-W oversubscription ratio should be 1:1. If you have 1:1 oversubscription and each node has one 10 gbps NIC, then you should get as close to 10 gbps in bandwidth with these tests when you are running network traffic in parallel across all nodes between two racks.

Practical reasons may result in lower oversubscription ratios of upto 4:1. For really large clusters (more than 500 nodes), this might be higher  (7:1).

Some caveats need to be kept in mind. Hadoop has traditionally been a shared-nothing architecture, and with great emphasis on Data locality. What that means is that the worker nodes of a Hadoop cluster have locally attached drives (SATA is the norm). There are multiple copies of data stored (HDFS defaults to 3 replicas per HDFS block).

However, as network hardware has improved in terms of throughput, it has opened up avenues for flexibility vis-a-vis data locality (40 Gbps, 25 Gbps, 50 Gbps ethernet networks are fast becoming ubiquitous, with 100 Gbps uplink ports becoming more prevalent in enterprise data-centers). As we start seeing more architectures leveraging network based data storage and retrieval in these distributed computing environments, the need to gravitate towards 1:1 oversubscription will also increase correspondingly.

We need the following for this methodology to work —

  • iperf3 installed on all nodes
  • moreutils installed on all nodes
  • A clush setup (clustershell) (a laptop will do).

The clush setup involves creating an entry in the file /etc/clustershell/groups as follows —

 

mc_all:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com host[302,304,305-0333].my.company.com

mc_rack1:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com 

mc_rack2 :host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com

mc_rack3: host[302,304,305-0333].my.company.com

Install iperf3 and moreutils as follows —

# clush -g mc_all 'sudo yum install -y iperf3 moreutils'

After this is done, launch iperf3 in Daemon mode on all the hosts as follows —

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5001

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5002

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5003

Verify that the iperf3 Daemons are running —

$ clush -g mc_all -l johndoe 'ps -ef|grep iperf3|grep -v grep'

host102.my.company.com: root      10245 1 14 12:50 ? 00:06:42 iperf3 -sD -p 5001

host106.my.company.com: root       9495 1 12 12:50 ? 00:05:36 iperf3 -sD -p 5001

host104.my.company.com: root       9554 1 9 12:50 ? 00:04:20 iperf3 -sD -p 5001

<truncated for readability>

host315.my.company.com: root      33247 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host224.my.company.com: root      33136 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host323.my.company.com: root      33257 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host318.my.company.com: root      32868 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: root      33470 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: johndoe   33734 33492 22 13:34 ? 00:00:10 iperf3 -c host120.my.company.com -p 5001 -t 60

 

After all the nodes have been setup to run iperf3 server instances in Daemon mode, create multiple nodelist files (in this case, we are testing cross-rack bandwidth, and so set up the nodelist per rack as follows —

$ ls -lrt 

total 32

-rw-r--r--  1 johndoe  staff 806 Aug  9 14:52 nodes.rack3

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack2

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack1

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 random_net.sh

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 randomnet_single.sh

Distribute these files to all nodes of the cluster. Assumptions are that you already have clush setup (so have keys to the kingdom, so to speak — ie an user id on each node which has sudo privileges as root etc).

$ clush -g mc_all -c nodes.rack* --dest=/home/johndoe

$ clush -g mc_all -c randomnet*.sh --dest=/home/johndoe

Each file contains a list of all nodes in the specific rack. Eg:

$ cat nodes.rack1 

host102.my.company.com

host104.my.company.com

host106.my.company.com

host108.my.company.com 

host110.my.company.com

The shell script random_net.sh is intended to randomly select a target node against which to run the iperf test. This is to be run on the client side, say from Rack2 (client) to Rack1 (server) —

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

ports=(5001 5002 5003)

psize=${#ports[@]}

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    pindex=$(( $RANDOM % psize ))

    target=${nodes[$index]}

    ptarget=${ports[$pindex]}

    iperf3 -c $target -p $ptarget |ts |tee -a ~/iperf3.log  # Run payload from Client to server

    iperf3 -c $target -R -p $ptarget |ts |tee -a ~/iperf3.log  # Reverse the direction

done

In order to run against a single iperf3 server instance per node, run this script instead

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    target=${nodes[$index]}

    iperf3 -c $target -p 5001 |ts |tee -a ~/${target}_iperf3.log    # Run payload from Client to server

    iperf3 -c $target -R -p 5001|ts |tee -a  ~/${target}_iperf3.log # Reverse the direction

done

Run the script(s) as follows —

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

Or

$ clush -g mc_rack2 -l johndoe 'sh randomnet_single.sh nodes.rack1'

This example shows the test running in parallel on all nodes of Rack2 by randomly selecting a node in Rack1 as the target. The output of the tests can be parsed to generate usable reports that will help ascertain if there is a bottleneck anywhere.

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

host218.my.company.com: Aug 09 14:31:17 Connecting to host host108.my.company.com, port 5001

host218.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.59 port 54962 connected to 192.168.1.14 port 5001

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 646 MBytes 5.42 Gbits/sec 80   595 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 542 MBytes 4.55 Gbits/sec 50   561 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 574 MBytes 4.81 Gbits/sec 29   577 KBytes 

host204.my.company.com: Aug 09 14:31:17 Connecting to host host140.my.company.com, port 5001

host204.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.52 port 47034 connected to 192.168.1.30 port 5001

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 870 MBytes 7.30 Gbits/sec 38   799 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 626 MBytes 5.25 Gbits/sec 28   454 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 516 MBytes 4.33 Gbits/sec 19   512 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 590 MBytes 4.95 Gbits/sec 19   656 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 581 MBytes 4.88 Gbits/sec 88   649 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 570 MBytes 4.78 Gbits/sec 19   592 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 561 MBytes 4.71 Gbits/sec 41   560 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 589 MBytes 4.94 Gbits/sec 91   563 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 539 MBytes 4.52 Gbits/sec 46   479 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 570 MBytes 4.78 Gbits/sec 68   607 KBytes 

host204.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec 457             sender

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec              receiver

host204.my.company.com: Aug 09 14:31:17 

host204.my.company.com: Aug 09 14:31:17 iperf Done.

host218.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 636 MBytes 5.34 Gbits/sec 12   484 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 508 MBytes 4.26 Gbits/sec 65   433 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 384 MBytes 3.22 Gbits/sec 62   566 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 632 MBytes 5.30 Gbits/sec 69   519 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 595 MBytes 4.99 Gbits/sec 30   650 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 564 MBytes 4.73 Gbits/sec 45   478 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 525 MBytes 4.40 Gbits/sec 26   444 KBytes 

host218.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec 468             sender

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec              receiver

 

NOTE: This blog documents my personal opinions only and does not reflect my employer’s positions on any subject written about here. 

Mar 182014
 

This is an old one. I thought I’d post this before it falls off main memory.

You will need two things for this. I’ll start with the obvious.

  1. A splunk server configured to act as a deployment server (which will result in apps, configurations, etc being deployed centrally)
  2. A centralized “bastion” host, that has root-level ssh-based trust established with your target hosts (ie splunk agents). Another option is to have unprivileged ssh key-based trusts set up, but with local sudo privileges (passwordless) on each of your target hosts.

Once you do that, you can run a modification of this script shown below —

I’d written in for an old shop of mine, where we managed Solaris on SPARC, Solaris on x86 and Redhat Linux. Modify/write functions corresponding to your target OS of choice and it should do pretty much what you need rather easily.

Another thing to watch out for is make sure the version of splunk you are working with hasn’t changed the underlying mechanisms. I had written this for Splunk v4.x at least 3 years back (can’t remember how long ago now – this is one script I didn’t put in version control, so no way for me to tell how old it…and timestamps changed because I copied them from my old laptop, etc).

 

Mar 122014
 

Okay, this has been a long time coming, but thought I’d write this down before I forget all about it.

First a little bit of a rant. In the EMC world, we assign LUNs to hosts and identify them using a hex-id code called the LUN ID. This, in conjunction with the EMC Frame ID, helps us uniquely and “easily” identify the LUNs, especially when you consider that there are usually multiple EMC frames and hundreds or thousands of SAN-attached hosts in most modern enterprise data centers.

Okay, the challenge is when using a disk multipathing solution other than EMC’s expensive power path software, or the exorbitantly expensive Veritas Storage Foundation suite from Symantec.

Most self-respecting modern operating systems have disk multipathing software built in and freely available.  and the focus item for this blog is Solaris. Solaris has a mature and efficient multipathing software called MPxIO, which cleverly creates a pseudo-device corresponding to the complex of the multiple (or single) paths via which a SAN LUN is visible to the Operating system.

The MPxIO driver, then using the LUN’s global unique identifier (GUID) to, perhaps can be argued, rightly identify the LUN uniquely. This is however a challenge since the GUID is a 32-character string (very nicely explained here http://www.slideshare.net/JamesCMcPherson/what-is-a-guid).

When I first proposed to our usual user community – Oracle DBAs, that they would have to use disks named as shown below, I faced outraged howls of indignation. “How can we address these disks in ASM, etc?”

c3t60060160294122002A2ADDB296EADE11d0

To work around, we considered creating an alternate namespace, say /dev/oracle/disk001 and so forth, retaining the major/minor numbers of the underlying devices.  But that would get hairy real quick, especially on those databases where we had multiple terabytes of storage (hundreds of LUNs).

If you are working with CLARiiON arrays, then you will have figure out some other way (than what I’ve shown in this blog post) to map your MPxIO/Solaris disk name to the valid Hex LUNID.

The code in this gist will cover what’s what. A friendly storage admin told me about this, wrt VMAX/Symmetrix LUNs. Apparently EMC generates the GUID (aka UUID) of LUNs differently based on whether it is a CLARiiON/VNX or a Symmetrix/VMAX.

In that, the fields that were extracted corresponding to variables $xlunid1 and so forth, and converted to their hex values, is the pertinent piece of information. Having this information then reduces our need to install symcli on every host so that we can extract the LUN ID that way.

This then will give us the ability to map a MPxIO/Solaris Disk name to an EMC LUN ID.

The full script is here — https://github.com/implicateorder/sysadmin-tools/blob/master/getluninfo.pl

Jan 102014
 

This is in continuation of my previous post regarding Tactical vs Strategic thinking. I received some feedback about the previous post and thought I’d articulate the gist of the feedback and address the concern further.

The feedback – It seems like the previous post was too generic for some readers. The suggestion was to provide examples of what I consider “classic” tactical vs strategic thinking.

I can think of a few off the top, and please forgive me for using/re-using clichés or being hopelessly reductionist (but they are so ripe for the taking…).

I’ve seen several application “administrators” and DBAs refusing to automate startup and shutdown of their application via init scripts and/or SMF (in case of Solaris systems).

Really? An SAP basis application cannot be automatically started up? Or a Weblogic instance?

Why?

Because, *sigh and a long pause*…well because “bad things might happen!”

You know I’m not exaggerating about this. We all have faced such responses.

To get to the point, they were hard-pressed for valid answers. Sometimes we just accept their position, filed under the category “it’s their headache”.  That aside, a “strategic” thing to do, in the interest of increased efficiency and better delivery of services would be to automate the startup and shutdown of these apps.

An even more strategic thing to do would be to evaluate whether the applications would benefit from some form of clustering, and then automate the startup/shutdown of the app, building resource groups for various tiers of the application and establishing affinities between various groups.

For example, a 3-tier web UI based app would have an RDBMS backend, a business logic middle tier and a web front end. Build three resource/service groups using a standard clustering software and devise affinities between the three, such that the web ui is dependent on the business layer and the business layer dependent on the DB. Now, it’s a different case if all three tiers are scalable (aka active/active in the true sense).

An even more strategic approach would be to establish rules of thumb regarding use of cluster ware, or highly available virtual machines. Establishing service level tiers, that deliver different levels of availability (I’ve based it off the service level provided by vendors in the past).

This would call for standardization of the infrastructure architecture and as a corollary thereof, associated considerations would have to be addressed.

For eg:

  1. What do you need to provide a specific tier of service?
  2. Do you have the budget to meet such requirements (it’s significantly more expensive to provide a 99.99% planned uptime as opposed to 99%)?
  3. Do the business owners consider the app worthy of such a tier?
  4. If not, are you sure that the business owners actually understand the implications of this going down? (How much money are they likely to lose if this app was down – better if we can quantify it)

All these and a slew of other such questions will arise and will have to be responded to. A mature organization will already have worked a lot of these into their standard procedures. But it is good for engineers at all levels of maturity and experience to think of these things. And build an internal methodology, keep refining it, re-tooling it to adapt to changes in the IT landscape. What is valid for an on-premise implementation might not be applicable for something on a public cloud.

Actually this topic gets even more murky when consider public IaaS cloud offerings. Those of us who have predominantly worked in the enterprise, often might find it hard to believe that people actually use public IaaS clouds, because, realistically an on-premise offering can provide better service levels and reliability than these measly virtual machines can provide. While the lure of a compute on demand, pay as you go model is very attractive, it is perhaps not applicable for every scenario.

So then you have adapt your methodology (if there is an acceptable policy to employ public IaaS clouds). More questions around deployments might now need to be prepended to our list above:

  1. How much compute will this application require?
  2. Will it need to be on a public cloud or a VM farm (private cloud) or on bare-metal hardware?

I used to think that almost everyone in this line of business thought about Systems Engineering on these lines. But I was very wrong. I have met and worked with some very smart engineers, who have some vague idea about these topics, or have thought through some aspects of these topics.

Many don’t take the effort to learn how to do a TCO calculation, or show an RoI for projects they are involved in (even to the extent of knowing for their own sakes). As “bean-counter-esque” these subjects might seem, they are very important (imho) towards taking your repertoire as a mature Engineer to the next level.

And Cost of ownership is often times neglected by Engineers. I’ve had heated discussions with my colleagues at times regarding why One technology was chosen/recommended over another.

One that comes to mind was when I was asked to evaluate and recommend one of the two technologies – Veritas Cluster Server and Veritas Storage Foundation vs Sun Cluster/ZFS/UFS combination. The shop at that time was a heavy user of Solaris + SPARC + Veritas Storage Foundation/HA.

The project was very simple – we wanted to reduce the physical footprint of our SPARC infrastructure (then comprised of V490s, V890s, V440s, V240s etc). So we chose to use a lightweight technology called Solaris Containers to achieve that. The apps were primarily DSS type, and batch oriented. We opted to use Sun’s  T5440 servers to virtualize (There were no T4s or T5s in those days, so we took the lumps on the slow single-thread performance in order not really upgrade the performance characteristics of the servers we were P2V’ing, but get a good consolidation ratio).

As a proof of concept, we baked off a pair of two-node cluster using Veritas Cluster Server 5.0 and Sun Cluster 3.2. Functionally, our needs were fairly simple. We needed to ascertain the following –

  • Can we use the cluster software to lash multiple physical nodes into a Zone/Container farm
  • How easy is it to set up, use and maintain
  • What would be our TCO and what would our return on investment be, choosing one technology over another

Both cluster stacks had agents to cluster and manage Solaris Containers and ZFS pools. At the end it boiled down to two things really –

  • The team was more familiar with VCS than Sun Cluster
  • VCS + Veritas Storage Foundation cost 2-3x more than Sun Cluster + ZFS combination

The cost of ownership was an overwhelmingly higher number. On the other hand, while the team (with the exception of myself) wasn’t trained in Sun Cluster, we  weren’t trying to do anything outrageous with the cluster software.

We would have a cluster group for each container we built, that comprised of the ZFS pools that housed the container’s OS + data volumes and a Zone agent that managed the container itself. The setup would then comprise of following steps:

  1. add storage to the cluster
  2. set up the ZFS pool for new container being instantiated
  3. install the container (a 10 minute task for a full root zone)
  4. create a cluster resource group for the container
  5. create a ZFS pool resource (called HAStoragePlus resource in Sun Cluster parlance)
  6. Crete a HA-Zone resource (to manage the zone)
  7. bring the cluster resource group online/enable cluster-based monitoring

These literally require 9-10 commands on the command line. Lather, rinse, repeat. So, when I defended the design, I had all this information as well as the financial data to support why I recommended using Sun Cluster over Veritas. Initially a few colleagues were resistant to the suggestion, over concerns about supporting the solution. But it was a matter of spending 2-3x more up-front and continually for support vs spending $40K on training over one year.

Every organization is different, some have pressures of reducing operational costs, while others reducing capital costs. At the end, the decision makers need to make their decisions based on what’s right for their organization. But providing this degree of detail as to why a solution was recommended helps to eliminate hurdles with greater ease. Unsound arguments usually fall apart when faced with cold, hard data.

Jan 092014
 

Tactical vs Strategic thinking

Let me attempt to delineate the two, in context of IT and specifically Infrastructure (because without that, these are just buzz-words that are being tossed around).

Tactical – This calls for quick thinking, quick-fixing type of mentality — for e.g., when we need to fix things that are broken quickly, or how to avert an impending disaster just this one time – essentially bandaids, repairs, and so on.

Strategic – This calls for vision, of looking at the complete picture and devising multiple responses (each as a failsafe for another possibly) to address as many things as possible that might affect your infrastructure and operations.

If any of you are chess players, you will know what I mean. A good chess player will evaluate the board, think ahead 5-7 moves, factoring in the various moves/counter-moves possible and then take a step. That is strategic thinking. A tactical player will play the game a move at a time (or at best a couple of moves ahead). He might save his piece for that moment (i.e. avert immediate damage), but he will most likely set stage for defeat, if he is playing a strategic opponent.

Some of the readers might be thinking that this is all great for talking, but how does this translate to the role of a Systems Engineer/Administrator? Surely, managing IT infrastructure is not a game of chess?

To that, my response is – But it is!

We, by virtue of our trade have taken on a big responsibility. We provision, manage and service the infrastructure that our respective organizations depend on (and perhaps a huge part of the organization’s revenue stream depends on it). So, it is our job (as individuals as well as a collective within each organization) to be able to foresee potential pitfalls, constantly evaluate what we are doing well and what we aren’t (be it process related, or technology related) and devise plans to address any issues that might crop up.

In some organizations, the management team would like to consider the “strategic” activities to be their role. But the fact is, good managers will always take the counsel of their teams in order to make successful and strategic decisions. Autocracy is not very conducive to successful decision making. Now, while I’m no management guru, I have worked in this industry for close to two decades and my opinions are based on my observations (And I will be happy to stand and defend my informal thesis with any Management Guru that would like to discuss this with me).

Moreover, most individuals in the management space have outdated knowledge of technology. They more than likely aren’t experts in any of the fields that their team supports and even if they had been experts in the past, they likely rely on vendors and VARs to  fill their knowledge gap. Now, we all know how reliable VARs and vendors can be to provide unbiased knowledge that will help with sound decision making.

But if you ask me that is not a weakness, but a strength. Managers (with a few exceptions) who are too close to technology tend to get bogged down in the minutiae of things, because they usually only look at it from the outside (as supervisors and not technicians). A good manager will necessarily include his/her team (or at least a few key individuals from the team) in their decision making process.

  • A manager I’ve had great pleasure of working with would take his team out every year for a 2-3 day planning session at the beginning of the year.
  • He would provide guidance of what the organizational goals were for the fiscal year (and what the business strategies were)
  • and then the team would brainstorm on what we would need to do to address the various components of infrastructure that we managed.
  • We would return back to the office at the end of the session with maybe 120-150 items on the board, categorized as small projects, large projects or tactical initiatives, divvy them up between the team members based on skills and interests
  • We would then, together with our manager, align these projects and tactical initiatives with the organizational goals and try and show how they help us help the organization, or help make a business case for cost savings, or increased efficiencies, and so on.
  • These would not only help us design our annual goals and objectives into measurable items, but also help the manager showcase what his team was doing to his manager(s) (and so on, up the food chain).

Strategic thinking is an art, but it is also a philosophical inclination, if you ask me. If we always assess a situation from not just the short-term effects in perspective, but look at what the cause is; evaluate whether it is an aberration or due to some flaw in process or whatever the reason might be. We can then, over time internalize this type of thinking, so that we can quickly look into the future and see what the implications of an event may be, and thereby take corrective actions or devise solutions to address this.

While it might seem that one-off solutions, workarounds and break-fixes are tactical, it is possible to have a strategy around using tactical solutions too. When you orchestrate a series of rules around where tactical thinking is applicable, you are in essence setting the stage for an over-arching strategy towards some end.

This is precisely what good chess players do – they strategize, practice, practice, practice. Then they do some more. Until they can recognize patterns and identify “the shape of things to come” almost effortlessly.

We, as IT professionals need to be able to discern patterns that affect our professional lives. These are patterns that are process-oriented, social or technological.

We have tools to look at the technological aspect – collect massive amounts of logs, run it through a tool such as Splunk or something similar, that will let us visualize the patterns easily.

The other two are harder – the hardest being social. To be aware of things as they transpire around us, in general, to be aware.

If you are interested, check this book out – The Art of Learning, by a brilliant chess player – Josh Waitzkin (the book and subsequent movie made on his life titled “Searching for Bobby Fischer”). The book of course doesn’t have anything to do with IT or infrastructure (it is about learning, chess and Tai Chi). But the underlying principles of the book resonated deeply with me in my quest to try and articulate my particular and peculiar philosophy that drives both my personal as well as my professional life.

As intelligent beings, we have the innate ability to look at and recognize patterns. The great 16th century Japanese swordsman – Miyamoto Musashi wrote in his famous book titled “Go Rin No Sho – The Book of Five Rings”  about how Rhythms are pervasive in everything (and while his overt focus was on Swordsmanship, there was a generic message in there as well) and how a good swordsman needs to recognize these rhythms (the rhythm of the battle, the rhythm of his opponent, and so on).

So I leave you with a quote from the Go Rin No Sho –

There is rhythm in everything, however, the rhythm in strategy, in particular, cannot be mastered without a great deal of hard practice.

Among the rhythms readily noticeable in our lives are the exquisite rhythms in dancing and accomplished pipe or string playing. Timing and rhythm are also involved in the military arts, shooting bows and guns, and riding horses. In all skills and abilities there is timing.

There is also rhythm in the Void.

There is a rhythm in the whole life of the warrior, in his thriving and declining, in his harmony and discord. Similarly, there is a rhythm in the Way of the merchant, of becoming rich and of losing one’s fortune, in the rise and fall of capital. All things entail rising and falling rhythm. You must be able to discern this. In strategy there are various considerations. From the outset you must attune to your opponent, then you must learn to disconcert him. It is crucial to know the applicable rhythm and the inapplicable rhythm, and from among the large and small rhythms and the fast and slow rhythms find the relevant rhythm. You must see the rhythm of distance, and the rhythm of reversal. This is the main thing in strategy. It is especially important to understand the rhythms of reversal; otherwise your strategy will be unreliable.

Jan 062014
 

I’ve just recently revived my interest in CMS (Content Management Systems) – specifically Joomla!

Why? Because I created this “portal” (http://www.medhajournal.com) and though I seriously doubt I have the time to actively maintain it, I’ve spent way too many hours working on it, to just abandon it on the wayside.

Medha Journal has evolved since it’s inception in 2007 from a Joomla 1.x version to Joomla 1.5.x, with a large hiatus in the middle (from 2010 through the last week of december 2013), until I just upgraded it (last week) from it’s rickety 1.5 version to the latest stable 2.5x version of Joomla.

Those who don’t know what joomla is, here’s a summary –

It is a PHP based open source Content Management System that can be extended to do more or less anything, using a massive collective Extensions library (sort of like CPAN), that is community driven.

So, to get back to the topic at hand. Since I had not dabbled with Joomla (besides the user-land/power-user oriented work – primarily content editorial stuff) in a long time, I decided to take the plunge with some down time in hand (the holidays) and upgrade it from 1.5 to 2.5.

In course of this migration, I also switched from the default Joomla content management to using a tool called K2 (which realistically is more suited from social media oriented content portals, such as Medha Journal).

One major issue I ran into was this –

The Old version of the website used an extension called “jomcomment” which was developed by some folks down in Malaysia or Indonesia, and allowed from some better spam control. I used this in conjunction with the myblog extension, developed by the same group of developers, to give my users a seamless (or relatively seamless) experience as they added content and commented on each other’s works.

However, with the newer versions of Joomla that currently are active (2.5x and 3.x), these extensions don’t work. Over the years, we had accumulated thousands of comments on the north of 2000 articles collected on the Journal. So, it was imperative to migrate these.

K2 (http://getk2.org) has a very nice commenting system with builtin Akismet spam filters, etc. So, the new site would obviously use this. Migration was an interesting proposition since the table structure (jomcomment to K2 comments) were not identical.

Old table looked like this:

 

Field Type Null Key Default Extra
id int(10) NO PRI NULL auto_increment
parentid int(10) NO 0
status int(10) NO 0
contentid int(10) NO MUL 0
ip varchar(15) NO
name varchar(200) YES NULL
title varchar(200) NO
comment text NO NULL
preview text NO NULL
date datetime NO 0000-00-00 00:00:00
published tinyint(1) NO MUL 0
ordering int(11) NO 0
email varchar(100) NO
website varchar(100) NO
updateme smallint(5) unsigned NO 0
custom1 varchar(200) NO
custom2 varchar(200) NO
custom3 varchar(200) NO
custom4 varchar(200) NO
custom5 varchar(200) NO
star tinyint(3) unsigned NO 0
user_id int(10) unsigned NO 0
option varchar(50) NO MUL com_content
voted smallint(6) NO 0
referer text NO NULL

The new table looks like this:

Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
itemID int(11) NO MUL NULL
userID int(11) NO MUL NULL
userName varchar(255) NO NULL
commentDate datetime NO NULL
commentText text NO NULL
commentEmail varchar(255) NO NULL
commentURL varchar(255) NO NULL
published int(11) NO MUL 0

This called for some manipulation of the data extracted from the old jomcomment tables, before they got inserted back into the K2 comments table in the new database (both were MySQL).

So, I imported the data out of the old table using phpMyAdmin, massaged the data, modified it to fit the new table structure and imported it back in.

The hardest part was ensuring that the date field of the extracted CSV imported into the new table correctly. For that, I had to manually adjust the date format. One way is as shown here.

Another way is (if your data fits in Excel), manually set the column format (corresponding to your date field) to something that matches the MySQL format (in this case, yyyy-mm-dd hh:mm).

 

Jan 062014
 

In a past life when I used to work for a wireless service provider,  they used a vended application to evaluate how much data bandwidth customers were consuming and that data was sent to the billing system to form the customers’ monthly bills.

The app was a poorly written (imho) and was woefully single-threaded, incapable of leveraging oodles of compute resources that were provided in the form of the twelve two-node VCS/RHEL based clusters. There were two data centers and there was one such “cluster of clusters” at each site.

The physical infrastructure was pretty standard and over-engineered to a certain degree (the infrastructure under consideration having been built in the 2007-2008 timeframe) – DL580 G5 HP servers  with 128GB of RAM each, a pair of gigabit ethernet nics for cluster interconnects, another pair bonded together as public interfaces and 4 x 4GB FC HBAs through Brocade DCX core switches (dual fabric) to an almost dedicated EMC CLaRiiON C4-960.

The application was basically a bunch of processes that watched traffic as it flowed through the network cores and calculated bandwidth usage based on end-user handset IP addresses (i’m watering it down to help keep the narrative fast and fluid).

Each location (separated by about 200+ miles) acted as the fault tolerant component for the other. So, the traffic was cross-fed across the two data centers over a pair of OC12 links.

The application was a series of processes/services that formed a queue across the various machines of each cluster (over TCP ports). The processes within a single physical system too communicated via IP addresses and TCP ports.

The problem we started observing over a few months from the point of deployment was that the data would start queuing up and slow down/skew the calculations of the data usage, implications of which I don’t necessarily have to spell out (ridiculous).

The software  vendor’s standard response would be –

  1. It is not an application problem
  2. The various components of the application would write into FIFO queues on SAN-based filesystems. The vendor would constantly raise the bogey of SAN storage being slow, not enough IOPs being available and/or response time being poor. Their basis of coming up with was with what seemed to be an arbitrary metric of CPU IO Wait percentage rising over 5% (or perhaps even lower at times).
  3. After much deliberation poring over the NAR reports of almost dedicated EMC CX4-960s (and working with EMC), we were able to ascertain that the Storage arrays or the SAN were not in any way contributory towards any latency (that resulted in poor performance of this app).
  4. The processes being woefully single threaded, would barely ever use more than 15%of total CPUs available on the servers (each server having 16 cores at it’s disposal).
  5. Memory usage was nominal and well within acceptable limits
  6. The network throughput wasn’t anywhere near saturation

We decided to start profiling the application (despite great protestations from the vendor) during normal as well as problematic periods, at the layer where we were seeing the issues, as well as the immediately upstream and downstream layers.

What we observed was that:

  1. Under normal circumstances, the app would be spending most of it’s time in either read, write, send or recv syscalls
  2. When app was performing poorly, it would spend most of it’s time in the poll syscall. It became apparent that it was waiting on TCP sockets from app instances in the remote site (and issue was bidirectional).

 

Once this bit of information was carefully vetted out and provided to the vendor, then they decided to provide us with the minimal throughput requirements – they needed a minimum of 30mbps throughput. The assumption made was that on an OC12 (bw of 622mbps), 30mbps was quite achievable.

However, it so turns out that latency plays a huge role in the actual throughput on a WAN connection! (*sarcasm alert*)

The average RTT between the two sites was 30ms. Given that the servers were running RHEL 4.6, for the default TCP send and recv buffer sizes of 64K (untuned), the 30ms RTT resulted in a max throughput of about 17mbps.

It turns out that methodologies we (*NIX admins) would use to generate some network traffic and measure “throughput” with, aren’t necessarily equipped to handle wide area networks. For example, SSH has an artificial bottleneck in it’s code, that throttles the TCP window size to a default of 64K (in the version of OpenSSH we were using at that time) – as hardcoded in the channels.h file. Initial tests were indeed baffling, since we would never cross a 22mbps throughput on the WAN. After a little research we realized that all the default TCP window sizes (for passive ftp, scp, etc) were not really tuned for high RTT connections.

Thus begun the process of tweaking the buffer sizes, and generating synthetic loads using iperf.

After we established that the default TCP buffer sizes were inadequate, we calculated buffer sizes required to provide at least a 80mbps throughput, and implemented then across the environment. The queuing stopped immediately after.

Jan 022014
 

This is a continuation of my previous post on this topic.

First, a disclaimer –

  • I have focused on CentOS primarily. I will try and update this to accommodate more than one Linux distro (and even consider writing for Solaris-based implementations in the future).

Here’s the skeleton of the cloudera manager setup plugin:

import posixpath

from starcluster import threadpool
from starcluster import clustersetup
from starcluster.logger import log

class ClouderaMgr(clustersetup.ClusterSetup):

        def __init__(self,cdh_mgr_agentdir='/etc/cloudera-scm-agent'):
                self.cdh_mgr_agentdir = cdh_mgr_agentdir
                self.cdh_mgr_agent_conf = '/etc/cloudera-scm-agent/config.ini'
                self.cdh_mgr_repo_conf = '/etc/yum.repos.d/cloudera-cdh4.repo'
                self.cdh_mgr_repo_url = 'http://archive.cloudera.com/cm4/redhat/6/x86_64/cm/cloudera-manager.repo'
                self._pool = None

        @property
        def pool(self):
                if self._pool is None:
                        self._pool = threadpool.get_thread_pool(20, disable_threads=False)
                return self._pool

        def _install_cdhmgr_repo(self,node):
                node.ssh.execute('wget %s' % self.cdh_mgr_repo_url)
                node.ssh.execute('cat /root/cloudera-manager.repo >> %s' % self.cdh_mgr_repo_conf)

        def _install_cdhmgr_agent(self,node):
                node.ssh.execute('yum install -y cloudera-manager-agent')
                node.ssh.execute('yum install -y cloudera-manager-daemons')

        def _install_cdhmgr(self,master):
                master.ssh.execute('/sbin/service cloudera-scn-agent stop')
                master.ssh.execute('/sbin/chkconfig cloudera-scn-agent off')
                master.ssh.execute('/sbin/chkconfig hue off')
                master.ssh.execute('/sbin/chkconfig oozie off')
                master.ssh.execute('/sbin/chkconfig hadoop-httpfs off')
                master.ssh.execute('yum install -y cloudera-manager-server')
                master.ssh.execute('yum install -y cloudera-manager-server-db')
                master.ssh.execute('/sbin/service cloudera-scm-server-db start')
                master.ssh.execute('/sbin/service cloudera-scm-server start')

        def _setup_hadoop_user(self,node,user):
                node.ssh.execute('gpasswd -a %s hadoop' %user)

        def _install_agent_conf(self, node):
                node.ssh.execute('/bin/sed -e"s/server_host=localhost/server_host=master/g" self.cdh_mgr_agent_conf > /tmp/config.ini; mv /tmp/config.ini self.cdh_mgr_agent_conf')

        def _open_ports(self,master):
                ports = [7180,50070,50030]
                ec2 = master.ec2
                for group in master.cluster_groups:
                        for port in ports:
                                has_perm = ec2.has_permission(group, 'tcp', port, port, '0.0.0.0/0')
                                if not has_perm:
                                        ec2.conn.authorize_security_group(group_id=group.id,
                                                                        ip_protocol='tcp',
                                                                        from_port=port,
                                                                        to_port=port,
                                                                        cidr_ip='0.0.0.0/0')

        def run(self,nodes, master, user, user_shell, volumes):
                for node in nodes:
                     self._install_cdhmgr_repo(node)
                     self._install_cdhmgr_agent(node)
                     self._install_agent_conf(node)
                self._install_cdhmgr(master)
                self._open_ports(master)
        def on_add_node(self, node, nodes, master, user, user_shell, volumes):
                for node in nodes:
                     self._install_cdhmgr_repo(node)
                     self._install_cdhmgr_agent(node)
                     self._install_agent_conf(node)

And this plugin can be referenced and executed in the cluster configuration as follows:

[plugin cdhmgr]
setup_class = cdh_mgr.ClouderaMgr

[cluster testcluster]
keyname = testcluster
cluster_size = 2
cluster_user = sgeadmin
cluster_shell = bash
master_image_id = ami-232b034a
master_instance_type = t1.micro
node_image_id =  ami-232b034a
node_instance_type = t1.micro
plugins = wgetter,rpminstaller,repoinstaller,pkginstaller,cdhmgr

Once this step is successfully executed by StarCluster, you will be able to access the cloudera manager web GUI on port 7180 of your master node’s public IP and/or DNS entry.

<...more to follow...>

Dec 132013
 

I ran into an incredible tool known as StarCluster, which is an open-source project from MIT (http://star.mit.edu/cluster/).

StarCluster is built using Sun Microsystem’s N1 Grid Engine software (Sun used it to do deployment for HPC environments). And the folks at MIT developed on a fork of that (SGE – Sun Grid Engine) and StarCluster was formed.

StarCluster has hooks into the AWS API and is used to dynamically and automatically provision multi-node clusters in Amazon’s EC2 cloud. The base StarCluster software provisions a pre-selected virtual machine image (AMI) onto a pre-defined VM type (e.g.: t1.micro, m1.small, etc). The distributed StarCluster software has a hadoop plugin (http://star.mit.edu/cluster/docs/0.93.3/plugins/hadoop.html) which installs the generic Apache Foundation Hadoop stack on the cluster of nodes (or single node) provisioned thereof. It then uses a tool called dumbo (http://klbostee.github.io/dumbo/) to drive the Hadoop Framework. This is great.

But I want to leverage Cloudera’s CDH manager based Hadoop Distribution (CDH?) and want to roll this into the StarCluster framework. Therefore I have to start developing the necessary plugin to implement Cloudera Hadoop on top of StarCluster.

I won’t delve deeper into StarCluster’s inner-workings in this post, as I just started working on it a couple of days ago. I do intend to see how I can leverage this toolkit to add more functionality into it (like integrating Ganglia-based monitoring, etc) in due time.

Why StarCluster and not a Vagrant, Jenkins, Puppet combination?

Tools like StarCluster are meant to do rapid and effective deployments of distributed computing environments on a gigantic scale (N1 grid engine was used to deploy hundreds of physical servers back in the days when Sun still shone on the IT industry). While Vagrant, Puppet, Jenkins etc can be put together to build such a framework, it is in essence easier to use StarCluster (with less effort) to operationalize a Hadoop Cluster on EC2 (and after I spend some more time on the inner-workings, I hope to be able to use it work on other virtualization technologies too – perhaps integrate with an on-premise Xen or vmware or KVM cluster).

Let’s Skim the surface:

How StarCluster works

Let’s start with how StarCluster works (and this pre-supposes a working knowledge of Amazon’s AWS/EC2 infrastructure).

It is in essence a toolkit that you can install on your workstation/laptop and use it to drive configuration and deployment of your distributed clustering environment on EC2.

To set it up, follow the instructions on MIT’s website. It was pretty straightforward on my Mac Book Pro OS ver 10.9 (Mavericks). I already had the Xcode software installed on my Mac. So I cloned the StarCluster Git Distribution and ran through two commands that needed to be run as listed here — http://star.mit.edu/cluster/docs/latest/installation.html

After that, I set up my preliminary configuration file here (~/.starcluster/config) and we were ready to roll. I chose to test the star cluster functionality without and plugins etc by setting up a single node.

It’s important to choose the right AMI (because not all AMIs can be deployed on VM’s of your choice. For instance, I wanted to use the smallest, t1.micro instance to do my testing, since I won’t really be using this for any actual work. I settled on a CentOS 5.4 AMI from RealStack.

After verifying that my single node cluster was being setup correctly, I proceeded to develop a plugin to do Cloudera Hadoop initial setup along with a Cloudera Manager  installation on the Master node. As a result thereof, we can then roll out a cluster, connect to Cloudera Manager and use the Cloudera Manager interface to configure the Hadoop cluster. If further automation is possible with Cloudera Manager, I will work on that as an addendum to this effort.

The Plugins

Following the instructions here, I started building out the framework for my plugin. I called the foundational plugin centos (since my selected OS is centos). The CDH specific one will be called Cloudera.

Following directory structure is set up after StarCluster is installed –

In your home directory, a .starcluster directory, under which we have:

total 8
drwxr-xr-x   9 dwai  dlahiri   306 Dec 12 18:15 logs
drwxr-xr-x   4 dwai  dlahiri   136 Dec 13 10:01 plugins
-rw-r--r--   1 dwai  dlahiri  2409 Dec 13 10:01 config
drwxr-xr-x+ 65 dwai  dlahiri  2210 Dec 13 10:01 ..
drwxr-xr-x   5 dwai  dlahiri   170 Dec 13 10:01 .

Your new plugins will go into the plugins directory listed above (for now).

This is as simple as this (the plugin in essence is a python script – we call it centos.py):

from starcluster.clustersetup import ClusterSetup
from starcluster.logger import log

global repo_to_install
global pkg_to_install

class WgetPackages(ClusterSetup):
	def __init__(self,pkg_to_wget):
		self.pkg_to_wget = pkg_to_wget
		log.debug('pkg_to_wget = %s' % pkg_to_wget)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:
			log.info("Wgetting %s on %s" % (self.pkg_to_wget, node.alias))
			node.ssh.execute('wget %s' % self.pkg_to_wget)

class RpmInstaller(ClusterSetup):
	def __init__(self,rpm_to_install):
		self.rpm_to_install = rpm_to_install
		log.debug('rpm_to_install = %s' % rpm_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:
			log.info("Installing %s on %s" % (self.rpm_to_install, node.alias))
			node.ssh.execute('yum -y --nogpgcheck localinstall %s' %self.rpm_to_install)

class RepoConfigurator(ClusterSetup):
	def __init__(self,repo_to_install):
		self.repo_to_install  = repo_to_install
		log.debug('repo_to_install = %s' % repo_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:
			log.info("Installing %s on %s" % (self.repo_to_install, node.alias))
			node.ssh.execute('rpm --import %s' % self.repo_to_install)

class PackageInstaller(ClusterSetup):
	def __init__(self,pkg_to_install):
		self.pkg_to_install  = pkg_to_install
		log.debug('pkg_to_install = %s' % pkg_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:
			log.info("Installing %s on %s" % (self.pkg_to_install, node.alias))
			node.ssh.execute('yum -y install %s' % self.pkg_to_install)

Now we can reference the plugin in our configuration file:

[cluster testcluster]
keyname = testcluster
cluster_size = 1
cluster_user = sgeadmin
cluster_shell = bash
node_image_id = ami-0db22764
node_instance_type = t1.micro
plugins = wgetter,rpminstaller,repoinstaller,pkginstaller,sge
permissions = zookeeper.1, zookeeper.2, accumulo.monitor, hdfs, jobtracker, tablet_server, master_server, accumulo_logger, accumulo_tracer, datanode_data, datanode_metadata, tasktrackers, namenode_http_monitor, datanode_http_monitor, accumulo, accumulo_http_monitor

[plugin wgetter]
setup_class = centos.WgetPackages
pkg_to_wget = "http://archive.cloudera.com/cdh4/one-click-install/redhat/5/x86_64/cloudera-cdh-4-0.x86_64.rpm"

[plugin rpminstaller]
setup_class = centos.RpmInstaller
rpm_to_install = cloudera-cdh-4-0.x86_64.rpm

[plugin repoinstaller]
setup_class = centos.RepoConfigurator
repo_to_install = "http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera"

[plugin pkginstaller]
setup_class = centos.PackageInstaller
pkg_to_install = hadoop-0.20-mapreduce-jobtracker

This is not complete by any means – I intend to post the complete plugin + framework in a series of subsequent posts.