Mar 182014

This is an old one. I thought I’d post this before it falls off main memory.

You will need two things for this. I’ll start with the obvious.

  1. A splunk server configured to act as a deployment server (which will result in apps, configurations, etc being deployed centrally)
  2. A centralized “bastion” host, that has root-level ssh-based trust established with your target hosts (ie splunk agents). Another option is to have unprivileged ssh key-based trusts set up, but with local sudo privileges (passwordless) on each of your target hosts.

Once you do that, you can run a modification of this script shown below —

I’d written in for an old shop of mine, where we managed Solaris on SPARC, Solaris on x86 and Redhat Linux. Modify/write functions corresponding to your target OS of choice and it should do pretty much what you need rather easily.

Another thing to watch out for is make sure the version of splunk you are working with hasn’t changed the underlying mechanisms. I had written this for Splunk v4.x at least 3 years back (can’t remember how long ago now – this is one script I didn’t put in version control, so no way for me to tell how old it…and timestamps changed because I copied them from my old laptop, etc).


Mar 122014

Okay, this has been a long time coming, but thought I’d write this down before I forget all about it.

First a little bit of a rant. In the EMC world, we assign LUNs to hosts and identify them using a hex-id code called the LUN ID. This, in conjunction with the EMC Frame ID, helps us uniquely and “easily” identify the LUNs, especially when you consider that there are usually multiple EMC frames and hundreds or thousands of SAN-attached hosts in most modern enterprise data centers.

Okay, the challenge is when using a disk multipathing solution other than EMC’s expensive power path software, or the exorbitantly expensive Veritas Storage Foundation suite from Symantec.

Most self-respecting modern operating systems have disk multipathing software built in and freely available.  and the focus item for this blog is Solaris. Solaris has a mature and efficient multipathing software called MPxIO, which cleverly creates a pseudo-device corresponding to the complex of the multiple (or single) paths via which a SAN LUN is visible to the Operating system.

The MPxIO driver, then using the LUN’s global unique identifier (GUID) to, perhaps can be argued, rightly identify the LUN uniquely. This is however a challenge since the GUID is a 32-character string (very nicely explained here

When I first proposed to our usual user community – Oracle DBAs, that they would have to use disks named as shown below, I faced outraged howls of indignation. “How can we address these disks in ASM, etc?”


To work around, we considered creating an alternate namespace, say /dev/oracle/disk001 and so forth, retaining the major/minor numbers of the underlying devices.  But that would get hairy real quick, especially on those databases where we had multiple terabytes of storage (hundreds of LUNs).

If you are working with CLARiiON arrays, then you will have figure out some other way (than what I’ve shown in this blog post) to map your MPxIO/Solaris disk name to the valid Hex LUNID.

The code in this gist will cover what’s what. A friendly storage admin told me about this, wrt VMAX/Symmetrix LUNs. Apparently EMC generates the GUID (aka UUID) of LUNs differently based on whether it is a CLARiiON/VNX or a Symmetrix/VMAX.

In that, the fields that were extracted corresponding to variables $xlunid1 and so forth, and converted to their hex values, is the pertinent piece of information. Having this information then reduces our need to install symcli on every host so that we can extract the LUN ID that way.

This then will give us the ability to map a MPxIO/Solaris Disk name to an EMC LUN ID.

The full script is here —

Jan 022014

This is a continuation of my previous post on this topic.

First, a disclaimer –

  • I have focused on CentOS primarily. I will try and update this to accommodate more than one Linux distro (and even consider writing for Solaris-based implementations in the future).

Here’s the skeleton of the cloudera manager setup plugin:

import posixpath

from starcluster import threadpool
from starcluster import clustersetup
from starcluster.logger import log

class ClouderaMgr(clustersetup.ClusterSetup):

        def __init__(self,cdh_mgr_agentdir='/etc/cloudera-scm-agent'):
                self.cdh_mgr_agentdir = cdh_mgr_agentdir
                self.cdh_mgr_agent_conf = '/etc/cloudera-scm-agent/config.ini'
                self.cdh_mgr_repo_conf = '/etc/yum.repos.d/cloudera-cdh4.repo'
                self.cdh_mgr_repo_url = ''
                self._pool = None

        def pool(self):
                if self._pool is None:
                        self._pool = threadpool.get_thread_pool(20, disable_threads=False)
                return self._pool

        def _install_cdhmgr_repo(self,node):
                node.ssh.execute('wget %s' % self.cdh_mgr_repo_url)
                node.ssh.execute('cat /root/cloudera-manager.repo >> %s' % self.cdh_mgr_repo_conf)

        def _install_cdhmgr_agent(self,node):
                node.ssh.execute('yum install -y cloudera-manager-agent')
                node.ssh.execute('yum install -y cloudera-manager-daemons')

        def _install_cdhmgr(self,master):
                master.ssh.execute('/sbin/service cloudera-scn-agent stop')
                master.ssh.execute('/sbin/chkconfig cloudera-scn-agent off')
                master.ssh.execute('/sbin/chkconfig hue off')
                master.ssh.execute('/sbin/chkconfig oozie off')
                master.ssh.execute('/sbin/chkconfig hadoop-httpfs off')
                master.ssh.execute('yum install -y cloudera-manager-server')
                master.ssh.execute('yum install -y cloudera-manager-server-db')
                master.ssh.execute('/sbin/service cloudera-scm-server-db start')
                master.ssh.execute('/sbin/service cloudera-scm-server start')

        def _setup_hadoop_user(self,node,user):
                node.ssh.execute('gpasswd -a %s hadoop' %user)

        def _install_agent_conf(self, node):
                node.ssh.execute('/bin/sed -e"s/server_host=localhost/server_host=master/g" self.cdh_mgr_agent_conf > /tmp/config.ini; mv /tmp/config.ini self.cdh_mgr_agent_conf')

        def _open_ports(self,master):
                ports = [7180,50070,50030]
                ec2 = master.ec2
                for group in master.cluster_groups:
                        for port in ports:
                                has_perm = ec2.has_permission(group, 'tcp', port, port, '')
                                if not has_perm:

        def run(self,nodes, master, user, user_shell, volumes):
                for node in nodes:
        def on_add_node(self, node, nodes, master, user, user_shell, volumes):
                for node in nodes:

And this plugin can be referenced and executed in the cluster configuration as follows:

[plugin cdhmgr]
setup_class = cdh_mgr.ClouderaMgr

[cluster testcluster]
keyname = testcluster
cluster_size = 2
cluster_user = sgeadmin
cluster_shell = bash
master_image_id = ami-232b034a
master_instance_type = t1.micro
node_image_id =  ami-232b034a
node_instance_type = t1.micro
plugins = wgetter,rpminstaller,repoinstaller,pkginstaller,cdhmgr

Once this step is successfully executed by StarCluster, you will be able to access the cloudera manager web GUI on port 7180 of your master node’s public IP and/or DNS entry.

<...more to follow...>

Dec 132013

I ran into an incredible tool known as StarCluster, which is an open-source project from MIT (

StarCluster is built using Sun Microsystem’s N1 Grid Engine software (Sun used it to do deployment for HPC environments). And the folks at MIT developed on a fork of that (SGE – Sun Grid Engine) and StarCluster was formed.

StarCluster has hooks into the AWS API and is used to dynamically and automatically provision multi-node clusters in Amazon’s EC2 cloud. The base StarCluster software provisions a pre-selected virtual machine image (AMI) onto a pre-defined VM type (e.g.: t1.micro, m1.small, etc). The distributed StarCluster software has a hadoop plugin ( which installs the generic Apache Foundation Hadoop stack on the cluster of nodes (or single node) provisioned thereof. It then uses a tool called dumbo ( to drive the Hadoop Framework. This is great.

But I want to leverage Cloudera’s CDH manager based Hadoop Distribution (CDH?) and want to roll this into the StarCluster framework. Therefore I have to start developing the necessary plugin to implement Cloudera Hadoop on top of StarCluster.

I won’t delve deeper into StarCluster’s inner-workings in this post, as I just started working on it a couple of days ago. I do intend to see how I can leverage this toolkit to add more functionality into it (like integrating Ganglia-based monitoring, etc) in due time.

Why StarCluster and not a Vagrant, Jenkins, Puppet combination?

Tools like StarCluster are meant to do rapid and effective deployments of distributed computing environments on a gigantic scale (N1 grid engine was used to deploy hundreds of physical servers back in the days when Sun still shone on the IT industry). While Vagrant, Puppet, Jenkins etc can be put together to build such a framework, it is in essence easier to use StarCluster (with less effort) to operationalize a Hadoop Cluster on EC2 (and after I spend some more time on the inner-workings, I hope to be able to use it work on other virtualization technologies too – perhaps integrate with an on-premise Xen or vmware or KVM cluster).

Let’s Skim the surface:

How StarCluster works

Let’s start with how StarCluster works (and this pre-supposes a working knowledge of Amazon’s AWS/EC2 infrastructure).

It is in essence a toolkit that you can install on your workstation/laptop and use it to drive configuration and deployment of your distributed clustering environment on EC2.

To set it up, follow the instructions on MIT’s website. It was pretty straightforward on my Mac Book Pro OS ver 10.9 (Mavericks). I already had the Xcode software installed on my Mac. So I cloned the StarCluster Git Distribution and ran through two commands that needed to be run as listed here —

After that, I set up my preliminary configuration file here (~/.starcluster/config) and we were ready to roll. I chose to test the star cluster functionality without and plugins etc by setting up a single node.

It’s important to choose the right AMI (because not all AMIs can be deployed on VM’s of your choice. For instance, I wanted to use the smallest, t1.micro instance to do my testing, since I won’t really be using this for any actual work. I settled on a CentOS 5.4 AMI from RealStack.

After verifying that my single node cluster was being setup correctly, I proceeded to develop a plugin to do Cloudera Hadoop initial setup along with a Cloudera Manager  installation on the Master node. As a result thereof, we can then roll out a cluster, connect to Cloudera Manager and use the Cloudera Manager interface to configure the Hadoop cluster. If further automation is possible with Cloudera Manager, I will work on that as an addendum to this effort.

The Plugins

Following the instructions here, I started building out the framework for my plugin. I called the foundational plugin centos (since my selected OS is centos). The CDH specific one will be called Cloudera.

Following directory structure is set up after StarCluster is installed –

In your home directory, a .starcluster directory, under which we have:

total 8
drwxr-xr-x   9 dwai  dlahiri   306 Dec 12 18:15 logs
drwxr-xr-x   4 dwai  dlahiri   136 Dec 13 10:01 plugins
-rw-r--r--   1 dwai  dlahiri  2409 Dec 13 10:01 config
drwxr-xr-x+ 65 dwai  dlahiri  2210 Dec 13 10:01 ..
drwxr-xr-x   5 dwai  dlahiri   170 Dec 13 10:01 .

Your new plugins will go into the plugins directory listed above (for now).

This is as simple as this (the plugin in essence is a python script – we call it

from starcluster.clustersetup import ClusterSetup
from starcluster.logger import log

global repo_to_install
global pkg_to_install

class WgetPackages(ClusterSetup):
	def __init__(self,pkg_to_wget):
		self.pkg_to_wget = pkg_to_wget
		log.debug('pkg_to_wget = %s' % pkg_to_wget)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:"Wgetting %s on %s" % (self.pkg_to_wget, node.alias))
			node.ssh.execute('wget %s' % self.pkg_to_wget)

class RpmInstaller(ClusterSetup):
	def __init__(self,rpm_to_install):
		self.rpm_to_install = rpm_to_install
		log.debug('rpm_to_install = %s' % rpm_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:"Installing %s on %s" % (self.rpm_to_install, node.alias))
			node.ssh.execute('yum -y --nogpgcheck localinstall %s' %self.rpm_to_install)

class RepoConfigurator(ClusterSetup):
	def __init__(self,repo_to_install):
		self.repo_to_install  = repo_to_install
		log.debug('repo_to_install = %s' % repo_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:"Installing %s on %s" % (self.repo_to_install, node.alias))
			node.ssh.execute('rpm --import %s' % self.repo_to_install)

class PackageInstaller(ClusterSetup):
	def __init__(self,pkg_to_install):
		self.pkg_to_install  = pkg_to_install
		log.debug('pkg_to_install = %s' % pkg_to_install)
	def run(self, nodes, master, user, user_shell, volumes):
		for node in nodes:"Installing %s on %s" % (self.pkg_to_install, node.alias))
			node.ssh.execute('yum -y install %s' % self.pkg_to_install)

Now we can reference the plugin in our configuration file:

[cluster testcluster]
keyname = testcluster
cluster_size = 1
cluster_user = sgeadmin
cluster_shell = bash
node_image_id = ami-0db22764
node_instance_type = t1.micro
plugins = wgetter,rpminstaller,repoinstaller,pkginstaller,sge
permissions = zookeeper.1, zookeeper.2, accumulo.monitor, hdfs, jobtracker, tablet_server, master_server, accumulo_logger, accumulo_tracer, datanode_data, datanode_metadata, tasktrackers, namenode_http_monitor, datanode_http_monitor, accumulo, accumulo_http_monitor

[plugin wgetter]
setup_class = centos.WgetPackages
pkg_to_wget = ""

[plugin rpminstaller]
setup_class = centos.RpmInstaller
rpm_to_install = cloudera-cdh-4-0.x86_64.rpm

[plugin repoinstaller]
setup_class = centos.RepoConfigurator
repo_to_install = ""

[plugin pkginstaller]
setup_class = centos.PackageInstaller
pkg_to_install = hadoop-0.20-mapreduce-jobtracker

This is not complete by any means – I intend to post the complete plugin + framework in a series of subsequent posts.