I ran into an incredible tool known as StarCluster, which is an open-source project from MIT (http://star.mit.edu/cluster/).
StarCluster is built using Sun Microsystem’s N1 Grid Engine software (Sun used it to do deployment for HPC environments). And the folks at MIT developed on a fork of that (SGE – Sun Grid Engine) and StarCluster was formed.
StarCluster has hooks into the AWS API and is used to dynamically and automatically provision multi-node clusters in Amazon’s EC2 cloud. The base StarCluster software provisions a pre-selected virtual machine image (AMI) onto a pre-defined VM type (e.g.: t1.micro, m1.small, etc). The distributed StarCluster software has a hadoop plugin (http://star.mit.edu/cluster/docs/0.93.3/plugins/hadoop.html) which installs the generic Apache Foundation Hadoop stack on the cluster of nodes (or single node) provisioned thereof. It then uses a tool called dumbo (http://klbostee.github.io/dumbo/) to drive the Hadoop Framework. This is great.
But I want to leverage Cloudera’s CDH manager based Hadoop Distribution (CDH?) and want to roll this into the StarCluster framework. Therefore I have to start developing the necessary plugin to implement Cloudera Hadoop on top of StarCluster.
I won’t delve deeper into StarCluster’s inner-workings in this post, as I just started working on it a couple of days ago. I do intend to see how I can leverage this toolkit to add more functionality into it (like integrating Ganglia-based monitoring, etc) in due time.
Why StarCluster and not a Vagrant, Jenkins, Puppet combination?
Tools like StarCluster are meant to do rapid and effective deployments of distributed computing environments on a gigantic scale (N1 grid engine was used to deploy hundreds of physical servers back in the days when Sun still shone on the IT industry). While Vagrant, Puppet, Jenkins etc can be put together to build such a framework, it is in essence easier to use StarCluster (with less effort) to operationalize a Hadoop Cluster on EC2 (and after I spend some more time on the inner-workings, I hope to be able to use it work on other virtualization technologies too – perhaps integrate with an on-premise Xen or vmware or KVM cluster).
Let’s Skim the surface:
How StarCluster works
Let’s start with how StarCluster works (and this pre-supposes a working knowledge of Amazon’s AWS/EC2 infrastructure).
It is in essence a toolkit that you can install on your workstation/laptop and use it to drive configuration and deployment of your distributed clustering environment on EC2.
To set it up, follow the instructions on MIT’s website. It was pretty straightforward on my Mac Book Pro OS ver 10.9 (Mavericks). I already had the Xcode software installed on my Mac. So I cloned the StarCluster Git Distribution and ran through two commands that needed to be run as listed here — http://star.mit.edu/cluster/docs/latest/installation.html
After that, I set up my preliminary configuration file here (~/.starcluster/config) and we were ready to roll. I chose to test the star cluster functionality without and plugins etc by setting up a single node.
It’s important to choose the right AMI (because not all AMIs can be deployed on VM’s of your choice. For instance, I wanted to use the smallest, t1.micro instance to do my testing, since I won’t really be using this for any actual work. I settled on a CentOS 5.4 AMI from RealStack.
After verifying that my single node cluster was being setup correctly, I proceeded to develop a plugin to do Cloudera Hadoop initial setup along with a Cloudera Manager installation on the Master node. As a result thereof, we can then roll out a cluster, connect to Cloudera Manager and use the Cloudera Manager interface to configure the Hadoop cluster. If further automation is possible with Cloudera Manager, I will work on that as an addendum to this effort.
The Plugins
Following the instructions here, I started building out the framework for my plugin. I called the foundational plugin centos (since my selected OS is centos). The CDH specific one will be called Cloudera.
Following directory structure is set up after StarCluster is installed –
In your home directory, a .starcluster directory, under which we have:
total 8 drwxr-xr-x 9 dwai dlahiri 306 Dec 12 18:15 logs drwxr-xr-x 4 dwai dlahiri 136 Dec 13 10:01 plugins -rw-r--r-- 1 dwai dlahiri 2409 Dec 13 10:01 config drwxr-xr-x+ 65 dwai dlahiri 2210 Dec 13 10:01 .. drwxr-xr-x 5 dwai dlahiri 170 Dec 13 10:01 .
Your new plugins will go into the plugins directory listed above (for now).
This is as simple as this (the plugin in essence is a python script – we call it centos.py):
from starcluster.clustersetup import ClusterSetup from starcluster.logger import log global repo_to_install global pkg_to_install class WgetPackages(ClusterSetup): def __init__(self,pkg_to_wget): self.pkg_to_wget = pkg_to_wget log.debug('pkg_to_wget = %s' % pkg_to_wget) def run(self, nodes, master, user, user_shell, volumes): for node in nodes: log.info("Wgetting %s on %s" % (self.pkg_to_wget, node.alias)) node.ssh.execute('wget %s' % self.pkg_to_wget) class RpmInstaller(ClusterSetup): def __init__(self,rpm_to_install): self.rpm_to_install = rpm_to_install log.debug('rpm_to_install = %s' % rpm_to_install) def run(self, nodes, master, user, user_shell, volumes): for node in nodes: log.info("Installing %s on %s" % (self.rpm_to_install, node.alias)) node.ssh.execute('yum -y --nogpgcheck localinstall %s' %self.rpm_to_install) class RepoConfigurator(ClusterSetup): def __init__(self,repo_to_install): self.repo_to_install = repo_to_install log.debug('repo_to_install = %s' % repo_to_install) def run(self, nodes, master, user, user_shell, volumes): for node in nodes: log.info("Installing %s on %s" % (self.repo_to_install, node.alias)) node.ssh.execute('rpm --import %s' % self.repo_to_install) class PackageInstaller(ClusterSetup): def __init__(self,pkg_to_install): self.pkg_to_install = pkg_to_install log.debug('pkg_to_install = %s' % pkg_to_install) def run(self, nodes, master, user, user_shell, volumes): for node in nodes: log.info("Installing %s on %s" % (self.pkg_to_install, node.alias)) node.ssh.execute('yum -y install %s' % self.pkg_to_install)
Now we can reference the plugin in our configuration file:
[cluster testcluster] keyname = testcluster cluster_size = 1 cluster_user = sgeadmin cluster_shell = bash node_image_id = ami-0db22764 node_instance_type = t1.micro plugins = wgetter,rpminstaller,repoinstaller,pkginstaller,sge permissions = zookeeper.1, zookeeper.2, accumulo.monitor, hdfs, jobtracker, tablet_server, master_server, accumulo_logger, accumulo_tracer, datanode_data, datanode_metadata, tasktrackers, namenode_http_monitor, datanode_http_monitor, accumulo, accumulo_http_monitor [plugin wgetter] setup_class = centos.WgetPackages pkg_to_wget = "http://archive.cloudera.com/cdh4/one-click-install/redhat/5/x86_64/cloudera-cdh-4-0.x86_64.rpm" [plugin rpminstaller] setup_class = centos.RpmInstaller rpm_to_install = cloudera-cdh-4-0.x86_64.rpm [plugin repoinstaller] setup_class = centos.RepoConfigurator repo_to_install = "http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera" [plugin pkginstaller] setup_class = centos.PackageInstaller pkg_to_install = hadoop-0.20-mapreduce-jobtracker
This is not complete by any means – I intend to post the complete plugin + framework in a series of subsequent posts.