Jan 102014

This is in continuation of my previous post regarding Tactical vs Strategic thinking. I received some feedback about the previous post and thought I’d articulate the gist of the feedback and address the concern further.

The feedback – It seems like the previous post was too generic for some readers. The suggestion was to provide examples of what I consider “classic” tactical vs strategic thinking.

I can think of a few off the top, and please forgive me for using/re-using clichés or being hopelessly reductionist (but they are so ripe for the taking…).

I’ve seen several application “administrators” and DBAs refusing to automate startup and shutdown of their application via init scripts and/or SMF (in case of Solaris systems).

Really? An SAP basis application cannot be automatically started up? Or a Weblogic instance?


Because, *sigh and a long pause*…well because “bad things might happen!”

You know I’m not exaggerating about this. We all have faced such responses.

To get to the point, they were hard-pressed for valid answers. Sometimes we just accept their position, filed under the category “it’s their headache”.  That aside, a “strategic” thing to do, in the interest of increased efficiency and better delivery of services would be to automate the startup and shutdown of these apps.

An even more strategic thing to do would be to evaluate whether the applications would benefit from some form of clustering, and then automate the startup/shutdown of the app, building resource groups for various tiers of the application and establishing affinities between various groups.

For example, a 3-tier web UI based app would have an RDBMS backend, a business logic middle tier and a web front end. Build three resource/service groups using a standard clustering software and devise affinities between the three, such that the web ui is dependent on the business layer and the business layer dependent on the DB. Now, it’s a different case if all three tiers are scalable (aka active/active in the true sense).

An even more strategic approach would be to establish rules of thumb regarding use of cluster ware, or highly available virtual machines. Establishing service level tiers, that deliver different levels of availability (I’ve based it off the service level provided by vendors in the past).

This would call for standardization of the infrastructure architecture and as a corollary thereof, associated considerations would have to be addressed.

For eg:

  1. What do you need to provide a specific tier of service?
  2. Do you have the budget to meet such requirements (it’s significantly more expensive to provide a 99.99% planned uptime as opposed to 99%)?
  3. Do the business owners consider the app worthy of such a tier?
  4. If not, are you sure that the business owners actually understand the implications of this going down? (How much money are they likely to lose if this app was down – better if we can quantify it)

All these and a slew of other such questions will arise and will have to be responded to. A mature organization will already have worked a lot of these into their standard procedures. But it is good for engineers at all levels of maturity and experience to think of these things. And build an internal methodology, keep refining it, re-tooling it to adapt to changes in the IT landscape. What is valid for an on-premise implementation might not be applicable for something on a public cloud.

Actually this topic gets even more murky when consider public IaaS cloud offerings. Those of us who have predominantly worked in the enterprise, often might find it hard to believe that people actually use public IaaS clouds, because, realistically an on-premise offering can provide better service levels and reliability than these measly virtual machines can provide. While the lure of a compute on demand, pay as you go model is very attractive, it is perhaps not applicable for every scenario.

So then you have adapt your methodology (if there is an acceptable policy to employ public IaaS clouds). More questions around deployments might now need to be prepended to our list above:

  1. How much compute will this application require?
  2. Will it need to be on a public cloud or a VM farm (private cloud) or on bare-metal hardware?

I used to think that almost everyone in this line of business thought about Systems Engineering on these lines. But I was very wrong. I have met and worked with some very smart engineers, who have some vague idea about these topics, or have thought through some aspects of these topics.

Many don’t take the effort to learn how to do a TCO calculation, or show an RoI for projects they are involved in (even to the extent of knowing for their own sakes). As “bean-counter-esque” these subjects might seem, they are very important (imho) towards taking your repertoire as a mature Engineer to the next level.

And Cost of ownership is often times neglected by Engineers. I’ve had heated discussions with my colleagues at times regarding why One technology was chosen/recommended over another.

One that comes to mind was when I was asked to evaluate and recommend one of the two technologies – Veritas Cluster Server and Veritas Storage Foundation vs Sun Cluster/ZFS/UFS combination. The shop at that time was a heavy user of Solaris + SPARC + Veritas Storage Foundation/HA.

The project was very simple – we wanted to reduce the physical footprint of our SPARC infrastructure (then comprised of V490s, V890s, V440s, V240s etc). So we chose to use a lightweight technology called Solaris Containers to achieve that. The apps were primarily DSS type, and batch oriented. We opted to use Sun’s  T5440 servers to virtualize (There were no T4s or T5s in those days, so we took the lumps on the slow single-thread performance in order not really upgrade the performance characteristics of the servers we were P2V’ing, but get a good consolidation ratio).

As a proof of concept, we baked off a pair of two-node cluster using Veritas Cluster Server 5.0 and Sun Cluster 3.2. Functionally, our needs were fairly simple. We needed to ascertain the following –

  • Can we use the cluster software to lash multiple physical nodes into a Zone/Container farm
  • How easy is it to set up, use and maintain
  • What would be our TCO and what would our return on investment be, choosing one technology over another

Both cluster stacks had agents to cluster and manage Solaris Containers and ZFS pools. At the end it boiled down to two things really –

  • The team was more familiar with VCS than Sun Cluster
  • VCS + Veritas Storage Foundation cost 2-3x more than Sun Cluster + ZFS combination

The cost of ownership was an overwhelmingly higher number. On the other hand, while the team (with the exception of myself) wasn’t trained in Sun Cluster, we  weren’t trying to do anything outrageous with the cluster software.

We would have a cluster group for each container we built, that comprised of the ZFS pools that housed the container’s OS + data volumes and a Zone agent that managed the container itself. The setup would then comprise of following steps:

  1. add storage to the cluster
  2. set up the ZFS pool for new container being instantiated
  3. install the container (a 10 minute task for a full root zone)
  4. create a cluster resource group for the container
  5. create a ZFS pool resource (called HAStoragePlus resource in Sun Cluster parlance)
  6. Crete a HA-Zone resource (to manage the zone)
  7. bring the cluster resource group online/enable cluster-based monitoring

These literally require 9-10 commands on the command line. Lather, rinse, repeat. So, when I defended the design, I had all this information as well as the financial data to support why I recommended using Sun Cluster over Veritas. Initially a few colleagues were resistant to the suggestion, over concerns about supporting the solution. But it was a matter of spending 2-3x more up-front and continually for support vs spending $40K on training over one year.

Every organization is different, some have pressures of reducing operational costs, while others reducing capital costs. At the end, the decision makers need to make their decisions based on what’s right for their organization. But providing this degree of detail as to why a solution was recommended helps to eliminate hurdles with greater ease. Unsound arguments usually fall apart when faced with cold, hard data.