Aug 132018
 

The tool iperf (or iperf3) is popular as a synthetic network load generator that can be used to measure the network throughput between two nodes on a TCP/IP network. This is fine for point-to-point measurements. However, when it comes to distributed computing clusters involving multiple racks of nodes, it becomes challenging to find a tool that can easily simulate traffic that will test the network infrastructure sufficiently. One could of course always set up Hadoop on the cluster (or if Hadoop is already installed), run a TeraGen (MapReduce tool) that will generate heavy network traffic. However, often we are required to test and validate the boundaries and limitations of a network, especially the traffic between nodes of the cluster (which crosses Rack boundaries), often before a software stack can be deployed on the infrastructure. It is useful to be able to use a simple and well known tool (hence iperf3) so that it can provide a common frame of reference, and the tests can easily be reproduced irrespective of the underlying infrastructure (eg: bare-metal vs private cloud vs public cloud).

The objective of this document is, that, using the standard and freely available tool iperf3, we run parallel synthetic network loads to test the limits of a distributed cluster network. In other words, given a known set of iperf3 server instances, randomly run iperf3 client sessions against them simultaneously from multiple hosts. Towards that end, the same hosts can be set up as servers as well as clients (with the caveat that we avoid running a test from any given node against itself).

The methodology introduced in this document can be used to generate synthetic network loads to test East-West network bandwidth.  The diagram below shows the recommended (and standard) network topology for a distributed computing cluster (such as is Hadoop).

The Leaf to Spine uplink is recommended to be such that each node in one rack be able to talk in a non-blocking fashion to each node in another rack. Or, in other words, the E-W oversubscription ratio should be 1:1. If you have 1:1 oversubscription and each node has one 10 gbps NIC, then you should get as close to 10 gbps in bandwidth with these tests when you are running network traffic in parallel across all nodes between two racks.

Practical reasons may result in lower oversubscription ratios of upto 4:1. For really large clusters (more than 500 nodes), this might be higher  (7:1).

Some caveats need to be kept in mind. Hadoop has traditionally been a shared-nothing architecture, and with great emphasis on Data locality. What that means is that the worker nodes of a Hadoop cluster have locally attached drives (SATA is the norm). There are multiple copies of data stored (HDFS defaults to 3 replicas per HDFS block).

However, as network hardware has improved in terms of throughput, it has opened up avenues for flexibility vis-a-vis data locality (40 Gbps, 25 Gbps, 50 Gbps ethernet networks are fast becoming ubiquitous, with 100 Gbps uplink ports becoming more prevalent in enterprise data-centers). As we start seeing more architectures leveraging network based data storage and retrieval in these distributed computing environments, the need to gravitate towards 1:1 oversubscription will also increase correspondingly.

We need the following for this methodology to work —

  • iperf3 installed on all nodes
  • moreutils installed on all nodes
  • A clush setup (clustershell) (a laptop will do).

The clush setup involves creating an entry in the file /etc/clustershell/groups as follows —

 

mc_all:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com host[302,304,305-0333].my.company.com

mc_rack1:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com 

mc_rack2 :host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com

mc_rack3: host[302,304,305-0333].my.company.com

Install iperf3 and moreutils as follows —

# clush -g mc_all 'sudo yum install -y iperf3 moreutils'

After this is done, launch iperf3 in Daemon mode on all the hosts as follows —

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5001

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5002

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5003

Verify that the iperf3 Daemons are running —

$ clush -g mc_all -l johndoe 'ps -ef|grep iperf3|grep -v grep'

host102.my.company.com: root      10245 1 14 12:50 ? 00:06:42 iperf3 -sD -p 5001

host106.my.company.com: root       9495 1 12 12:50 ? 00:05:36 iperf3 -sD -p 5001

host104.my.company.com: root       9554 1 9 12:50 ? 00:04:20 iperf3 -sD -p 5001

<truncated for readability>

host315.my.company.com: root      33247 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host224.my.company.com: root      33136 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host323.my.company.com: root      33257 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host318.my.company.com: root      32868 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: root      33470 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: johndoe   33734 33492 22 13:34 ? 00:00:10 iperf3 -c host120.my.company.com -p 5001 -t 60

 

After all the nodes have been setup to run iperf3 server instances in Daemon mode, create multiple nodelist files (in this case, we are testing cross-rack bandwidth, and so set up the nodelist per rack as follows —

$ ls -lrt 

total 32

-rw-r--r--  1 johndoe  staff 806 Aug  9 14:52 nodes.rack3

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack2

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack1

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 random_net.sh

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 randomnet_single.sh

Distribute these files to all nodes of the cluster. Assumptions are that you already have clush setup (so have keys to the kingdom, so to speak — ie an user id on each node which has sudo privileges as root etc).

$ clush -g mc_all -c nodes.rack* --dest=/home/johndoe

$ clush -g mc_all -c randomnet*.sh --dest=/home/johndoe

Each file contains a list of all nodes in the specific rack. Eg:

$ cat nodes.rack1 

host102.my.company.com

host104.my.company.com

host106.my.company.com

host108.my.company.com 

host110.my.company.com

The shell script random_net.sh is intended to randomly select a target node against which to run the iperf test. This is to be run on the client side, say from Rack2 (client) to Rack1 (server) —

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

ports=(5001 5002 5003)

psize=${#ports[@]}

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    pindex=$(( $RANDOM % psize ))

    target=${nodes[$index]}

    ptarget=${ports[$pindex]}

    iperf3 -c $target -p $ptarget |ts |tee -a ~/iperf3.log  # Run payload from Client to server

    iperf3 -c $target -R -p $ptarget |ts |tee -a ~/iperf3.log  # Reverse the direction

done

In order to run against a single iperf3 server instance per node, run this script instead

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    target=${nodes[$index]}

    iperf3 -c $target -p 5001 |ts |tee -a ~/${target}_iperf3.log    # Run payload from Client to server

    iperf3 -c $target -R -p 5001|ts |tee -a  ~/${target}_iperf3.log # Reverse the direction

done

Run the script(s) as follows —

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

Or

$ clush -g mc_rack2 -l johndoe 'sh randomnet_single.sh nodes.rack1'

This example shows the test running in parallel on all nodes of Rack2 by randomly selecting a node in Rack1 as the target. The output of the tests can be parsed to generate usable reports that will help ascertain if there is a bottleneck anywhere.

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

host218.my.company.com: Aug 09 14:31:17 Connecting to host host108.my.company.com, port 5001

host218.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.59 port 54962 connected to 192.168.1.14 port 5001

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 646 MBytes 5.42 Gbits/sec 80   595 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 542 MBytes 4.55 Gbits/sec 50   561 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 574 MBytes 4.81 Gbits/sec 29   577 KBytes 

host204.my.company.com: Aug 09 14:31:17 Connecting to host host140.my.company.com, port 5001

host204.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.52 port 47034 connected to 192.168.1.30 port 5001

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 870 MBytes 7.30 Gbits/sec 38   799 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 626 MBytes 5.25 Gbits/sec 28   454 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 516 MBytes 4.33 Gbits/sec 19   512 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 590 MBytes 4.95 Gbits/sec 19   656 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 581 MBytes 4.88 Gbits/sec 88   649 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 570 MBytes 4.78 Gbits/sec 19   592 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 561 MBytes 4.71 Gbits/sec 41   560 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 589 MBytes 4.94 Gbits/sec 91   563 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 539 MBytes 4.52 Gbits/sec 46   479 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 570 MBytes 4.78 Gbits/sec 68   607 KBytes 

host204.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec 457             sender

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec              receiver

host204.my.company.com: Aug 09 14:31:17 

host204.my.company.com: Aug 09 14:31:17 iperf Done.

host218.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 636 MBytes 5.34 Gbits/sec 12   484 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 508 MBytes 4.26 Gbits/sec 65   433 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 384 MBytes 3.22 Gbits/sec 62   566 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 632 MBytes 5.30 Gbits/sec 69   519 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 595 MBytes 4.99 Gbits/sec 30   650 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 564 MBytes 4.73 Gbits/sec 45   478 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 525 MBytes 4.40 Gbits/sec 26   444 KBytes 

host218.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec 468             sender

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec              receiver

 

NOTE: This blog documents my personal opinions only and does not reflect my employer’s positions on any subject written about here. 

Jan 062014
 

In a past life when I used to work for a wireless service provider,  they used a vended application to evaluate how much data bandwidth customers were consuming and that data was sent to the billing system to form the customers’ monthly bills.

The app was a poorly written (imho) and was woefully single-threaded, incapable of leveraging oodles of compute resources that were provided in the form of the twelve two-node VCS/RHEL based clusters. There were two data centers and there was one such “cluster of clusters” at each site.

The physical infrastructure was pretty standard and over-engineered to a certain degree (the infrastructure under consideration having been built in the 2007-2008 timeframe) – DL580 G5 HP servers  with 128GB of RAM each, a pair of gigabit ethernet nics for cluster interconnects, another pair bonded together as public interfaces and 4 x 4GB FC HBAs through Brocade DCX core switches (dual fabric) to an almost dedicated EMC CLaRiiON C4-960.

The application was basically a bunch of processes that watched traffic as it flowed through the network cores and calculated bandwidth usage based on end-user handset IP addresses (i’m watering it down to help keep the narrative fast and fluid).

Each location (separated by about 200+ miles) acted as the fault tolerant component for the other. So, the traffic was cross-fed across the two data centers over a pair of OC12 links.

The application was a series of processes/services that formed a queue across the various machines of each cluster (over TCP ports). The processes within a single physical system too communicated via IP addresses and TCP ports.

The problem we started observing over a few months from the point of deployment was that the data would start queuing up and slow down/skew the calculations of the data usage, implications of which I don’t necessarily have to spell out (ridiculous).

The software  vendor’s standard response would be –

  1. It is not an application problem
  2. The various components of the application would write into FIFO queues on SAN-based filesystems. The vendor would constantly raise the bogey of SAN storage being slow, not enough IOPs being available and/or response time being poor. Their basis of coming up with was with what seemed to be an arbitrary metric of CPU IO Wait percentage rising over 5% (or perhaps even lower at times).
  3. After much deliberation poring over the NAR reports of almost dedicated EMC CX4-960s (and working with EMC), we were able to ascertain that the Storage arrays or the SAN were not in any way contributory towards any latency (that resulted in poor performance of this app).
  4. The processes being woefully single threaded, would barely ever use more than 15%of total CPUs available on the servers (each server having 16 cores at it’s disposal).
  5. Memory usage was nominal and well within acceptable limits
  6. The network throughput wasn’t anywhere near saturation

We decided to start profiling the application (despite great protestations from the vendor) during normal as well as problematic periods, at the layer where we were seeing the issues, as well as the immediately upstream and downstream layers.

What we observed was that:

  1. Under normal circumstances, the app would be spending most of it’s time in either read, write, send or recv syscalls
  2. When app was performing poorly, it would spend most of it’s time in the poll syscall. It became apparent that it was waiting on TCP sockets from app instances in the remote site (and issue was bidirectional).

 

Once this bit of information was carefully vetted out and provided to the vendor, then they decided to provide us with the minimal throughput requirements – they needed a minimum of 30mbps throughput. The assumption made was that on an OC12 (bw of 622mbps), 30mbps was quite achievable.

However, it so turns out that latency plays a huge role in the actual throughput on a WAN connection! (*sarcasm alert*)

The average RTT between the two sites was 30ms. Given that the servers were running RHEL 4.6, for the default TCP send and recv buffer sizes of 64K (untuned), the 30ms RTT resulted in a max throughput of about 17mbps.

It turns out that methodologies we (*NIX admins) would use to generate some network traffic and measure “throughput” with, aren’t necessarily equipped to handle wide area networks. For example, SSH has an artificial bottleneck in it’s code, that throttles the TCP window size to a default of 64K (in the version of OpenSSH we were using at that time) – as hardcoded in the channels.h file. Initial tests were indeed baffling, since we would never cross a 22mbps throughput on the WAN. After a little research we realized that all the default TCP window sizes (for passive ftp, scp, etc) were not really tuned for high RTT connections.

Thus begun the process of tweaking the buffer sizes, and generating synthetic loads using iperf.

After we established that the default TCP buffer sizes were inadequate, we calculated buffer sizes required to provide at least a 80mbps throughput, and implemented then across the environment. The queuing stopped immediately after.

Oct 102013
 
About the STREAM benchmark

http://blogs.utexas.edu/jdm4372/tag/stream-benchmark/

Here’s what the author has to say about the benchmark itself —

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
Leveraging the Parallelization potential of the T4

In order to run this benchmark, the stream benchmark program was compiled with GCC as well as SolarisStudio 12 (the optimized, native compiler for Solaris).

A standard compile with the gcc compiler resulted in this —

[jdoe@myserver:~/stream-gcc (52)]
$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1614701 microseconds.
   (= 1614701 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1119.4976       1.7513       1.7151       1.7878
Scale:       1094.2510       1.7722       1.7546       1.7939
Add:         1455.0495       1.9815       1.9793       1.9847
Triad:       1463.1247       1.9774       1.9684       1.9889
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Then we compiled the code using Solaris studio and immediately saw improvements in Memory throughput (without any optimization) —

Unoptimized compile gave --

$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1434242 microseconds.
   (= 1434242 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1322.0838       1.4544       1.4523       1.4573
Scale:       1365.2033       1.4066       1.4064       1.4070
Add:         1968.3168       1.4633       1.4632       1.4637
Triad:       1944.1898       1.4815       1.4813       1.4819
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

After optimization —

Various degrees of optimization resulted in slight variations of performance (the following gave best results which was around 3x of unoptimized code)

cc -mt -m32 -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 278639 microseconds.
   (= 278639 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        3137.3320       0.6123       0.6120       0.6128
Scale:       3142.1011       0.6119       0.6111       0.6125
Add:         4230.4671       0.6811       0.6808       0.6817
Triad:       4323.3051       0.6667       0.6662       0.6674
-------------------------------------------------------------
Solution Validates
Make it Parallel

Using the sunstudio compiler, it is possible to force a single-threaded app to multi-thread on the CMT platform —

devzone:$(build) # cc -m32 -mt -xautopar -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 133846 microseconds.
   (= 133846 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6126.3741       0.3178       0.3134       0.3267
Scale:       6318.8244       0.3057       0.3039       0.3135
Add:         8280.5469       0.3490       0.3478       0.3508
Triad:       8396.7949       0.3438       0.3430       0.3449
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

This defaults to only 2 threads running in parallel (albeit the app thinks it is using a single thread of execution)

Now explicitly setting the following two variables in the parent shell, we were able to get 8 parallel threads of execution, effectively getting around 3x higher memory throughput (going from ~ 3GB/s with single thread to 6GB/s with 2 threads to 21GB/s with 8 threads — ie utilizing a full core)

$ export PARALLEL=8
[jdoe@myserver:~ (9)]
$  export SUNW_MP_THR_IDLE=8
[jdoe@myserver:~ (10)]
$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43905 microseconds.
   (= 43905 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       21245.0500       0.0914       0.0904       0.0920
Scale:      21816.9850       0.0885       0.0880       0.0908
Add:        28052.9390       0.1032       0.1027       0.1056
Triad:      28368.5107       0.1022       0.1015       0.1065
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~ (11)]

Now running 16 parallel threads —

$ ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 219395 microseconds.
   (= 219395 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32325.0822       0.3009       0.2970       0.3427
Scale:      32666.0515       0.3126       0.2939       0.3858
Add:        40507.6894       0.3741       0.3555       0.4537
Triad:      40263.1710       0.3676       0.3576       0.4074
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~/benchmarks (24)]

While prstat sees —

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 15865 jdoe   74  25 0.0 0.0 0.0 0.4 0.0 0.7  12  40  17   0 stream_64.ap/4
 15865 jdoe   73  26 0.0 0.0 0.0 0.5 0.0 0.6  14  40  17   0 stream_64.ap/11
 15865 jdoe   73  23 0.0 0.0 0.0 3.5 0.0 0.2  14  35  17   0 stream_64.ap/15
 15865 jdoe   73  23 0.0 0.0 0.0 2.9 0.0 1.0  12  40  17   0 stream_64.ap/8
 15865 jdoe   73  23 0.0 0.0 0.0 3.2 0.0 0.7  19  40  24   0 stream_64.ap/10
 15865 jdoe   73  23 0.0 0.0 0.0 3.9 0.0 0.2  14  40  17   0 stream_64.ap/13
 15865 jdoe   73  22 0.0 0.0 0.0 4.2 0.0 0.0  14  40  19   0 stream_64.ap/2
 15865 jdoe   73  22 0.0 0.0 0.0 3.1 0.0 1.1  15  31  19   0 stream_64.ap/6
 15865 jdoe   71  23 0.0 0.0 0.0 5.6 0.0 0.0  10  35 740   0 stream_64.ap/1
 15865 jdoe   71  23 0.0 0.0 0.0 6.0 0.0 0.0  15  35  17   0 stream_64.ap/14
 15865 jdoe   71  23 0.0 0.0 0.0 6.1 0.0 0.0  14  38  19   0 stream_64.ap/5
 15865 jdoe   71  23 0.0 0.0 0.0 6.2 0.0 0.0  15  35  17   0 stream_64.ap/7
 15865 jdoe   71  23 0.0 0.0 0.0 6.3 0.0 0.0  12  35  15   0 stream_64.ap/16
 15865 jdoe   71  22 0.0 0.0 0.0 6.5 0.0 0.0  15  35  22   0 stream_64.ap/9
 15865 jdoe   71  22 0.0 0.0 0.0 6.6 0.0 0.0  19  35  19   0 stream_64.ap/3
 15865 jdoe   71  22 0.0 0.0 0.0 6.8 0.0 0.0  14  37  17   0 stream_64.ap/12
 15182 jdoe  0.6 0.8 0.0 0.0 0.0 0.0  99 0.0   7   1  3K   0 prstat/1
 14998 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  34   0 bash/1
 14996 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  11   0 118   0 sshd/1
 15162 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   8   0 sshd/1
 15164 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 bash/1

  NLWP USERNAME  SWAP   RSS MEMORY      TIME  CPU
    21 jdoe    13G   13G    11%   0:00:48 1.5%

Total: 6 processes, 21 lwps, load averages: 0.23, 0.11, 0.16

The acceleration was astounding.

In time elapsed, with single thread —

[jdoe@myserver:~/benchmarks (24)]
$ export SUNW_MP_THR_IDLE=1
[jdoe@myserver:~/benchmarks (25)]
$ export PARALLEL=1
[jdoe@myserver:~/benchmarks (26)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1379961 microseconds.
   (= 1379961 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2956.0364       3.2745       3.2476       3.3159
Scale:       3025.7681       3.1895       3.1727       3.2110
Add:         4026.0036       3.5974       3.5767       3.6166
Triad:       4025.0673       3.5911       3.5776       3.6025
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real     4:57.114
user     4:47.825
sys         9.284
[jdoe@myserver:~/benchmarks (27)]
$

With 16 parallel threads —

[jdoe@myserver:~/benchmarks (27)]
$ export PARALLEL=16
[jdoe@myserver:~/benchmarks (28)]
$ export SUNW_MP_THR_IDLE=16
[jdoe@myserver:~/benchmarks (29)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 231461 microseconds.
   (= 231461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32235.5417       0.3057       0.2978       0.3653
Scale:      32646.3996       0.3104       0.2941       0.3647
Add:        40598.9607       0.3722       0.3547       0.4290
Triad:      40255.7375       0.3656       0.3577       0.4070
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real       29.316
user     7:13.691
sys        10.981
[jdoe@myserver:~/benchmarks (30)]
$

See how the “real” time went from 5 minutes to 30s.

The benchmark program
/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <stddef.h>
# include <sys/time.h>

/* INSTRUCTIONS:
 *
 *      1) Stream requires a good bit of memory to run.  Adjust the
 *          value of 'N' (below) to give a 'timing calibration' of
 *          at least 20 clock-ticks.  This will provide rate estimates
 *          that should be good to about 5% precision.
 */

#ifndef N
#   define N    120000000
#endif
#ifndef NTIMES
#   define NTIMES       20
#endif
#ifndef OFFSET
#   define OFFSET       0
#endif

/*
 *      3) Compile the code with full optimization.  Many compilers
 *         generate unreasonably bad code before the optimizer tightens
 *         things up.  If the results are unreasonably good, on the
 *         other hand, the optimizer might be too smart for me!
 *
 *         Try compiling with:
 *               cc -O stream_omp.c -o stream_omp
 *
 *         This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 *      4) Mail the results to mccalpin@cs.virginia.edu
 *         Be sure to include:
 *              a) computer hardware model number and software revision
 *              b) the compiler flags
 *              c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double   avgtime[4] = {0}, maxtime[4] = {0},
                mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};

static char     *label[4] = {"Copy:      ", "Scale:     ",
    "Add:       ", "Triad:     "};

static double   bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

extern double mysecond();
extern void checkSTREAMresults();
#ifdef TUNED
extern void tuned_STREAM_Copy();
extern void tuned_STREAM_Scale(double scalar);
extern void tuned_STREAM_Add();
extern void tuned_STREAM_Triad(double scalar);
#endif
#ifdef _OPENMP
extern int omp_get_num_threads();
#endif
int
main()
    {
    int                 quantum, checktick();
    int                 BytesPerWord;
    register int        j, k;
    double              scalar, t, times[4][NTIMES];

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    printf("STREAM version $Revision: 5.9 $\n");
    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
#ifdef NO_LONG_LONG
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
#else
    printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
#endif

    printf("Total memory required = %.1f MB.\n",
        (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

#ifdef _OPENMP
    printf(HLINE);
#pragma omp parallel
    {
#pragma omp master
        {
            k = omp_get_num_threads();
            printf ("Number of Threads requested = %i\n",k);
        }
    }
#endif

    printf(HLINE);
#pragma omp parallel
    {
    printf ("Printing one line per active thread....\n");
    }

    /* Get initial value for system clock. */
#pragma omp parallel for
    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

    printf(HLINE);

    if  ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else {
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");
        quantum = 1;
    }

    t = mysecond();
#pragma omp parallel for
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (mysecond() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t  );
    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING -- The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);

    /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
        {
        times[0][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Copy();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
#endif
        times[0][k] = mysecond() - times[0][k];

        times[1][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Scale(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
#endif
        times[1][k] = mysecond() - times[1][k];

        times[2][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Add();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
#endif
        times[2][k] = mysecond() - times[2][k];

        times[3][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Triad(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
#endif
        times[3][k] = mysecond() - times[3][k];
        }

    /*  --- SUMMARY --- */

    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
        {
        for (j=0; j<4; j++)
            {
            avgtime[j] = avgtime[j] + times[j][k];
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }

    printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
    for (j=0; j<4; j++) {
        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               avgtime[j],
               mintime[j],
               maxtime[j]);
    }
    printf(HLINE);

    /* --- Check Results --- */
    checkSTREAMresults();
    printf(HLINE);

    return 0;
}

# define        M       20

int
checktick()
    {
    int         i, minDelta, Delta;
    double      t1, t2, timesfound[M];

/*  Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = mysecond();
        while( ((t2=mysecond()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

   return(minDelta);
    }

/* A gettimeofday routine to give access to the wall
   clock timer on most UNIX-like systems.  */

#include <sys/time.h>

double mysecond()
{
        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void checkSTREAMresults ()
{
        double aj,bj,cj,scalar;
        double asum,bsum,csum;
        double epsilon;
        int     j,k;

    /* reproduce initialization */
        aj = 1.0;
        bj = 2.0;
        cj = 0.0;
    /* a[] is modified during timing check */
        aj = 2.0E0 * aj;
    /* now execute timing loop */
        scalar = 3.0;
        for (k=0; k<NTIMES; k++)
        {
            cj = aj;
            bj = scalar*cj;
            cj = aj+bj;
            aj = bj+scalar*cj;
        }
        aj = aj * (double) (N);
        bj = bj * (double) (N);
        cj = cj * (double) (N);

        asum = 0.0;
        bsum = 0.0;
        csum = 0.0;
        for (j=0; j<N; j++) {
                asum += a[j];
                bsum += b[j];
                csum += c[j];
        }
#ifdef VERBOSE
        printf ("Results Comparison: \n");
        printf ("        Expected  : %f %f %f \n",aj,bj,cj);
        printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
#endif

#ifndef abs
#define abs(a) ((a) >= 0 ? (a) : -(a))
#endif
        epsilon = 1.e-8;

        if (abs(aj-asum)/asum > epsilon) {
                printf ("Failed Validation on array a[]\n");
                printf ("        Expected  : %f \n",aj);
                printf ("        Observed  : %f \n",asum);
        }
        else if (abs(bj-bsum)/bsum > epsilon) {
                printf ("Failed Validation on array b[]\n");
                printf ("        Expected  : %f \n",bj);
                printf ("        Observed  : %f \n",bsum);
        }
        else if (abs(cj-csum)/csum > epsilon) {
                printf ("Failed Validation on array c[]\n");
                printf ("        Expected  : %f \n",cj);
                printf ("        Observed  : %f \n",csum);
        }
        else {
                printf ("Solution Validates\n");
        }
}

void tuned_STREAM_Copy()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
}

void tuned_STREAM_Scale(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
}

void tuned_STREAM_Add()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
}

void tuned_STREAM_Triad(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
}