Jan 062014
 

In a past life when I used to work for a wireless service provider,  they used a vended application to evaluate how much data bandwidth customers were consuming and that data was sent to the billing system to form the customers’ monthly bills.

The app was a poorly written (imho) and was woefully single-threaded, incapable of leveraging oodles of compute resources that were provided in the form of the twelve two-node VCS/RHEL based clusters. There were two data centers and there was one such “cluster of clusters” at each site.

The physical infrastructure was pretty standard and over-engineered to a certain degree (the infrastructure under consideration having been built in the 2007-2008 timeframe) – DL580 G5 HP servers  with 128GB of RAM each, a pair of gigabit ethernet nics for cluster interconnects, another pair bonded together as public interfaces and 4 x 4GB FC HBAs through Brocade DCX core switches (dual fabric) to an almost dedicated EMC CLaRiiON C4-960.

The application was basically a bunch of processes that watched traffic as it flowed through the network cores and calculated bandwidth usage based on end-user handset IP addresses (i’m watering it down to help keep the narrative fast and fluid).

Each location (separated by about 200+ miles) acted as the fault tolerant component for the other. So, the traffic was cross-fed across the two data centers over a pair of OC12 links.

The application was a series of processes/services that formed a queue across the various machines of each cluster (over TCP ports). The processes within a single physical system too communicated via IP addresses and TCP ports.

The problem we started observing over a few months from the point of deployment was that the data would start queuing up and slow down/skew the calculations of the data usage, implications of which I don’t necessarily have to spell out (ridiculous).

The software  vendor’s standard response would be –

  1. It is not an application problem
  2. The various components of the application would write into FIFO queues on SAN-based filesystems. The vendor would constantly raise the bogey of SAN storage being slow, not enough IOPs being available and/or response time being poor. Their basis of coming up with was with what seemed to be an arbitrary metric of CPU IO Wait percentage rising over 5% (or perhaps even lower at times).
  3. After much deliberation poring over the NAR reports of almost dedicated EMC CX4-960s (and working with EMC), we were able to ascertain that the Storage arrays or the SAN were not in any way contributory towards any latency (that resulted in poor performance of this app).
  4. The processes being woefully single threaded, would barely ever use more than 15%of total CPUs available on the servers (each server having 16 cores at it’s disposal).
  5. Memory usage was nominal and well within acceptable limits
  6. The network throughput wasn’t anywhere near saturation

We decided to start profiling the application (despite great protestations from the vendor) during normal as well as problematic periods, at the layer where we were seeing the issues, as well as the immediately upstream and downstream layers.

What we observed was that:

  1. Under normal circumstances, the app would be spending most of it’s time in either read, write, send or recv syscalls
  2. When app was performing poorly, it would spend most of it’s time in the poll syscall. It became apparent that it was waiting on TCP sockets from app instances in the remote site (and issue was bidirectional).

 

Once this bit of information was carefully vetted out and provided to the vendor, then they decided to provide us with the minimal throughput requirements – they needed a minimum of 30mbps throughput. The assumption made was that on an OC12 (bw of 622mbps), 30mbps was quite achievable.

However, it so turns out that latency plays a huge role in the actual throughput on a WAN connection! (*sarcasm alert*)

The average RTT between the two sites was 30ms. Given that the servers were running RHEL 4.6, for the default TCP send and recv buffer sizes of 64K (untuned), the 30ms RTT resulted in a max throughput of about 17mbps.

It turns out that methodologies we (*NIX admins) would use to generate some network traffic and measure “throughput” with, aren’t necessarily equipped to handle wide area networks. For example, SSH has an artificial bottleneck in it’s code, that throttles the TCP window size to a default of 64K (in the version of OpenSSH we were using at that time) – as hardcoded in the channels.h file. Initial tests were indeed baffling, since we would never cross a 22mbps throughput on the WAN. After a little research we realized that all the default TCP window sizes (for passive ftp, scp, etc) were not really tuned for high RTT connections.

Thus begun the process of tweaking the buffer sizes, and generating synthetic loads using iperf.

After we established that the default TCP buffer sizes were inadequate, we calculated buffer sizes required to provide at least a 80mbps throughput, and implemented then across the environment. The queuing stopped immediately after.

Oct 102013
 
About the STREAM benchmark

http://blogs.utexas.edu/jdm4372/tag/stream-benchmark/

Here’s what the author has to say about the benchmark itself —

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
Leveraging the Parallelization potential of the T4

In order to run this benchmark, the stream benchmark program was compiled with GCC as well as SolarisStudio 12 (the optimized, native compiler for Solaris).

A standard compile with the gcc compiler resulted in this —

[jdoe@myserver:~/stream-gcc (52)]
$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1614701 microseconds.
   (= 1614701 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1119.4976       1.7513       1.7151       1.7878
Scale:       1094.2510       1.7722       1.7546       1.7939
Add:         1455.0495       1.9815       1.9793       1.9847
Triad:       1463.1247       1.9774       1.9684       1.9889
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Then we compiled the code using Solaris studio and immediately saw improvements in Memory throughput (without any optimization) —

Unoptimized compile gave --

$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1434242 microseconds.
   (= 1434242 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1322.0838       1.4544       1.4523       1.4573
Scale:       1365.2033       1.4066       1.4064       1.4070
Add:         1968.3168       1.4633       1.4632       1.4637
Triad:       1944.1898       1.4815       1.4813       1.4819
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

After optimization —

Various degrees of optimization resulted in slight variations of performance (the following gave best results which was around 3x of unoptimized code)

cc -mt -m32 -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 278639 microseconds.
   (= 278639 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        3137.3320       0.6123       0.6120       0.6128
Scale:       3142.1011       0.6119       0.6111       0.6125
Add:         4230.4671       0.6811       0.6808       0.6817
Triad:       4323.3051       0.6667       0.6662       0.6674
-------------------------------------------------------------
Solution Validates
Make it Parallel

Using the sunstudio compiler, it is possible to force a single-threaded app to multi-thread on the CMT platform —

devzone:$(build) # cc -m32 -mt -xautopar -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 133846 microseconds.
   (= 133846 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6126.3741       0.3178       0.3134       0.3267
Scale:       6318.8244       0.3057       0.3039       0.3135
Add:         8280.5469       0.3490       0.3478       0.3508
Triad:       8396.7949       0.3438       0.3430       0.3449
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

This defaults to only 2 threads running in parallel (albeit the app thinks it is using a single thread of execution)

Now explicitly setting the following two variables in the parent shell, we were able to get 8 parallel threads of execution, effectively getting around 3x higher memory throughput (going from ~ 3GB/s with single thread to 6GB/s with 2 threads to 21GB/s with 8 threads — ie utilizing a full core)

$ export PARALLEL=8
[jdoe@myserver:~ (9)]
$  export SUNW_MP_THR_IDLE=8
[jdoe@myserver:~ (10)]
$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43905 microseconds.
   (= 43905 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       21245.0500       0.0914       0.0904       0.0920
Scale:      21816.9850       0.0885       0.0880       0.0908
Add:        28052.9390       0.1032       0.1027       0.1056
Triad:      28368.5107       0.1022       0.1015       0.1065
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~ (11)]

Now running 16 parallel threads —

$ ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 219395 microseconds.
   (= 219395 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32325.0822       0.3009       0.2970       0.3427
Scale:      32666.0515       0.3126       0.2939       0.3858
Add:        40507.6894       0.3741       0.3555       0.4537
Triad:      40263.1710       0.3676       0.3576       0.4074
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~/benchmarks (24)]

While prstat sees —

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 15865 jdoe   74  25 0.0 0.0 0.0 0.4 0.0 0.7  12  40  17   0 stream_64.ap/4
 15865 jdoe   73  26 0.0 0.0 0.0 0.5 0.0 0.6  14  40  17   0 stream_64.ap/11
 15865 jdoe   73  23 0.0 0.0 0.0 3.5 0.0 0.2  14  35  17   0 stream_64.ap/15
 15865 jdoe   73  23 0.0 0.0 0.0 2.9 0.0 1.0  12  40  17   0 stream_64.ap/8
 15865 jdoe   73  23 0.0 0.0 0.0 3.2 0.0 0.7  19  40  24   0 stream_64.ap/10
 15865 jdoe   73  23 0.0 0.0 0.0 3.9 0.0 0.2  14  40  17   0 stream_64.ap/13
 15865 jdoe   73  22 0.0 0.0 0.0 4.2 0.0 0.0  14  40  19   0 stream_64.ap/2
 15865 jdoe   73  22 0.0 0.0 0.0 3.1 0.0 1.1  15  31  19   0 stream_64.ap/6
 15865 jdoe   71  23 0.0 0.0 0.0 5.6 0.0 0.0  10  35 740   0 stream_64.ap/1
 15865 jdoe   71  23 0.0 0.0 0.0 6.0 0.0 0.0  15  35  17   0 stream_64.ap/14
 15865 jdoe   71  23 0.0 0.0 0.0 6.1 0.0 0.0  14  38  19   0 stream_64.ap/5
 15865 jdoe   71  23 0.0 0.0 0.0 6.2 0.0 0.0  15  35  17   0 stream_64.ap/7
 15865 jdoe   71  23 0.0 0.0 0.0 6.3 0.0 0.0  12  35  15   0 stream_64.ap/16
 15865 jdoe   71  22 0.0 0.0 0.0 6.5 0.0 0.0  15  35  22   0 stream_64.ap/9
 15865 jdoe   71  22 0.0 0.0 0.0 6.6 0.0 0.0  19  35  19   0 stream_64.ap/3
 15865 jdoe   71  22 0.0 0.0 0.0 6.8 0.0 0.0  14  37  17   0 stream_64.ap/12
 15182 jdoe  0.6 0.8 0.0 0.0 0.0 0.0  99 0.0   7   1  3K   0 prstat/1
 14998 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  34   0 bash/1
 14996 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  11   0 118   0 sshd/1
 15162 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   8   0 sshd/1
 15164 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 bash/1

  NLWP USERNAME  SWAP   RSS MEMORY      TIME  CPU
    21 jdoe    13G   13G    11%   0:00:48 1.5%

Total: 6 processes, 21 lwps, load averages: 0.23, 0.11, 0.16

The acceleration was astounding.

In time elapsed, with single thread —

[jdoe@myserver:~/benchmarks (24)]
$ export SUNW_MP_THR_IDLE=1
[jdoe@myserver:~/benchmarks (25)]
$ export PARALLEL=1
[jdoe@myserver:~/benchmarks (26)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1379961 microseconds.
   (= 1379961 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2956.0364       3.2745       3.2476       3.3159
Scale:       3025.7681       3.1895       3.1727       3.2110
Add:         4026.0036       3.5974       3.5767       3.6166
Triad:       4025.0673       3.5911       3.5776       3.6025
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real     4:57.114
user     4:47.825
sys         9.284
[jdoe@myserver:~/benchmarks (27)]
$

With 16 parallel threads —

[jdoe@myserver:~/benchmarks (27)]
$ export PARALLEL=16
[jdoe@myserver:~/benchmarks (28)]
$ export SUNW_MP_THR_IDLE=16
[jdoe@myserver:~/benchmarks (29)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 231461 microseconds.
   (= 231461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32235.5417       0.3057       0.2978       0.3653
Scale:      32646.3996       0.3104       0.2941       0.3647
Add:        40598.9607       0.3722       0.3547       0.4290
Triad:      40255.7375       0.3656       0.3577       0.4070
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real       29.316
user     7:13.691
sys        10.981
[jdoe@myserver:~/benchmarks (30)]
$

See how the “real” time went from 5 minutes to 30s.

The benchmark program
/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <stddef.h>
# include <sys/time.h>

/* INSTRUCTIONS:
 *
 *      1) Stream requires a good bit of memory to run.  Adjust the
 *          value of 'N' (below) to give a 'timing calibration' of
 *          at least 20 clock-ticks.  This will provide rate estimates
 *          that should be good to about 5% precision.
 */

#ifndef N
#   define N    120000000
#endif
#ifndef NTIMES
#   define NTIMES       20
#endif
#ifndef OFFSET
#   define OFFSET       0
#endif

/*
 *      3) Compile the code with full optimization.  Many compilers
 *         generate unreasonably bad code before the optimizer tightens
 *         things up.  If the results are unreasonably good, on the
 *         other hand, the optimizer might be too smart for me!
 *
 *         Try compiling with:
 *               cc -O stream_omp.c -o stream_omp
 *
 *         This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 *      4) Mail the results to mccalpin@cs.virginia.edu
 *         Be sure to include:
 *              a) computer hardware model number and software revision
 *              b) the compiler flags
 *              c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double   avgtime[4] = {0}, maxtime[4] = {0},
                mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};

static char     *label[4] = {"Copy:      ", "Scale:     ",
    "Add:       ", "Triad:     "};

static double   bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

extern double mysecond();
extern void checkSTREAMresults();
#ifdef TUNED
extern void tuned_STREAM_Copy();
extern void tuned_STREAM_Scale(double scalar);
extern void tuned_STREAM_Add();
extern void tuned_STREAM_Triad(double scalar);
#endif
#ifdef _OPENMP
extern int omp_get_num_threads();
#endif
int
main()
    {
    int                 quantum, checktick();
    int                 BytesPerWord;
    register int        j, k;
    double              scalar, t, times[4][NTIMES];

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    printf("STREAM version $Revision: 5.9 $\n");
    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
#ifdef NO_LONG_LONG
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
#else
    printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
#endif

    printf("Total memory required = %.1f MB.\n",
        (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

#ifdef _OPENMP
    printf(HLINE);
#pragma omp parallel
    {
#pragma omp master
        {
            k = omp_get_num_threads();
            printf ("Number of Threads requested = %i\n",k);
        }
    }
#endif

    printf(HLINE);
#pragma omp parallel
    {
    printf ("Printing one line per active thread....\n");
    }

    /* Get initial value for system clock. */
#pragma omp parallel for
    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

    printf(HLINE);

    if  ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else {
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");
        quantum = 1;
    }

    t = mysecond();
#pragma omp parallel for
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (mysecond() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t  );
    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING -- The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);

    /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
        {
        times[0][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Copy();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
#endif
        times[0][k] = mysecond() - times[0][k];

        times[1][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Scale(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
#endif
        times[1][k] = mysecond() - times[1][k];

        times[2][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Add();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
#endif
        times[2][k] = mysecond() - times[2][k];

        times[3][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Triad(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
#endif
        times[3][k] = mysecond() - times[3][k];
        }

    /*  --- SUMMARY --- */

    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
        {
        for (j=0; j<4; j++)
            {
            avgtime[j] = avgtime[j] + times[j][k];
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }

    printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
    for (j=0; j<4; j++) {
        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               avgtime[j],
               mintime[j],
               maxtime[j]);
    }
    printf(HLINE);

    /* --- Check Results --- */
    checkSTREAMresults();
    printf(HLINE);

    return 0;
}

# define        M       20

int
checktick()
    {
    int         i, minDelta, Delta;
    double      t1, t2, timesfound[M];

/*  Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = mysecond();
        while( ((t2=mysecond()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

   return(minDelta);
    }

/* A gettimeofday routine to give access to the wall
   clock timer on most UNIX-like systems.  */

#include <sys/time.h>

double mysecond()
{
        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void checkSTREAMresults ()
{
        double aj,bj,cj,scalar;
        double asum,bsum,csum;
        double epsilon;
        int     j,k;

    /* reproduce initialization */
        aj = 1.0;
        bj = 2.0;
        cj = 0.0;
    /* a[] is modified during timing check */
        aj = 2.0E0 * aj;
    /* now execute timing loop */
        scalar = 3.0;
        for (k=0; k<NTIMES; k++)
        {
            cj = aj;
            bj = scalar*cj;
            cj = aj+bj;
            aj = bj+scalar*cj;
        }
        aj = aj * (double) (N);
        bj = bj * (double) (N);
        cj = cj * (double) (N);

        asum = 0.0;
        bsum = 0.0;
        csum = 0.0;
        for (j=0; j<N; j++) {
                asum += a[j];
                bsum += b[j];
                csum += c[j];
        }
#ifdef VERBOSE
        printf ("Results Comparison: \n");
        printf ("        Expected  : %f %f %f \n",aj,bj,cj);
        printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
#endif

#ifndef abs
#define abs(a) ((a) >= 0 ? (a) : -(a))
#endif
        epsilon = 1.e-8;

        if (abs(aj-asum)/asum > epsilon) {
                printf ("Failed Validation on array a[]\n");
                printf ("        Expected  : %f \n",aj);
                printf ("        Observed  : %f \n",asum);
        }
        else if (abs(bj-bsum)/bsum > epsilon) {
                printf ("Failed Validation on array b[]\n");
                printf ("        Expected  : %f \n",bj);
                printf ("        Observed  : %f \n",bsum);
        }
        else if (abs(cj-csum)/csum > epsilon) {
                printf ("Failed Validation on array c[]\n");
                printf ("        Expected  : %f \n",cj);
                printf ("        Observed  : %f \n",csum);
        }
        else {
                printf ("Solution Validates\n");
        }
}

void tuned_STREAM_Copy()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
}

void tuned_STREAM_Scale(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
}

void tuned_STREAM_Add()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
}

void tuned_STREAM_Triad(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
}