Oct 102013
 
About the STREAM benchmark

http://blogs.utexas.edu/jdm4372/tag/stream-benchmark/

Here’s what the author has to say about the benchmark itself —

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (inĀ MB/s) and the corresponding computation rate for simple vector kernels.

/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
Leveraging the Parallelization potential of the T4

In order to run this benchmark, the stream benchmark program was compiled with GCC as well as SolarisStudio 12 (the optimized, native compiler for Solaris).

A standard compile with the gcc compiler resulted in this —

[jdoe@myserver:~/stream-gcc (52)]
$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1614701 microseconds.
   (= 1614701 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1119.4976       1.7513       1.7151       1.7878
Scale:       1094.2510       1.7722       1.7546       1.7939
Add:         1455.0495       1.9815       1.9793       1.9847
Triad:       1463.1247       1.9774       1.9684       1.9889
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Then we compiled the code using Solaris studio and immediately saw improvements in Memory throughput (without any optimization) —

Unoptimized compile gave --

$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1434242 microseconds.
   (= 1434242 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1322.0838       1.4544       1.4523       1.4573
Scale:       1365.2033       1.4066       1.4064       1.4070
Add:         1968.3168       1.4633       1.4632       1.4637
Triad:       1944.1898       1.4815       1.4813       1.4819
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

After optimization —

Various degrees of optimization resulted in slight variations of performance (the following gave best results which was around 3x of unoptimized code)

cc -mt -m32 -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 278639 microseconds.
   (= 278639 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        3137.3320       0.6123       0.6120       0.6128
Scale:       3142.1011       0.6119       0.6111       0.6125
Add:         4230.4671       0.6811       0.6808       0.6817
Triad:       4323.3051       0.6667       0.6662       0.6674
-------------------------------------------------------------
Solution Validates
Make it Parallel

Using the sunstudio compiler, it is possible to force a single-threaded app to multi-thread on the CMT platform —

devzone:$(build) # cc -m32 -mt -xautopar -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 133846 microseconds.
   (= 133846 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6126.3741       0.3178       0.3134       0.3267
Scale:       6318.8244       0.3057       0.3039       0.3135
Add:         8280.5469       0.3490       0.3478       0.3508
Triad:       8396.7949       0.3438       0.3430       0.3449
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

This defaults to only 2 threads running in parallel (albeit the app thinks it is using a single thread of execution)

Now explicitly setting the following two variables in the parent shell, we were able to get 8 parallel threads of execution, effectively getting around 3x higher memory throughput (going from ~ 3GB/s with single thread to 6GB/s with 2 threads to 21GB/s with 8 threads — ie utilizing a full core)

$ export PARALLEL=8
[jdoe@myserver:~ (9)]
$  export SUNW_MP_THR_IDLE=8
[jdoe@myserver:~ (10)]
$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43905 microseconds.
   (= 43905 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       21245.0500       0.0914       0.0904       0.0920
Scale:      21816.9850       0.0885       0.0880       0.0908
Add:        28052.9390       0.1032       0.1027       0.1056
Triad:      28368.5107       0.1022       0.1015       0.1065
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~ (11)]

Now running 16 parallel threads —

$ ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 219395 microseconds.
   (= 219395 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32325.0822       0.3009       0.2970       0.3427
Scale:      32666.0515       0.3126       0.2939       0.3858
Add:        40507.6894       0.3741       0.3555       0.4537
Triad:      40263.1710       0.3676       0.3576       0.4074
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~/benchmarks (24)]

While prstat sees —

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 15865 jdoe   74  25 0.0 0.0 0.0 0.4 0.0 0.7  12  40  17   0 stream_64.ap/4
 15865 jdoe   73  26 0.0 0.0 0.0 0.5 0.0 0.6  14  40  17   0 stream_64.ap/11
 15865 jdoe   73  23 0.0 0.0 0.0 3.5 0.0 0.2  14  35  17   0 stream_64.ap/15
 15865 jdoe   73  23 0.0 0.0 0.0 2.9 0.0 1.0  12  40  17   0 stream_64.ap/8
 15865 jdoe   73  23 0.0 0.0 0.0 3.2 0.0 0.7  19  40  24   0 stream_64.ap/10
 15865 jdoe   73  23 0.0 0.0 0.0 3.9 0.0 0.2  14  40  17   0 stream_64.ap/13
 15865 jdoe   73  22 0.0 0.0 0.0 4.2 0.0 0.0  14  40  19   0 stream_64.ap/2
 15865 jdoe   73  22 0.0 0.0 0.0 3.1 0.0 1.1  15  31  19   0 stream_64.ap/6
 15865 jdoe   71  23 0.0 0.0 0.0 5.6 0.0 0.0  10  35 740   0 stream_64.ap/1
 15865 jdoe   71  23 0.0 0.0 0.0 6.0 0.0 0.0  15  35  17   0 stream_64.ap/14
 15865 jdoe   71  23 0.0 0.0 0.0 6.1 0.0 0.0  14  38  19   0 stream_64.ap/5
 15865 jdoe   71  23 0.0 0.0 0.0 6.2 0.0 0.0  15  35  17   0 stream_64.ap/7
 15865 jdoe   71  23 0.0 0.0 0.0 6.3 0.0 0.0  12  35  15   0 stream_64.ap/16
 15865 jdoe   71  22 0.0 0.0 0.0 6.5 0.0 0.0  15  35  22   0 stream_64.ap/9
 15865 jdoe   71  22 0.0 0.0 0.0 6.6 0.0 0.0  19  35  19   0 stream_64.ap/3
 15865 jdoe   71  22 0.0 0.0 0.0 6.8 0.0 0.0  14  37  17   0 stream_64.ap/12
 15182 jdoe  0.6 0.8 0.0 0.0 0.0 0.0  99 0.0   7   1  3K   0 prstat/1
 14998 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  34   0 bash/1
 14996 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  11   0 118   0 sshd/1
 15162 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   8   0 sshd/1
 15164 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 bash/1

  NLWP USERNAME  SWAP   RSS MEMORY      TIME  CPU
    21 jdoe    13G   13G    11%   0:00:48 1.5%

Total: 6 processes, 21 lwps, load averages: 0.23, 0.11, 0.16

The acceleration was astounding.

In time elapsed, with single thread —

[jdoe@myserver:~/benchmarks (24)]
$ export SUNW_MP_THR_IDLE=1
[jdoe@myserver:~/benchmarks (25)]
$ export PARALLEL=1
[jdoe@myserver:~/benchmarks (26)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1379961 microseconds.
   (= 1379961 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2956.0364       3.2745       3.2476       3.3159
Scale:       3025.7681       3.1895       3.1727       3.2110
Add:         4026.0036       3.5974       3.5767       3.6166
Triad:       4025.0673       3.5911       3.5776       3.6025
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real     4:57.114
user     4:47.825
sys         9.284
[jdoe@myserver:~/benchmarks (27)]
$

With 16 parallel threads —

[jdoe@myserver:~/benchmarks (27)]
$ export PARALLEL=16
[jdoe@myserver:~/benchmarks (28)]
$ export SUNW_MP_THR_IDLE=16
[jdoe@myserver:~/benchmarks (29)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 231461 microseconds.
   (= 231461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32235.5417       0.3057       0.2978       0.3653
Scale:      32646.3996       0.3104       0.2941       0.3647
Add:        40598.9607       0.3722       0.3547       0.4290
Triad:      40255.7375       0.3656       0.3577       0.4070
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real       29.316
user     7:13.691
sys        10.981
[jdoe@myserver:~/benchmarks (30)]
$

See how the “real” time went from 5 minutes to 30s.

The benchmark program
/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <stddef.h>
# include <sys/time.h>

/* INSTRUCTIONS:
 *
 *      1) Stream requires a good bit of memory to run.  Adjust the
 *          value of 'N' (below) to give a 'timing calibration' of
 *          at least 20 clock-ticks.  This will provide rate estimates
 *          that should be good to about 5% precision.
 */

#ifndef N
#   define N    120000000
#endif
#ifndef NTIMES
#   define NTIMES       20
#endif
#ifndef OFFSET
#   define OFFSET       0
#endif

/*
 *      3) Compile the code with full optimization.  Many compilers
 *         generate unreasonably bad code before the optimizer tightens
 *         things up.  If the results are unreasonably good, on the
 *         other hand, the optimizer might be too smart for me!
 *
 *         Try compiling with:
 *               cc -O stream_omp.c -o stream_omp
 *
 *         This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 *      4) Mail the results to mccalpin@cs.virginia.edu
 *         Be sure to include:
 *              a) computer hardware model number and software revision
 *              b) the compiler flags
 *              c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double   avgtime[4] = {0}, maxtime[4] = {0},
                mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};

static char     *label[4] = {"Copy:      ", "Scale:     ",
    "Add:       ", "Triad:     "};

static double   bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

extern double mysecond();
extern void checkSTREAMresults();
#ifdef TUNED
extern void tuned_STREAM_Copy();
extern void tuned_STREAM_Scale(double scalar);
extern void tuned_STREAM_Add();
extern void tuned_STREAM_Triad(double scalar);
#endif
#ifdef _OPENMP
extern int omp_get_num_threads();
#endif
int
main()
    {
    int                 quantum, checktick();
    int                 BytesPerWord;
    register int        j, k;
    double              scalar, t, times[4][NTIMES];

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    printf("STREAM version $Revision: 5.9 $\n");
    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
#ifdef NO_LONG_LONG
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
#else
    printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
#endif

    printf("Total memory required = %.1f MB.\n",
        (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

#ifdef _OPENMP
    printf(HLINE);
#pragma omp parallel
    {
#pragma omp master
        {
            k = omp_get_num_threads();
            printf ("Number of Threads requested = %i\n",k);
        }
    }
#endif

    printf(HLINE);
#pragma omp parallel
    {
    printf ("Printing one line per active thread....\n");
    }

    /* Get initial value for system clock. */
#pragma omp parallel for
    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

    printf(HLINE);

    if  ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else {
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");
        quantum = 1;
    }

    t = mysecond();
#pragma omp parallel for
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (mysecond() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t  );
    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING -- The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);

    /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
        {
        times[0][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Copy();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
#endif
        times[0][k] = mysecond() - times[0][k];

        times[1][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Scale(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
#endif
        times[1][k] = mysecond() - times[1][k];

        times[2][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Add();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
#endif
        times[2][k] = mysecond() - times[2][k];

        times[3][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Triad(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
#endif
        times[3][k] = mysecond() - times[3][k];
        }

    /*  --- SUMMARY --- */

    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
        {
        for (j=0; j<4; j++)
            {
            avgtime[j] = avgtime[j] + times[j][k];
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }

    printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
    for (j=0; j<4; j++) {
        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               avgtime[j],
               mintime[j],
               maxtime[j]);
    }
    printf(HLINE);

    /* --- Check Results --- */
    checkSTREAMresults();
    printf(HLINE);

    return 0;
}

# define        M       20

int
checktick()
    {
    int         i, minDelta, Delta;
    double      t1, t2, timesfound[M];

/*  Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = mysecond();
        while( ((t2=mysecond()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

   return(minDelta);
    }

/* A gettimeofday routine to give access to the wall
   clock timer on most UNIX-like systems.  */

#include <sys/time.h>

double mysecond()
{
        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void checkSTREAMresults ()
{
        double aj,bj,cj,scalar;
        double asum,bsum,csum;
        double epsilon;
        int     j,k;

    /* reproduce initialization */
        aj = 1.0;
        bj = 2.0;
        cj = 0.0;
    /* a[] is modified during timing check */
        aj = 2.0E0 * aj;
    /* now execute timing loop */
        scalar = 3.0;
        for (k=0; k<NTIMES; k++)
        {
            cj = aj;
            bj = scalar*cj;
            cj = aj+bj;
            aj = bj+scalar*cj;
        }
        aj = aj * (double) (N);
        bj = bj * (double) (N);
        cj = cj * (double) (N);

        asum = 0.0;
        bsum = 0.0;
        csum = 0.0;
        for (j=0; j<N; j++) {
                asum += a[j];
                bsum += b[j];
                csum += c[j];
        }
#ifdef VERBOSE
        printf ("Results Comparison: \n");
        printf ("        Expected  : %f %f %f \n",aj,bj,cj);
        printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
#endif

#ifndef abs
#define abs(a) ((a) >= 0 ? (a) : -(a))
#endif
        epsilon = 1.e-8;

        if (abs(aj-asum)/asum > epsilon) {
                printf ("Failed Validation on array a[]\n");
                printf ("        Expected  : %f \n",aj);
                printf ("        Observed  : %f \n",asum);
        }
        else if (abs(bj-bsum)/bsum > epsilon) {
                printf ("Failed Validation on array b[]\n");
                printf ("        Expected  : %f \n",bj);
                printf ("        Observed  : %f \n",bsum);
        }
        else if (abs(cj-csum)/csum > epsilon) {
                printf ("Failed Validation on array c[]\n");
                printf ("        Expected  : %f \n",cj);
                printf ("        Observed  : %f \n",csum);
        }
        else {
                printf ("Solution Validates\n");
        }
}

void tuned_STREAM_Copy()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
}

void tuned_STREAM_Scale(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
}

void tuned_STREAM_Add()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
}

void tuned_STREAM_Triad(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
}