About the STREAM benchmark
http://blogs.utexas.edu/jdm4372/tag/stream-benchmark/
Here’s what the author has to say about the benchmark itself —
What is STREAM?
The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
/*-----------------------------------------------------------------------*/ /* Program: Stream */ /* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */ /* Original code developed by John D. McCalpin */ /* Programmers: John D. McCalpin */ /* Joe R. Zagar */ /* */ /* This program measures memory transfer rates in MB/s for simple */ /* computational kernels coded in C. */ /*-----------------------------------------------------------------------*/ /* Copyright 1991-2005: John D. McCalpin */ /*-----------------------------------------------------------------------*/ /* License: */ /* 1. You are free to use this program and/or to redistribute */ /* this program. */ /* 2. You are free to modify this program for your own use, */ /* including commercial use, subject to the publication */ /* restrictions in item 3. */ /* 3. You are free to publish results obtained from running this */ /* program, or from works that you derive from this program, */ /* with the following limitations: */ /* 3a. In order to be referred to as "STREAM benchmark results", */ /* published results must be in conformance to the STREAM */ /* Run Rules, (briefly reviewed below) published at */ /* http://www.cs.virginia.edu/stream/ref.html */ /* and incorporated herein by reference. */ /* As the copyright holder, John McCalpin retains the */ /* right to determine conformity with the Run Rules. */ /* 3b. Results based on modified source code or on runs not in */ /* accordance with the STREAM Run Rules must be clearly */ /* labelled whenever they are published. Examples of */ /* proper labelling include: */ /* "tuned STREAM benchmark results" */ /* "based on a variant of the STREAM benchmark code" */ /* Other comparable, clear and reasonable labelling is */ /* acceptable. */ /* 3c. Submission of results to the STREAM benchmark web site */ /* is encouraged, but not required. */ /* 4. Use of this program or creation of derived works based on this */ /* program constitutes acceptance of these licensing restrictions. */ /* 5. Absolutely no warranty is expressed or implied. */ /*-----------------------------------------------------------------------*/
Leveraging the Parallelization potential of the T4
In order to run this benchmark, the stream benchmark program was compiled with GCC as well as SolarisStudio 12 (the optimized, native compiler for Solaris).
A standard compile with the gcc compiler resulted in this —
[jdoe@myserver:~/stream-gcc (52)] $ ./stream32 ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 120000000, Offset = 0 Total memory required = 2746.6 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 1614701 microseconds. (= 1614701 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1119.4976 1.7513 1.7151 1.7878 Scale: 1094.2510 1.7722 1.7546 1.7939 Add: 1455.0495 1.9815 1.9793 1.9847 Triad: 1463.1247 1.9774 1.9684 1.9889 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------
Then we compiled the code using Solaris studio and immediately saw improvements in Memory throughput (without any optimization) —
Unoptimized compile gave -- $ ./stream32 ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 120000000, Offset = 0 Total memory required = 2746.6 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 1434242 microseconds. (= 1434242 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1322.0838 1.4544 1.4523 1.4573 Scale: 1365.2033 1.4066 1.4064 1.4070 Add: 1968.3168 1.4633 1.4632 1.4637 Triad: 1944.1898 1.4815 1.4813 1.4819 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------
After optimization —
Various degrees of optimization resulted in slight variations of performance (the following gave best results which was around 3x of unoptimized code)
cc -mt -m32 -xarch=native -xO4 stream.c -o stream_omp32
$ ./stream_omp32 ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 120000000, Offset = 0 Total memory required = 2746.6 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 278639 microseconds. (= 278639 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 3137.3320 0.6123 0.6120 0.6128 Scale: 3142.1011 0.6119 0.6111 0.6125 Add: 4230.4671 0.6811 0.6808 0.6817 Triad: 4323.3051 0.6667 0.6662 0.6674 ------------------------------------------------------------- Solution Validates
Make it Parallel
Using the sunstudio compiler, it is possible to force a single-threaded app to multi-thread on the CMT platform —
devzone:$(build) # cc -m32 -mt -xautopar -xarch=native -xO4 stream.c -o stream_omp32
$ ./stream_omp32 ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 120000000, Offset = 0 Total memory required = 2746.6 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 133846 microseconds. (= 133846 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 6126.3741 0.3178 0.3134 0.3267 Scale: 6318.8244 0.3057 0.3039 0.3135 Add: 8280.5469 0.3490 0.3478 0.3508 Triad: 8396.7949 0.3438 0.3430 0.3449 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------
This defaults to only 2 threads running in parallel (albeit the app thinks it is using a single thread of execution)
Now explicitly setting the following two variables in the parent shell, we were able to get 8 parallel threads of execution, effectively getting around 3x higher memory throughput (going from ~ 3GB/s with single thread to 6GB/s with 2 threads to 21GB/s with 8 threads — ie utilizing a full core)
$ export PARALLEL=8 [jdoe@myserver:~ (9)] $ export SUNW_MP_THR_IDLE=8 [jdoe@myserver:~ (10)] $ ./stream_omp32 ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 120000000, Offset = 0 Total memory required = 2746.6 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 43905 microseconds. (= 43905 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 21245.0500 0.0914 0.0904 0.0920 Scale: 21816.9850 0.0885 0.0880 0.0908 Add: 28052.9390 0.1032 0.1027 0.1056 Triad: 28368.5107 0.1022 0.1015 0.1065 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- [jdoe@myserver:~ (11)]
Now running 16 parallel threads —
$ ./stream_64.ap ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 600000000, Offset = 10 Total memory required = 13732.9 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 219395 microseconds. (= 219395 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 32325.0822 0.3009 0.2970 0.3427 Scale: 32666.0515 0.3126 0.2939 0.3858 Add: 40507.6894 0.3741 0.3555 0.4537 Triad: 40263.1710 0.3676 0.3576 0.4074 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- [jdoe@myserver:~/benchmarks (24)]
While prstat sees —
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 15865 jdoe 74 25 0.0 0.0 0.0 0.4 0.0 0.7 12 40 17 0 stream_64.ap/4 15865 jdoe 73 26 0.0 0.0 0.0 0.5 0.0 0.6 14 40 17 0 stream_64.ap/11 15865 jdoe 73 23 0.0 0.0 0.0 3.5 0.0 0.2 14 35 17 0 stream_64.ap/15 15865 jdoe 73 23 0.0 0.0 0.0 2.9 0.0 1.0 12 40 17 0 stream_64.ap/8 15865 jdoe 73 23 0.0 0.0 0.0 3.2 0.0 0.7 19 40 24 0 stream_64.ap/10 15865 jdoe 73 23 0.0 0.0 0.0 3.9 0.0 0.2 14 40 17 0 stream_64.ap/13 15865 jdoe 73 22 0.0 0.0 0.0 4.2 0.0 0.0 14 40 19 0 stream_64.ap/2 15865 jdoe 73 22 0.0 0.0 0.0 3.1 0.0 1.1 15 31 19 0 stream_64.ap/6 15865 jdoe 71 23 0.0 0.0 0.0 5.6 0.0 0.0 10 35 740 0 stream_64.ap/1 15865 jdoe 71 23 0.0 0.0 0.0 6.0 0.0 0.0 15 35 17 0 stream_64.ap/14 15865 jdoe 71 23 0.0 0.0 0.0 6.1 0.0 0.0 14 38 19 0 stream_64.ap/5 15865 jdoe 71 23 0.0 0.0 0.0 6.2 0.0 0.0 15 35 17 0 stream_64.ap/7 15865 jdoe 71 23 0.0 0.0 0.0 6.3 0.0 0.0 12 35 15 0 stream_64.ap/16 15865 jdoe 71 22 0.0 0.0 0.0 6.5 0.0 0.0 15 35 22 0 stream_64.ap/9 15865 jdoe 71 22 0.0 0.0 0.0 6.6 0.0 0.0 19 35 19 0 stream_64.ap/3 15865 jdoe 71 22 0.0 0.0 0.0 6.8 0.0 0.0 14 37 17 0 stream_64.ap/12 15182 jdoe 0.6 0.8 0.0 0.0 0.0 0.0 99 0.0 7 1 3K 0 prstat/1 14998 jdoe 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 4 0 34 0 bash/1 14996 jdoe 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 11 0 118 0 sshd/1 15162 jdoe 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 8 0 sshd/1 15164 jdoe 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 bash/1 NLWP USERNAME SWAP RSS MEMORY TIME CPU 21 jdoe 13G 13G 11% 0:00:48 1.5% Total: 6 processes, 21 lwps, load averages: 0.23, 0.11, 0.16
The acceleration was astounding.
In time elapsed, with single thread —
[jdoe@myserver:~/benchmarks (24)] $ export SUNW_MP_THR_IDLE=1 [jdoe@myserver:~/benchmarks (25)] $ export PARALLEL=1 [jdoe@myserver:~/benchmarks (26)] $ ptime ./stream_64.ap ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 600000000, Offset = 10 Total memory required = 13732.9 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 1379961 microseconds. (= 1379961 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2956.0364 3.2745 3.2476 3.3159 Scale: 3025.7681 3.1895 3.1727 3.2110 Add: 4026.0036 3.5974 3.5767 3.6166 Triad: 4025.0673 3.5911 3.5776 3.6025 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- real 4:57.114 user 4:47.825 sys 9.284 [jdoe@myserver:~/benchmarks (27)] $
With 16 parallel threads —
[jdoe@myserver:~/benchmarks (27)] $ export PARALLEL=16 [jdoe@myserver:~/benchmarks (28)] $ export SUNW_MP_THR_IDLE=16 [jdoe@myserver:~/benchmarks (29)] $ ptime ./stream_64.ap ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 600000000, Offset = 10 Total memory required = 13732.9 MB. Each test is run 20 times, but only the *best* time for each is used. ------------------------------------------------------------- Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 231461 microseconds. (= 231461 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 32235.5417 0.3057 0.2978 0.3653 Scale: 32646.3996 0.3104 0.2941 0.3647 Add: 40598.9607 0.3722 0.3547 0.4290 Triad: 40255.7375 0.3656 0.3577 0.4070 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- real 29.316 user 7:13.691 sys 10.981 [jdoe@myserver:~/benchmarks (30)] $
See how the “real” time went from 5 minutes to 30s.
The benchmark program
/*-----------------------------------------------------------------------*/ /* Program: Stream */ /* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */ /* Original code developed by John D. McCalpin */ /* Programmers: John D. McCalpin */ /* Joe R. Zagar */ /* */ /* This program measures memory transfer rates in MB/s for simple */ /* computational kernels coded in C. */ /*-----------------------------------------------------------------------*/ /* Copyright 1991-2005: John D. McCalpin */ /*-----------------------------------------------------------------------*/ /* License: */ /* 1. You are free to use this program and/or to redistribute */ /* this program. */ /* 2. You are free to modify this program for your own use, */ /* including commercial use, subject to the publication */ /* restrictions in item 3. */ /* 3. You are free to publish results obtained from running this */ /* program, or from works that you derive from this program, */ /* with the following limitations: */ /* 3a. In order to be referred to as "STREAM benchmark results", */ /* published results must be in conformance to the STREAM */ /* Run Rules, (briefly reviewed below) published at */ /* http://www.cs.virginia.edu/stream/ref.html */ /* and incorporated herein by reference. */ /* As the copyright holder, John McCalpin retains the */ /* right to determine conformity with the Run Rules. */ /* 3b. Results based on modified source code or on runs not in */ /* accordance with the STREAM Run Rules must be clearly */ /* labelled whenever they are published. Examples of */ /* proper labelling include: */ /* "tuned STREAM benchmark results" */ /* "based on a variant of the STREAM benchmark code" */ /* Other comparable, clear and reasonable labelling is */ /* acceptable. */ /* 3c. Submission of results to the STREAM benchmark web site */ /* is encouraged, but not required. */ /* 4. Use of this program or creation of derived works based on this */ /* program constitutes acceptance of these licensing restrictions. */ /* 5. Absolutely no warranty is expressed or implied. */ /*-----------------------------------------------------------------------*/ # include <stdio.h> # include <math.h> # include <float.h> # include <limits.h> # include <stddef.h> # include <sys/time.h> /* INSTRUCTIONS: * * 1) Stream requires a good bit of memory to run. Adjust the * value of 'N' (below) to give a 'timing calibration' of * at least 20 clock-ticks. This will provide rate estimates * that should be good to about 5% precision. */ #ifndef N # define N 120000000 #endif #ifndef NTIMES # define NTIMES 20 #endif #ifndef OFFSET # define OFFSET 0 #endif /* * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonably good, on the * other hand, the optimizer might be too smart for me! * * Try compiling with: * cc -O stream_omp.c -o stream_omp * * This is known to work on Cray, SGI, IBM, and Sun machines. * * * 4) Mail the results to mccalpin@cs.virginia.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * Thanks! * */ # define HLINE "-------------------------------------------------------------\n" # ifndef MIN # define MIN(x,y) ((x)<(y)?(x):(y)) # endif # ifndef MAX # define MAX(x,y) ((x)>(y)?(x):(y)) # endif static double a[N+OFFSET], b[N+OFFSET], c[N+OFFSET]; static double avgtime[4] = {0}, maxtime[4] = {0}, mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; static char *label[4] = {"Copy: ", "Scale: ", "Add: ", "Triad: "}; static double bytes[4] = { 2 * sizeof(double) * N, 2 * sizeof(double) * N, 3 * sizeof(double) * N, 3 * sizeof(double) * N }; extern double mysecond(); extern void checkSTREAMresults(); #ifdef TUNED extern void tuned_STREAM_Copy(); extern void tuned_STREAM_Scale(double scalar); extern void tuned_STREAM_Add(); extern void tuned_STREAM_Triad(double scalar); #endif #ifdef _OPENMP extern int omp_get_num_threads(); #endif int main() { int quantum, checktick(); int BytesPerWord; register int j, k; double scalar, t, times[4][NTIMES]; /* --- SETUP --- determine precision and check timing --- */ printf(HLINE); printf("STREAM version $Revision: 5.9 $\n"); printf(HLINE); BytesPerWord = sizeof(double); printf("This system uses %d bytes per DOUBLE PRECISION word.\n", BytesPerWord); printf(HLINE); #ifdef NO_LONG_LONG printf("Array size = %d, Offset = %d\n" , N, OFFSET); #else printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET); #endif printf("Total memory required = %.1f MB.\n", (3.0 * BytesPerWord) * ( (double) N / 1048576.0)); printf("Each test is run %d times, but only\n", NTIMES); printf("the *best* time for each is used.\n"); #ifdef _OPENMP printf(HLINE); #pragma omp parallel { #pragma omp master { k = omp_get_num_threads(); printf ("Number of Threads requested = %i\n",k); } } #endif printf(HLINE); #pragma omp parallel { printf ("Printing one line per active thread....\n"); } /* Get initial value for system clock. */ #pragma omp parallel for for (j=0; j<N; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0; } printf(HLINE); if ( (quantum = checktick()) >= 1) printf("Your clock granularity/precision appears to be " "%d microseconds.\n", quantum); else { printf("Your clock granularity appears to be " "less than one microsecond.\n"); quantum = 1; } t = mysecond(); #pragma omp parallel for for (j = 0; j < N; j++) a[j] = 2.0E0 * a[j]; t = 1.0E6 * (mysecond() - t); printf("Each test below will take on the order" " of %d microseconds.\n", (int) t ); printf(" (= %d clock ticks)\n", (int) (t/quantum) ); printf("Increase the size of the arrays if this shows that\n"); printf("you are not getting at least 20 clock ticks per test.\n"); printf(HLINE); printf("WARNING -- The above is only a rough guideline.\n"); printf("For best results, please be sure you know the\n"); printf("precision of your system timer.\n"); printf(HLINE); /* --- MAIN LOOP --- repeat test cases NTIMES times --- */ scalar = 3.0; for (k=0; k<NTIMES; k++) { times[0][k] = mysecond(); #ifdef TUNED tuned_STREAM_Copy(); #else #pragma omp parallel for for (j=0; j<N; j++) c[j] = a[j]; #endif times[0][k] = mysecond() - times[0][k]; times[1][k] = mysecond(); #ifdef TUNED tuned_STREAM_Scale(scalar); #else #pragma omp parallel for for (j=0; j<N; j++) b[j] = scalar*c[j]; #endif times[1][k] = mysecond() - times[1][k]; times[2][k] = mysecond(); #ifdef TUNED tuned_STREAM_Add(); #else #pragma omp parallel for for (j=0; j<N; j++) c[j] = a[j]+b[j]; #endif times[2][k] = mysecond() - times[2][k]; times[3][k] = mysecond(); #ifdef TUNED tuned_STREAM_Triad(scalar); #else #pragma omp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; #endif times[3][k] = mysecond() - times[3][k]; } /* --- SUMMARY --- */ for (k=1; k<NTIMES; k++) /* note -- skip first iteration */ { for (j=0; j<4; j++) { avgtime[j] = avgtime[j] + times[j][k]; mintime[j] = MIN(mintime[j], times[j][k]); maxtime[j] = MAX(maxtime[j], times[j][k]); } } printf("Function Rate (MB/s) Avg time Min time Max time\n"); for (j=0; j<4; j++) { avgtime[j] = avgtime[j]/(double)(NTIMES-1); printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j], 1.0E-06 * bytes[j]/mintime[j], avgtime[j], mintime[j], maxtime[j]); } printf(HLINE); /* --- Check Results --- */ checkSTREAMresults(); printf(HLINE); return 0; } # define M 20 int checktick() { int i, minDelta, Delta; double t1, t2, timesfound[M]; /* Collect a sequence of M unique time values from the system. */ for (i = 0; i < M; i++) { t1 = mysecond(); while( ((t2=mysecond()) - t1) < 1.0E-6 ) ; timesfound[i] = t1 = t2; } /* * Determine the minimum difference between these M values. * This result will be our estimate (in microseconds) for the * clock granularity. */ minDelta = 1000000; for (i = 1; i < M; i++) { Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1])); minDelta = MIN(minDelta, MAX(Delta,0)); } return(minDelta); } /* A gettimeofday routine to give access to the wall clock timer on most UNIX-like systems. */ #include <sys/time.h> double mysecond() { struct timeval tp; struct timezone tzp; int i; i = gettimeofday(&tp,&tzp); return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 ); } void checkSTREAMresults () { double aj,bj,cj,scalar; double asum,bsum,csum; double epsilon; int j,k; /* reproduce initialization */ aj = 1.0; bj = 2.0; cj = 0.0; /* a[] is modified during timing check */ aj = 2.0E0 * aj; /* now execute timing loop */ scalar = 3.0; for (k=0; k<NTIMES; k++) { cj = aj; bj = scalar*cj; cj = aj+bj; aj = bj+scalar*cj; } aj = aj * (double) (N); bj = bj * (double) (N); cj = cj * (double) (N); asum = 0.0; bsum = 0.0; csum = 0.0; for (j=0; j<N; j++) { asum += a[j]; bsum += b[j]; csum += c[j]; } #ifdef VERBOSE printf ("Results Comparison: \n"); printf (" Expected : %f %f %f \n",aj,bj,cj); printf (" Observed : %f %f %f \n",asum,bsum,csum); #endif #ifndef abs #define abs(a) ((a) >= 0 ? (a) : -(a)) #endif epsilon = 1.e-8; if (abs(aj-asum)/asum > epsilon) { printf ("Failed Validation on array a[]\n"); printf (" Expected : %f \n",aj); printf (" Observed : %f \n",asum); } else if (abs(bj-bsum)/bsum > epsilon) { printf ("Failed Validation on array b[]\n"); printf (" Expected : %f \n",bj); printf (" Observed : %f \n",bsum); } else if (abs(cj-csum)/csum > epsilon) { printf ("Failed Validation on array c[]\n"); printf (" Expected : %f \n",cj); printf (" Observed : %f \n",csum); } else { printf ("Solution Validates\n"); } } void tuned_STREAM_Copy() { int j; #pragma omp parallel for for (j=0; j<N; j++) c[j] = a[j]; } void tuned_STREAM_Scale(double scalar) { int j; #pragma omp parallel for for (j=0; j<N; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Add() { int j; #pragma omp parallel for for (j=0; j<N; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Triad(double scalar) { int j; #pragma omp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; }