Mar 122014
 

Okay, this has been a long time coming, but thought I’d write this down before I forget all about it.

First a little bit of a rant. In the EMC world, we assign LUNs to hosts and identify them using a hex-id code called the LUN ID. This, in conjunction with the EMC Frame ID, helps us uniquely and “easily” identify the LUNs, especially when you consider that there are usually multiple EMC frames and hundreds or thousands of SAN-attached hosts in most modern enterprise data centers.

Okay, the challenge is when using a disk multipathing solution other than EMC’s expensive power path software, or the exorbitantly expensive Veritas Storage Foundation suite from Symantec.

Most self-respecting modern operating systems have disk multipathing software built in and freely available.  and the focus item for this blog is Solaris. Solaris has a mature and efficient multipathing software called MPxIO, which cleverly creates a pseudo-device corresponding to the complex of the multiple (or single) paths via which a SAN LUN is visible to the Operating system.

The MPxIO driver, then using the LUN’s global unique identifier (GUID) to, perhaps can be argued, rightly identify the LUN uniquely. This is however a challenge since the GUID is a 32-character string (very nicely explained here http://www.slideshare.net/JamesCMcPherson/what-is-a-guid).

When I first proposed to our usual user community – Oracle DBAs, that they would have to use disks named as shown below, I faced outraged howls of indignation. “How can we address these disks in ASM, etc?”

c3t60060160294122002A2ADDB296EADE11d0

To work around, we considered creating an alternate namespace, say /dev/oracle/disk001 and so forth, retaining the major/minor numbers of the underlying devices.  But that would get hairy real quick, especially on those databases where we had multiple terabytes of storage (hundreds of LUNs).

If you are working with CLARiiON arrays, then you will have figure out some other way (than what I’ve shown in this blog post) to map your MPxIO/Solaris disk name to the valid Hex LUNID.

The code in this gist will cover what’s what. A friendly storage admin told me about this, wrt VMAX/Symmetrix LUNs. Apparently EMC generates the GUID (aka UUID) of LUNs differently based on whether it is a CLARiiON/VNX or a Symmetrix/VMAX.

In that, the fields that were extracted corresponding to variables $xlunid1 and so forth, and converted to their hex values, is the pertinent piece of information. Having this information then reduces our need to install symcli on every host so that we can extract the LUN ID that way.

This then will give us the ability to map a MPxIO/Solaris Disk name to an EMC LUN ID.

The full script is here — https://github.com/implicateorder/sysadmin-tools/blob/master/getluninfo.pl

Oct 102013
 
About the STREAM benchmark

http://blogs.utexas.edu/jdm4372/tag/stream-benchmark/

Here’s what the author has to say about the benchmark itself —

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
Leveraging the Parallelization potential of the T4

In order to run this benchmark, the stream benchmark program was compiled with GCC as well as SolarisStudio 12 (the optimized, native compiler for Solaris).

A standard compile with the gcc compiler resulted in this —

[jdoe@myserver:~/stream-gcc (52)]
$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1614701 microseconds.
   (= 1614701 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1119.4976       1.7513       1.7151       1.7878
Scale:       1094.2510       1.7722       1.7546       1.7939
Add:         1455.0495       1.9815       1.9793       1.9847
Triad:       1463.1247       1.9774       1.9684       1.9889
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Then we compiled the code using Solaris studio and immediately saw improvements in Memory throughput (without any optimization) —

Unoptimized compile gave --

$ ./stream32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1434242 microseconds.
   (= 1434242 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1322.0838       1.4544       1.4523       1.4573
Scale:       1365.2033       1.4066       1.4064       1.4070
Add:         1968.3168       1.4633       1.4632       1.4637
Triad:       1944.1898       1.4815       1.4813       1.4819
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

After optimization —

Various degrees of optimization resulted in slight variations of performance (the following gave best results which was around 3x of unoptimized code)

cc -mt -m32 -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 278639 microseconds.
   (= 278639 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        3137.3320       0.6123       0.6120       0.6128
Scale:       3142.1011       0.6119       0.6111       0.6125
Add:         4230.4671       0.6811       0.6808       0.6817
Triad:       4323.3051       0.6667       0.6662       0.6674
-------------------------------------------------------------
Solution Validates
Make it Parallel

Using the sunstudio compiler, it is possible to force a single-threaded app to multi-thread on the CMT platform —

devzone:$(build) # cc -m32 -mt -xautopar -xarch=native -xO4 stream.c -o stream_omp32

$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 133846 microseconds.
   (= 133846 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        6126.3741       0.3178       0.3134       0.3267
Scale:       6318.8244       0.3057       0.3039       0.3135
Add:         8280.5469       0.3490       0.3478       0.3508
Triad:       8396.7949       0.3438       0.3430       0.3449
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

This defaults to only 2 threads running in parallel (albeit the app thinks it is using a single thread of execution)

Now explicitly setting the following two variables in the parent shell, we were able to get 8 parallel threads of execution, effectively getting around 3x higher memory throughput (going from ~ 3GB/s with single thread to 6GB/s with 2 threads to 21GB/s with 8 threads — ie utilizing a full core)

$ export PARALLEL=8
[jdoe@myserver:~ (9)]
$  export SUNW_MP_THR_IDLE=8
[jdoe@myserver:~ (10)]
$ ./stream_omp32
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 120000000, Offset = 0
Total memory required = 2746.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43905 microseconds.
   (= 43905 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       21245.0500       0.0914       0.0904       0.0920
Scale:      21816.9850       0.0885       0.0880       0.0908
Add:        28052.9390       0.1032       0.1027       0.1056
Triad:      28368.5107       0.1022       0.1015       0.1065
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~ (11)]

Now running 16 parallel threads —

$ ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 219395 microseconds.
   (= 219395 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32325.0822       0.3009       0.2970       0.3427
Scale:      32666.0515       0.3126       0.2939       0.3858
Add:        40507.6894       0.3741       0.3555       0.4537
Triad:      40263.1710       0.3676       0.3576       0.4074
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
[jdoe@myserver:~/benchmarks (24)]

While prstat sees —

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 15865 jdoe   74  25 0.0 0.0 0.0 0.4 0.0 0.7  12  40  17   0 stream_64.ap/4
 15865 jdoe   73  26 0.0 0.0 0.0 0.5 0.0 0.6  14  40  17   0 stream_64.ap/11
 15865 jdoe   73  23 0.0 0.0 0.0 3.5 0.0 0.2  14  35  17   0 stream_64.ap/15
 15865 jdoe   73  23 0.0 0.0 0.0 2.9 0.0 1.0  12  40  17   0 stream_64.ap/8
 15865 jdoe   73  23 0.0 0.0 0.0 3.2 0.0 0.7  19  40  24   0 stream_64.ap/10
 15865 jdoe   73  23 0.0 0.0 0.0 3.9 0.0 0.2  14  40  17   0 stream_64.ap/13
 15865 jdoe   73  22 0.0 0.0 0.0 4.2 0.0 0.0  14  40  19   0 stream_64.ap/2
 15865 jdoe   73  22 0.0 0.0 0.0 3.1 0.0 1.1  15  31  19   0 stream_64.ap/6
 15865 jdoe   71  23 0.0 0.0 0.0 5.6 0.0 0.0  10  35 740   0 stream_64.ap/1
 15865 jdoe   71  23 0.0 0.0 0.0 6.0 0.0 0.0  15  35  17   0 stream_64.ap/14
 15865 jdoe   71  23 0.0 0.0 0.0 6.1 0.0 0.0  14  38  19   0 stream_64.ap/5
 15865 jdoe   71  23 0.0 0.0 0.0 6.2 0.0 0.0  15  35  17   0 stream_64.ap/7
 15865 jdoe   71  23 0.0 0.0 0.0 6.3 0.0 0.0  12  35  15   0 stream_64.ap/16
 15865 jdoe   71  22 0.0 0.0 0.0 6.5 0.0 0.0  15  35  22   0 stream_64.ap/9
 15865 jdoe   71  22 0.0 0.0 0.0 6.6 0.0 0.0  19  35  19   0 stream_64.ap/3
 15865 jdoe   71  22 0.0 0.0 0.0 6.8 0.0 0.0  14  37  17   0 stream_64.ap/12
 15182 jdoe  0.6 0.8 0.0 0.0 0.0 0.0  99 0.0   7   1  3K   0 prstat/1
 14998 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  34   0 bash/1
 14996 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  11   0 118   0 sshd/1
 15162 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   8   0 sshd/1
 15164 jdoe  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 bash/1

  NLWP USERNAME  SWAP   RSS MEMORY      TIME  CPU
    21 jdoe    13G   13G    11%   0:00:48 1.5%

Total: 6 processes, 21 lwps, load averages: 0.23, 0.11, 0.16

The acceleration was astounding.

In time elapsed, with single thread —

[jdoe@myserver:~/benchmarks (24)]
$ export SUNW_MP_THR_IDLE=1
[jdoe@myserver:~/benchmarks (25)]
$ export PARALLEL=1
[jdoe@myserver:~/benchmarks (26)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1379961 microseconds.
   (= 1379961 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2956.0364       3.2745       3.2476       3.3159
Scale:       3025.7681       3.1895       3.1727       3.2110
Add:         4026.0036       3.5974       3.5767       3.6166
Triad:       4025.0673       3.5911       3.5776       3.6025
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real     4:57.114
user     4:47.825
sys         9.284
[jdoe@myserver:~/benchmarks (27)]
$

With 16 parallel threads —

[jdoe@myserver:~/benchmarks (27)]
$ export PARALLEL=16
[jdoe@myserver:~/benchmarks (28)]
$ export SUNW_MP_THR_IDLE=16
[jdoe@myserver:~/benchmarks (29)]
$ ptime ./stream_64.ap
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 600000000, Offset = 10
Total memory required = 13732.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 231461 microseconds.
   (= 231461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       32235.5417       0.3057       0.2978       0.3653
Scale:      32646.3996       0.3104       0.2941       0.3647
Add:        40598.9607       0.3722       0.3547       0.4290
Triad:      40255.7375       0.3656       0.3577       0.4070
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

real       29.316
user     7:13.691
sys        10.981
[jdoe@myserver:~/benchmarks (30)]
$

See how the “real” time went from 5 minutes to 30s.

The benchmark program
/*-----------------------------------------------------------------------*/
/* Program: Stream                                                       */
/* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
/* Original code developed by John D. McCalpin                           */
/* Programmers: John D. McCalpin                                         */
/*              Joe R. Zagar                                             */
/*                                                                       */
/* This program measures memory transfer rates in MB/s for simple        */
/* computational kernels coded in C.                                     */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2005: John D. McCalpin                                 */
/*-----------------------------------------------------------------------*/
/* License:                                                              */
/*  1. You are free to use this program and/or to redistribute           */
/*     this program.                                                     */
/*  2. You are free to modify this program for your own use,             */
/*     including commercial use, subject to the publication              */
/*     restrictions in item 3.                                           */
/*  3. You are free to publish results obtained from running this        */
/*     program, or from works that you derive from this program,         */
/*     with the following limitations:                                   */
/*     3a. In order to be referred to as "STREAM benchmark results",     */
/*         published results must be in conformance to the STREAM        */
/*         Run Rules, (briefly reviewed below) published at              */
/*         http://www.cs.virginia.edu/stream/ref.html                    */
/*         and incorporated herein by reference.                         */
/*         As the copyright holder, John McCalpin retains the            */
/*         right to determine conformity with the Run Rules.             */
/*     3b. Results based on modified source code or on runs not in       */
/*         accordance with the STREAM Run Rules must be clearly          */
/*         labelled whenever they are published.  Examples of            */
/*         proper labelling include:                                     */
/*         "tuned STREAM benchmark results"                              */
/*         "based on a variant of the STREAM benchmark code"             */
/*         Other comparable, clear and reasonable labelling is           */
/*         acceptable.                                                   */
/*     3c. Submission of results to the STREAM benchmark web site        */
/*         is encouraged, but not required.                              */
/*  4. Use of this program or creation of derived works based on this    */
/*     program constitutes acceptance of these licensing restrictions.   */
/*  5. Absolutely no warranty is expressed or implied.                   */
/*-----------------------------------------------------------------------*/
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <stddef.h>
# include <sys/time.h>

/* INSTRUCTIONS:
 *
 *      1) Stream requires a good bit of memory to run.  Adjust the
 *          value of 'N' (below) to give a 'timing calibration' of
 *          at least 20 clock-ticks.  This will provide rate estimates
 *          that should be good to about 5% precision.
 */

#ifndef N
#   define N    120000000
#endif
#ifndef NTIMES
#   define NTIMES       20
#endif
#ifndef OFFSET
#   define OFFSET       0
#endif

/*
 *      3) Compile the code with full optimization.  Many compilers
 *         generate unreasonably bad code before the optimizer tightens
 *         things up.  If the results are unreasonably good, on the
 *         other hand, the optimizer might be too smart for me!
 *
 *         Try compiling with:
 *               cc -O stream_omp.c -o stream_omp
 *
 *         This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 *      4) Mail the results to mccalpin@cs.virginia.edu
 *         Be sure to include:
 *              a) computer hardware model number and software revision
 *              b) the compiler flags
 *              c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static double   a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double   avgtime[4] = {0}, maxtime[4] = {0},
                mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};

static char     *label[4] = {"Copy:      ", "Scale:     ",
    "Add:       ", "Triad:     "};

static double   bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

extern double mysecond();
extern void checkSTREAMresults();
#ifdef TUNED
extern void tuned_STREAM_Copy();
extern void tuned_STREAM_Scale(double scalar);
extern void tuned_STREAM_Add();
extern void tuned_STREAM_Triad(double scalar);
#endif
#ifdef _OPENMP
extern int omp_get_num_threads();
#endif
int
main()
    {
    int                 quantum, checktick();
    int                 BytesPerWord;
    register int        j, k;
    double              scalar, t, times[4][NTIMES];

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    printf("STREAM version $Revision: 5.9 $\n");
    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
#ifdef NO_LONG_LONG
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
#else
    printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
#endif

    printf("Total memory required = %.1f MB.\n",
        (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

#ifdef _OPENMP
    printf(HLINE);
#pragma omp parallel
    {
#pragma omp master
        {
            k = omp_get_num_threads();
            printf ("Number of Threads requested = %i\n",k);
        }
    }
#endif

    printf(HLINE);
#pragma omp parallel
    {
    printf ("Printing one line per active thread....\n");
    }

    /* Get initial value for system clock. */
#pragma omp parallel for
    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

    printf(HLINE);

    if  ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else {
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");
        quantum = 1;
    }

    t = mysecond();
#pragma omp parallel for
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (mysecond() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t  );
    printf("   (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING -- The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);

    /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
        {
        times[0][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Copy();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
#endif
        times[0][k] = mysecond() - times[0][k];

        times[1][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Scale(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
#endif
        times[1][k] = mysecond() - times[1][k];

        times[2][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Add();
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
#endif
        times[2][k] = mysecond() - times[2][k];

        times[3][k] = mysecond();
#ifdef TUNED
        tuned_STREAM_Triad(scalar);
#else
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
#endif
        times[3][k] = mysecond() - times[3][k];
        }

    /*  --- SUMMARY --- */

    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
        {
        for (j=0; j<4; j++)
            {
            avgtime[j] = avgtime[j] + times[j][k];
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }

    printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
    for (j=0; j<4; j++) {
        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               avgtime[j],
               mintime[j],
               maxtime[j]);
    }
    printf(HLINE);

    /* --- Check Results --- */
    checkSTREAMresults();
    printf(HLINE);

    return 0;
}

# define        M       20

int
checktick()
    {
    int         i, minDelta, Delta;
    double      t1, t2, timesfound[M];

/*  Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = mysecond();
        while( ((t2=mysecond()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

   return(minDelta);
    }

/* A gettimeofday routine to give access to the wall
   clock timer on most UNIX-like systems.  */

#include <sys/time.h>

double mysecond()
{
        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void checkSTREAMresults ()
{
        double aj,bj,cj,scalar;
        double asum,bsum,csum;
        double epsilon;
        int     j,k;

    /* reproduce initialization */
        aj = 1.0;
        bj = 2.0;
        cj = 0.0;
    /* a[] is modified during timing check */
        aj = 2.0E0 * aj;
    /* now execute timing loop */
        scalar = 3.0;
        for (k=0; k<NTIMES; k++)
        {
            cj = aj;
            bj = scalar*cj;
            cj = aj+bj;
            aj = bj+scalar*cj;
        }
        aj = aj * (double) (N);
        bj = bj * (double) (N);
        cj = cj * (double) (N);

        asum = 0.0;
        bsum = 0.0;
        csum = 0.0;
        for (j=0; j<N; j++) {
                asum += a[j];
                bsum += b[j];
                csum += c[j];
        }
#ifdef VERBOSE
        printf ("Results Comparison: \n");
        printf ("        Expected  : %f %f %f \n",aj,bj,cj);
        printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
#endif

#ifndef abs
#define abs(a) ((a) >= 0 ? (a) : -(a))
#endif
        epsilon = 1.e-8;

        if (abs(aj-asum)/asum > epsilon) {
                printf ("Failed Validation on array a[]\n");
                printf ("        Expected  : %f \n",aj);
                printf ("        Observed  : %f \n",asum);
        }
        else if (abs(bj-bsum)/bsum > epsilon) {
                printf ("Failed Validation on array b[]\n");
                printf ("        Expected  : %f \n",bj);
                printf ("        Observed  : %f \n",bsum);
        }
        else if (abs(cj-csum)/csum > epsilon) {
                printf ("Failed Validation on array c[]\n");
                printf ("        Expected  : %f \n",cj);
                printf ("        Observed  : %f \n",csum);
        }
        else {
                printf ("Solution Validates\n");
        }
}

void tuned_STREAM_Copy()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];
}

void tuned_STREAM_Scale(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
}

void tuned_STREAM_Add()
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
}

void tuned_STREAM_Triad(double scalar)
{
        int j;
#pragma omp parallel for
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
}

Jul 052013
 

It’s been a while since I’ve posted anything. So, here goes:

Identifying kernel memory leak on myhost-ldom (after P2V and OS upgrade from Solaris 9 to Solaris 10u10):

kmem_track.d sees this:

Tracing...If you see more allocs than frees, there is a potential issue...
Check against the cache name that is suspect

CACHE NAME                       ALLOCS   FREES
kmem_bufctl_audit_cache          0        47775
kmem_alloc_256                   0        87805
streams_dblk_1040                26072    0
kmem_alloc_40                    63752    0
vn_cache                         64883    64484
kmem_alloc_1152                  85338    82191
rctl_val_cache                   98712    99425
anonmap_cache                    105272   106039
kmem_alloc_96                    109072   108992
sfmmu8_cache                     109171   112886
kmem_alloc_32                    134058   135008
zio_cache                        146456   146456
streams_dblk_80                  162260   162274
kmem_alloc_160                   167740   167783
kmem_alloc_80                    187855   188146
sfmmu1_cache                     190247   194525
segvn_cache                      217514   218797
seg_cache                        232548   233831
kmem_alloc_8                     283391   283672
kmem_alloc_64                    286856   286354
streams_mblk                     313263   313281
anon_cache                       330058   336717
Tracing...If you see more allocs than frees, there is a potential issue...
Check against the cache name that is suspect

CACHE NAME                       ALLOCS   FREES
kmem_bufctl_audit_cache          0        47778
kmem_alloc_256                   0        87807
streams_dblk_1040                26216    0
kmem_alloc_40                    63777    0
vn_cache                         64887    64488
kmem_alloc_1152                  85383    82236
rctl_val_cache                   98787    99500
anonmap_cache                    105331   106098
kmem_alloc_96                    109075   108995
sfmmu8_cache                     109226   112967
kmem_alloc_32                    134132   135082
zio_cache                        146468   146468
streams_dblk_80                  162472   162486
kmem_alloc_160                   167875   167918
kmem_alloc_80                    187950   188241
sfmmu1_cache                     190362   194689
segvn_cache                      217628   218911
seg_cache                        232666   233949
kmem_alloc_8                     283452   283733
kmem_alloc_64                    286923   286421
streams_mblk                     313688   313706
anon_cache                       330176   336835
Tracing...If you see more allocs than frees, there is a potential issue...
Check against the cache name that is suspect

The two caches ( streams_dblk_1040 and kmem_alloc_40 are growing rapidly), associated with growth of kernel memory (validated by following output):

-----------------------------------
05-22-13-15-15
-----------------------------------
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      99310               775   10%
ZFS File Data               31104               243    3%
Anon                        36342               283    4%
Exec and libs                4090                31    0%
Page cache                  20560               160    2%
Free (cachelist)             3905                30    0%
Free (freelist)            833012              6507   81%

Total                     1028323              8033
Physical                  1008210              7876

-----------------------------------
05-22-13-15-30
-----------------------------------
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     150859              1178   15%
ZFS File Data               65092               508    6%
Anon                        68299               533    7%
Exec and libs                9371                73    1%
Page cache                  22618               176    2%
Free (cachelist)            11104                86    1%
Free (freelist)            700980              5476   68%

Total                     1028323              8033
Physical                  1005627              7856
-----------------------------------

The kernel memory utilization has grown 400MB in a space of 15 minutes.

So, running dtrace on streams_dblk_1040 cache shows us this:

$ sudo dtrace -n 'fbt::kmem_cache_alloc:entry /args[0]->cache_name == "streams_dblk_1040"/ \
 { @[uid,pid,ppid,curpsinfo->pr_psargs,execname,stack()] = count(); trunc(@,10);}'
dtrace: description 'fbt::kmem_cache_alloc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME

        0     2122        1  /usr/local/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a  snmpd
              genunix`allocb+0x94
              ip`snmpcom_req+0x2bc
              ip`ip_wput_nondata+0x7b0
              unix`putnext+0x218
              unix`putnext+0x218
              ip`snmpcom_req+0x37c
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              ip`snmpcom_req+0x37c
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              genunix`strput+0x1d8
              genunix`strputmsg+0x2d4
              genunix`msgio32+0x354
              genunix`putmsg32+0x98
              unix`syscall_trap32+0xcc
               11
        0     2122        1  /usr/local/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a  snmpd
              genunix`allocb+0x94
              ip`snmp_append_data2+0x70
              ip`tcp_snmp_get+0x5d4
              ip`snmpcom_req+0x350
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              ip`snmpcom_req+0x37c
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              genunix`strput+0x1d8
              genunix`strputmsg+0x2d4
              genunix`msgio32+0x354
              genunix`putmsg32+0x98
              unix`syscall_trap32+0xcc
               22
        0     2122        1  /usr/local/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a  snmpd
              genunix`allocb+0x94
              ip`snmp_append_data2+0x70
              ip`ip_snmp_get2_v4+0x310
              ip`ire_walk_ill_tables+0x30c
              ip`ire_walk_ipvers+0x64
              ip`ip_snmp_get_mib2_ip_route_media+0x74
              ip`ip_snmp_get+0x298
              ip`snmpcom_req+0x350
              ip`ip_wput_nondata+0x7b0
              unix`putnext+0x218
              unix`putnext+0x218
              ip`snmpcom_req+0x37c
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              ip`snmpcom_req+0x37c
              ip`ip_snmpmod_wput+0xe4
              unix`putnext+0x218
              genunix`strput+0x1d8
              genunix`strputmsg+0x2d4
              genunix`msgio32+0x354
               77

This is snmpd. I had already noted that snmpd had an issues (complained during boot up about configuration file entries).

Stopping net-smpd resulted in the kernel memory being released back into the freelist.

The other cache — kmem_alloc_40 shows this:

$ sudo dtrace -n 'fbt::kmem_cache_alloc:entry /args[0]->cache_name == "kmem_alloc_40"/ \
{ @[uid,pid,ppid,curpsinfo->pr_psargs,execname,stack()] = count(); trunc(@,10);}'
dtrace: description 'fbt::kmem_cache_alloc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
        0      586        1  /opt/quest/sbin/vasd -p /var/opt/quest/vas/vasd/.vasd.pid  vasd
              genunix`kmem_zalloc+0x28
              zfs`dsl_dir_tempreserve_impl+0x1e0
              zfs`dsl_dir_tempreserve_space+0x128
              zfs`dmu_tx_try_assign+0x220
              zfs`dmu_tx_assign+0xc
              zfs`zfs_write+0x4b8
              genunix`fop_write+0x20
              genunix`write+0x268
              unix`syscall_trap32+0xcc
               39
        0      586        1  /opt/quest/sbin/vasd -p /var/opt/quest/vas/vasd/.vasd.pid  vasd
              genunix`kmem_zalloc+0x28
              zfs`dsl_dir_tempreserve_space+0x68
              zfs`dmu_tx_try_assign+0x220
              zfs`dmu_tx_assign+0xc
              zfs`zfs_write+0x4b8
              genunix`fop_write+0x20
              genunix`write+0x268
              unix`syscall_trap32+0xcc
               39
        0      586        1  /opt/quest/sbin/vasd -p /var/opt/quest/vas/vasd/.vasd.pid  vasd
              genunix`kmem_zalloc+0x28
              zfs`dsl_dir_tempreserve_space+0xd4
              zfs`dmu_tx_try_assign+0x220
              zfs`dmu_tx_assign+0xc
              zfs`zfs_write+0x4b8
              genunix`fop_write+0x20
              genunix`write+0x268
              unix`syscall_trap32+0xcc
               39
        0      586        1  /opt/quest/sbin/vasd -p /var/opt/quest/vas/vasd/.vasd.pid  vasd
              genunix`kmem_zalloc+0x28
              zfs`zfs_get_data+0x80
              zfs`zil_lwb_commit+0x188
              zfs`zil_commit_writer+0xbc
              zfs`zil_commit+0x90
              zfs`zfs_fsync+0xfc
              genunix`fop_fsync+0x14
              genunix`fdsync+0x20
              unix`syscall_trap32+0xcc
               40

Looks like vintela is writing to ZFS…

After shutting down snmpd, it seems like kernel stopped growing like crazy…

$ echo "::memstat"|sudo mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     151839              1186   15%
ZFS File Data               65746               513    6%
Anon                        58443               456    6%
Exec and libs                6927                54    1%
Page cache                  22475               175    2%
Free (cachelist)            13836               108    1%
Free (freelist)            709057              5539   69%

Total                     1028323              8033
Physical                  1005627              7856
[$:/var/adm (56)]
$ echo "::memstat"|sudo mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     135583              1059   13%
ZFS File Data               65876               514    6%
Anon                        55253               431    5%
Exec and libs                7202                56    1%
Page cache                  22475               175    2%
Free (cachelist)            13589               106    1%
Free (freelist)            728345              5690   71%

Total                     1028323              8033
Physical                  1005627              7856
[$:/var/adm (57)]
$ echo "::memstat"|sudo mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     135839              1061   13%
ZFS File Data               65941               515    6%
Anon                        55161               430    5%
Exec and libs                6909                53    1%
Page cache                  22477               175    2%
Free (cachelist)            13890               108    1%
Free (freelist)            728106              5688   71%

Total                     1028323              8033
Physical                  1005627              7856

It’s now hovering around 1000MB, down from ~ 1200MB..

05-24-2013 Revisit

anon_cache                       1162987  1161545
Tracing...If you see more allocs than frees, there is a potential issue...
Check against the cache name that is suspect

CACHE NAME                       ALLOCS   FREES
kmem_alloc_16                    0        204696
kmem_alloc_128                   111705   111703
vn_cache                         182607   0
kmem_alloc_256                   186757   186900
file_cache                       217169   217300
kmem_alloc_40                    242674   242645
zio_cache                        279855   279855
kmem_alloc_1152                  295560   295553
rctl_val_cache                   354759   355428
sfmmu8_cache                     358797   360868
kmem_alloc_96                    370087   370046
anonmap_cache                    380465   380542
kmem_alloc_160                   474039   474385
kmem_alloc_32                    476749   476818
kmem_alloc_8                     588250   587960
kmem_alloc_80                    674765   675200
sfmmu1_cache                     683800   684676
segvn_cache                      785621   785685
kmem_alloc_64                    789370   788307
seg_cache                        839840   839904
anon_cache                       1163105  1161663

Only one stands out — vn_cache. Therefore, run following dtrace command:

$ sudo dtrace -qn 'fbt::kmem_cache_alloc:entry /args[0]->cache_name == "vn_cache"/  \
{ @[uid,pid,ppid,curpsinfo->pr_psargs,execname,stack()] = count(); trunc(@,10);} \
profile:::tick-15sec { printa(@); }'

This will show stack trace aggregations of top 10 processes entering kmem_cache_alloc at a 15 second interval. Turns out it is HPOV:

        0     2104        1  /opt/perf/bin/perfd                                 perfd
              genunix`vn_alloc+0xc
              procfs`prgetnode+0x38
              procfs`pr_lookup_procdir+0x94
              procfs`prlookup+0x198
              genunix`fop_lookup+0x28
              genunix`lookuppnvp+0x30c
              genunix`lookuppnat+0x120
              genunix`lookupnameat+0x5c
              genunix`vn_openat+0x168
              genunix`copen+0x260
              unix`syscall_trap32+0xcc
             1087
        0     2104        1  /opt/perf/bin/perfd                                 perfd
              genunix`vn_alloc+0xc
              procfs`prgetnode+0x38
              procfs`prgetnode+0x110
              procfs`pr_lookup_procdir+0x94
              procfs`prlookup+0x198
              genunix`fop_lookup+0x28
              genunix`lookuppnvp+0x30c
              genunix`lookuppnat+0x120
              genunix`lookupnameat+0x5c
              genunix`vn_openat+0x168
              genunix`copen+0x260
              unix`syscall_trap32+0xcc
             1087
        0     2046        1  /opt/perf/bin/scopeux                               scopeux
              genunix`vn_alloc+0xc
              procfs`prgetnode+0x38
              procfs`pr_lookup_procdir+0x94
              procfs`prlookup+0x198
              genunix`fop_lookup+0x28
              genunix`lookuppnvp+0x30c
              genunix`lookuppnat+0x120
              genunix`lookupnameat+0x5c
              genunix`vn_openat+0x168
              genunix`copen+0x260
              unix`syscall_trap32+0xcc
             2168
        0     2046        1  /opt/perf/bin/scopeux                               scopeux
              genunix`vn_alloc+0xc
              procfs`prgetnode+0x38
              procfs`prgetnode+0x110
              procfs`pr_lookup_procdir+0x94
              procfs`prlookup+0x198
              genunix`fop_lookup+0x28
              genunix`lookuppnvp+0x30c
              genunix`lookuppnat+0x120
              genunix`lookupnameat+0x5c
              genunix`vn_openat+0x168
              genunix`copen+0x260
              unix`syscall_trap32+0xcc

Which lines up with the output of the heap/stack growth (hpstckgrow.d) script —

Tracing...Ctrl-C to exit
Tracking processes that are growing their heap size...
aggregation printed at 60s intervals
EXEC     PID      COMMAND                                  COUNT
kstat    23772    /usr/perl5/bin/perl /usr/bin/kstat -n dnlcstats 412
kstat    26262    /usr/perl5/bin/perl /usr/bin/kstat -n dnlcstats 412
kstat    28069    /usr/perl5/bin/perl /usr/bin/kstat -n dnlcstats 412
kstat    647      /usr/perl5/bin/perl /usr/bin/kstat -p unix:::boot_time 1028
kstat    6674     /usr/perl5/bin/perl /usr/bin/kstat -p unix:::boot_time 1028
kstat    24090    /usr/perl5/bin/perl /usr/bin/kstat -p unix:::boot_time 1028
kstat    1410     /usr/perl5/bin/perl /usr/bin/kstat -p unix:::boot_time 1030
facter   395      /opt/puppet/bin/ruby /opt/puppet/bin/facter --puppet --yaml 2106
facter   6561     /opt/puppet/bin/ruby /opt/puppet/bin/facter --puppet --yaml 2106
facter   23793    /opt/puppet/bin/ruby /opt/puppet/bin/facter --puppet --yaml 2107

Tracking processes that are growing their stack size...
aggregation printed at 60s intervals
EXEC     PID      COMMAND                                  COUNT
zstatd   27933    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   28306    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   28763    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   28918    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   29227    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   29491    /opt/sun/xvm/lib/zstat/zstatd            1
zstatd   29664    /opt/sun/xvm/lib/zstat/zstatd            1
java     952      /usr/jdk/instances/jdk1.6.0/bin/java -Xmx150m -Xss192k -XX:MinHeapFreeRatio=10  5
scopeux  2046     /opt/perf/bin/scopeux                    126
perfd    2104     /opt/perf/bin/perfd                      1260
Uninstalling and reinstalling the net-snmpd and HP Openview packages ultimately stopped the memory leaks.
Links:
Both the scripts listed above are modifications/adaptations of code from Brendan Gregg’s excellent DTrace book —