Aug 132018
 

The tool iperf (or iperf3) is popular as a synthetic network load generator that can be used to measure the network throughput between two nodes on a TCP/IP network. This is fine for point-to-point measurements. However, when it comes to distributed computing clusters involving multiple racks of nodes, it becomes challenging to find a tool that can easily simulate traffic that will test the network infrastructure sufficiently. One could of course always set up Hadoop on the cluster (or if Hadoop is already installed), run a TeraGen (MapReduce tool) that will generate heavy network traffic. However, often we are required to test and validate the boundaries and limitations of a network, especially the traffic between nodes of the cluster (which crosses Rack boundaries), often before a software stack can be deployed on the infrastructure. It is useful to be able to use a simple and well known tool (hence iperf3) so that it can provide a common frame of reference, and the tests can easily be reproduced irrespective of the underlying infrastructure (eg: bare-metal vs private cloud vs public cloud).

The objective of this document is, that, using the standard and freely available tool iperf3, we run parallel synthetic network loads to test the limits of a distributed cluster network. In other words, given a known set of iperf3 server instances, randomly run iperf3 client sessions against them simultaneously from multiple hosts. Towards that end, the same hosts can be set up as servers as well as clients (with the caveat that we avoid running a test from any given node against itself).

The methodology introduced in this document can be used to generate synthetic network loads to test East-West network bandwidth.  The diagram below shows the recommended (and standard) network topology for a distributed computing cluster (such as is Hadoop).

The Leaf to Spine uplink is recommended to be such that each node in one rack be able to talk in a non-blocking fashion to each node in another rack. Or, in other words, the E-W oversubscription ratio should be 1:1. If you have 1:1 oversubscription and each node has one 10 gbps NIC, then you should get as close to 10 gbps in bandwidth with these tests when you are running network traffic in parallel across all nodes between two racks.

Practical reasons may result in lower oversubscription ratios of upto 4:1. For really large clusters (more than 500 nodes), this might be higher  (7:1).

Some caveats need to be kept in mind. Hadoop has traditionally been a shared-nothing architecture, and with great emphasis on Data locality. What that means is that the worker nodes of a Hadoop cluster have locally attached drives (SATA is the norm). There are multiple copies of data stored (HDFS defaults to 3 replicas per HDFS block).

However, as network hardware has improved in terms of throughput, it has opened up avenues for flexibility vis-a-vis data locality (40 Gbps, 25 Gbps, 50 Gbps ethernet networks are fast becoming ubiquitous, with 100 Gbps uplink ports becoming more prevalent in enterprise data-centers). As we start seeing more architectures leveraging network based data storage and retrieval in these distributed computing environments, the need to gravitate towards 1:1 oversubscription will also increase correspondingly.

We need the following for this methodology to work —

  • iperf3 installed on all nodes
  • moreutils installed on all nodes
  • A clush setup (clustershell) (a laptop will do).

The clush setup involves creating an entry in the file /etc/clustershell/groups as follows —

 

mc_all:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com host[302,304,305-0333].my.company.com

mc_rack1:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com 

mc_rack2 :host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com

mc_rack3: host[302,304,305-0333].my.company.com

Install iperf3 and moreutils as follows —

# clush -g mc_all 'sudo yum install -y iperf3 moreutils'

After this is done, launch iperf3 in Daemon mode on all the hosts as follows —

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5001

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5002

clush -g mc_all -l johndoe sudo iperf3 -sD -p 5003

Verify that the iperf3 Daemons are running —

$ clush -g mc_all -l johndoe 'ps -ef|grep iperf3|grep -v grep'

host102.my.company.com: root      10245 1 14 12:50 ? 00:06:42 iperf3 -sD -p 5001

host106.my.company.com: root       9495 1 12 12:50 ? 00:05:36 iperf3 -sD -p 5001

host104.my.company.com: root       9554 1 9 12:50 ? 00:04:20 iperf3 -sD -p 5001

<truncated for readability>

host315.my.company.com: root      33247 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host224.my.company.com: root      33136 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host323.my.company.com: root      33257 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host318.my.company.com: root      32868 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: root      33470 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001

host236.my.company.com: johndoe   33734 33492 22 13:34 ? 00:00:10 iperf3 -c host120.my.company.com -p 5001 -t 60

 

After all the nodes have been setup to run iperf3 server instances in Daemon mode, create multiple nodelist files (in this case, we are testing cross-rack bandwidth, and so set up the nodelist per rack as follows —

$ ls -lrt 

total 32

-rw-r--r--  1 johndoe  staff 806 Aug  9 14:52 nodes.rack3

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack2

-rw-r--r--  1 johndoe  staff 520 Aug  9 14:52 nodes.rack1

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 random_net.sh

-rwxr-xr-x  1 johndoe  staff 221 Aug  9 14:52 randomnet_single.sh

Distribute these files to all nodes of the cluster. Assumptions are that you already have clush setup (so have keys to the kingdom, so to speak — ie an user id on each node which has sudo privileges as root etc).

$ clush -g mc_all -c nodes.rack* --dest=/home/johndoe

$ clush -g mc_all -c randomnet*.sh --dest=/home/johndoe

Each file contains a list of all nodes in the specific rack. Eg:

$ cat nodes.rack1 

host102.my.company.com

host104.my.company.com

host106.my.company.com

host108.my.company.com 

host110.my.company.com

The shell script random_net.sh is intended to randomly select a target node against which to run the iperf test. This is to be run on the client side, say from Rack2 (client) to Rack1 (server) —

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

ports=(5001 5002 5003)

psize=${#ports[@]}

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    pindex=$(( $RANDOM % psize ))

    target=${nodes[$index]}

    ptarget=${ports[$pindex]}

    iperf3 -c $target -p $ptarget |ts |tee -a ~/iperf3.log  # Run payload from Client to server

    iperf3 -c $target -R -p $ptarget |ts |tee -a ~/iperf3.log  # Reverse the direction

done

In order to run against a single iperf3 server instance per node, run this script instead

#!/bin/sh

# Read nodelist into an array

IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'

# Create an array of ports on which iperf3 server daemon instances will be running

size=${#nodes[@]}

for i in $(seq $size)

do

    index=$(( $RANDOM % size ))

    target=${nodes[$index]}

    iperf3 -c $target -p 5001 |ts |tee -a ~/${target}_iperf3.log    # Run payload from Client to server

    iperf3 -c $target -R -p 5001|ts |tee -a  ~/${target}_iperf3.log # Reverse the direction

done

Run the script(s) as follows —

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

Or

$ clush -g mc_rack2 -l johndoe 'sh randomnet_single.sh nodes.rack1'

This example shows the test running in parallel on all nodes of Rack2 by randomly selecting a node in Rack1 as the target. The output of the tests can be parsed to generate usable reports that will help ascertain if there is a bottleneck anywhere.

$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'

host218.my.company.com: Aug 09 14:31:17 Connecting to host host108.my.company.com, port 5001

host218.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.59 port 54962 connected to 192.168.1.14 port 5001

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 646 MBytes 5.42 Gbits/sec 80   595 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 542 MBytes 4.55 Gbits/sec 50   561 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 574 MBytes 4.81 Gbits/sec 29   577 KBytes 

host204.my.company.com: Aug 09 14:31:17 Connecting to host host140.my.company.com, port 5001

host204.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.52 port 47034 connected to 192.168.1.30 port 5001

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr Cwnd

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-1.00 sec 870 MBytes 7.30 Gbits/sec 38   799 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   1.00-2.00 sec 626 MBytes 5.25 Gbits/sec 28   454 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   2.00-3.00 sec 516 MBytes 4.33 Gbits/sec 19   512 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 590 MBytes 4.95 Gbits/sec 19   656 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 581 MBytes 4.88 Gbits/sec 88   649 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 570 MBytes 4.78 Gbits/sec 19   592 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 561 MBytes 4.71 Gbits/sec 41   560 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 589 MBytes 4.94 Gbits/sec 91   563 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 539 MBytes 4.52 Gbits/sec 46   479 KBytes 

host204.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 570 MBytes 4.78 Gbits/sec 68   607 KBytes 

host204.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec 457             sender

host204.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec              receiver

host204.my.company.com: Aug 09 14:31:17 

host204.my.company.com: Aug 09 14:31:17 iperf Done.

host218.my.company.com: Aug 09 14:31:17 [ 4]   3.00-4.00 sec 636 MBytes 5.34 Gbits/sec 12   484 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   4.00-5.00 sec 508 MBytes 4.26 Gbits/sec 65   433 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   5.00-6.00 sec 384 MBytes 3.22 Gbits/sec 62   566 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   6.00-7.00 sec 632 MBytes 5.30 Gbits/sec 69   519 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   7.00-8.00 sec 595 MBytes 4.99 Gbits/sec 30   650 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   8.00-9.00 sec 564 MBytes 4.73 Gbits/sec 45   478 KBytes 

host218.my.company.com: Aug 09 14:31:17 [ 4]   9.00-10.00 sec 525 MBytes 4.40 Gbits/sec 26   444 KBytes 

host218.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - -

host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer     Bandwidth Retr

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec 468             sender

host218.my.company.com: Aug 09 14:31:17 [ 4]   0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec              receiver

 

NOTE: This blog documents my personal opinions only and does not reflect my employer’s positions on any subject written about here.