The tool iperf (or iperf3) is popular as a synthetic network load generator that can be used to measure the network throughput between two nodes on a TCP/IP network. This is fine for point-to-point measurements. However, when it comes to distributed computing clusters involving multiple racks of nodes, it becomes challenging to find a tool that can easily simulate traffic that will test the network infrastructure sufficiently. One could of course always set up Hadoop on the cluster (or if Hadoop is already installed), run a TeraGen (MapReduce tool) that will generate heavy network traffic. However, often we are required to test and validate the boundaries and limitations of a network, especially the traffic between nodes of the cluster (which crosses Rack boundaries), often before a software stack can be deployed on the infrastructure. It is useful to be able to use a simple and well known tool (hence iperf3) so that it can provide a common frame of reference, and the tests can easily be reproduced irrespective of the underlying infrastructure (eg: bare-metal vs private cloud vs public cloud).
The objective of this document is, that, using the standard and freely available tool iperf3, we run parallel synthetic network loads to test the limits of a distributed cluster network. In other words, given a known set of iperf3 server instances, randomly run iperf3 client sessions against them simultaneously from multiple hosts. Towards that end, the same hosts can be set up as servers as well as clients (with the caveat that we avoid running a test from any given node against itself).
The methodology introduced in this document can be used to generate synthetic network loads to test East-West network bandwidth. The diagram below shows the recommended (and standard) network topology for a distributed computing cluster (such as is Hadoop).

The Leaf to Spine uplink is recommended to be such that each node in one rack be able to talk in a non-blocking fashion to each node in another rack. Or, in other words, the E-W oversubscription ratio should be 1:1. If you have 1:1 oversubscription and each node has one 10 gbps NIC, then you should get as close to 10 gbps in bandwidth with these tests when you are running network traffic in parallel across all nodes between two racks.
Practical reasons may result in lower oversubscription ratios of upto 4:1. For really large clusters (more than 500 nodes), this might be higher (7:1).
Some caveats need to be kept in mind. Hadoop has traditionally been a shared-nothing architecture, and with great emphasis on Data locality. What that means is that the worker nodes of a Hadoop cluster have locally attached drives (SATA is the norm). There are multiple copies of data stored (HDFS defaults to 3 replicas per HDFS block).
However, as network hardware has improved in terms of throughput, it has opened up avenues for flexibility vis-a-vis data locality (40 Gbps, 25 Gbps, 50 Gbps ethernet networks are fast becoming ubiquitous, with 100 Gbps uplink ports becoming more prevalent in enterprise data-centers). As we start seeing more architectures leveraging network based data storage and retrieval in these distributed computing environments, the need to gravitate towards 1:1 oversubscription will also increase correspondingly.
We need the following for this methodology to work —
- iperf3 installed on all nodes
- moreutils installed on all nodes
- A clush setup (clustershell) (a laptop will do).
The clush setup involves creating an entry in the file /etc/clustershell/groups as follows —
mc_all:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com host[302,304,305-0333].my.company.com mc_rack1:host[102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140].my.company.com mc_rack2 :host[202,204,206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,240].my.company.com mc_rack3: host[302,304,305-0333].my.company.com
Install iperf3 and moreutils as follows —
# clush -g mc_all 'sudo yum install -y iperf3 moreutils'
After this is done, launch iperf3 in Daemon mode on all the hosts as follows —
clush -g mc_all -l johndoe sudo iperf3 -sD -p 5001 clush -g mc_all -l johndoe sudo iperf3 -sD -p 5002 clush -g mc_all -l johndoe sudo iperf3 -sD -p 5003
Verify that the iperf3 Daemons are running —
$ clush -g mc_all -l johndoe 'ps -ef|grep iperf3|grep -v grep'
host102.my.company.com: root 10245 1 14 12:50 ? 00:06:42 iperf3 -sD -p 5001
host106.my.company.com: root 9495 1 12 12:50 ? 00:05:36 iperf3 -sD -p 5001
host104.my.company.com: root 9554 1 9 12:50 ? 00:04:20 iperf3 -sD -p 5001
<truncated for readability>
host315.my.company.com: root 33247 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001
host224.my.company.com: root 33136 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001
host323.my.company.com: root 33257 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001
host318.my.company.com: root 32868 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001
host236.my.company.com: root 33470 1 0 12:57 ? 00:00:00 iperf3 -sD -p 5001
host236.my.company.com: johndoe 33734 33492 22 13:34 ? 00:00:10 iperf3 -c host120.my.company.com -p 5001 -t 60
After all the nodes have been setup to run iperf3 server instances in Daemon mode, create multiple nodelist files (in this case, we are testing cross-rack bandwidth, and so set up the nodelist per rack as follows —
$ ls -lrt
total 32
-rw-r--r-- 1 johndoe staff 806 Aug 9 14:52 nodes.rack3
-rw-r--r-- 1 johndoe staff 520 Aug 9 14:52 nodes.rack2
-rw-r--r-- 1 johndoe staff 520 Aug 9 14:52 nodes.rack1
-rwxr-xr-x 1 johndoe staff 221 Aug 9 14:52 random_net.sh
-rwxr-xr-x 1 johndoe staff 221 Aug 9 14:52 randomnet_single.sh
Distribute these files to all nodes of the cluster. Assumptions are that you already have clush setup (so have keys to the kingdom, so to speak — ie an user id on each node which has sudo privileges as root etc).
$ clush -g mc_all -c nodes.rack* --dest=/home/johndoe $ clush -g mc_all -c randomnet*.sh --dest=/home/johndoe
Each file contains a list of all nodes in the specific rack. Eg:
$ cat nodes.rack1 host102.my.company.com host104.my.company.com host106.my.company.com host108.my.company.com host110.my.company.com
The shell script random_net.sh is intended to randomly select a target node against which to run the iperf test. This is to be run on the client side, say from Rack2 (client) to Rack1 (server) —
#!/bin/sh
IFS=$'\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))' # Read nodelist into an array
# Create an array of ports on which iperf3 server daemon instances run
ports=(5001 5002 5003)
psize=${#ports[@]}
size=${#nodes[@]}
for i in $(seq $size)
do
index=$(( $RANDOM % size ))
pindex=$(( $RANDOM % psize ))
target=${nodes[$index]}
ptarget=${ports[$pindex]}
iperf3 -c $target -p $ptarget |ts |tee -a ~/iperf3.log # Run payload from Client to server
iperf3 -c $target -R -p $ptarget |ts |tee -a ~/iperf3.log # Reverse the direction
done
In order to run against a single iperf3 server instance per node, run this script instead
#!/bin/sh
# Read nodelist into an array
IFS='\r\n' GLOBIGNORE='*' command eval 'nodes=($(cat $1))'
# Create an array of ports on which iperf3 server daemon instances will run
size=${#nodes[@]}
for i in $(seq $size) do
index=$(( $RANDOM % size ))
target=${nodes[$index]}
iperf3 -c $target -p 5001 |ts |tee -a ~/${target}_iperf3.log # Run payload from Client to server iperf3 -c $target -R -p 5001|ts |tee -a ~/${target}_iperf3.log # Reverse the direction
done
Run the script(s) as follows —
$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1'
Or
$ clush -g mc_rack2 -l johndoe 'sh randomnet_single.sh nodes.rack1'
This example shows the test running in parallel on all nodes of Rack2 by randomly selecting a node in Rack1 as the target. The output of the tests can be parsed to generate usable reports that will help ascertain if there is a bottleneck anywhere.
$ clush -g mc_rack2 -l johndoe 'sh random_net.sh nodes.rack1' host218.my.company.com: Aug 09 14:31:17 Connecting to host host108.my.company.com, port 5001 host218.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.59 port 54962 connected to 192.168.1.14 port 5001 host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer Bandwidth Retr Cwnd host218.my.company.com: Aug 09 14:31:17 [ 4] 0.00-1.00 sec 646 MBytes 5.42 Gbits/sec 80 595 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 1.00-2.00 sec 542 MBytes 4.55 Gbits/sec 50 561 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 2.00-3.00 sec 574 MBytes 4.81 Gbits/sec 29 577 KBytes host204.my.company.com: Aug 09 14:31:17 Connecting to host host140.my.company.com, port 5001 host204.my.company.com: Aug 09 14:31:17 [ 4] local 192.168.1.52 port 47034 connected to 192.168.1.30 port 5001 host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer Bandwidth Retr Cwnd host204.my.company.com: Aug 09 14:31:17 [ 4] 0.00-1.00 sec 870 MBytes 7.30 Gbits/sec 38 799 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 1.00-2.00 sec 626 MBytes 5.25 Gbits/sec 28 454 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 2.00-3.00 sec 516 MBytes 4.33 Gbits/sec 19 512 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 3.00-4.00 sec 590 MBytes 4.95 Gbits/sec 19 656 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 4.00-5.00 sec 581 MBytes 4.88 Gbits/sec 88 649 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 5.00-6.00 sec 570 MBytes 4.78 Gbits/sec 19 592 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 6.00-7.00 sec 561 MBytes 4.71 Gbits/sec 41 560 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 7.00-8.00 sec 589 MBytes 4.94 Gbits/sec 91 563 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 8.00-9.00 sec 539 MBytes 4.52 Gbits/sec 46 479 KBytes host204.my.company.com: Aug 09 14:31:17 [ 4] 9.00-10.00 sec 570 MBytes 4.78 Gbits/sec 68 607 KBytes host204.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - - host204.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer Bandwidth Retr host204.my.company.com: Aug 09 14:31:17 [ 4] 0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec 457 sender host204.my.company.com: Aug 09 14:31:17 [ 4] 0.00-10.00 sec 5.87 GBytes 5.04 Gbits/sec receiver host204.my.company.com: Aug 09 14:31:17 host204.my.company.com: Aug 09 14:31:17 iperf Done. host218.my.company.com: Aug 09 14:31:17 [ 4] 3.00-4.00 sec 636 MBytes 5.34 Gbits/sec 12 484 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 4.00-5.00 sec 508 MBytes 4.26 Gbits/sec 65 433 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 5.00-6.00 sec 384 MBytes 3.22 Gbits/sec 62 566 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 6.00-7.00 sec 632 MBytes 5.30 Gbits/sec 69 519 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 7.00-8.00 sec 595 MBytes 4.99 Gbits/sec 30 650 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 8.00-9.00 sec 564 MBytes 4.73 Gbits/sec 45 478 KBytes host218.my.company.com: Aug 09 14:31:17 [ 4] 9.00-10.00 sec 525 MBytes 4.40 Gbits/sec 26 444 KBytes host218.my.company.com: Aug 09 14:31:17 - - - - - - - - - - - - - - - - - - - - - - - - - host218.my.company.com: Aug 09 14:31:17 [ ID] Interval Transfer Bandwidth Retr host218.my.company.com: Aug 09 14:31:17 [ 4] 0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec 468 sender host218.my.company.com: Aug 09 14:31:17 [ 4] 0.00-10.00 sec 5.47 GBytes 4.70 Gbits/sec receiver
NOTE: This blog documents my personal opinions only and does not reflect my employer’s positions on any subject written about here.