Here few steps to try to analyse your congestion problem:
What is IB congestion?
- IB congestion is a situation where nodes fail to send data or send rate decreases
- In most cases when an IB network is experiencing congestion, there will be no packets drops. Just slowness
- Usually IB congestion is caused by a slow node receiver.
It can also cause by the network itself in cases where the network is blocking by design or due to an issue
How to identify congestion situation:
- Network is slow. All or some of the nodes packet rate decreases dramatically
- No packet drops in the fabric. If the network drops packets it is probably not real congestion, just a physical problem that should be locally identified and fixed
Suspect #1: Physical Layer Issues
Physical layer issues can cause degraded performance of the fabric. In order to eliminate any impact on the fabric by physical layer issues, fabric cleanup is required.
Information on fabric status and ports’ counters can be collected using the ibdiagnet tool (from the UFM server where we have the ibdiagnet2 version installed):
ibdiagnet -r -pc -P all=1 --pm_pause_time 600 -o <output_dir>
- It is recommended specifying the output directory so files will not get overwritten
- Output files can be used in other sections of this technical guide
In the ibdiagnet2.log file, need to look for ports reporting on one or more of the following physical layer issues:
- link_down_counter – ignoring scheduled servers’ reboot
-E- lid=0x0143 dev=51000 xxxxxxxx/U1/P36
Performance Monitor counter : Value
link_down_counter : 3 (threshold=0)
- Links degraded speed and width – links with reduced capability will be reported in the “Speed / Width checks” section
Speed / Width checks
-I- Link Speed Check (Compare to supported link speed)
-E- Links Speed Check finished with errors
-E- Link: S0002c902004213d3/N0002c902004213d0(Infiniscale-IV Mellanox Technologies)/P24<-->switch-1137be:IS5030/U1/P32 - Unexpected actual link speed 2.5
-I- Link Width Check (Expected value given = 4x)
-E- Links Width Check finished with errors
-E- Link: S0002c902004213d3/N0002c902004213d0(Infiniscale-IV Mellanox Technologies)/P24<-->switch-1137be:IS5030/U1/P32 - Unexpected width, actual link width is 1x
- link_error_recovery_counter
-E- lid=0x0009 dev=51000 xxx/U1/P32
Performance Monitor counter : Value
link_error_recovery_counter : 255 (overflow)
- max_retransmission_rate– check for increments during test run. Look for anything greater than threshold of 500 (the threshold mentioned in the example below is set by the ibdiagnet test flag “-P all=1”)
-E- Ports counters Difference Check (during run) finished with errors
-E- Sf4521403004d20a0/r xxx/P6 - "max_retransmission_rate" increased during the run (difference value=1,difference allowed threshold=1)
- symbol_error_counter– relevant only for non FDR/FDR10 links
-E- lid=0x016e dev=23131 S0008f1040040c018/N0008f10500650e4e/P30
Performance Monitor counter : Value
symbol_error_counter : 65535 (overflow)
- Ø UFM Port Counters CSV diagnostic
Configuring UFM to collect PortCounters CSV files in gv.cfg configuration file:
[CSV]
max_files= 5
write_interval= 30
ext_ports_only= no
Output files will be saved in this location on the UFM server: /opt/ufm/files/csv/.
- Extract the latest file and open with Excel
- Form a table
- Relevant column for physical layer issues:
- E: Width – look for any port without 4x width
- T: SymErr – SymbolError. Relevant for non FDR/FDR10 links
- U: LinkRecovers
- V: LinkDowned
- AY: Speed – look for any degraded rate
- AZ: Status – look for anything not OK
Device name and port can be found in columns P and B respectively.
Suspect #2: Unresponsive node/s issue
Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.
If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.
- Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
- If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node
Example:
root # ibnetdiscover
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out
#
# Topology file: generated on Mon Mar 2 17:19:19 2016
#
# Initiated from node f4521403008b9a30 port f4521403008b9a31
- Identify the unresponsive node/s:
- From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>
Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"
- The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path
Example: for direct path "0,1,18", the unresponsive device will be connected to port 18
Suspect #3: Slow Receivers
- Nodes that pushes back on data because it can’t process data fast enough
- A slow node will not give the switch credits to send traffic. The backpressure will spread on to other connected switches by allocating buffer space for delayed traffic
Congested links:
- Indication for a congested link is a link that sends or receive high amount of data (high XmitPacket/RcvPacket) and is also having high rates of XmitWait
- We can get a clear indication for congestion if: WmitWait / XmitPackets >10
(Ratio between XmitWait and the XmitPacket is bigger than 10)
Possible causes for slow receiver:
- Server resources
- CPU speed – it is recommended to work with CPU in max performance mode
- Memory - bad memory dimm or memory section can decrease the server performance. This can only be detected with low-level memory testing utilities
- PCI connection – degraded Gen (speed) and/or width
More information can be found in the Performance Tuning Guide document.
- Ø Detecting slow receivers using PortCounters CSV file
For using this method, the reset counters policy should be reset_every_poll (only data counters will be reset).
- Extract 2x latest CSV files (by name convention)
- Open the 2 files in Excel and format as tables
- Copy the XmitWait column from the older file to the new file right next to the XmitWait column in the newer file
- Insert new column (NEW_ XmitWait) and calculate the delta between the 2 XmitWait values (we want the number of ticks counted between the 2 files)
- In column D (NodeType) select only Switch
- In Column AR (PeerPlatform) select only Computer
- Insert new column, Congestion Ratio, and add formula of: NEW_ XmitWait/XmitPkts
- Sort Congestion Ratio column from largest to smallest
- Start from the top on any transmitting port reporting on a ratio greater than 10
- Ø Detecting slow receivers using ibdiagnet2
With this method, manual mapping between GUIDs and hostname is required.
This can be done using the Excel vlookup function and any parsed hostname <-> GUIDs list.
- Copy the “PM_INFO” data from the f ibdiagnet2.db_csv file to Excel sheet and for a table
Example – all other columns are hidden:
- Calculate the Congestion index = XmitWait / XmitPkt
Using 32/64 bits counters. 64 bit Counters requires additional translation from Hex to Dec
Example:
- Complete data & Analyze results
Congestion index: Normalized XmitWait [ticks] = ∆XmitWait / ∆XmitPackets
- Avg # of ticks packet waits in Head of Queue
Ports with Congestion index >= 10 should be treated as congested
Example:
Suspect #4: Network issues
Routing issues can be investigated by Mellanox support using the following information:
- ibdiagnet output files
- Opensm log
- Opensm configuration files (/opt/ufm/files/conf/opensm/)
- ibnetdiscover
- partitions.conf
- /opt/ufm/files/log/ opensm-sa.dump
- Root GUIDs file
Using MSTK:
Missing links or devices can cause degradation in performance.
You can use the /opt/ufm/support/MSTK5.5/Linux/Host-Tools/ib-topology-viewer.sh script on the UFM server for backing up reference topology summary and comparing to any new collected topology summary.
[root@xxx Host-Tools]# ./ib-topology-viewer.sh
ib-topology-viewer.sh Version 5.5
MF0;xxx:SX6036/U1(0x0002c903004693c1) 1 HCA ports and 2 switch ports.
SwitchIB Mellanox Technologies(0x7cfe9003009ea930) 2 HCA ports and 3 switch ports.
SwitchIB Mellanox Technologies(0x7cfe900300bf8530) 1 HCA ports and 1 switch ports.
Using ibnetdiscover:
- Cache ibnetdiscover data – this will be the reference data:
ibnetdicover --cache <file>
- Compare any new ibnetdiscover to the cached data:
ibnetdiscover --diff <cache_file>
Output will contain changed between cached data and new ibnetdiscover output.
BR
Marc