This is one of the reason I have given up on infiniband for our storage, dropping protocol support and dropping adapter support because newer versions appear. Just not worth it anymore.
Re: Which ESXi driver to use for SRP/iSER over IB (not Eth!)?
Re: Which ESXi driver to use for SRP/iSER over IB (not Eth!)?
Will we have the honour of someone from Mellanox responding to this one? They seem to be monitoring these forums pretty closely, so no response so far looks a bit strange...
Re: VMware ESXi 6.0 virtual ib_ipoib interfaces
I have a similar problem: i cant get IP over IB working.
MT 27500 ConnectX-3 (PSID MT_1100120019) on ESX 6 (latest Updates 2 days ago). In the BIOS (SuperMicro) I set in the PCIe Configuration Page “Above 4G Decoding” as “Disabled” since otherwise the card was not recognized in older SuperMicro hardware.
First installed the 3.15.5.5 and also updated the firmware to 5150
Then had a vmnic2 with up to 40000, but link was down.
Then I uninstalled and installed OFED 2.4.0.0
Then had a vmnic2 with up to 1000GB, link down.
I also tried to install both drivers, same issue: link down, card is recognized as Ethernet and not IPoIB.
I already opened a support ticket with VMWare but the technician had no solution.
It seems the last know driver to work is 1.8.2.4 (see here VSphere 6.0 ) but this driver is not supported in 6.x
By the way: when you select now OFED 2.4.0.0. you are redirected to 2.3.3.1 for download (http://www.mellanox.com/page/products_dyn?product_family=36&mtag=vmware_drivers )
I had no problem with the same cards in similar (older) hardware and ESX 5.5.
Re: Windows, TCP/IP - lowest possible latency?
How low RTT could I expect to see with Mellanox hardware?
Re: Which ESXi driver to use for SRP/iSER over IB (not Eth!)?
Mellanox SRP initiator in vSphere OFED 1 8.2.4 shows me a good performance.
But it has a issue with ZFS SRP target that cause problem to VM auto start function in ESXi host.
SRP is a native RDMA SCSI protocol that light-weight storage protocol and~ very old ancient one.
Mellanox iSER initiator in vSphere OFED 1.8.3 shows me a good performance.
IPoIB iSER initiator can support properly my ZFS iSER target that can support VM auto start function in ESXi host perfectly.
But above 2 of protocols in ESXi 6.x will be a journey to PSOD world.
IPoIB iSCSI was a bad performer in history.
They show me a only 450~650MB/s throughput with 2port ConnectX-2 HCA between ZFS IPoIB iSCSI target.
AND~ very high processor usage level will let you go to tour the Andromeda Galaxy above your roof!
I decide to move IPoIB iSCSI with vSphere OFED 2.4.0 and try SR-IOV function in ESXi 6.x environment.
Because vSphere OFED 2.4.0 can support SR-IOV and both IB and ETH mode or VPI.
vSphere OFED 2.4.0 show me a big improvement in IPoIB iSCSI performance with 2 of MHQH19B-XTR QDR HCAs.
Performance level was over 2GB/s in peak time...
But not like as RDMA protocol...
01.Mellanox vSphere OFED 2.4.0 iometer test in IPoIB iSCSI with 2 of MHQH19B-XTR and OmniOS ZFS iSER Target
Also SR-IOV support was impressive.
Absolutely, ConnectX-2 was EOL product and I build a custom configuration firmware with binary file version 2.9.1314 from IBM site to support SR-IOV.
Here is a result.
02. SR-IOV VF list in MHQH19B-XTR ConnectX-2 HCA
02. vSphere 6.0 Update 2 Build 4192238 PCI device lists
03. Windows Guest RDMA communication test
Mellanox show me a good product concept and also many functions in their brochure.
But it's almost impossible to get a good manuals and best practice guide like Dell or etc.
So many failures make a above test result.
Unstable driver!
Unstable firmware!
Undescripted driver options!
For what?
RDMA is a good concept and RDMA NIC show a excellent performance!
Mellanox said their product can support almost major OS environment.
But there ware always exist many bugs and limits in almost OS environments.
Latest product ConnectX-4,5 also useless in vSphere and Hyper-V environment, too.
It's almost EOL of vSphere 5.x and Mellanox show a beta level driver and laziest support for vSphere environment in historically.
vSphere OFED 1.8.3 beta IPoIB iSER initiator have a critical bug that was show in public some years ago.
I think vSphere OFED 1.8.3 beta IPoIB iSER initiator poor quality, tricky SRP initiator modification for experimental!
Everybody can find that how it was a critical that issue can find in resolved issue in vSphere Ethernet iSER dirver 1.9.10.5 release notes.
I'm waiting VMWorld in Aug. 2016, and expect vRDMA nic driver in new VMwareTools and stable IPoIB iSER initiator support.
I'll make a decision in near future to move or not...
Question about stacking
We just got two SX1012 swtiches for use with a SCALE cluster.
We received a stacking cable with the pair, which I have plugged into port 10 on both switches.
When I asked Scale, they said that plugging in the cable is all that I would have to do to tie the switches together?
This didn't seem to make sense to me,and in any case, doing that did not allow me to hit both switches if only one of them has the mgmt interface plugged in.
So I looked around and saw that I think I needed to setup a MLAG, so I found some directions and did that.
now both switches are setup that way, but I am still missing some piece.
If I do a show mlag from switch one, it says that is active-full, it is the master, but operationally down.
If I do a show mlag on switch two, it says that it is active-full, it is the standby, but operationally down, however it also says mlag members summary and lists itself, switch one does not have this.
So something on switch one, i am thinking is not enabled, but if I go through the config line by line, they appear to have the same settings for all ports, port-channels ... etc.
The plan is to do the following.
connect both switches to each other via stacking cable, and they connect to 3 Scale nodes with one connection from each switch going to Backplane and Lan ports on the Scale nodes.
I want to then use ports 11 and 12 on the Mellanox to port-channel over to my 2 cisco switches, so that each Mellanox has 10 GB to each of my Cisco's LACP for 20 Gb from network core to the Mellanox switches.
What am I missing to make these connections? I want to ensure that if either of my Mellanox or either of my Cisco fail, that users will still be able to reach the VMs on the Scale cluster.
Any help would be appreciated, if you need to see the mellanox configs, please let me know.
thanx,
'State: Initializing' but works
Hi all,
I am linking 2 identical nodes with 2 IB cables (2 ports are in different subnet). I found the port 2 state is in 'Initializing' (till yesterday, both port status were Active).
Subnet manager is working. I swapped the cables but the port 2 status did not change.
Further strange things is that both ports are (probably) working normally, because two nodes can communicate through both pairs of ports.
Does anyone know what is happening?
-------------------------------
CA 'mlx4_0'
CA type: MT25408
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x0008f1040399e550
System image GUID: 0x0008f1040399e553
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251086a
Port GUID: 0x0008f1040399e551
Link layer: InfiniBand
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0008f1040399e552
Link layer: InfiniBand
Re: 'State: Initializing' but works
Hi,
Do you have a subnet manager running on the 2nd subnet via port 2?
Re: Which ESXi driver to use for SRP/iSER over IB (not Eth!)?
This is mostly exactly the current question I am trying to answer for myself.
We have a few ESXi hosts currently on v5.5 with dual port VPI adapters.
We want to run SRP/iSER on one port back to our SAN and we want to run IPoIB on the other port for vMotion communications.
We don't want to go to Ethernet based adapters and/or have to buy managed switches to achieve this.
Ultimately we don't want to implement this and then find it won't run of v6.x either so forwards compatibility is important also.
Re: MCX354A-FCBT & Switch Speed Issue
Sorry for the lack of response on this - I have been so busy lately this project has got delayed due to more important commitments.
At this stage I think the FDR vs. FDR10 answer is the most likely solution...
How would I go about confirming the switch model from the software utilities. I can't find the above model numbers on the switch anywhere (its second hand)
Also I was under the impression that the SX600xx switch family were all based on the SwitchX-2 based hardware which could all do 56 Gb/s per port.
Is there a firmware upgrade we can use to increase the speed from 40Gb/s to 56Gb/s?
Re: Question about stacking
Hi Craig,
The SX1012 switches don't support Stacking, it does support MLAG, can you try reviewing the below article (if you haven't reviewed it already).
HowTo Configure MLAG on Mellanox Switches
Also - after doing that - please paste the configurations of both switches here and indicate each port and it's what it's connected to.
Re: Question about stacking
Well i have it configured to talk over the port that the stacking cable is connected into, but if you are saying that it can't use a stacking cable, that is probably my issue. I will further investigate.
Re: MCX354A-FCBT & Switch Speed Issue
Hi,
MSX6015T-1SFS
MSX6015T-1BRS
These two models can only do FDR10. Its a hardware limitation.
Normally there is a sticker on the switch with a part number and serial number.
Try this:
If you have a Mellanox OFED driver installed on you servers, MFT tools is probably installed.
1) From one of the nodes in the fabric (directly connected to this switch) , invoke: mst start, then mst ib add, then mst status.
If "mst start" is not recognized:
Download and install MFT (Mellanox Firmware Tool):
http://www.mellanox.com/content/pages.php?pg=management_tools&menu_section=34
Refer
to the User Manual for the installation instructions:
http://www.mellanox.com/pdf/MFT/MFT_user_manual.pdf
2) Invoke the following command:
flint -d <remote_switch_mst_device> q (the "remote switch mst device" string is listed in the output under "Inband Devices".
Example:[root@zahn11 ~]# flint -d /dev/mst/SW_MT48438_IS5030_lid-0x0008 q
You'll see a PSID identifier:
Example: PSID: MT_0D00110012
This PSID will tell us the system capabilities and we can trace it to a part #. You could probably just Google it and find a quick hit on a Mellanox page, or reply back to this post and I can check it for you.
I expect you'll determine the switch is in fact FDR10 only, limiting its max speed to 40G. It a bit better than QDR (also 40G) because it has an enhanced proprietary encoding scheme.
mlnx_tune does not detect the BIOS I/O non-posted prefetch settings?
I have four AIC SB122A-PH 1U all NVMe storage servers.
- Two of them have Intel E5-2620v3 2.4Ghz CPUs, 8 x 16GiB 1833Mhz DDR4 DIMMs, 2 x DC S3510 SATA SSDs, 8 x DC P3700 NVMe SSDs
- Two of them have Intel E5-2643v3 3.4Ghz CPUs, 8 x 16GiB 2133Mhz DDR4 DIMMs, 2 x DC S3510 SATA SSDs, 8 x DC P3700 NVMe SSDs
- All four run CentOS 7.2
[root@fs00 ~]# uname -a
Linux fs00 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
- All four are installed with MLNX_OFED_LINUX-3.3-1.0.4.0-3.10.0-327.22.2.el7.x86_64
- All four have a Mellanox EDR HCA
[root@fs00 tmp]# ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.16.1006
Hardware version: 0
Node GUID: 0x7cfe90030029288e
System image GUID: 0x7cfe90030029288e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x7cfe90030029288e
Link layer: InfiniBand
As a preparation for tuning, I ran mlnx_tune -r to get some ideas first. So, I got
>>> PCI capabilities might not be fully utilized with Hasweel CPU. Make sure I/O non-posted prefetch is disabled in BIOS.
Fine. I updated the BIOS, now it looks like the following. It should be obvious that the Non-posted prefetch settings are all Disabled. I made sure all PCIe ports have the same settings.
But upon re-running mlnx_tune -r, I still saw the same warning Now either the BIOS software OR Mellanox mlnx_tune is incorrect. It's impossible that both are right. Now any hints as to where to dig more and determine the actual source of warning?
[root@fs00 ~]# mlnx_tune -r
2016-08-12 15:21:22,928 INFO Collecting node information
2016-08-12 15:21:22,929 INFO Collecting OS information
2016-08-12 15:21:22,931 INFO Collecting CPU information
2016-08-12 15:21:22,987 INFO Collecting IRQ balancer information
2016-08-12 15:21:23,002 INFO Collecting firewall information
2016-08-12 15:21:23,787 INFO Collecting IP forwarding information
2016-08-12 15:21:23,791 INFO Collecting hyper threading information
2016-08-12 15:21:23,791 INFO Collecting IOMMU information
2016-08-12 15:21:23,793 INFO Collecting driver information
2016-08-12 15:21:27,185 INFO Collecting Mellanox devices information
Mellanox Technologies - System Report
Operation System Status
CENTOS
3.10.0-327.28.2.el7.x86_64
CPU Status
Intel Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Haswell
OK: Frequency 2600.156MHz
Hyper Threading Status
ACTIVE
IRQ Balancer Status
ACTIVE
Driver Status
OK: MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4)
ConnectX-4 Device Status on PCI 84:00.0
FW version 12.16.1006
OK: PCI Width x16
>>> PCI capabilities might not be fully utilized with Hasweel CPU. Make sure I/O non-posted prefetch is disabled in BIOS.
OK: PCI Speed 8GT/s
PCI Max Payload Size 256
PCI Max Read Request 512
Local CPUs list [6, 7, 8, 9, 10, 11, 18, 19, 20, 21, 22, 23]
ib0 (Port 1) Status
Link Type ib
OK: Link status Up
Speed EDR
MTU 2044
2016-08-12 15:21:27,459 INFO System info file: /tmp/mlnx_tune_160812_152122.log
How to setup IPoIB correctly?
We have a small network for testing. Its layout is shown below. The orange-colored IP addresses are the current IPoIB setup. They are done with /etc/sysconfig/network-scripts/ifcfg0-ib0 using IP addresses in an existing subnet 192.168.11.0/24. The switch is a Mellanox SN2410. As indicated in the picture, the gateway IP address for this subnet is 192.168.11.3, assigned to a 10GbE interface of the bastion host.
At present, the SB7700 runs SM.
With this IPoIB setup,
- I have run into problems with them using e.g. ipv_rc_pingpong - please see below for test session:
- We can't get more than 10Gbps with iperf3 or iperf over such IPoIB IPv4 addressse, despite the fact that we have a SB7700 IB switch!
A. Server:
local address: LID 0x0000, QPN 0x0000e0, PSN 0x79172e, GID ::
Failed to modify QP to RTR
Couldn't connect to remote QP
I noticed the GID as :: which is not right to me.
B. Client:
local address: LID 0x0000, QPN 0x0000e0, PSN 0x4ba0b3, GID ::
client read: Unknown error 524
Couldn't read remote address
As above, the GID is shown as ::.
ping to each IP address works however.
ibping also works:
A' Server:
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.33.5040
Hardware version: 1
Node GUID: 0x7cfe900300b98f30
System image GUID: 0x7cfe900300b98f33
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe900300b98f31
Link layer: InfiniBand
[root@sc2u0n0 ~]# ibping -S -C mlx4_0 -P 1
B' Client:
--- sc2u0n0.(none) (Lid 5) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 573 ms
rtt min/avg/max = 0.017/0.057/0.099 ms
All nodes run:
OFED: MLNX_OFED_LINUX-3.3-1.0.4.0-rhel7.2-x86_64
OS: CentOS 7.2: Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
As such, I suspect that we can't use the 192.168.11.0/24 for IPoIB. But to use another subnet, the question is where and how do I setup the gateway?
Setting up IPoIB seems to be simple, especially the Mellanox Academy has this short video: "IPoIB Performance Measurement". Towards the end, the speaker for the video showed the use of ifconfig to setup temporary IPv4 addresses for the two IB equipped servers. But IMHO that's not enough! According to my understanding, an Ethernet broadcast domain like what IPoIB uses is an "emulated" one. Thus, somewhere there must be a gateway. I believe this should be done in the IB switch. But I have Googled a lot without finding any info as to how to do this part. I read carefully my SB7700 IB switch manuals, no clues there either.
I am still getting up to speed with setting up InfiniBand, although I am very comfortable in configuring switches and HCAs already. If there is something that I missed, please let me know.
Thanks!
Re: dma_map_single not able to receive capsules on ARM with mellanox card
Hello,
I advise you to contact Mellanox Support (support@mellanox.com) and describe your project, configuration and further elaborate on what you're trying to do with Mellanox NICs over ARM platforms.
Can multiple versions of mlnx-ofed exist in the same IB fabric?
Hi, we have installed compute nodes with RHELS6.3 and MLNX_OFED_LINUX-2.0-2.0.5 using an Infiniband network. We intend to upgrade to RHELS6.7 and mlnx-ofed 3.x.
In a rolling upgrade scenario, can we have both versions of OFED co-existing on the same fabric?
thanks, Greg
Re: Windows, TCP/IP - lowest possible latency?
Should I expect any difference between ConnectX 3 Pro and ConnectX 4 EN in this case?
I have 2 colocation sites located 6-7km apart. I can rent a 10gb wave between both sites, which I want to connect to my switches; (SX1036 at each site). Will i be able to use RDMA from one site to the other? If not, what is the max distance that I can cro
Title says about all;
We are starting our cloud-hosting company at 2 sites, in the future it will be 3 sites in total. For now however, we start at 2 sites.
I have a SX1036 with dual PSU that I am using as my main switch for both sites. At each site i have 1x10gbe transit/internet, and 2x (redundand ) 10gbe wave interconnect between the sites.
My question; How will I connect both switches so I can live migrate vm's (Use RDMA) from one DC to another? Is there a max. distance that I can cross? How far is this distance?
How far can I go with 1310ns fiber using SX1036 / ConnectX-3 EN / VPI PRO without losing my RDMA capabilities?
Is the latency important? I am still choosing which area my colocation be in, and i am wondering if datacenters with short distance between eachother (Lets say 250microseconds, or 0.25ms) has any real benefits in terms of RDMA and general application-load balancing. If there is no real benefit in having sub 1ms latency between sites, instead of say 10ms I will consider going with long distance between my sites for geo-redundancy.
Can you guys advise me on the benefits of having multiple datacenters that are very close together, other than redundancy? Would you guys consider this to be a bad idea ? I'm guessing that DC's that are further away from eachother do have some benefits in terms of latency to clients/customers. This is a big plus ofcourse. The thing is though, I am using SSD, NVME and RAM only as storage, and my internal network is very low latency. My gut tells me I should try to also have a minimal latency between my DC's, but I cant really think of any major benefits that would justify not having a wider area of customers that I can offer sub 10ms services. Could you guys share your opinion on this subject?
Anyone managing multiple sites that can give me some advice?
Re: VMware ESXi 6.0 virtual ib_ipoib interfaces
Hi David,
1) Driver version 2.4.0 is an Ethernet driver only so if you have a VPI card, please make sure the port protocol is configured as 2 which is Ethernet.
#/opt/mellanox/bin/mlxconfig -d /dev/mt4099_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2
2) Enable SRIOV with the MFT tool (Mellanox FW tool) as follow:
#/opt/mellanox/bin/mlxconfig -d /dev/mt4099_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=16
3) Confirm BIOS setting supports SRIOV and/or virtualization.
4) Set the driver module parameters for VF:
#esxcli system module parameters set -m mlx4_core -p 'num_vfs=<VFs over Port1, VFs over port2, 0> port_type_array=2'
This is all documented in the driver UM.
Thank you,
Sophie.