Re: rx-out-of-buffer

November 8, 2018, 5:43 am

≫ Next: Re: rx-out-of-buffer

≪ Previous: Hang occured when mount by glusterfs using driver 4.4.2 for MT27500 Family [ConnectX-3 on CentOS 7.1

So Arvind, I found at least one "lying" counter : imissed is not implemented in mlx5 driver.

↧

Re: rx-out-of-buffer

November 9, 2018, 12:48 am

≫ Next: VXLAN offload/RSS not working under OVS switch scenario

≪ Previous: Re: rx-out-of-buffer

So I could finally answer this specific questions with help from support :

The counters for "imissed" that is the number of packets that could not be delivered to a queue is not implemented with DPDK mellanox driver. So the only way to know if a specific queue dropped packets is to track it with eth_queue_count, for which I added support in the DPDK driver, it is coming in the next release.

rx_out_of_buffer is actually (what should have been) the imissed aggregated for all queues. That is the number of packets dropped because your CPU does not consume them fast enough.

In our case, rx_out_of_buffer did not explain all the drops.

So we observed that rx_packets_phy was higher than rx_good_packets. Actually, if you look at ethtool -S (that contains more counter than DPDK xstats) you will also have rx_discards_phy.

If there is no intrinsic error in the packets (checksums etc), you'll have rx_packets_phy = rx_packets_phy + rx_discards_phy + rx_out_of_buffer.

So rx_discards_phy is actually (as stated in the mentioned doc above) the number of packets dropped by the NIC, not because there were not enough buffers in the queue but because there is some congestion in the NIC or the bus.

We're now investigating why that happens, but this question is resolved.

Tom

↧

VXLAN offload/RSS not working under OVS switch scenario

November 9, 2018, 11:50 am

≫ Next: In-Cast or Micro Burst on SN2000 Series Switch

≪ Previous: Re: rx-out-of-buffer

Hi folks,

We are using two Mellanox Connectx-5 EN (2-port 100 GBE) cards to validate performance under an Openstack vxlan tenant scenario using OVS bridges. The Mellanox cards are NOT in SR-IOV mode and are currently connected back-to-back to create the vxlan tenant network with core IP addresses applied directly to the ports. It is our understanding from reading Mellanox literature that under this scenario the Mellanox should be offloading the VXLAN tunnel udp encap/decap somewhat automatically (as per the OVS switch 'vxlan' interface configuration); however, at the moment this does not seem to be occurring. Traffic is being generated across multiple VNID's, however all ingress traffic from the vxlan/tenant network is getting processed by a single core softirq process. We are not able to verify if this is b/c something is not setup properly in regard to the Mellanox card/drivers or if we do not have RSS tuned as needed. Further details below, any assistance on configuration/debug suggestions or updated documentation for this scenario under Connectx-5 would be appreciated!

Further details:

We have been reading from the following document although it is targeted for Connectx-3 ( HowTo Configure VXLAN for ConnectX-3 Pro (Linux OVS) . This document shows several items to be configured (DMFS & VXLAN port number) but it seems both of these should be enabled now under Connectx-5 and the 'mlx5_core' driver. Also the recommended debug/log steps in this document are no longer supported under Connectx-5. In our setup we are using the apt package install for OVS.

--------------------------------------------------------------------

Mellanox card: Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Server version: Ubuntu 18.04 (kernel 4.15.0-36-generic)

OVS version: ovs-vsctl (Open vSwitch) 2.9.0

root@us-ral-1a-e1:~# mlxfwmanager --query

Querying Mellanox devices firmware ...

Device #1:

----------

Device Type: ConnectX5

Part Number: MCX516A-CCA_Ax

Description: ConnectX-5 EN network interface card; 100GbE dual-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6

PSID: MT_0000000012

PCI Device Name: /dev/mst/mt4119_pciconf0

Base GUID: 98039b03004dd018

Base MAC: 98039b4dd018

Versions: Current Available

FW 16.23.1020 N/A

PXE 3.5.0504 N/A

UEFI 14.16.0017 N/A

Status: No matching image found

root@us-ral-1a-e1:~#

↧

In-Cast or Micro Burst on SN2000 Series Switch

November 9, 2018, 1:07 pm

≫ Next: Status of RDMA over Resilient Ethernet

≪ Previous: VXLAN offload/RSS not working under OVS switch scenario

Hi ...,

I would like to know whether a 100Gbps NIC is able to achieve full 100Gbps speeds, without RoCE / RDMA / VMA.

Consider a compute farm like scenario.

Say, there is a NAS with a 100Gps NIC, connected to an SN2100 Switch.

The 100Gbps switch is in turn connected to 4 x 48 Port 1Gbps switches using a 40G links to each switch.

i.e. 192 x 1Gbps client computers

|----------| |-----------| ---40G---> | 48 Port 1G Switch |

| NAS | ---100G NIC ---> |SN2100| ---40G---> | 48 Port 1G Switch |

|Storage| | Switch | ---40G---> | 48 Port 1G Switch |

|----------| |-----------| ---40G---> | 48 Port 1G Switch |

Now, if all the 192 x 1Gbps clients were to read files from the storage at the same time, will the NAS NIC be able to serve at 100Gbps (assuming that there are no bottlenecks in the storage system itself)?

Regards,

Indivar Nair

↧

Status of RDMA over Resilient Ethernet

November 10, 2018, 3:03 am

≫ Next: SX6025 and QSFP-LR4-40G

≪ Previous: In-Cast or Micro Burst on SN2000 Series Switch

Dear All,

at SIGCOM 2018 Mellanox announced support for RoCEv2 over Resilient Ethernet, with HCA packet drop detection (and out of order handling) ?

there is also this FAQ but no coding procedure is provide.

Introduction to Resilient RoCE - FAQ

how is it possible to use this feature with a Connectx-5 using IBV_SEND over UD (Unreliable Datagram) queue pair ?

thank for your attention

↧

SX6025 and QSFP-LR4-40G

November 12, 2018, 3:34 am

≫ Next: ESXi host on NEO

≪ Previous: Status of RDMA over Resilient Ethernet

Hi,

we have multiple SX6025 switches and want to connect them within 2 different rooms.

We bought 4 QSFP-LR4-40G modules LC/LC 1310.

If we connect them to the switches, we dont get a link status up and running.

Is this possible or do we use the wrong switches/interfaces?

Do we have to configure the opensm to accept 40Gbit interfaces?

Best regards,

Volker

↧

ESXi host on NEO

November 12, 2018, 8:20 am

≫ Next: Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

≪ Previous: SX6025 and QSFP-LR4-40G

Hi all!

The managed hosts supported by Mellanox NEO are Linux and Windows. Is any kind of roadmap to add ESXi hosts?

Regards!

↧

Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

November 12, 2018, 11:37 am

≫ Next: Re: Concurrent INFINIBAND multicast writers

≪ Previous: ESXi host on NEO

I have three Mellanox ConnectX-3 cards, that I'm trying to setup with Proxmox (Proxmox Installer does not see Mellanox ConnectX-3 card at all? | Proxmox Support Forum)

I need to change them from Infiniband mode to Ethernet mode.

I was able to install the Mellanox Management Tools, and they can see my card:

root@gcc-proxmox:~/mft-4.10.0-104-x86_64-deb# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
root@gcc-proxmox:~/mft-4.10.0-104-x86_64-deb# mst status
MST modules:
------------  MST PCI module loaded  MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4099_pciconf0 - PCI configuration cycles access.
  domain:bus:dev.fn=0000:41:00.0 addr.reg=88 data.reg=92  Chip revision is: 01
/dev/mst/mt4099_pci_cr0 - PCI direct access.  domain:bus:dev.fn=0000:41:00.0 bar=0xd4f00000 size=0x100000  Chip revision is: 01

However, when I tried to query the current config, it complains about the firmware version being too old.

root@gcc-proxmox:~/mft-4.10.0-104-x86_64-deb# mlxconfig -d /dev/mst/mt4099_pciconf0 q -E- Failed to open device: /dev/mst/mt4099_pciconf0. Unsupported FW (version 2.31.5000 or above required for CX3/PRO)

So I updated the firmware:

root@gcc-proxmox:~# flint -d /dev/mst/mt4099_pci_cr0 -i fw-ConnectX3-rel-2_42_5000-MCX311A-XCA_Ax-FlexBoot-3.4.752.bin burn    Current FW version on flash:  2.10.4290    New FW version:               2.42.5000


Burn process will not be failsafe. No checks will be performed.
ALL flash, including the Invariant Sector will be overwritten.
If this process fails, computer may remain in an inoperable state.


 Do you want to continue ? (y/n) [n] : y
Burning FS2 FW image without signatures - OK
Restoring signature                     - OK

But now when I try to read the config - I get a new error:

root@gcc-proxmox:~# mlxconfig -d /dev/mst/mt4099_pciconf0 q


Device #1:
----------


Device type:    ConnectX3
Device:         /dev/mst/mt4099_pciconf0


Configurations:                              Next Boot
-E- Failed to query device current configuration

Any ideas what's going on, or how to get these cards working in permanent Ethernet mode?

↧

Re: Concurrent INFINIBAND multicast writers

November 12, 2018, 2:51 pm

≫ Next: Re: Concurrent INFINIBAND multicast writers

≪ Previous: Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

Hi Wayne,

Also, this is an expected behaviour, that is, if the number of writers increase and is greater than number of readers then congestion is expected.

↧

Re: Concurrent INFINIBAND multicast writers

November 12, 2018, 3:40 pm

≫ Next: Re: Hang occured when mount by glusterfs using driver 4.4.2 for MT27500 Family [ConnectX-3 on CentOS 7.1

≪ Previous: Re: Concurrent INFINIBAND multicast writers

Thank you for the input. I will pull the data for the previous comment.

↧

Re: Hang occured when mount by glusterfs using driver 4.4.2 for MT27500 Family [ConnectX-3 on CentOS 7.1

November 12, 2018, 6:32 pm

≫ Next: Re: ESXi host on NEO

≪ Previous: Re: Concurrent INFINIBAND multicast writers

Hello Malfe,

Thank you for posting your question on the Mellanox Community.

Based on the information provided, we are not able to debug the issue you are experiencing. Currently we have no issues reported with Glusterfs and MLNX_OFED 4.4.

We recommend to open a Mellanox Support case through support@mellanox.com to investigate this issue further.

Thanks and regards,
~Mellanox Technical Support

↧

Re: ESXi host on NEO

November 13, 2018, 3:40 am

≫ Next: Re: In-Cast or Micro Burst on SN2000 Series Switch

≪ Previous: Re: Hang occured when mount by glusterfs using driver 4.4.2 for MT27500 Family [ConnectX-3 on CentOS 7.1

Hi Diego

Have checked internally the the input I have is that we Mellanox does not support Esxi on Neo, neither do we have it in our roadmap to support in the future

↧

Re: In-Cast or Micro Burst on SN2000 Series Switch

November 13, 2018, 5:50 am

≫ Next: Re: sending order of 'segmented' UDP packets

≪ Previous: Re: ESXi host on NEO

With no RoCE / RDMA / VMA and as per the configuration you've presented - then theoraticaly, the NAS nic of the Storage node is able to server 100Gb/s only practically the switch is likely to cause traffic-congestion In more details, when at the same time a total of 192Gb/s traffic is tunneled through 160Gb/s switch (4x 40Gb), hitting an adapter with "only "100Gb/s capability - traffic will go into congestion state and will drop down drastically due to switch running into "buffer overflow".

A workaround will be setting Flow-control (Paused-Frames" on adapter interface & switch ports).

↧

Re: sending order of 'segmented' UDP packets

November 13, 2018, 3:15 pm

≫ Next: Re: Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

≪ Previous: Re: In-Cast or Micro Burst on SN2000 Series Switch

Hi Sofia,

Could you be more specific what is not working? In the code you mentioned "without 0x05 order it ok",, after the on beginning of the comment "order seems to be random" and later, "sending order gets ok".

By the way, how do you capture the packets to see the order?

↧

Re: Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

November 14, 2018, 3:23 am

≫ Next: Re: SX6025 and QSFP-LR4-40G

≪ Previous: Re: sending order of 'segmented' UDP packets

I have the same issue that mlxconfig cannot read the current device configuration ( Iwant to set sr-iov, my cards are already in ethernet mode). I also noticed that the bios screen cannot save the configuration for SR-IOV claiming "access denied". Any help would be much appreciated

/Louis

↧

Re: SX6025 and QSFP-LR4-40G

November 14, 2018, 5:24 am

≫ Next: Inconsistent hardware timestamping? ConnectX-5 EN & tcpdump

≪ Previous: Re: Unable to set Mellanox ConnectX-3 to Ethernet (Failed to query device current configuration)

Any ideas? Do we have to configure link speed in opensm.conf or partitions.conf?

opensm.log didnt recognized that we pluged in a new interface and we got no link signal.

Regards,

Volker

↧

Inconsistent hardware timestamping? ConnectX-5 EN & tcpdump

November 14, 2018, 11:06 am

≫ Next: Re: VXLAN offload/RSS not working under OVS switch scenario

≪ Previous: Re: SX6025 and QSFP-LR4-40G

Hi all,

We recently purchased a MCX516A-CCAT from the Mellanox webstore, but encountered the following issue when trying to do a simple latency measurement, using hardware timestamping.
Using the following command to retrieve system timestamps:

ip netns exec ns_m0 tcpdump --time-stamp-type=host --time-stamp-precision=nano

Which gives the following results (for example):

master/server	slave/client
19:36:03.883442258 IP15 19:36:03.883524725 IP15 19:36:03.883678497 IP15 19:36:03.883703809 IP15 19:36:03.883924377 IP15 19:36:03.883939231 IP15 19:36:03.883971437 IP15 19:36:03.883985143 IP15 19:36:03.884010765 IP15 19:36:03.884021139 IP15 19:36:03.884051422 IP15 19:36:03.884062029 IP15 19:36:03.884083780 IP15 19:36:03.884091661 IP15 19:36:03.884127283 IP15 19:36:03.884135654 IP15 19:36:03.884159177 IP15 19:36:03.884167900 IP15 19:36:03.884187810 IP15 19:36:03.884197308 IP15	19:36:03.883379688 IP15 19:36:03.883590507 IP15 19:36:03.883659403 IP15 19:36:03.883716669 IP15 19:36:03.883914510 IP15 19:36:03.883947770 IP15 19:36:03.883961851 IP15 19:36:03.883994953 IP15 19:36:03.884005137 IP15 19:36:03.884030823 IP15 19:36:03.884046094 IP15 19:36:03.884068390 IP15 19:36:03.884078674 IP15 19:36:03.884100314 IP15 19:36:03.884119333 IP15 19:36:03.884141135 IP15 19:36:03.884152060 IP15 19:36:03.884173955 IP15 19:36:03.884182438 IP15 19:36:03.884203057 IP15

master/server

slave/client

19:36:03.883442258 IP15
19:36:03.883524725 IP15
19:36:03.883678497 IP15
19:36:03.883703809 IP15
19:36:03.883924377 IP15
19:36:03.883939231 IP15
19:36:03.883971437 IP15
19:36:03.883985143 IP15
19:36:03.884010765 IP15
19:36:03.884021139 IP15
19:36:03.884051422 IP15
19:36:03.884062029 IP15
19:36:03.884083780 IP15
19:36:03.884091661 IP15
19:36:03.884127283 IP15
19:36:03.884135654 IP15
19:36:03.884159177 IP15
19:36:03.884167900 IP15
19:36:03.884187810 IP15
19:36:03.884197308 IP15

19:36:03.883379688 IP15
19:36:03.883590507 IP15
19:36:03.883659403 IP15
19:36:03.883716669 IP15
19:36:03.883914510 IP15
19:36:03.883947770 IP15
19:36:03.883961851 IP15
19:36:03.883994953 IP15
19:36:03.884005137 IP15
19:36:03.884030823 IP15
19:36:03.884046094 IP15
19:36:03.884068390 IP15
19:36:03.884078674 IP15
19:36:03.884100314 IP15
19:36:03.884119333 IP15
19:36:03.884141135 IP15
19:36:03.884152060 IP15
19:36:03.884173955 IP15
19:36:03.884182438 IP15
19:36:03.884203057 IP15

This is expected, timestamps are in chronological order. About the traffic: small and equal packets are bounced back-and-forth. Client initiates traffic generation. So for the client the odd numbered timestamps are outgoing and vice-versa for the server.

But now, when using hardware timestamping, we get the following (for example):

ip netns exec ns_m0 tcpdump --time-stamp-type=adapter_unsynced --time-stamp-precision=nano

master/server	slave/client
14:44:04.710315788 IP15 14:44:04.758545873 IP15 14:44:04.710567282 IP15 14:44:04.758799830 IP15 14:44:04.710849394 IP15 14:44:04.759069396 IP15 14:44:04.711042879 IP15 14:44:04.759236686 IP15 14:44:04.711141554 IP15 14:44:04.759281897 IP15 14:44:04.711184281 IP15 14:44:04.759324535 IP15 14:44:04.711224345 IP15 14:44:04.759364437 IP15 14:44:04.711266610 IP15 14:44:04.759406555 IP15 14:44:04.711310310 IP15 14:44:04.759449711 IP15 14:44:04.711349465 IP15 14:44:04.759488431 IP15	14:44:04.758411898 IP15 14:44:04.710425435 IP15 14:44:04.758680982 IP15 14:44:04.710662581 IP15 14:44:04.758963612 IP15 14:44:04.710928565 IP15 14:44:04.759157087 IP15 14:44:04.711098779 IP15 14:44:04.759261251 IP15 14:44:04.711140994 IP15 14:44:04.759302503 IP15 14:44:04.711182978 IP15 14:44:04.759344893 IP15 14:44:04.711223669 IP15 14:44:04.759384802 IP15 14:44:04.711267547 IP15 14:44:04.759428520 IP15 14:44:04.711308661 IP15 14:44:04.759469128 IP15 14:44:04.711351810 IP15

master/server

slave/client

14:44:04.710315788 IP15
14:44:04.758545873 IP15
14:44:04.710567282 IP15
14:44:04.758799830 IP15
14:44:04.710849394 IP15
14:44:04.759069396 IP15
14:44:04.711042879 IP15
14:44:04.759236686 IP15
14:44:04.711141554 IP15
14:44:04.759281897 IP15
14:44:04.711184281 IP15
14:44:04.759324535 IP15
14:44:04.711224345 IP15
14:44:04.759364437 IP15
14:44:04.711266610 IP15
14:44:04.759406555 IP15
14:44:04.711310310 IP15
14:44:04.759449711 IP15
14:44:04.711349465 IP15
14:44:04.759488431 IP15

14:44:04.758411898 IP15
14:44:04.710425435 IP15
14:44:04.758680982 IP15
14:44:04.710662581 IP15
14:44:04.758963612 IP15
14:44:04.710928565 IP15
14:44:04.759157087 IP15
14:44:04.711098779 IP15
14:44:04.759261251 IP15
14:44:04.711140994 IP15
14:44:04.759302503 IP15
14:44:04.711182978 IP15
14:44:04.759344893 IP15
14:44:04.711223669 IP15
14:44:04.759384802 IP15
14:44:04.711267547 IP15
14:44:04.759428520 IP15
14:44:04.711308661 IP15
14:44:04.759469128 IP15
14:44:04.711351810 IP15

Now we can see that the timestamps are not chronological (see nanosecond portions). Which is unexpected, and is making the latency measurement impossible (as far as I can see). I expect both ports to be on their own clocks, this does not appear to be the case however (clock for RX and a clock for TX, instead of clock per port). Is there a solution to this? Must I use socket ancillary data in a custom C application to receive the correct timestamps? I'll put information on the setup below. Please let me know if more information is needed. Note: Applications like linuxptp do seem to work fine with hardware timestamping, and gives a path-delay in the sub-microsecond range.

The setup:

CentOS Linux 7.

Kernel 3.10.0-862.14.4.el7.x86_64 (default kernel for CentOS 7.5 installation).

Mellanox OFED, latest firmware & drivers.

Using network namespaces ns_m0 with ens6f0 and ns_m1 with ens6f1 to prevent kernel loopback.

↧

Re: VXLAN offload/RSS not working under OVS switch scenario

November 14, 2018, 6:21 pm

≫ Next: SN2700 mgmt0 offline / arp cache ?

≪ Previous: Inconsistent hardware timestamping? ConnectX-5 EN & tcpdump

Hello Robert,

Thank you for posting your question on the Mellanox Community.

Based on the information provided, we noticed that you also opened a Mellanox Support case which is already assigned to one of our engineers.

We will continue to assist you further through the Mellanox Support case.

Thanks and regards,
~Mellanox Technical Support

↧

SN2700 mgmt0 offline / arp cache ?

November 14, 2018, 10:47 pm

≫ Next: Re: sending order of 'segmented' UDP packets

≪ Previous: Re: VXLAN offload/RSS not working under OVS switch scenario

Hi,

we have 2 brand new SN2700 100G Switches. It shipped with Onyx and since the beginning one of them was acting weird on the Management Interface. Even after upgrading to latest version 8190, the switch is rebooted and not pingable, BUT if i start pinging anything from serial connection, the gateway or the second device, the network connecticity is restored.

I ping 10.0.100.100 -t and get no answer from the switch. If i then go on the serial and ping 10.0.254.254, or 10.0.100.101 and the switch begins to be available from my admin host again. Network guys checked the other switch where the mgmt0 is connected to, and point to maybe ARP problems ? Anything else i can do to make it stable ?

The second switch worked fine out of the box, im not sure whats wrong with that device.

Best regards

Tim

↧

Re: sending order of 'segmented' UDP packets

November 15, 2018, 1:54 am

≫ Next: Re: VXLAN offload/RSS not working under OVS switch scenario

≪ Previous: SN2700 mgmt0 offline / arp cache ?

Hi,

packets are captured by Wireshark and so far I never got any issues with packet ordering when using it.

When splitting valid UDP header and payload by using linked mbufs (see example code), the order of the received packets (at least captured with Wireshark) is not as expected - anyway all packets are received, none are lost. I expect that the first packet in the TX array is the first one that is received. Which is the case, when I use the same mbuf for header and payload and don't split them.

The strange thing is, that when I set

ip_hdr->version_ihl = 0x40; // (and not 0x45)

the packets are received in the correct order (same order as TX array).

Thanks and best regards
Sofia

↧