I/O connectivity, a new (5th) generation is showing off

NEIO Systems, Ltd.
5 min readNov 8, 2021

For a long time CPUs were limited because of bottlenecks in I/O. Those bottlenecks were tackled in various places and gradually improved. We first saw CPU caches taking care of delays in memory access. Caches with varies levels (L1, …, L3) were introduced but also main memory got a lot faster — we enjoy DDR5 here in late 2021.

We also saw improvements in storage capacity. I remember the days when we were still juggling with 1.44MB Floppy disks. My personal data rate was probably a few hundred kilobytes per second when I would bicycle to university and having 10 disks with me. But hey, that was way faster than the 33kbit/s modem we had for our dial up connection.

Disks were followed by USB sticks. Those with a capacity of 8 and more GB are still nice because you can have your OS image saved on them. Takes quite a while to get them formatted, doesn’t it?

But what we see in networking as of today is that standard hardware gives you more than1GB per second through your network. Did I mention _per second_ ? Even by riding my e-bike I won’t be able to compete with that.

Design Space for Network I/O

When communication takes place, a commercially available solution is to use the I/O Bus. But the design space for network I/O is much more versatile. For performance reasons the communication path is closest to the CPU, minimizing hops, as every transition involves latency (Yet another reason why Programmed I/O (PIO) is used in favor to DMA for ultra low latency systems, e.g In Trading, HPC). Over the years solutions were presented which were exploring several options, including a solution which would use the DIMM slot for communication! [5].

Courtesy Bruening, Giloi [4]

For years, we saw a trend using Commercial off the shelf (COTS) for supercomputers, but recently special purpose, vendor locked networks like the TOFU [6] network are needed for maximum scale for ExaFlop Systems.

This also includes, that for traditional networking using commodity protocols like Ethernet, the I/O bus is our choice and we rely on the major vendors to increase its performance: ISA/VESA/PCI/PCI-X/PCI Express… come to mind.

Network I/O Evolution

10 and 100 Mbit devices were standard for a long time. But it was the trend to using commercial off the shelf (COTS) for HPC which was driving a need for improved network environments. Improvement was needed in two areas

  1. lowering latency (typically measured as a roundtrip operation and dividing the elapsed time by 2)
  2. increasing bandwidth (typically an amount of data exchanged in time)

There were a handful of vendors (Quadrics [QsNet], Dolphinics [Scalable Coherent Interface], Myricom [Myrinet] and hundreds of startups [Atoll, Dimmnet, …]) vetting on this requirement [3]. During the late nineties Myricom introduced Myrinet 2000 [2], which would eventually be used in more than 50% of the systems listed in the Top500, a respected list of the fastest supercomputers in the world.

In terms of bandwidth Myrinet 2000 would allow for 2000Mbit/s per port, hence 2+2Gbit/s (or 500MByte/s) going through a PCI-X (1.0) [1] bus (which was maxing out at 533 MB/s for its 64-bit, 66MHz implementation). At Myricom we would keep a list of motherboards which were able achieve that performance as not all chipset vendors would be able to deliver this throughput and hence not allowing the NIC to deliver its full potential.

That’s that for that on PCI-X. It was superseded by PCI express. (Yes, there was a PCI-X 2.0 specification and a handful of implementations but it never got any traction).

Myrinet 2000 at that time was a good fit. But HPC supercomputers had a need for even more bandwidth and the community established Infiniband as a standard (Initially offering Single Data Rate (SDR) = 10Gbit/s → 8Gbit/s data rate). On Myricom’s roadmap was a new NIC which would target 10Gbit/s data rates and allowing for both the Myrinet and 10GigE protocol to run on the same device (using 10-Gigabit Ethernet PHYs as the same physical layer for both). Infiniband’s roadmap not only included DDR, but QDR, EDR and many more *DRs to come. And with many more DRs available today, there is a need for much more bandwidth for network I/O (leaving the GPUs aside on this topic).

Press fast forward… PCIe Gen5

PCI-E / PCIe comes with the concept of lanes and each lane offering so called Gigatransfers (GT). GTs include an en/decoding overhead, starting with Gen3 a 128b/130b line coding scheme was used. This overhead can be neglected but earlier generations had an 8b/10b encoding scheme which did come with a 20% penalty on data rates.

So far GT’s bandwidth doubled through each increment of PCIe generations. As an example, we’ll compute the available bandwidth for an x8 PCIe Gen3 slot: It’s 8GT * x8 * 128/130 = ~63Gbit/s . What we see here is that this won’t suit a NIC which is targeting 100Gbit/s data rates (e.g 100GbE). You would either need to switch to the next PCIe Generation (Gen4) (an x8 16GT slot giving 126Gbit/s) or add more lines, e.g using an x16 PCIe slot in combination with an x16 NIC (x16 * 8GT).

Throwing the gauntlet onto PCIe Gen5: 400GbE

While PCI express Gen3 got good traction over the last 8 years, industry basically skipped PCIe Gen4 (16GT) and just recently PCIe Gen5 (32GT) is becoming an option because of the release of latest AMD and Intel CPUs (Raphael/Alder Lake, respectively) . At this point we can look out for a system which is well suited for future networking trends, like 400GbE.

Let’s see if PCIe Gen5 can offer support (single port configuration) for 400GbE: x16 * 32GT * 128/130 = 504Gbits.

All good. PCIe Gen5 can certainly deliver 400Gbit/s and you can enable 100GbE on the 2nd port, too :D

Oh, one more thing: If you are about to RX 50GBytes (400Gbit/s) of traffic per second, is your memory (capacity wise, performance wise) ready to consume it (DDR5 Scales to 8.4 Gbps) ? tbc…

More to PCIe Gen5 …

What we should also mention is that PCIe Gen5 allows for other protocols to run over the physical layer. After quite a few consortiums and specifications like Gen-Z, CCIX the Compute Express Link (CXL) finally got enough momentum to kick off. This hints towards driving PCIe Gen5 in many ways including cache coherency [7]. Exciting times ahead.

References

[1] PCI-X
https://en.wikipedia.org/wiki/PCI-X

[2] Myrinet
https://en.wikipedia.org/wiki/Myrinet

[3] BOF on High Speed Interconnects at SC99
http://webserver.ziti.uni-heidelberg.de/atoll/bofsc99.html

[4] Ulrich Brüning, Wolfgang K. Giloi, “Future Building Blocks for Parallel Architectures”, Keynote talk at of the 2004 International Conference on Parallel Processing (ICPP.04), Montreal, CA, 2004
http://ieeexplore.ieee.org/document/1327943/

[5] DIMMnet-2, Evaluation of Network Interface Controller on DIMMnet-2 Prototype Board
https://ieeexplore.ieee.org/document/1579028

[6] The Tofu Interconnect D - Fujitsu
https://www.fujitsu.com/hk/imagesgig5/08514929.pdf

[7] CXL
https://www.computeexpresslink.org/post/introduction-to-compute-express-link-cxl-the-cpu-to-device-interconnect-breakthrough

--

--

NEIO Systems, Ltd.

http://fastsockets.com || low latency, networking experts, 10GbE++, FPGA trading, Linux and Windows internals gurus