libfsock: A versatile API with a focus on ultra low latency and efficient data handling
Technologies pathing the way for low latency
When a trend to using COTS (Commercials off the shelf) became popular for HPC (High Performance Computing), a raze to zero — or best in class latency started so that parallel applications can scale out to the maximum. For decades special purpose machines “parallel computers” (supercomputers) were leading the list of the fastest computers in the world (Top500 ). They were unique in many ways and came with their own operating system and programming language.
We first saw a common API becoming a de-facto standard in which a project called PVM (Parallel Virtual Machine) provided an interface which would allow to run your source code on different machines — just rebuild it. Even better, it would allow for unifying a collection of resources to become one giant computer. Prior to this, you would need to maintain several versions of your code, all tailored for your very own exotic system. Those machines had both: a very fast compute node but also very fast inter connections.
The popularity of using COTS was driven by Intel increasing the compute power so that clusters were using Pentium II-III and later as its CPU per node. Comparing their technical specifications with todays mainstream processors, CPU technology has improved a lot. Heck, we are talking about Mhz frequencies and fabrication process of 250nm (Katmai) / 130nm (Tualatin). It was a single core era and the P6 architecture ended with a whopping 1.4Ghz (Pentium III-S) :) — to be followed upon with the new Netburst concept. While it showed a roadmap of more than 10Ghz for a CPU, it was a non-starter, with its first CPU — the 180nm, Willamette at 1.5Ghz — being less performing than the latest PIII edition. It eventually maxed out at 3.8Ghz. However, that compute performance was better than those of traditional supercomputers.
On the other hand, the communication performance of Transputers, Thinking Machine’s CM5, Cray’s T3D and Intel’s Paragon et al (take a look at the Top500 list from November 1996 for some time travel ), were challenged when the research around UNET  introducing kernel bypass was made available to the commercial market and became mainstream. Using PCI-X, followed by PCIe, Networks from Quadrics (QSnet), Dolphinics (Scalable Coherent Interface | SCI ) and Myricom (Myrinet) were competing for market share. During the best times, Myrinet from Myricom had a market share of more than 50% of the Top500 list. In 2020 PCIe Gen4 can bring in 400Gbps ~ 50 GBytes per second in every single second into your machine. I/O performance has increased dramatically while CPU performance has stagnated.
Today’s Standard Communication APIs
What MPI (Message Passing Interface) is for HPC, is the socket API for general purpose communication between computers using Ethernet. Until today, Ethernet can be unreliable and comes with higher level protocols such as UDP or TCP for reliable communication. Traditionally, however, all packets traverse the operating system and applications issue a system call for sends and receives. In an area of massive ingress at line rate (even for 10Gbits), the OS has a hard time to keep up processing the incoming data — and will introduce a high system load just delivering the data to the application. Several efforts were made to lift the TCP/IP stack into user space so that applications would only need to be run with a prefix so that the dynamic linker would pick up the user level stack instead when socket function calls are made [4,5]. That was a nice approach when it would allow for shaving of microseconds. It was needed when legacy application could not be modified. In the meantime there were enhancements to the Linux network kernel including a poll mode option. Now this approach is no longer a viable solution when your application is racing to catch the nanoseconds for winning a trade — finance/trading being the major use case for user level tcp stacks. Instead of dealing with accelerated vs non-accelerated sockets, an endless, tremendous effort for making your solution ABI compatible — why not expose an API which can be plugged in as a communication layer? Using an API with a focus on performance would not involve much additional effort as minor modifications are already made for a successful launch. Yet, with an API the benefits are multiple.
Stability, Stability, Stability — as this matters most. ABI compatibility is an endless effort when all features should be matched.
Best possible performance.
Easy access to hardware features like RX and TX Hardware Timestamps (the socket API was extended to query those timestamps in a non-obtrusive way)
Additional access and integration of special NIC features, e.g split header and payload on RX or pre-stage packets on NIC for TX so that the latency penalty when crossing the PCIe bus can be overcome
The FastSockets API || LIBFSOCK
When using FastSockets a developer gets access to a lean and mean API focusing on providing a drop less solution on ingress, capturing any packets coming in, yet giving fastest transmit as well. Another benefit is the easy access to timestamps — which can be quite cumbersome when using the socket API. More specific the TX part of looping a packet back to the stack likely to introduce additional jitter.
For this, any incoming packet will have a hardware timestamp which got attached vie the MAC/PHY. But also the timestamp of the outgoing packet can be queried as well.
We’ll discuss how we use these timestamps when tuning for tick to trade. A number which matters most in a trading world and describes the time elapsed between receiving a packet from a feed (UDP/mcast) to TCP execution — when the packet has left the NIC. To get the most accurate measurement we rely on hardware timestamping for this.
Let’s first start with an overview of the API. Similar to the socket API, libfsock use the concept of endpoints. Each endpoint can be used for communication. Connections will be made between endpoints and since the protocol is UDP/TCP compatibility is given to a traditional socket endpoint. For libfsock an endpoint is further specified with a protocol and a port. With a handshake to the OS, this port is secured and can not conflict with the traditional TCP/IP stack. For RX operation an exclusive ring buffer is allocated and registered via a filter with the Network Interface Card. At this point it will dispatch packets matching the filter and a read from the ring buffer guarantees that there are no packets involved in which the endpoint is not interested in. Usually a thread is handling an incoming data stream and with this design, there is no contention receiving those packets. The benefit is better RX processing than when using a combined TCP(UDP)/IP stack which must be able to dispatch arbitrary traffic into pre-allocated software buffers. However, locks will lead to contention and potentially ingress delivery to the application can get delayed. Thus, instead of having a dispatch in software, we purely focus on the hardware for this — for some vendors with no extra cost involved.
For easy adoption, the semantics for libfsock are very similar to that of the socket API. With a focus on ultra low latency, it does not provide an asynchronous model, e.g known from Winsock2 which has a different focus being that of message throughput. Some of this concept originating from TCP offload stacks like Chimney where receive buffers needed to be registered with the device to support high speed networks beyond 1Gbps. As a consequence, retrieval of message is CPU intensive, typically polling on memory to detect incoming messages dispatch via the DMA engines of the hardware. Thus, for the libfsock API it will allow for opening an endpoint, binding to a port, subscribing to ingress data and/or establishing connections for creating libfsock channels. The associated endpoint can be seen as an entity for holding several channels, similar to the epoll approach.
Without further details, we’ll share a few libfsock API calls needed for a simple UDP receiver (Unicast / Multicast) :
/* create an endpoint for a given IP and proto */
int fsock_open (struct ip_addr *ip, fsock_open_flags_t flags, fsock_endpoint_t *fsock_ep);
/* create channel by giving a port, for unicast , potentially mcast traffic */
int fsock_bind (fsock_endpoint_t fsock_ep, fsock_bind_flags_t flags, int port, fsock_channel_t *fsock_ch);/* use FSOCK_JOIN_MCAST flag */
int fsock_setopt (fsock_channel_t fsock_ch, fsock_setopt_flags_t flags, void *opt, int optlen);/* receive unicast/multicast */
int fsock_recvfrom (fsock_channel_t fsock_ch, fsock_recv_flags_t flags, char *buf, int buflen, fsock_recv_info_t *info);
Even on the same hardware we see a performance gain, e.g comparing libfsock to an approach in which acceleration of UDP/TCP via the socket interface by overriding the dynamic symbols — we believe that this approach is not giving the best solution when chasing single digit nanoseconds. Remember, in trading there is no second best. The winner takes it all.
In general, we can associate different behaviors to endpoints, channels, but also recv (send) operations via flags. In the above example, we see setting a channel to also subscribe to multicast traffic.
The flags are quite powerful. We can adjust the holding endpoint but also steer the functional behavior of the recv call. This allows for taking best approaches of the underlying hardware being used. For recv for example we can fetch data into a buffer but we can also just provide pointers to a block in a video processing.
LIBFSOCK Media Extensions for GigE Vision
Introducing libfsock as a GigE Vision SDK: What is nice about the versatile API is that of having a number of so called “recv” flags. In line with the libfsock design we introduced special receive flags which allow for retrieval of special protocols, such as GigE Vision.
We’ll give details as part of a full article [ see: https://latency-matters.medium.com/gige-vision-sdk-libfsock-draft-ee444323b08d ]on those media extensions. But in short, it allows for zero copy access to the image data. Keeping the images in the ring buffer. If needed and the data is to be stored for much longer, then a simple libfsock API call will take care of this.
Other use cases for low latency
A PoC also showed value in game servers — where a better response time was achieved. E.G the dispatching (serialization delay) of UDP packets is within nanoseconds.
We’ll present a series of performance charts showing the benefits of using libfsock. The depicted UDP serialization delay was measured as a Device Under Test measurement in which the time between arriving packets is shown. This shows how fast the TX portion of the sender actually is — as handing over the data for sends does not indicate when it actually left the device. We see that libfsock has immense performance enhancements, making it suitable e.g for market distribution.
Given the resolution of the NICs the DuT even allows for benchmarking the benefit comparing similar models. We see the newer model being more efficient on TX as follows :
Tick to Trade Benchmark
As mentioned above, tick-to-trade is a real world useful benchmark in trading. Given a market message (feed), what does it take to generate a message to the exchange (order) ? We present the following results, on a 5Ghz server for 1 byte payloads. Slightly higher values are seen when more data needs to be copied.
tick-to-trade[UDP 1 TCP 1] 744ns mean=787ns, median=784ns, 99%=816ns
From those values we can derive that it took 744ns when the data arrived at the MAC/PHY (UDP) and the order did leave the NIC (MAC/PHY). This includes all overhead processing RX in HW and SW (to the application), and send out (all TX HW and SW processing).
 Thorsten von Eicken, U-Net: A User-Level Network Interface for Parallel and Distributed Computing, http://www.cs.cornell.edu/Info/Projects/Spinglass/public_pdfs/U-Net%20A%20User.pdf
 Dongarra, Strohmaier et al., http://www.top500.org
 Zhang et al, I’m Not Dead Yet! The Role of the Operating System in a Kernel-Bypass Era, https://www.microsoft.com/en-us/research/uploads/prod/2019/04/demikernel-hotos19.pdf
 Onload, http://www.openonload.org
 Emmerich et al, A Study of Network Stack Latency for Game Servers https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/Network-Latency-Netgames-2014.pdf