When there is a reference to latency, beware of your assumptions on how this is being measured. Usually this means a half roundtrip time (pingpong). This allows for taking the timestamps using the same resource. From a given host, it will take a timestamp t_start using the system clock, then send a message to another host, which will receive it, send back, and the initiating host will finalize this test with another timestamp t_end using the system clock after receiving it.
This can be done for messaging layers such as MPI but also for UDP or TCP protocols — all for a given payload size. In trading / finance, there is less importance on how fast UDP can be send or TCP can be received. Yet, the focus is on reducing tick-to-trade latency (tick from market data feeds, trade for order).
Hence, it matters most to reduce the time when a UDP packet arrived ( from a market data feed) — and a TCP order was issued (to the exchange). How can you benchmark this? It would not give best insight into the system if you were to use software timestamps. e.g. taking the host system time, even in nanoseconds when the application has received the packet in user space (how long got it delayed by various layers ?) and issued the TCP send. In the worst case the TCP packet would have been only given to the sending entity — this might be the OS — which would immediately return the call while still internally processing the send or the user level tcp stack. It would give a false result as it would not indicate how long it took for the packet to finally hit the wire.
Instead the best approach is to use hardware timestamps on the NIC. This would allow for the MAC to add a timestamp on ingress data (RX) but also egress data (TX) — the time the packet was actually put out on the wire for the latter. This includes all additional processing, all RX/TX processing (hardware and software) even the traversal of the packets across IO busses and for a best trading solution we would like to minimize this value.
Alternatively, if the client has no access to hardware timestamps a DuT (Device under Test) benchmark can be carried out which would measure the time when a UDP packet left the NIC (take TX HW timestamp) and a TCP packet was received (take a RX HW timestamp).
Both approaches are valid for comparison, benchmarking a solution. Given that some NICs have a resolution of 4.2ns or better (The ExaNIC X10 HPT has a resolution of 250 picoseconds) — the use of hardware timestamps can not be beaten.
Areas for Tick To Trade Acceleration
We can differentiate between a purely x86 driven approach or a much more complex approach which we just introduce as offloaded tick to trade. When using the x86 design, we have all processing for tick to trade on the host side. An application will receive (passively) UDP market data and send (actively) a TCP packet. What can be done as a (software) optimization is to have all data ready for sending out the TCP packet when needed. We might even pre-stage some data like ETH / IP data already on the NIC!
With the best techniques this gives us about 744ns tick to trade.
This procedures involves crossing the PCIe bus — twice — for recv and send. Even for PCIe Gen3 this comes with a real performance penalty. On average this is about 200ns. Hence a huge portion of the number above. Getting your endpoints closer to the exchange means not using PCIe alltogether and having your logic run on the network card itself — if not even on the switch!
SmartNICs have become popular and one valid is approach creating a NIC from an FPGA and merging in the trading logic. This includes having the NIC consume the UDP packets — which is quite easy as there is no protocol involved, but also serving as a TCP endpoint. This approach can not be beaten in terms of performance but requires a good implementation but also larger FGPAs to have all the logic included. Some references point out that this can be achieved — measured by a STAC benchmark — in less than 30ns. Given no logic was actually processing those packets. Useful for real world setups? So so. What about having a mix between the x86 and purely FPGA driven approach?
A hybrid approach can be to have the book building run on the FPGA, handoff final data to host and have the host issue TCP, but also perform all UDP processing in host, but pre-stage TCP packets on the NIC so that a packet only needs to be triggered. The latter approach is not very practicable as some trading information typically needs to be updated before finalizing the send. This also involves a new computation of the TCP checksum.
Will newer PCIe Generations lower the cost traversing the I/O bus? PCIe Gen5 will allow other protocols like CXL to run over the media. This would bring back the purely x86 driven approach into the game.
With very fast servers, the x86 approach and an optimized TCP send, we’d be happy to share the following results when using the FastSockets API libfsock. All error processing has been removed.
We see a minimum tick to trade numbers using a purely host driven solution for a 1 byte UDP, 1 byte TCP message is 744ns.
/* rxinfo holding additional meta information such as rx (HW) timestamp and src ip, etc */
rc = fsock_recvfrom(fsock_ep, FSOCK_RECV_DEFAULT, buf, sizeof(buf), &rxinfo);/* txinfo holding additional meta information such as tx (HW) timestamp */
rc = fsock_ext_send(tcp_channel, FSOCK_SEND_DEFAULT, buf, tcp_size, &txinfo);/* compute tick to trade */
ttimes[i] = txinfo.timestamp - rxinfo.timestamp;
tick-to-trade[UDP 1 TCP 1] 744ns mean=787ns, median=784ns, 99%=816ns