InfiniBand is a widely-used high-speed interconnect technology in HPC and AI-driven data centers. Known for its ultra-low latency, high bandwidth, and scalability, InfiniBand is a backbone for applications like weather modeling, genomics, AI training, inference, and financial analysis. It is also critical in constructing some of the world’s most powerful supercomputers and high-performance data centers. These environments demand exceptional computational power, requiring equally fast and reliable communication to support data-heavy workloads.
Forward Error Correction (FEC) ensures reliable, high-speed data transfers in InfiniBand networks. By detecting and correcting transmission errors without needing data retransmission, FEC enhances network stability and efficiency, and reduces data errors caused by caused by noise, signal interference, and link quality issues. This makes it an indispensable feature in environments where high-speed and data integrity are paramount.
How FEC Enhances InfiniBand Network Performance
1. Ensuring Data Integrity in High-Bandwidth Transfers
InfiniBand networks are designed for high-throughput environments like supercomputing clusters and large-scale data centers, where maintaining data integrity at low latency and high bandwidths is critical. FEC addresses this by automatically correcting transmission errors, reducing the risk of packet loss caused by high bit error rates (BER) as speeds increase to 100 Gbps, 200 Gbps, or higher.
2. Supporting Long-Distance Communication
When data travels over long-distance optical fiber links in InfiniBand networks, it’s more prone to signal attenuation, interference, and noise, which can lead to transmission errors. FEC is particularly valuable in this scenario as it can automatically detect and correct the most common bit errors, ensuring data integrity across extended distances. This is especially important in cross-data-center interconnects, where long optical links are used to connect geographically dispersed facilities.
In such cases, FEC helps maintain the robustness of communication links, ensuring uninterrupted data flow between distant nodes. This is essential for large-scale AI clusters and HPC infrastructures, which rely on fast and reliable data exchanges to optimize performance and productivity in geographically distributed environments.
3. Minimizing Latency in Time-Sensitive & HFT Application
Although FEC introduces a small amount of processing overhead, it remains a low-latency error correction technique, especially when compared to the alternative of retransmitting corrupted packets. The delay introduced by FEC is minimal, making it ideal for latency-sensitive applications such as high-frequency trading (HFT) and real-time computing. In these time-critical environments, FEC delivers the necessary reliability and error correction without significantly affecting the low-latency requirements of the network.
This ability to maintain both network reliability and low latency is crucial for AI clusters and real-time analytics, where even minor delays can impact performance. By ensuring that data packets are transmitted without errors or the need for retransmission, FEC optimizes network efficiency for latency-sensitive tasks, keeping critical operations running smoothly.
4. Improving Network Efficiency by Reducing Retransmissions
One of the key benefits of FEC is that it automatically corrects transmission errors on the fly, eliminating the need to resend entire data packets. This reduces the retransmission demands that are typically caused by erroneous packets, which in turn boosts overall network efficiency.
In high-traffic environments, especially where network load is heavy, FEC plays a crucial role in reducing congestion caused by packet retransmissions. This enables the network to operate more smoothly under demanding conditions, maintaining high performance even as data volumes grow.
5. Ensuring Data Integrity Through Packet Verification and Repair
By incorporating redundant information into the transmitted data packets, FEC allows the receiver to verify the integrity of the data. If errors are detected, the receiver can use the redundant data to automatically repair the corrupted sections without waiting for the sender to retransmit the entire packet. This forward error correction mechanism greatly improves data transmission reliability, especially in high BER environments (where data is transmitted at high frequency), such as in AI clusters or supercomputing networks.
This automatic packet verification and repair ensures that even in challenging network conditions, data integrity is preserved, minimizing disruptions to performance. For environments that require continuous high-frequency data transmission, such as real-time analytics or AI model training, FEC is an indispensable tool for maintaining high data quality and system reliability.
Two Common FEC Schemes in InfiniBand Networks
InfiniBand networks employ different FEC encoding schemes depending on the link speed and specific application requirements. Two commonly used FEC schemes are (544, 514) FEC and (272, 257) FEC, both designed to ensure reliable, error-free data transmission by adding redundancy, allowing errors to be detected and corrected.
Understanding the (n, k) FEC Format
The notation (n, k) is commonly used to describe FEC schemes:
- n: The total number of bits in the encoded data, including both the original data bits and the redundant (error-correcting) bits.
- k: The number of original data bits before encoding.
This format describes how k bits of information are encoded into n bits, with the difference representing the redundant bits used for error correction.
Example:
- (544, 514):
- Total bits (n) = 544
- Information bits (k) = 514
- The difference (544 - 514 = 30) represents 30 redundant bits used for error correction.
- (272, 257):
- Total bits (n) = 272
- Information bits (k) = 257
- The difference (272 - 257 = 15) represents 15 redundant bits for error correction.
At first glance, both FEC schemes appear similar in terms of efficiency, calculated as the ratio of information bits to total bits:
(544, 514) FEC efficiency ≈ 0.945
(272, 257) FEC efficiency ≈ 0.945
Thus, the encoding efficiency of both schemes is approximately 94.5%. However, despite similar efficiency, these schemes differ significantly in performance, particularly in terms of data block size and error correction capability.
Key Differences Between the Two FEC Schemes
Data Block Size
(544, 514) processes larger data blocks, meaning it handles more data and redundancy per cycle compared to (272, 257). This can impact system bandwidth and processing complexity, making it more suitable for high-bandwidth environments.
(272, 257), with smaller data blocks, is better for scenarios where low latency is a priority, such as in short-distance communication.
Error Correction Capability
(544, 514), with 30 redundant bits, offers stronger error correction, making it more effective in noisy or long-distance links.
(272, 257), with 15 redundant bits, provides adequate correction in low-noise environments. It strikes a balance between performance and error correction where signal quality is stable, typically in local or short-range networks.
In summary, (544, 514) FEC, with its higher redundancy and stronger error correction capability, is well-suited for scenarios where data integrity is critical and the transmission channel quality is poor, such as in satellite communications or long-distance optical fiber links. (272, 257) FEC is more appropriate for environments that require low latency and where the transmission conditions are relatively stable, such as in local area networks (LANs) or short-distance fiber connections.
Real-World FEC Testing with 400G OSFP Transceivers
When designing lossless, low-latency communication systems in InfiniBand networks, the performance of optical transceivers plays a crucial role. One of the key indicators of a transceiver's quality is its BER—both pre-FEC and post-FEC. Maintaining a low BER is essential for ensuring high network reliability, particularly in HPC and AI-driven environments, where data integrity and speed are paramount.
While many transceivers claim to offer superior FEC performance, real-world results can vary significantly. NADDOD Lab's testing of 400G transceivers under the (272, 257) FEC encoding scheme revealed that not all modules perform optimally in practical applications. This makes selecting the right transceiver even more critical when aiming to maintain a lossless, high-performance network.
In a real-world test, two OSFP 400G SR4 optical transceivers—Transceiver A (Broadcom VSCEL&DSP) from NADDOD and Transceiver B from another brand—were compared for BER performance using the (272, 257) FEC scheme. Both transceivers were tested on a ConnectX-7 VPI network card (MCX75310AAS-NEAT) with firmware version 28.41.1000.
The results were telling:
- Transceiver A (NADDOD): Pre-FEC BER of 1E-10 and post-FEC BER was effectively zero during the testing period.
- Transceiver B (other brand): Pre-FEC BER of 1E-8 and post-FEC BER of 2E-11—an improvement, but still with residual errors.
As the comparison shows, despite using the same (272, 257) FEC encoding scheme, the performance of the two transceivers differed significantly. Transceiver A was able to achieve zero errors post-FEC, while Transceiver B still exhibited residual errors after correction. This demonstrates that real-world performance can vary widely even when the same FEC is applied.
So, in this case, when selecting optical transceivers for InfiniBand networks aimed at lossless, high-speed communication, it's crucial to choose transceivers that can fully support the demands of a (272, 257) FEC environment. Ensuring that the chosen transceivers meet these performance standards is key to building a reliable, lossless network in high-bandwidth and low-latency scenarios.
Choosing the Right Transceiver for High-Speed, Lossless Networks
When considering the performance of optical transceivers in InfiniBand networks, selecting the right transceiver is crucial for maintaining a lossless, high-speed communication environment, whether operating under RS-FEC (544, 514) or LL-FEC (272, 257). NADDOD’s transceivers are optimized for these FEC schemes, ensuring reliable and stable performance, even in demanding environments like AI clusters, high-frequency trading, and large-scale AI infrastructures.
In addition to high-performance transceivers, NADDOD offers a full suite of optical interconnect solutions, including active optical cables (AOCs) and direct attach copper cables (DACs). These solutions are designed to ensure stable and reliable operations for AI-driven applications, high-frequency trading, and large AI clusters, where speed, reliability, and low latency are mission-critical.
By leveraging NADDOD’s advanced optical technologies, your network can meet the increasing demands of high-performance computing and AI workloads, while maintaining the highest levels of data integrity and efficiency.
![NADDOD](https://resource.naddod.com/images/blog/2024-07-25/blog-goods-010017.webp)
![NADDOD](https://resource.naddod.com/images/products/2024-01-08/1-009111.webp)