Technology

Your current position: Home>>Articles

How to reduce the failure rate of high-speed optical modules in data centers

Date: 2020-03-17 02:36:25 From: C-light Onclick:

5G, big data, artificial intelligence and other technologies have higher requirements for data processing and network bandwidth. Data centers need to continuously increase network bandwidth to meet them. Therefore, data centers have an urgent need to increase network bandwidth during this time, especially for the Internet. data center. The most direct way to increase network bandwidth is to increase the network bandwidth of a single port, from 40G to 100G, from 100G to 200G, or even higher, thereby increasing the bandwidth of the entire data center. Some experts have predicted that most 400GbE deployments will begin in 2019. 400GbE switches will be launched as spine or core switches for very large-scale data centers, and spine or backbone switches for private and public cloud data centers. It is important to know that 100G will only become popular. Three years, and now it is necessary to transition to 400G, and the network bandwidth is increasing faster and faster.
 
One side is the strong demand for high-speed modules in data centers, and the other is the high failure rate of modules. Compared to 1G, 10G, 40G, 100G, or even 200G, the intuitive failure rate is much higher. Of course, the technological complexity of these high-speed modules is much higher than that of low-speed ones. For example, a 40G optical module is essentially bound through four 10G channels and works at the same time, which is equivalent to four 10G at work, as long as there is a problem in one way, The entire 40G can no longer be used, of course, the failure rate must be higher than 10G, and the optical module also needs to achieve the coordination of 4 optical paths, the error probability is naturally higher. 100G is even more so, some are bound through 10 10G channels, and others use new optical technology, which will increase the possibility of errors. Not to mention the higher rate, the technical maturity is not high, like 400G is still a technology in the laboratory, and it will be launched in the market in 2019. A small climax of the failure rate will inevitably occur. There will be many, as the technology continues to improve, I believe it will gradually stabilize like vulgar modules. Imagine getting the 1G optical module of GBIC 20 years ago to use it, and it feels similar to using 200G now. It is inevitable that the failure rate of new products will increase in the short term.
 
Fortunately, the failure of the optical module has a small impact on the business, and the data center links are redundantly backed up. If a link optical module fails, the service can go to other links. If it is a CRC error packet, it can also pass the network management. Immediately found that early replacement is required, so optical module failures rarely have a greater impact on services. In rare cases, the optical module may cause the device port to fail, causing the entire device to hang. This situation is mostly equipment. It is rarely caused by unreasonable implementation. Most optical modules and devices are loosely coupled. Although they are connected together, there is no coupling relationship. Therefore, although the high-speed optical module is used in a lot of cases, the impact on the business is not so great. Generally, people do not pay attention to it. When a fault is found, it is replaced directly. The maintenance time of the high-speed optical module is also long, and the failure is basically free. Replacement is not a big loss.
 
The faults of optical modules are mostly manifested in the failure of ports to be up, the identification of optical modules, and CRC error packets on the ports. These failures are related to the device side, the optical module itself, and the link quality, especially the misreporting and failure to go up, which is difficult. Judging the fault location from software technology. Some of them are still problems of the adaptation type. Both sides have no problems, but they have not been debugged and adapted to each other. As a result, they cannot work together. This is not uncommon, so many network devices will provide adaptation. The list of optical modules requires customers to use the optical modules that they have adapted to ensure stability and availability. If you encounter a failure, the best method is still to rotate the test, change the link fiber, module, and port. Use this series of tests to determine whether the problem is the optical module or the link or device port. The type of failure phenomenon is relatively determined, and it is difficult to deal with that kind of failure phenomenon. For example, if there is a CRC error packet on the port, the optical module is directly pulled out and replaced with a new one. The fault phenomenon disappears, and then the original optical module is replaced. The fault no longer occurs. It is difficult to determine whether the optical module is faulty. This situation is often encountered in actual use, which makes it difficult to judge.
 
C-light QSFP28 SR4 video
 
 
 
How to reduce the failure rate of optical modules? First, grasp the source, do not rush to the market with higher bandwidth optical modules, do a good job of experiments, and high-speed modules require corresponding equipment to implement, these technologies also need to be perfect to mature, new high-speed modules must be smooth Introduce the market to the market, do not simply pursue high speed. Now network equipment supports multi-port bundling, 400G is not enough, just use 4 100G bundling can also meet the demand; Second, grasp the introduction of high-speed optical modules, network equipment vendors and data centers Customers should be cautious when introducing high-speed optical modules. Increase rigorous testing of high-speed optical modules, and resolutely filter products with defective quality. At present, the market for high-speed optical modules in the market is also fiercely competitive. They all want to seize the opportunity in new high-speed modules, but the quality and price are uneven. This requires network equipment vendors and data center customers to increase their assessment efforts. High-speed modules increase the complexity of verification. Third, the optical module is actually a highly integrated device. The exposed fiber channels and internal components are relatively fragile. When in use, handle it lightly and bring it with you. Clean gloves to avoid falling into the dust. These will also reduce the failure rate of use. Use optical fiber caps in unused optical modules and put them in bags. Fourth, limit cases occur as little as possible. For example, a 100G optical module is used near the speed limit for a long time. An optical module with a distance of 200 meters must be used at a distance of 200 meters. Use of these limit boundaries causes the loss of the optical module larger. It is just like a person. When people work in an air-conditioned room with a temperature of 24 to 26 degrees, the work efficiency is high. At a high temperature outside 35 degrees, attention cannot be concentrated for a long time, and the work efficiency is extremely low, reaching 40 degrees or more. How do people work when they are almost heatstroke? Providing a comfortable environment for the optical module can effectively extend the service life of the optical module.