DMIT洛杉矶机场网络故障分析
太平洋时间约19:35:00,DMIT在洛杉矶机场城域网中部署了一项变更,旨在为接入交换机引入MPLS和IS-IS协议下的IPv6网络。
- DMIT在所有设备上均使用环回地址进行iGP路由。
- 然而,在IPv6 RR配置中,我们未能统一处理从接入交换机接收的IPv6路由的下一跳设置,即下一跳未改为对等地址,而是保留了最终接口地址。
- 由于iBGP协议的特性,下一跳地址不会自动转换为对等地址。
- 为防止部分客户将保留的IPv4/IPv6地址用作点对点(PtP)接口地址,DMIT内部网络不传播特定端口地址。(这导致我们不得不更改所有iBGP路由的下一跳设置)
- 上述问题导致边缘路由器无法解析实际的下一跳至最终接口。
- 当DMIT边界路由器无法找到特定下一跳时,会回退至转接表。
- 在转接表中,FIB中的路由被编程至客户表。
这些因素共同导致客户发送的IPv6流量在单个路由器上不断在多个VRF中循环,直至128 TTL超时。
最终,这导致背板带宽耗尽,引发RR断连。RR断连后,客户路由中断,循环流量消失,网络短暂恢复后故障再次出现。
此次配置故障导致3分钟停机及累计13分钟服务降级。DMIT对此事件深表歉意。
DMIT LAX Network Failure AnalysisAt approximately 19:35:00 Pacific Time, DMIT deployed a change within the LAX metro to introduce IPv6 over MPLS and IS-IS for the access switches.1. DMIT uses loopback addresses for iGP routing on all devices.2. However, in the IPv6 RR configuration, we did not standardize the next-hop for IPv6 routes received from access switches, meaning Next-Hop was not changed to Peer-Address, it remain the final interface address.3. Due to iBGP behavior, next-hop addresses will not not automatically converted to Peer-Address.4. To prevent certain customers from using reserved IPv4/IPv6 addresses as Point-to-Point (PtP) interface address, DMIT's internal network does not propagate specific port addresses. (This made we have to change next-hop for all iBGP routes).5. The above things make the edge router cannot resolve the acual next-hop to final interface. 5. When DMIT's border router fails to find a specific next-hop, it falls back to a Transit table.6. On the Transit table, the route in FIB was programmed to the customer table. These factors collectively caused IPv6 traffic originating from customers to continuously loop through multiple VRFs on a single router until the 128 TTL expired. This ultimately exhausted backplane bandwidth, resulting in RR disconnections. When RR disconnected, custoemr routing was interrupted, loop traffic dropped, and the network recovered briefly before the looping failure recurred.This configuration fault caused 3 minutes of downtime and a cumulative 13 minutes of degraded service. DMIT sincerely apologizes for this incident.
评论 (0)