Thursday, June 30, 2011

Data Center ISP Load Sharing Part 1 - The challenge

Redundant ISP connections are the standard in most data centers. There is surprisingly little update-to-date information about data center Internet architecture out there. In fact, traffic flow on the internet connections while increasingly business critical, is probably among the least well understood area in enterprise network architecture.  Consequently, the potential for cost saving and performance improvement is substantial.

This is the first part of a series that explore this interesting and important topic, which starts by examining a classic solution and its associated challenge.

For a large data center, it is often believed that obtaining the entire Internet routing table from each ISP provides the most comprehensive and complete routing information to direct traffic towards the Internet. However, here are a few potential issues to consider:

Heavy resource utilization
The design may propagate the entire internet routing table in the internal BGP environment within the data center, so that gateways can determine which ISP to exit to for a given Internet destination. Most Internet routers are designed to handle a large number of routes. However gateway devices particularly older generation switches may be running near their CPU and memory maximum. You might want to check CPU utilization on all your internet facing devices - high CPU utilization is often indicative that some routes are software switched. Performance will degrade as a result.

This diagram from CIDR shows continuous growth of BGP prefixes over the years. Expensive upgrades are often required just to keep up with the growth, in order to maintain the performance of hardware forwarding.

Imbalance of ISP bandwidth
The second issue varies with the particular ISP you chose. When you select your ISPs, did you consider how the selection may affect your design? It is very common to have a tier one ISP and tier two ISP in one data center. With hierarchical nature of the Internet, the routes received from a tier one ISP may have generally shorter AS path than the same routes from a tier two ISP. With BGP ultimately selecting one route, a tier one ISP may be heavily favored. As a result, you will likely see your outbound traffic mostly going to one of the ISPs. Having one circuit running over capacity while another is under-utilized is a poor scenario to justify for additional bandwidth upgrade expenses.

Asymmetrical routing
Most data centers advertise routes out to the Internet equally, to achieve resiliency and inbound load balancing. Extreme imbalance of outbound use combined with balanced inbound traffic, results in significant amount of asymmetrical routing. While asymmetrical routing is common on the Internet, excessive amount may lead to operational headaches such as difficulty in troubleshooting. You have probably been on a phone call that one party can hear much better than the other. As voice, video and other delay sensitive traffic continue to increase in terms of volume and significance, controlling quality and user experience requires minimizing asymmetrical routing whenever possible.

In conclusion, the traditional approach of outbound load sharing based on entire internet routing table will largely depend on the particular ISPs. And more often than not, the traffic is not well balanced. To achieve better outcome by design, in part 2 we will start with a simple alternative.