Thursday, December 29, 2011

Beware of unpredictable OSPF to OSPF Mutual Redistribution

Just as it may be necessary to run multiple OSPF processes, It may also be necessary to redistribute routes between them. What happens when two OSPF processes redistribute routes to each other? The results can be quite surprising.

As shown in illustration, R1 runs two OSPF processes. R1 learns a network from both routing processes. From OSPF 1 (left) R1 learns it as an inter-area route. From OSPF 2 (right) R1 learns it as an External route. Which direction would R1 prefer?

A simple test demonstrates show results can be unpredictable. By shutting down the interface towards OSPF 1 and turn it back on, R1 prefers E1 route to reach via OSPF 2. Subsequently, by resetting the interface towards OSPF 2, R1 prefers inter-area route to reach via OSPF 1.

OSPF’s preference of intra-area over inter-area over external applies to routes learned via the same process only; it does not apply to routes learned from multiple OSPF processes.

So what determines preference between routing processes? It’s Admin Distance. Since the default AD for OSPF processes are the same, thus the unpredictable results. Therefore the results can be “swung” by resetting interfaces.

When Administrative Distances are equal, the process that first installs the route in the routing table wins, regardless of metric and type.

How to make it deterministic? The key is obviously around AD. But to apply AD properly as a solution, the desired behavior must be clearly defined. First, identify potentially overlapping networks, that is, networks that can be advertised by both processes. Next, how should the network behave for those networks?

If all overlapping networks should prefer one process, for example, OSPF 1 inter-area should be preferred over OSPF 2 E1, then AD of OSPF 2 can be increased:
router ospf 2
distance ospf external 120

The diagram illustrates the result of the fix, routes from OSPF 1 will now be preferred due to deterministic AD.

If the desired behavior is specific to networks, then AD must be selectively adjusted using filter list. And AD may need to be adjusted on both OSPF processes to arrive at the specific preference for specific networks.

The end results should always be deterministic and predictable, verified using tests in normal and failure scenarios.

As a side note, Before Cisco bug ID CSCdw10987 (integrated in Cisco IOS Software Releases 12.2(07.04)S, 12.2(07.04)T, and later), the last process to make an shortest path first algorithm (SPF) would have won, and the two processes overwrite other routes in the routing table. Now, if a route is installed via one process, it is not overwritten by another OSPF process with the same administrative domain (AD), unless the route is first deleted from the routing table by the process that initially installed the route in the routing table.

Sunday, October 16, 2011

Understand Nexus 7000 Module Forwarding Capacity

Nexus 7000’s line card forwarding (FIB) capacity may be a limiting factor. This is particularly true in a virtualized data center, where multiple VDCs and VRFs are used. This note explains the importance of understanding and managing module capacity.

Nexus 7000 has forwarding engine on line cards. The larger the size of the routing table, the bigger the FIB required for hardware forwarding. When designing with multiple VDCs and VRFs, multiple routing processes and routing tables consume more resources on the module.

Below shows a module using 64k of 65k allocated capacity:

#show hardware capacity forwarding
Module 1 usage:
Route Type              Used      %Used      Free      %Free      Total    
                     (Log/Phys)            (Log/Phys)           (Log/Phys)   
IPv4 Unicast:       64161/64161     97     1375/1375       2    65536/65536 
L2VPN Peer:             0/0          0        0/0          0        0/0     
MPLS:                   0/0          0        0/0          0        0/0     
IPv4 Multicast:         4/8          0    14332/28664     99    14336/28672 
L2VPN IPv4 Mcast:       0/0          0        0/0          0        0/0     
IPv6 Unicast:         195/390        1    14141/28282     98    14336/28672 
L2VPN IPv6 Mcast:       0/0          0        0/0          0        0/0     
IPv6 Multicast:         5/20         0     2043/8172      99     2048/8192  

Saturday, October 15, 2011

Building a private cloud using MPLS and selective tag switching

To support virtualization requirements, it may be necessary to create a private cloud within the enterprise. MPLS is a proven technology to extend virtualization across WAN networks. Most existing network equipment can be enabled for MPLS, thus allowing an enterprise to gain significant new capabilities without the expense of equipment and facility cost.

To transition an existing production network into one that is MPLS enabled, a detailed design is required to select and place key infrastructure elements such as P, PE, and CE routers. Given that existing production traffic may already be routed by these devices, it is often desirable to leave that untouched.

Selective tag switching allows certain VPN traffic to be label switched by the MPLS network, while the other traffic continues to be routed.

 IP address planning is a critical prerequisite to enable selective tag switching. It is highly desirable to place MPLS infrastructure and the new private cloud extensions in a new and distinctive IP address block.

On the P router, private cloud traffic can be identified by PE router loopback addresses, here using as an example
access-list 88 permit

By default, label switching is applied to all traffic. Therefore it is important to reverse that behavior, and turn on label switching for the target traffic identified by access list.
no mpls ldp advertise-labels
mpls ldp advertise-labels for 88

As a result, only MPLS VPN traffic will be tag switched, while existing traffic shows “untagged”
P-router#sh mpls forward
Local  Outgoing    Prefix            Bytes tag  Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
16     Untagged   0          Fa0/0
17     Untagged  0          Fa0/0
18     Untagged  0          Fa0/0
19     Untagged  0          Fa1/0
20     Untagged   0          Fa1/0
21     Pop tag   0          Fa0/0
22     Pop tag   0          Fa1/0
23     Pop tag     544932     Et3/0
24     21     399694     Fa0/0
25     21     416019     Fa1/0
26     Pop tag   0          Et3/0
27     Pop tag   0          Et3/0
28     22   0          Fa0/0
29     Pop tag   0          Fa0/0

 “traceroute” also provides an excellent tool to demonstrate if a packet is routed or tag switched throughout a network.

Saturday, September 24, 2011

A potential problem with Juniper’s implementation of OSPF Router ID

First of all, this non-compliant behavior is observed only on some Juniper devices, not all.

The potential effect of the observed behavior is such that certain OSPF routes fail to propagate as expected.

Here is the scenario: SRX originally uses a loopback for its Router ID.  When that loopback was deleted, it changed its Router ID to a different address This is expected behavior so far.

srx-node0> show interfaces lo0.1
error: interface lo0.1 not found

srx-node0> show ospf overview instance VR   
Instance: VR
  Router ID:

However, a closer look at SRX OSPF database reveals that it still has LSAs with (the old Router ID) as its Advertising Router ID.
srx-node0> show ospf database summary instance VR

    OSPF database, Area
 Type       ID          Adv Rtr           Seq      Age  Opt  Cksum  Len
Summary      0x80000008  1423  0x22 0xc857  28
Summary *      0x80000002   792  0x22 0x8794  28

    OSPF database, Area
 Type       ID          Adv Rtr           Seq      Age  Opt  Cksum  Len
Summary      0x8000001c  2280  0x22 0xfbfc  28
Summary      0x8000001c  2137  0x22 0xd321  28
Summary      0x8000001c  1994  0x22 0xbf35  28
Summary      0x8000001d  3423  0x22 0xfdfc  28
Summary      0x8000001d  1851  0x22 0xaf2e  28

According to RFC2328:
If a router's OSPF Router ID is changed, the router's OSPF software should be restarted before the new Router ID takes effect.  In this case the router should flush its self-originated LSAs from the routing domain before restarting

Note there are two desired behaviors when Router ID changes: 1) OSPF restarts; 2) originated LSA flushes. When that did not happen with JUNOS, the resulting behavior is that the old Router ID is still the “Advertising Router ID” in the LSA, an address that is no longer valid.

Why is that a problem? Because these LSAs will be flooded to neighbors (assuming the router here is an ABR). The neighbor would have noticed the change of Router ID, and thus it will check the validity of Advertising Router ID.

Again, definition of Advertising router according to RFC2328: 
This field indicates the Router ID of the router advertising the summary-LSA or AS-external-LSA that led to this path.
Since the neighbor sees the Advertising Router ID (the old Router ID) no longer matches the new Router ID, it will discard the LSA.

When troubleshooting OSPF routing involving Juniper devices, check OSPF databases for invalid entries.

To prevent such pitfalls, always set Router ID in OSPF. And more importantly, set Router ID using loopbacks, and make sure they are not accidentally deleted. 

Tuesday, September 13, 2011

ERSPAN with Nexus 1000v in a Virtualized Data Center

Encapsulated remote SPAN, or ERSPAN can be used to monitor traffic remotely. In a Nexus 1000v environment, it is not feasible to attach probe directly to the virtual switch. Therefore it is particularly valuable to monitor host traffic using ERSPAN, by routing monitored traffic through IP network to designated network analyzer.

A functioning ERSPAN system consists of these components working together:
·         Nexus1000v with specific port profile and SPAN session
·         Host configured to support monitoring interface
·         Destination switch to forward monitoring traffic to probe

A sample reference model is provided here, using Nexus 7000 attached probe as a common example.
ERSPAN - Cisco Networks

Nexus 1000v
First, choose a routed VLAN (2000) to carrying ERSPAN traffic. Chose a subnet size that will accommodate growth of hosts (each host uses an IP address). To illustrate, is used for VLAN 2000.

Create a port profile for this VLAN on Nexus1000v, note this VLAN must be a system VLAN.

port-profile type vethernet ERSPAN_2000
  capability l3control
  vmware port-group
  vmware max-ports 64
  switchport mode access
  switchport access vlan 2000
  no shutdown
  system vlan 2000
  state enabled

Next, create a test ERSPAN session, for example, monitor VM on Veth88, send monitored traffic to destination See Nexus 7000 section for destination configuration.

monitor session 1 type erspan-source
 source interface Vethernet88 both
  destination ip
  erspan-id 51
  ip ttl 64
  ip prec 0
  ip dscp 0
  mtu 1500
  header-type 2
  no shut

Add a VMKNIC for each host
Must be done from vCenter, for each host. An IP address in VLAN 2000 is required for each host.
Reference Vmware configuration guide for details.

Nexus 7000
The destination probe is connected to Nexus 7000. We’d want monitored traffic originating from Nexus 1000v, to be forwarded to the probe.

The destination specified by ERSPAN session (on N1kv) has an ARP entry in vlan 3000. There is also a corresponding static MAC address entry pointing to the port which the probe is connected. As a result, the ERSPAN traffic destined for will be forwarded to the probe.

interface Vlan2000
  ip address
  hsrp 2000

interface Vlan3001
  ip address
  ip arp 00AA.BBCC.DD66

interface Ethernet2/2
  switchport access vlan 3000
  no shutdown

mac address-table static 00AA.BBCC.DD66 vlan 3000 interface Ethernet2/2

Monday, July 4, 2011

Data Center ISP Load Sharing Part 4 – Tuning

Does full Internet routing table work the best for load sharing with multi-homed ISP connections? We have shown it is often not the case.

Part 1 of the posting shows  the challenges of dual ISP design, the traditional approach of outbound load sharing based on entire internet routing table will largely depend on the particular ISPs.

Part 2 of the posting shows the advantage of simple default route based Internet load sharing design.

Part 3 of the posting introduces a design that combines the simplicity of default based load sharing to dual ISP, and flexibility of selectively filtering subsets of Internet routes for optimal path selection.

In this final part we look at why and how the results should be fine-tuned.

At the initial design stage, you could estimate the number of routes filtered in from your respective ISPs, using the BGP regular expression you designed. Route count estimate provides the basis for your filtering design.  For example, you can count that approximately 50000 routes will be allowed in by a filter specifying only adjacent networks to a tier one ISP. You also count that approximately 40000 routes will be allowed in by another filter specifying adjacent networks as well as those one hop away from a tier two ISP.

After implementation, you will notice the actual number of specific networks allowed in will be less than the combined total of 50000 plus 40000. The total number (for example 80000) is less than the total due to duplicates. In other words, you learn the same 10000 routes from both ISPs because those networks are adjacent to both ISPs. This is common and to be expected.

You might have expected the duplicate routes to be split more or less evenly across the two ISPs, which is often not the case. Therefore, the effect of duplicate routes on load sharing requires some careful observation. ISPs may operate in different tiers of the internet hierarchy, thus affecting the routes they advertise to have shorter or longer AS path length.  AS path length is a primary criterion in BGP path selection, therefore you will likely see almost all of the duplicate routes favoring one ISP. This may affect load sharing, thus require further adjusting the filters.

A second example is the influence of ISP metrics. Some ISP may advertise routes with a metric, while others advertise all routes with zero metric. Zero metric will be preferred if other more priority criteria is equal.

The diagram shows the original design may have expected a load sharing design of 5/4. However, the result shows load sharing between ISP1 and ISP2 turns out to be 5/3, due to all duplicate routes favoring ISP1. Depending on your specific requirements, fine-tuning of the filter may be necessary.

Design for redundant Internet architecture is unique to every organization’s requirements, its national or global data center architecture, the ISPs selected, and the nature of its Internet traffic. The scenarios described hopefully have provide simple templates as references to adjust for your particular data center.

Sunday, July 3, 2011

Data Center ISP Load Sharing Part 3 – Route Selection

Part 2 of the posting shows the advantage of simple default route based Internet load sharing design. This part further optimize the design.

Using the entire Internet routing table for outbound load sharing proves to be resource intensive, and ineffective for load balancing. Default route only provides simplicity and better load balancing. To further optimize, a subset of Internet routes, when selected according to the unique environment, can complement the default route design very well.

Route Selection
Route selection refers to filtering and allow a subset of the Internet routing table to be introduced into the data center. The desired effect is to take the shorter path to content that is directly attached to specific ISPs, while the rest of the traffic load share equally to both ISPs.

The effectiveness of the design is largely based on route selection techniques applicable to the specific data center environment. In the example shown below, BGP regular expression is used to select a subset of Internet destinations adjacent to each ISP.

ISP1 is a tier one, therefore has more directly attached networks. BGP expression is used to select those directly attached networks, with the objective that traffic destined for those networks will exit on this ISP for optimal path.

ISP2 is a tier two, with less number of directly attached networks. BGP expression is used to select those directly attached networks as well as those one additional hop away, with the objective that roughly equal number of specific target networks will prefer ISP2 as the exit point, thus achieving load sharing with both ISPs.

On the respective internet router connected to each ISP, AS path filtering is applied on a route map, which is then applied to BGP inbound route filtering. As a result, the default route, as well as a subset of Internet routes is received from each ISP, in order to optimize outbound traffic to take the more direct path to destinations.

ip as-path access-list 1 permit ^3549_[0-9]*$

route-map ISP1in permit 10 
 match ip address prefix default 
route-map ISP1in permit 20 
 match as-path 1 

router bgp
 neighbor … route-map ISP1in in

Verification and Tuning
At the planning stage, counting number of routes using BGP regular expression filter may serve to arrive at the initial route selection design. By filtering in similar amount of specific routes from each ISP, the desired load sharing can usually be achieved. 

However, equivalent number of routes does not always result in equivalent amount of traffic. Over time, actual load on the respective ISP connections will provide more accurate information about traffic in the particular data center. ISP specific characteristics may also factor in. Part 4 will show why fine-tuning may be necessary.

Saturday, July 2, 2011

Data Center ISP Load Sharing Part 2 – Default Method

Part 1 of the posting shows  the challenges of dual ISP design, the traditional approach of outbound load sharing based on entire internet routing table will largely depend on the particular ISPs. And when routes received from ISPs have different characteristics such as AS Path and metric, the result is often undesirable. To achieve better outcome by design, in part 2 we will start with a simple alternative.

Replacing the entire Internet routing table with just the default route is an extremely simple method that offers a number of advantages.

Load balancing
Instead of getting the entire Internet routing table, only default route is received and installed in the routing table. As a result, IGP can load balance to two equal cost default routes. For outbound traffic, the simple design achieve near 50/50 load balancing, as well as resiliency.

Simplicity, Stability and Resource Efficiency
The design is extremely simple to implement and support. Resource usage on devices can be greatly reduced, from holding tens and thousands of Internet routes, to just default route. Route flapping and any disruptive convergence due to instability in any parts of the Internet is virtually eliminated.

The simplicity advantage is well suited for a large number of enterprise data centers.

The design essentially “splits” the Internet in half, by two equal cost default routes to dual ISPs. Therefore, the exit point may not be optimal, especially for networks directly attached to an ISP, which may require the longer path to get to.

For vast majority of applications, the selection of exit ISP is not noticeable. However, lower latency access of large amount of media content may be highly desirable when a direct path is available. An optimized solution is presented in part 3.

Thursday, June 30, 2011

Data Center ISP Load Sharing Part 1 - The challenge

Redundant ISP connections are the standard in most data centers. There is surprisingly little update-to-date information about data center Internet architecture out there. In fact, traffic flow on the internet connections while increasingly business critical, is probably among the least well understood area in enterprise network architecture.  Consequently, the potential for cost saving and performance improvement is substantial.

This is the first part of a series that explore this interesting and important topic, which starts by examining a classic solution and its associated challenge.

For a large data center, it is often believed that obtaining the entire Internet routing table from each ISP provides the most comprehensive and complete routing information to direct traffic towards the Internet. However, here are a few potential issues to consider:

Heavy resource utilization
The design may propagate the entire internet routing table in the internal BGP environment within the data center, so that gateways can determine which ISP to exit to for a given Internet destination. Most Internet routers are designed to handle a large number of routes. However gateway devices particularly older generation switches may be running near their CPU and memory maximum. You might want to check CPU utilization on all your internet facing devices - high CPU utilization is often indicative that some routes are software switched. Performance will degrade as a result.

This diagram from CIDR shows continuous growth of BGP prefixes over the years. Expensive upgrades are often required just to keep up with the growth, in order to maintain the performance of hardware forwarding.

Imbalance of ISP bandwidth
The second issue varies with the particular ISP you chose. When you select your ISPs, did you consider how the selection may affect your design? It is very common to have a tier one ISP and tier two ISP in one data center. With hierarchical nature of the Internet, the routes received from a tier one ISP may have generally shorter AS path than the same routes from a tier two ISP. With BGP ultimately selecting one route, a tier one ISP may be heavily favored. As a result, you will likely see your outbound traffic mostly going to one of the ISPs. Having one circuit running over capacity while another is under-utilized is a poor scenario to justify for additional bandwidth upgrade expenses.

Asymmetrical routing
Most data centers advertise routes out to the Internet equally, to achieve resiliency and inbound load balancing. Extreme imbalance of outbound use combined with balanced inbound traffic, results in significant amount of asymmetrical routing. While asymmetrical routing is common on the Internet, excessive amount may lead to operational headaches such as difficulty in troubleshooting. You have probably been on a phone call that one party can hear much better than the other. As voice, video and other delay sensitive traffic continue to increase in terms of volume and significance, controlling quality and user experience requires minimizing asymmetrical routing whenever possible.

In conclusion, the traditional approach of outbound load sharing based on entire internet routing table will largely depend on the particular ISPs. And more often than not, the traffic is not well balanced. To achieve better outcome by design, in part 2 we will start with a simple alternative.

Sunday, March 6, 2011

Cisco Flow Control with NetApp NAS

Update Nov 2012: Thanks for the comment, Paul. I have added an update with another look at flow control.

When NAS is used with virtualization, performance and throughput are potential concerns. Enabling jumbo has proven to be the most effective method. Flow control, on the other hand, is not always easily agreeable by all parties.

NetApp has a one paragraph “best practice” on the subject. The recommendation is to set flow control to “receive on” on the switch port. In other words, allow the switch to receive “pause” from NAS.

As it appears, NetApp can send a lot of “pause”. This is easily shown on the port channel or physical interfaces connected to NAS:

Nexus5k# show interface e2/7
Ethernet2/7 is up
  30 seconds input rate 116486568 bits/sec, 2079 packets/sec
  30 seconds output rate 33970464 bits/sec, 792 packets/sec
  Load-Interval #2: 5 minute (300 seconds)
    input rate 57.99 Mbps, 1.58 Kpps; output rate 32.18 Mbps, 693 pps
    20604669084 unicast packets  176831374 multicast packets  0 broadcast packet
    20781500458 input packets  63305126825983 bytes
    8255225288 jumbo packets  0 storm suppression packets
    0 runts  0 giants  0 CRC  0 no buffer
    0 input error  0 short frame  0 overrun   0 underrun  0 ignored
    0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop
    0 input with dribble  0 input discard
    176409426 Rx pause
    12091810302 unicast packets  52908470 multicast packets  2650659 broadcast p
    12147369431 output packets  48540927928599 bytes
    7298956759 jumbo packets
    0 output errors  0 collision  0 deferred  0 late collision
    0 lost carrier  0 no carrier  0 babble
    0 Tx pause
  0 interface resets

So what does it mean? Flow control is only meaningful if the receiving party acts on it. Therefore the expectation here is for switch to “slow down” the transmission, since it is hearing NAS saying “slow down, I can’t keep up”.

According to Cisco, once Pause is enabled and received on switch egress port, it will back pressure the ingress port, eventually packets will be buffered on the ingress port. If pause is enabled on the ingress port, it can further send the pause to the upstream switch. In the ideal scenario, the pause eventually reaches the source, which is the ESXi host, thus slowing down the origination of the transmission.

However, taking a second look at the NetApp diagram here, the recommendation is to configure the end points, ESX servers & NetApp arrays with flow control set to ‘send on’ and ‘receive off’. In other words, ESX is not expected to receive flow control, which leaves only the network to absorb Pause.

Now let’s revisit the flow control scenario again, NAS sends Pause, switches receives it, there is no use applying back pressure all the way to the host since host won’t receive it. The best a switch can do is to buffer it, or apply back pressure to upstream, and have the upstream switch buffer somewhere.

How well that works really depends on the switch and linecard models, each have different capabilities and buffer size. In many cases, it is highly questionable how far back flow control is propagating to have any positive effect. In any case, you want to check:
  • Switch interface to NAS, to see the amount of Pause received
  • All interfaces where NAS traffic flows, to see if there are drops
 More clarifications on this topic are probably required from vendors. Device behavior will likely evolve with technology advancements. For now, it’s best to turn flow control on switch side, but monitor network behavior closely.

Saturday, February 19, 2011

Where to place VSM and vCenter

With Nexus 1000v, VSM and vCenter can run as VM under VEM, but that doesn’t mean they always should.

VSM is the “supervisor” for VEMs (virtual line cards). It also communicates with vCenter which is the central management and provisioning center for Vmware virtual switching.

As a network designer, we will need to work with host team to determine VSM’s form factor:
  • As a VM running under VEM (taking a veth port)
  • As a VM running under a vSwitch
  • As a separate physical machine
  • As an appliance (Nexus 1010 VSA)
As you can see, options range from complete integration in the virtualized environment, to complete separation, at increasing cost. Arguably, in a large and complex virtualization environment, the advantage of having separate control points will become more apparent. Here we briefly touch on two practical considerations.
Failure Scenarios
When everything works, there is really no disadvantage having VSM and vCenter plugged into a VEM. In theory, VSM can communicate even before VEMs are boot up, through control and packet VLANs which should be system VLANs. However, it could become a lot more complex to troubleshoot, when something is not right. For example, misconfiguration on vCenter leading to communication failure, software bug on the Nexus 1000v leading to partial VLAN failures, having a faulty line card with packet drops.

The point is, if there is a failure, we want to know quickly if it is in the control plane or the data plane. We often rely on the control plane to analyze what is going on in the data plane. Mixing VSM with VEM increases the risk of having control plane and data plane failure at the same time, making root cause isolation more difficult. However unlikely we may think, failure scenarios could happen. When it does, having access to VSM and vCenter is essential to troubleshooting and problem isolation. We know VEM does not rely on the availability of VSM to pass packets; however having VSM under VEM essentially places it under the same DVS that it manages, therefore subject to DVS port corruption error as an example. When a VEM fails, imagine losing access to VSM and vCenter as well because they are running under it.
Administrative Boundary
VSM and vCenter, due to their critical nature, needs to be protected. To prevent administrators from mistakenly change vCenter and VSM while making changes to other VMs, there should be as much administrative boundary established as the infrastructure supports.
Having VSM and vCenter in a separate control cluster with dedicated hosts creates clear administrative boundary. The use of a Vmware virtual switches (vDS) instead of VEM for vCenter and VSM will further decouple dependency. The vDS should be clearly named; its special purpose will be understood by all administrators, therefore minimizing the chance for mistakes.
The diagram shows a sample of placing VSM and vCenter as VMs on a separate control cluster separate from the applications VM they manage.