enabling data science: 2010

Saturday, December 11, 2010

OSPF EIGRP BGP dual point mutual redistribution - Part 3

In part 1 and part 2 of the post, we mainly focused on OSPF and EIGRP mutual redistribution. We use administrative distance to control preference, and use tags to prevent loops.

In a typical enterprise WAN environment, carrier MPLS can also be used to carry traffic between sites and data centers. In a resilient architecture, there are multiple paths to the same destination. The business requirement may be such that certain traffic should take one type of link as its primary path, while still having a backup path in case of failure.

In the illustration, MPLS provides the WAN backup path for direct facilities (OSPF-EIGRP). BGP is used as the dynamic routing protocol through the MPLS cloud.

Recall from part 2, tag 25 is used to indicate routes originated in OSPF, and prevented from feed back from EIGRP back to OSPF. Why is there a third issue with BGP? Because the same route is advertised out of OSPF to MPLS via BGP. The data center running EIGRP will also learn the same route from MPLS cloud as a BGP route, in this case not tagged. On EIGRP to OSPF redistribution point, the tag filter does not stop feedback from a route learned via BGP. As long as east coast has a feasible successor (one with metric lower than current best FD), then this route will be advertised to west coast, with EIGRP distance of 100, thus preventing redistribution.

This is an example of network with the feedback issue. Note update tagged “1979” is sent due to better next hop FD. The end result is a network originated from west coast advertised out MPLS, became advertised back from east coast back on EIGRP, and preventing desired redistribution from OSPF into EIGRP.

East-RTR1#sh ip eigrp top 172.31.44.0 255.255.254.0
EIGRP-IPv4 (AS 100): Topology default(0) entry for 172.31.44.0/23
State is Passive, Query origin flag is 1, 2 Successor(s), FD is 30464
Routing Descriptor Blocks:
10.48.137.101 (GigabitEthernet0/0/1), from 10.48.137.101, Send flag is 0x0
Composite metric is (30464/30208), Route is External
Vector metric:
Minimum bandwidth is 625000 Kbit
Total delay is 1030 microseconds
Reliability is 255/255
Load is 31/255
Minimum MTU is 1500
Hop count is 3
External data:
Originating router is 168.147.152.166
AS number of route is 1
External protocol is OSPF, external metric is 20
Administrator tag is 250 (0x000000FA)
10.48.138.101 (GigabitEthernet0/0/2), from 10.48.138.101, Send flag is 0x0
Composite metric is (30464/30208), Route is External
Vector metric:
Minimum bandwidth is 625000 Kbit
Total delay is 1030 microseconds
Reliability is 255/255
Load is 2/255
Minimum MTU is 1500
Hop count is 3
External data:
Originating router is 168.147.152.166
AS number of route is 1
External protocol is OSPF, external metric is 20
Administrator tag is 250 (0x000000FA)
10.250.32.205 (GigabitEthernet0/2/0), from 10.250.32.205, Send flag is 0x0
Composite metric is (32768/7168), Route is External
Vector metric:
Minimum bandwidth is 625000 Kbit
Total delay is 1120 microseconds
Reliability is 255/255
Load is 2/255
Minimum MTU is 1500
Hop count is 3
External data:
Originating router is 10.250.248.1
AS number of route is 64601
External protocol is BGP, external metric is 0
Administrator tag is 1979 (0x0000369B)

The fix is by also setting tag (25) from routes coming from the west coast data center (identified by originating AS number) on the redistribution point from BGP to EIGRP. These tagged routes can be prevented from “feedback” on the EIGRP link with a route map. The route map can be applied on EIGRP interface distribute-list.

Sunday, November 14, 2010

Optimize Google and Netflix with ISP

This is the first part of a series to explore, regarding optimizing enterprise Internet applications.

I estimated that on average, I use Google at least a dozen times on an average work day. I don’t think I will be far off to say that it is the top Internet application in almost every enterprise. While writing this article, I searched a couple of key words and arrived at a few web sites to find the specific information I was looking for. I did not have any books or references with me. More often, business users expect information on demand. It is often faster to get the information on the Internet than finding on the hard drive, and sometimes more accurate than asking your buddies.

Why this matters? Google is just an example here. For an Internet application that triggers millions of transactions almost every day in your enterprise, it is worthwhile to look “into the cloud”, and understand how your enterprise is interconnected to the rest of the world.

Google’s primary AS is 15169. It originates more than a hundred IPv4 prefixes for common apps, such as www.google.com. So how are you connected to AS15169? The average AS path length from Google is almost 3. But if you share one of the ISPs that Google peer directly with, then your path will be shorter.

The following lists the top IPv4 peers for AS 15169:

ASN Name

AS3356 Level 3 Communications

AS3549 Global Crossing

AS1299 TeliaNet Global Network

AS7018 AT&T Services, Inc.

AS174 Cogent Communications

AS6762 Telecom Italia Sparkle

AS3320 Deutsche Telekom AG

AS286 KPN Internet Backbone

AS6939 Hurricane Electric, Inc.

AS3257 Tinet SpA

As an example, check the internet routing table to see how you reach IP address for Google.com. The second shorter path (through a tier one ISP) will always be preferred over the first one.

Internet_router#show ip bgp 173.194.33.104
BGP routing table entry for 173.194.33.0/24, version 2010015
Paths: (2 available, best #2, table default)
Not advertised to any peer
13789 2828 15169
…
3356 15169
…

The significance goes beyond the length of the path or how many ISPs to transit through. Larger ISPs are becoming content providers in order to further optimize performance. Just a few days ago, Level 3 signed a deal with Netflix to “support storage for the entire Netflix library of content”. Level 3 has become the primary content network for Netflix, while Akamai and Limelight remain as partners. In this case, if your users are in the entertainment business, you will be at significant advantage by peering directly with Level 3.

There are more considerations, inbound vs outbound, streaming, CDN, geographical, international, resilience and cost… more to follow.

Saturday, October 23, 2010

BGP timer and Cisco/Juniper negotiation

Juniper SRX integrates firewall features with full routing capabilities. In a network architecture where there are increasing demand for security segmentation and stateful firewall inspection within the network, multi-vender routing interaction may become necessary.

Risks and complexity exist with a mixed vendor environment, even when both sides support standard protocols. Here is such an example.

To establish BGP neighbor relationship, both sides need to agree on timer values. RFC4271 defines the characteristics of the timer, but does not specify value for keepalive and hold time.

BGP uses keepalive timers to monitor established connections. If for a period of time that exceeds hold time, BGP considers the neighbor connection down. Note the hold time is always three times the keepalive time.

By default, vendors set timers differently:
vendor Keepalive (second) Hold time (second)
Cisco Nexus (4.2) 60 180
Juniper SRX (10.0) 30 90

The timers should match between vendors. This is based on configuration, or negotiated behavior. Note with Juniper SRX, BGP on the local routing device uses the smaller of either the local hold-time value or the peer’s hold-time value received in the open message as the hold time for the BGP connection between the two peers. Therefore, by setting timer on Cisco Nexus to a smaller value (10/30), it is used by SRX.

To reduce convergence, the test sets Nexus keepalive to 10 seconds. This shows after successful negotiation, SRX (with a local hold time of 90) uses Nexus’s keepalive of 10 seconds, and active hold time of 30 seconds.

On Juniper SRX:
Peer: 10.88.15.14+179 AS 65000 Local: 10.88.14.4+54104 AS 65000
Type: Internal State: Established Flags:
Last State: OpenConfirm Last Event: RecvKeepAlive
Last Error: None
Export: [ bgp_outbound_policy ]
Options:
Local Address: 10.88.14.4 Holdtime: 90 Preference: 170
Number of flaps: 0
Peer ID: 10.88.120.130 Local ID: 10.88.14.4 Active Holdtime: 30
Keepalive Interval: 10 Peer index: 0
…

On Cisco Nexus:
Nexus7k# sh ip bgp nei vrf test
BGP neighbor is 10.88.14.4, remote AS 65000, ibgp link, Peer index 1
BGP version 4, remote router ID 10.88.14.4
BGP state = Established, up for 00:53:15
Last read 00:00:05, hold time = 30, keepalive interval is 10 seconds
Last written 0.956168, keepalive timer expiry due 00:00:09
Received 20433 messages, 1 notifications, 0 bytes in queue
Sent 18392 messages, 0 notifications, 0 bytes in queue
Connections established 2, dropped 1
Last reset by peer 02:05:25, due to peer deconfigured
Last reset by us never, due to process restart

Neighbor capabilities:
Dynamic capability: advertised (mp, refresh, gr)
Dynamic capability (old): advertised
Route refresh capability (new): advertised received
Route refresh capability (old): advertised received
4-Byte AS capability: advertised received
Address family IPv4 Unicast: advertised received
Graceful Restart capability: advertised received
…

Friday, October 8, 2010

Enable BGP load sharing with Multipath-relax

If your network is like most large enterprise’s, chances are you have Internet connection from dual ISPs to your data center.

It used to be a popular tactic to have a primary provider, and a backup provider which you pay lesser bandwidth for. In recent years, usage based circuits has given way to flat rate, often larger bandwidth circuits from both ISPs. After all, users demand more bandwidth, even in failure scenarios.

Why would you pay for expensive Internet access and only letting it sitting idle most of the time? Therefore, load sharing on redundant ISP connections has become highly desirable for data centers.

By default, BGP only installs one route into the routing table. To accomplish load sharing, BGP multipath needs to be enabled. To further complicate the matter, BGP requires all paths to have the following equal characteristics in order to become equal cost path:
• Weight
• Local preference
• AS-PATH length
• Origin
• MED
• AS-PATH

Since neighboring AS or AS-PATH must be equal, that means BGP can only load balance to a single AS or single ISP. Since different ISP will have different AS and AS-PATH, BGP will only pick one best route.

So how does Cisco support load balancing to two ISPs? There is a hidden command bgp bestpath as-path multipath-relax

I don’t know why it is hidden command, since it appears to be a very useful feature, and it is supported. But do use with caution, as the number of routing entries will increase, often overwhelming the limited memory you have on the Internet router.

In this year’s Networker, an alternative method was also shown, essentially separating out the Internet routing table by even and odd digits, for outbound traffic.

Sunday, September 5, 2010

Not all ports are created equal on Nexus5000

With Nexus 5000, it is very common to select the first few available ports for basic infrastructure such as peer links and uplinks. But that may not be best practice for resiliency, due to shared use of ASIC on the switch.

In short, there are 5 ASICs. Port e1/1-4 share the first ASIC, for example. If the first 4 physical port on Nexus 5000 are selected for all peer link and uplink to Nexus 7000, then loss of ASIC0 will bring down all these links at the same time. Even with the protection of VPC which can minimize loss, still not a desirable scenario.

Use “show hardware internal gatos all-ports” to see port to ASIC mapping. Below shows "gat" 0 which is the ASIC number for e1/1-4. It will also show the next four ports show the next ASIC, so on so forth.

Nexus-5010-sw1# show hardware internal gatos all-ports
process_cli_req:
Calling Hander:
Gatos Port Info:
name |log|gat|mac|flag|adm|opr|c:m:s:l|ipt|fab|xgat|xpt|if_index|diag
-------+---+---+---+----+---+---+-------+---+---+----+---+--------+----
xgb1/4 |3 |0 |0 |b7 |en |up |0:0:0:f|0 |55 |0 |2 |1a003000|pass
xgb1/3 |2 |0 |1 |b7 |en |up |0:1:1:f|1 |54 |0 |0 |1a002000|pass
xgb1/1 |0 |0 |2 |b7 |en |up |1:2:2:f|2 |56 |0 |4 |1a000000|pass
xgb1/2 |1 |0 |3 |b7 |en |up |1:3:3:f|3 |57 |0 |6 |1a001000|pass
xgb1/8 |7 |1 |0 |b7 |en |up |0:0:0:f|0 |50 |1 |2 |1a007000|pass
xgb1/7 |6 |1 |1 |b7 |en |up |0:1:1:f|1 |51 |1 |0 |1a006000|pass
xgb1/5 |4 |1 |2 |b7 |en |up |1:2:2:f|2 |53 |1 |4 |1a004000|pass
xgb1/6 |5 |1 |3 |b7 |en |up |1:3:3:f|3 |52 |1 |6 |1a005000|pass
xgb1/19|18 |2 |0 |b7 |en |up |0:0:0:f|0 |45 |4 |0 |1a012000|pass
xgb1/11|10 |2 |1 |b7 |dis|dn |0:1:1:f|1 |44 |2 |0 |1a00a000|pass
xgb1/20|19 |2 |2 |b7 |en |up |1:2:2:f|2 |49 |4 |2 |1a013000|pass
xgb1/12|11 |2 |3 |b7 |dis|dn |1:3:3:f|3 |48 |2 |2 |1a00b000|pass
xgb1/18|17 |3 |0 |b7 |en |up |0:0:0:f|0 |41 |4 |6 |1a011000|pass
xgb1/10|9 |3 |1 |b7 |en |up |0:1:1:f|1 |40 |2 |6 |1a009000|pass
xgb1/17|16 |3 |2 |b7 |en |up |1:2:2:f|2 |46 |4 |4 |1a010000|pass
xgb1/9 |8 |3 |3 |b7 |en |up |1:3:3:f|3 |47 |2 |4 |1a008000|pass
xgb1/15|14 |4 |1 |b7 |en |up |0:1:1:f|1 |36 |3 |0 |1a00e000|pass
xgb1/16|15 |4 |3 |b7 |en |up |1:3:3:f|3 |42 |3 |2 |1a00f000|pass
sup0 |20 |5 |0 |b7 |en |dn |0:0:0:0|0 |33 |0 |0 |15020000|pass
sup1 |21 |5 |1 |b7 |en |dn |0:0:0:1|0 |33 |0 |0 |15010000|pass
xgb1/14|13 |5 |1 |b7 |en |up |0:1:1:f|1 |32 |3 |6 |1a00d000|pass
xgb1/13|12 |5 |3 |b7 |en |up |1:3:3:f|3 |38 |3 |4 |1a00c000|pass
mfc2/5 |4 |6 |0 |b7 |dis|dn |0:0:0:0|0 |31 |6 |0 |01084000|pass
mfc2/6 |5 |6 |1 |b7 |dis|dn |0:0:0:1|0 |31 |6 |1 |01085000|pass
mfc2/7 |6 |6 |2 |b7 |dis|dn |0:1:0:2|1 |30 |6 |2 |01086000|pass
mfc2/8 |7 |6 |3 |b7 |dis|dn |0:1:0:3|1 |30 |6 |3 |01087000|pass
mfc2/4 |3 |6 |4 |b7 |dis|dn |1:2:2:0|2 |35 |5 |6 |01083000|pass
mfc2/3 |2 |6 |5 |b7 |dis|dn |1:2:2:1|2 |35 |5 |4 |01082000|pass
mfc2/2 |1 |6 |6 |b7 |dis|dn |1:3:2:2|3 |34 |5 |2 |01081000|pass
mfc2/1 |0 |6 |7 |b7 |dis|dn |1:3:2:3|3 |34 |5 |0 |01080000|pass

If you use dual physical ports for peer link, then definitely use two ports that are different ASICs.

Why “gatos”? Here is a little humor uncovered by Colin McNamara you might enjoy.

Thursday, August 19, 2010

Nexus7000 OSPF failure due to MFDM crash – still searching for root cause

This occurred a while ago, still waiting for words. Unfortunately the condition has cleared due to production network, just wondering if anybody else out there has some possible clue about root cause.

The most noticeable symptom was OSPF adjacency problem, with no error message. OSPF was able to establish at least partial adjacency with some neighbors, but not others.

Thinking OSPF process has gone bad, we restarted it, but no use. OSPF process runs fine with no error, but adjacency trouble remains. Further review of logs reveals that a process known as MFDM (Multicast FIB Distribution?) crashed with attempted to restart three times but did not recover. Obviously OSPF adjacency uses Multicast FIB (MFIB).

2010 Aug 2 17:29:57 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16377) hasn't caught signal 11 (core will be saved).
2010 Aug 2 17:30:03 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16471) hasn't caught signal 11 (core will be saved).
2010 Aug 2 17:30:03 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16524) hasn't caught signal 11 (core will be saved).

We could not recover MFDM individually, end up reloading VDC to recover it. Magically, OSPF started working again!

Still an open ticket, no root cause has been identified. Part of the difficulty was we seem to have lost some of the files, which makes it hard for vendor to trace it down. Just wondering is there is any similar experience out there?

And, could a hardware or ASIC related failure be causing MFDM crash? Or is it more likely a software bug?

Thanks for your thoughts and comments.

Saturday, August 7, 2010

Nexus7000/5000/1000v – A closer look at VPC and port channel load balancing

Port channel is a great redundacy and load sharing feature in data centers. Cisco Nexus takes it one step further with Virtual Port Channel (VPC). There are numerous good documentations about VPC, including the downloadable design guide.
Sometimes you need to look under the cover, and see exactly how each physical port is utilized by port channels. This note highlights a few useful commands.

Strictly speaking, port channel does load sharing, not load balancing. So there should not be an expectation for 50/50 balance. How well load sharing works largely depends on the traffic and the hashing method selected. For example, Nexus 1000v supports 17 hashing algorithms to load-share traffic across physical interfaces in a PortChannel, including source-based hashing and flow-based hashing. The default is source-mac.

Another important concept: hashing is uni-directional, determined by the sending party. Therefore there is no guarantee that load sharing will be symmetrical. For illustration, Nexus 7k and 5k are connected in what is known as “back to back” VPCs (port channel 75). A Nexus 7k has multiple physical connections southbound on the same logical channel. It determines the physical port to send traffic based on local hashing, as illustrated by the green arrow.

Here is the first command which shows the hashing algorithem used:
show port-channel load-balance
N-7010-1# sh port-c load
Port Channel Load-Balancing Configuration:
System: source-dest-ip-vlan
Port Channel Load-Balancing Addresses Used Per-Protocol:
Non-IP: source-dest-mac
IP: source-dest-ip-vlan

The second command shows how well load sharing is working on your port channels. Note statistics are accumulative, and is reset by clearing corresponding interface counters.
Show port-channel traffic

N-7010-1# show port-c traffic
ChanId Port Rx-Ucst Tx-Ucst Rx-Mcst Tx-Mcst Rx-Bcst Tx-Bcst
------ --------- ------- ------- ------- ------- ------- -------
75 Eth1/5 53.16% 42.58% 49.91% 53.70% 44.03% 44.86%
75 Eth1/6 46.83% 57.41% 50.08% 46.29% 55.96% 55.13%

As in the diagram, orange arrow indicates Nexus 5k northbound load sharing based on its hashing algorithem. The blue arrow indicates Nexus 1kv northbound load sharing based on its hashing.

Since Netflow is not yet supported on Nexus 5k, how do we tell which physical interface will a certain flow take? Here is the third command:

show port-channel load-balance forwarding-path interface ...

The following example shows how different IP address pairs yields different port utilization.
N-5010-1# show port-channel load-balance forwarding-path interface port-channel 75 vlan 25 src-ip 10.17.19.15 dst-ip 10.17.21.15
Missing params will be substituted by 0's.
Load-balance Algorithm on switch: source-dest-ip
crc8_hash: 90 Outgoing port id: Ethernet1/3
Param(s) used to calculate load-balance:
dst-ip: 10.174.21.15
src-ip: 10.174.19.15
dst-mac: 0000.0000.0000
src-mac: 0000.0000.0000

N-5010-sw1# show port-channel load-balance forwarding-path interface port-channel 75 vlan 25 src-ip 10.17.19.10 dst-ip 10.17.34.15
Missing params will be substituted by 0's.
Load-balance Algorithm on switch: source-dest-ip
crc8_hash: 179 Outgoing port id: Ethernet1/4
Param(s) used to calculate load-balance:
dst-ip: 10.174.34.15
src-ip: 10.174.19.10
dst-mac: 0000.0000.0000
src-mac: 0000.0000.0000

If you are running a test, and notices that the traffic is not well balanced, now you have a method to check the physical port allocation for your particular end points. You also have the option to experimenting with different hashing algorithem to suit your needs.

Sunday, July 25, 2010

Nexus1000v – when to use “system mtu”

I can’t be the first one confused about jumbo, mtu, and system mtu on Nexus 1000v. After reading some excellent posts, all signs were indicating that “system mtu” was designed to solve the “chicken and egg” problem of running VSM on IP storage.

Like "system vlan", “system mtu” applies to system uplink profile only. So if VSM is not even running under VEM (it runs on vSwitch), there is no need to set “system mtu”, right?

Well, not quite. It turns out system mtu is still needed to preserve the connection to VEM. Assuming jumbo is used (for storage as an example), reboot of ESX will revert the physical NIC to default MTU (1500), which results in mismatched MTU between physical NIC and virtual NIC, and loss of connectivity. “system mtu” preserves the setting on physical NIC, and thus preventing VEM from disappearing.

To further clarify, here is an example of configuring jumbo of 9000 on Nexus 1000v
1. “system jumbomtu 9000” (global)
2. “system mtu 9000” (uplink port profile)

That is all. Note once set, “system mtu” overwrites “mtu”, therefore there is no need to set interface mtu explicitly.

A couple of things potentially confusing:
-The show commands on Nexus 1000v is not exactly accurate for MTU, fix is supposed to be coming
-There is an error in Cisco Nexus command reference, which states “The value that is configured for system mtu command must be less then value configured in the system jumbomtu command”. It should be “less or equal to”; There are no reason to set system mtu to 8998 unless hardware dictates so.

I hope that clear up some confusions. If you notice any behavior inconsistent with this understanding, please kindly let me know.

Wednesday, July 21, 2010

Nexus 1000v bug - Three ways to check VSM HA health

If you are using Nexus 1000v, more than likely you have set up VSM high availability. To ensure system stability and gain the true benefit of HA, check if VSM HA is truly synchronized. Don’t stop when you see 2 supervisor modules. Otherwise you may be caught at the worst time of a failure scenario, and found out there is no HA, and you have to deal with a bug, even configuration loss.

Specifically, check after initial setup, and after system operations such as VSM reload. There is a bug, CSCtg46327, fixed in 4.0(4)SV1(3a). It prevents VSM active and standby to synchronize. Standby continues to come up and down in attempts to do so.

You can check for this bug and other potential VSM HA problem using these methods:

1. “show module” output affected by the bug, module 2 should show “VSM, Nexus 1000V, ha-standby”, instead of “powered-up”:

vsm# sh module
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
1 0 Virtual Supervisor Module Nexus1000V active *
2 0 Supervisor/Fabric-1 powered-up
5 248 Virtual Ethernet Module NA ok
6 248 Virtual Ethernet Module NA ok
7 248 Virtual Ethernet Module NA ok
...

2. “show svs neighbor” list standby VSM MAC as a “VEM”, which is incorrect. It should be type VSM:
vsm# sh svs nei
Active Domain ID: 91
AIPC Interface MAC: 0050-56b4-52bb
Inband Interface MAC: 0050-56b4-3fc1

Src MAC Type Domain-id Node-id Last learnt (Sec. ago)
------------------------------------------------------------------------
0050-56b4-5eb5 VEM 0 ffffffff 0.00
0002-3d43-8504 VEM 901 0502 160591.40
0002-3d43-8505 VEM 901 0602 160591.30
0002-3d43-8506 VEM 901 0702 160230.70

3. “show system redundancy status” shows operational mode “none”, which should be HA:

vsm# sh sys red stat
Redundancy role
---------------
administrative: primary
operational: primary

Redundancy mode
---------------
administrative: HA
operational: None

This supervisor (sup-1)
-----------------------
Redundancy state: Active
Supervisor state: Active
Internal state: Active with warm standby

Other supervisor (sup-2)
------------------------
Redundancy state: Standby
Supervisor state: HA standby
Internal state: HA standby

Tuesday, July 20, 2010

Nexus1000v bug – widespread VM intermittent connectivity and multicast failure

If you are experiencing widespread, contantly changing, totally unpredicatable VM connectivity issue, you have likely hit this bug. You will notice ping works intermittently, and all your regular troubleshooting will yield no particular results, because the behavior is not consistent.

I have benefited tremendously from many posts out there, I hope this will save some of you many many hours of frustration.

First, verify your health (I meant that of your nexus1000v, although at that point your health is probably not very good either). Issue command on VSM ’module vem # execute vemcmd show dr’. The DR (Designated Receiver) must be associated with uplinks. The DR must not be associated with any of the vmnics.

The following shows the “good” output. You have hit the bug if DR is pointing to any of the vmnics for any VLAN you have.

BD 2899, vdc 1, vlan 2899, 3 ports, DR 304, multi_uplinks TRUE
Portlist:
20 vmnic4
22 vmnic6
304 DR

The root cause is a Nexus 1000v programming error that misplaces DR, which is used for multicast and broadcast traffic. As problem comes and goes, lost ARP causes widespread inconsistent behaviors throughout the network. You may notice ping works from Nexus 7000, but not from Nexus 5000, works from one device, but not from another, you get the idea.

The temporary recovery procedure is to reset the physical ports associated with the VEM by “shut” and “no shut”. Check with ’module vem # execute vemcmd show dr’ again to see the symptoms corrected.

The bug is known as “unpublished”, with information hidden. The fix is forecast to be in the upcoming patch release, which should be soon.

One last thing, you may not want to rush with “shut” and “no shut”. Give it at least 10 seconds in between, to avoid another bug. More on this later.

Tuesday, July 13, 2010

Nexus 1000v bug - avoid LACP problems by using mode active

If you are using Nexus 5000 VPC as the method for host connection to Nexus 1000v, you may want to watch out for a current situation. Last week at Networkers, best practice was described as using mode active on Nexus 5000, and mode passive on Nexus 1000v, thus allowing Nexus 5000 to establish the LACP port channels.

However, there is a reported error condition, when triggered by something like a VSM reload, which will effectively put certain ports into "suspended" mode, as shown below:

vsm-1(config)# sh port-channel summary
...
5 Po5(SU) Eth LACP Eth5/5(s) Eth5/7(P)

Another symptom is "show CDP" does not match between 1000v and 5000, with only one side sending hellos.

Using a Cisco internal command, shows more detailed event information:
vsm-1(config)# sh port-c internal event-history errors
1) Event:E_DEBUG, length:162, at 710664 usecs after Fri Jul 9 20:38:39 2010
[102] pcm_proc_response(373): Ethernet5/5 (0x1a040400): Setting response status to 0x402b000c (port not compatible) for MTS_OPC_ETHPM_PORT_BRINGUP (61442) response

2) Event:E_DEBUG, length:84, at 710191 usecs after Fri Jul 9 20:38:39 2010
[102] pcm_eth_port_ac_drop_all_txns(798): Interface Ethernet5/5 suspended by protocol

It will probably be fixed in a future N1000v release, current workaround? Set LACP mode to "active" on Nexus 1000v:

port-profile type ethernet systemuplink_profile
vmware port-group
switchport mode trunk
switchport trunk allowed vlan all
channel-group auto mode active
no shutdown
system vlan ...
state enabled

Thursday, June 10, 2010

Understand Bridge Assurance

Bridge assurance comes up often in Nexus troubleshooting. Therefore it is important to understand its design and effect. What is BA? It is a Cisco STP enhancement feature designed to prevent loops, by making sure that a neighboring switch does not malfunction and begin forwarding frames when it shouldn't. Configured incorrectly, BA will likely cause some headaches.

BA monitors receipt of BPDUs on point-to-point links. When the BPDUs stop being received, the port is put into blocking state (actually a port inconsistent state, which stops forwarding). This is typically seen with "show spanning-tree ..."

Now it will make sense to highlight the important characteristics of BA:
• It’s enabled globally by default, but disabled by default on interfaces
• It is enabled only on STP “network” interface
• For it to work, both ends of the link must support BA; Otherwise, the BA side will block
• BA only works on point to point Cisco connections

Here are two examples of BA troubleshooting:

1. If STP "network" type is used with a host VPC, the host side does not support bridge assurance, and we know Nexus 1000v does not even send BPDU, then turning on BA on Nexus 5000 will make the port go into “inconsistency” and blocking.

2. With Nexus 7000 and 5000 back to back VPC connections, it is important to set both sides to type “network”, thus enabling BA consistently. Otherwise, it will also go into blocking state due to inconsistency.

See a more complete picture of spanning tree design is a typical Nexus virtualized data center here.

Wednesday, June 9, 2010

OSPF EIGRP BGP dual point mutual redistribution - Part 2

In part 1 of the post, we solved feedback on OSPF side, but issue in the opposite direction still exists. OSPF routes redistributed into EIGRP on R1, R2 learns from R1/EIGRP and prefers it (since we have lowered EIGRP EXT AD to fix issue 1). Because R2 now prefers the feedback route, it will not redistribute these OSPF routes to EIGRP.

First, between R1 and R2, stop EIGRP advertising routes redistributed from OSPF.

router eigrp 1

distance eigrp 90 100

distribute-list route-map Deny_routes_from_OSPF in

route-map Deny_routes_from_OSPF deny 10

match tag [ospf tags]

As best practice, also use tag to stop the feedback to redistribution points, by blocking OSPF originated routes from redistributing back into OSPF.

route-map redist_EIGRP-to-OSPF deny 10

match tag [ospf tags]

route-map redist_EIGRP-to-OSPF permit 20

...

What about routes learned from MPLS? Those are typically remote offices, which we don’t want to be redistributed from OSPF.

See an MPLS remote network 10.103.2.0/24. EIGRP side gets it from BGP as EIGRP EXT. OSPF side also gets it from BGP. Could redistributing router prefer EIGRP (since we set lower AD)? The router has two paths (left and right) to go out to the WAN to reach the remote site. We would want it to use the “local” MPLS exit point, in this case is through OSPF.

This is where EIGRP metric becomes important. Note EIGRP must have at least default metric for redistribution to happen. It is learning the same route from both BGP and OSPF. At the point of redistribution, the metric assigned to routes redistributed from OSPF to EIGRP should be lower than EIGRP metric going the other way (typically by setting lower bandwidth on WAN links). As a result, the redistribution router prefers the locally redistributed routes from OSPF.

Redistribution-rtr-1#sh ip eigrp top 10.103.2.0 255.255.255.0
EIGRP-IPv4 (AS 100): Topology default(0) entry for 10.103.2.0/24
State is Passive, Query origin flag is 1, 1 Successor(s), FD is 4352
Routing Descriptor Blocks:
168.147.152.161, from Redistributed, Send flag is 0x0
Composite metric is (4352/0), Route is External
Vector metric:
Minimum bandwidth is 625000 Kbit
Total delay is 10 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1500
Hop count is 0
External data:
Originating router is 10.250.248.7 (this system)
AS number of route is 1
External protocol is OSPF, external metric is 7
Administrator tag is 200 (0x000000C8)
But MPLS does cause additional feedback issue, which we will cover in part 3.

Tuesday, June 8, 2010

enabling jumbo MTU on Nexus

In our data center, enabling jumbo MTU has proven to result in significant throughput improvements. It should have been straightforward, but unfortunately not with Nexus. This note is meant to fill in a gap where existing documentation may be lacking.

In a typical Nexus 7000/5000/1000v architecture, use these steps to enable jumbo MTU end to end, and then verify.

1. System jumbomtu
It defines the maximum MTU size for the switch, which must be configured on ALL devices.
Note the max value depend on hardware, for example it’s 9000 on Nexus 1000v, and 9216 elsewhere.

2. On Nexus 5000, must configure with system QoS policy
Note it can not be configured on the interface! This is quite different from any other device I am aware of. To make it even more confusing, the command varies between releases.

For example, with 4.0.1:
policy-map jumbo
class class-default
mtu 9216
system qos
service-policy jumbo

With 4.1, 4.2 it becomes:
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
system qos
service-policy type network-qos jumbo

3. On Nexus 1000v, configure on port channel
You will see some documentation showing configuration on physical interface, and disappointed when you actually try to do it. No, it won’t take the command. You must configure on the port channel! And then, magic, it will appear on the physical interface.

More detailed explanation on Nexus1000v is here.

4. On Nexus 7000, configure on the interface
Don’t forget “system jumbomtu” on ALL devices

Verification
So how to verify? Assuming you have jumbo traffic, then you will see the difference before and after with “show interface … counter detail”

Nexus5000# sh int e1/1 count detail
Ethernet1/1
Rx Packets: 95396987709
Rx Unicast Packets: 95396962128
Rx Multicast Packets: 21481
Rx Broadcast Packets: 4100
Rx Jumbo Packets: 89336029550
Rx Bytes: 136394308168545
Rx Packets from 0 to 64 bytes: 6028626455
Rx Packets from 65 to 127 bytes: 611554
Rx Packets from 128 to 255 bytes: 221439
Rx Packets from 256 to 511 bytes: 188679
Rx Packets from 512 to 1023 bytes: 579970
Rx Packets from 1024 to 1518 bytes: 30730062
Rx Trunk Packets: 95396966228
Tx Packets: 2613558
Tx Unicast Packets: 257688
Tx Multicast Packets: 2355702
Tx Broadcast Packets: 168
Tx Bytes: 202281759
Tx Packets from 0 to 64 bytes: 128120
Tx Packets from 65 to 127 bytes: 2472242
Tx Packets from 128 to 255 bytes: 8732
Tx Packets from 256 to 511 bytes: 4391
Tx Packets from 512 to 1023 bytes: 63
Tx Packets from 1024 to 1518 bytes: 10
Tx Trunk Packets: 2307778

Known Issues

Last but not the least, be aware of some known issues (depending on version):

CSCsl21529, Symptom: An incorrect MTU value is displayed in the show interface command output. The Cisco Nexus 5000 Series switch only supports class-based MTU. Per-interface level MTU configuration is not supported. The switch supports jumbo frames by default. However, the show interface command output currently displays an incorrect MTU value of 1500 bytes.
Also, VEM may not retain jumbo MTU setting after reboot. Just something to watch and monitor.

Monday, June 7, 2010

OSPF EIGRP BGP dual point mutual redistribution - Part 1

It is still a common design to use IGP in the enterprise core. For a large enterprise, the requirement to integrate OSPF with EIGRP may be the result of mergers and acquisitions.

Although it may look like a CCIE bootcamp lab, mutual redistribution can be even more challenging in a production environment, due to these factors:
-As it will be shown, dual router mutual redistribution can be very tricky
-"Backdoor" paths such as multiple MPLS WAN networks adds more complexity
-Having multiple data centers, and the need to route different traffic in different failure scenarios

As I worked through the issues in a very large enterprise environment, We have seen multiple issues having a chain reaction, making the symptoms very difficult to diagnose. Sometimes, fixing one issue may introduce new ones. Unless we have an absolute crisp grasp of the design, and a systematic approach, the chance of confusion is extremely high.

This is the first of a 3 part series which I thought would be worthwhile to share some basics, and provide a logical breakdown of the interacting issues into separate and manageable pieces.

The much simplified diagram shows dual mutual redistribution points (blue arrow). So what is the issue? with dual router mutual redistribution, feedback can occur in both directions, resulting in inconsistency on the two redistribution points, and sub-optimal routing, even potential routing loops.
Issue 1: EIGRP->OSPF (feedback from OSPF)

Only applies to EIGRP EXT routes, which has a higher AD (170) than OSPF (110). These routes are redistributed into OSPF via R1. R2 learns from R1/OSPF, and prefers it. Therefore R2 will not redistribute these routes from EIGRP to OSPF, resulting in only one path used. Any EIGRP EXT route that doesn’t exist in OSPF will show up on one of the routers as preferring OSPF due to lower admin distance, breaking EIGRP->OSPF redistribution on one router. This means all traffic will be directed to one side.

This issues is resolved by setting EIGRP EXT AD to lower than OSPF (distance eigrp 90 100).

Note there is a side effect, the redistribution router will always prefer EIGRP path (due to lower EIGRP AD). But within OSPF, routes redistributed from EIGRP will have a higher metric (set with redistribution route map), therefore there is no risk of disturbing preference within OSPF.
Setting OSPF EXT AD(distance ospf ext 200) is similar. Both solutions will introduce issue 2, which we will cover in part 2.