Tuesday, July 20, 2010

Nexus1000v bug – widespread VM intermittent connectivity and multicast failure

If you are experiencing widespread, contantly changing, totally unpredicatable VM connectivity issue, you have likely hit this bug. You will notice ping works intermittently, and all your regular troubleshooting will yield no particular results, because the behavior is not consistent.

I have benefited tremendously from many posts out there, I hope this will save some of you many many hours of frustration.

First, verify your health (I meant that of your nexus1000v, although at that point your health is probably not very good either). Issue command on VSM ’module vem # execute vemcmd show dr’. The DR (Designated Receiver) must be associated with uplinks. The DR must not be associated with any of the vmnics.

The following shows the “good” output. You have hit the bug if DR is pointing to any of the vmnics for any VLAN you have.

BD 2899, vdc 1, vlan 2899, 3 ports, DR 304, multi_uplinks TRUE
Portlist:
20 vmnic4
22 vmnic6
304 DR

The root cause is a Nexus 1000v programming error that misplaces DR, which is used for multicast and broadcast traffic. As problem comes and goes, lost ARP causes widespread inconsistent behaviors throughout the network. You may notice ping works from Nexus 7000, but not from Nexus 5000, works from one device, but not from another, you get the idea.

The temporary recovery procedure is to reset the physical ports associated with the VEM by “shut” and “no shut”. Check with ’module vem # execute vemcmd show dr’ again to see the symptoms corrected.

The bug is known as “unpublished”, with information hidden. The fix is forecast to be in the upcoming patch release, which should be soon.

One last thing, you may not want to rush with “shut” and “no shut”. Give it at least 10 seconds in between, to avoid another bug. More on this later.

3 comments:

  1. Sean,

    If you're referring to CSCte96034 " MAC move packet must be sent to broadcast address" - This bug has been fixed in our latest release (4.0.4.SV1.3).

    Info on this bug can be looked up in the bugtoolkit -> http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCte96034

    If this isn't the bug you're referring to, let me know the ID and I'll see what further detail I can provide.

    Robert

    ReplyDelete
  2. God Bless you.... Been looking for days and yet to find a post that in anyway helped confirm the issues Working out.

    ReplyDelete
  3. This bug has been documented as CSCtg72137. Robert, I don't think it's the same as CSCte96034. I also think it should have been classified as "severe", because it affected just about any N1kv VLAN.
    http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCtg72137

    The bug has been fixed in 4.0(4)SV1(3a). Believe me it was one of the hardest to troubleshoot with. Unless you really enjoy your evenings with Nexus1000v, please upgrade without delay!

    ReplyDelete