Thursday, August 19, 2010

Nexus7000 OSPF failure due to MFDM crash – still searching for root cause

This occurred a while ago, still waiting for words. Unfortunately the condition has cleared due to production network, just wondering if anybody else out there has some possible clue about root cause.

The most noticeable symptom was OSPF adjacency problem, with no error message. OSPF was able to establish at least partial adjacency with some neighbors, but not others.

Thinking OSPF process has gone bad, we restarted it, but no use. OSPF process runs fine with no error, but adjacency trouble remains. Further review of logs reveals that a process known as MFDM (Multicast FIB Distribution?) crashed with attempted to restart three times but did not recover. Obviously OSPF adjacency uses Multicast FIB (MFIB).

2010 Aug 2 17:29:57 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16377) hasn't caught signal 11 (core will be saved).
2010 Aug 2 17:30:03 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16471) hasn't caught signal 11 (core will be saved).
2010 Aug 2 17:30:03 Nexus-7010 %SYSMGR-2-SERVICE_CRASHED: Service "mfdm" (PID 16524) hasn't caught signal 11 (core will be saved).

We could not recover MFDM individually, end up reloading VDC to recover it. Magically, OSPF started working again!

Still an open ticket, no root cause has been identified. Part of the difficulty was we seem to have lost some of the files, which makes it hard for vendor to trace it down. Just wondering is there is any similar experience out there?

And, could a hardware or ASIC related failure be causing MFDM crash? Or is it more likely a software bug?

Thanks for your thoughts and comments.


  1. Just experienced same failure... Was there a resolution?

  2. John, other than us, I only saw another user experiencing MFDM crash. In our case, it never happened again, and was not reproducible. I tend to believe that an unusual sequence of events (which is the case in our lab due to heavy testing) led to the crash, and is not likely to happen in production.

    I also hope the upcoming upgrade to 5.x may address this better. Something to keep an eye on.