Saturday, September 24, 2011

A potential problem with Juniper’s implementation of OSPF Router ID


First of all, this non-compliant behavior is observed only on some Juniper devices, not all.

The potential effect of the observed behavior is such that certain OSPF routes fail to propagate as expected.

Here is the scenario: SRX originally uses a loopback 10.0.0.11 for its Router ID.  When that loopback was deleted, it changed its Router ID to a different address 10.0.17.4. This is expected behavior so far.

srx-node0> show interfaces lo0.1
error: interface lo0.1 not found

srx-node0> show ospf overview instance VR   
Instance: VR
  Router ID: 10.0.17.4

However, a closer look at SRX OSPF database reveals that it still has LSAs with 10.0.0.11 (the old Router ID) as its Advertising Router ID.
srx-node0> show ospf database summary instance VR

    OSPF database, Area 0.0.0.0
 Type       ID          Adv Rtr           Seq      Age  Opt  Cksum  Len
Summary  10.0.17.0    10.0.0.11      0x80000008  1423  0x22 0xc857  28
Summary *10.0.17.0    10.0.17.4      0x80000002   792  0x22 0x8794  28

    OSPF database, Area 0.0.0.69
 Type       ID          Adv Rtr           Seq      Age  Opt  Cksum  Len
Summary  10.0.16.3    10.0.0.11      0x8000001c  2280  0x22 0xfbfc  28
Summary  10.0.16.7    10.0.0.11      0x8000001c  2137  0x22 0xd321  28
Summary  10.0.16.8    10.0.0.11      0x8000001c  1994  0x22 0xbf35  28
Summary  10.0.17.16   10.0.0.11      0x8000001d  3423  0x22 0xfdfc  28
Summary  10.0.17.32   10.0.0.11      0x8000001d  1851  0x22 0xaf2e  28


According to RFC2328:
If a router's OSPF Router ID is changed, the router's OSPF software should be restarted before the new Router ID takes effect.  In this case the router should flush its self-originated LSAs from the routing domain before restarting

Note there are two desired behaviors when Router ID changes: 1) OSPF restarts; 2) originated LSA flushes. When that did not happen with JUNOS, the resulting behavior is that the old Router ID is still the “Advertising Router ID” in the LSA, an address that is no longer valid.

Why is that a problem? Because these LSAs will be flooded to neighbors (assuming the router here is an ABR). The neighbor would have noticed the change of Router ID, and thus it will check the validity of Advertising Router ID.

Again, definition of Advertising router according to RFC2328: 
This field indicates the Router ID of the router advertising the summary-LSA or AS-external-LSA that led to this path.
 
Since the neighbor sees the Advertising Router ID (the old Router ID) no longer matches the new Router ID, it will discard the LSA.

When troubleshooting OSPF routing involving Juniper devices, check OSPF databases for invalid entries.

To prevent such pitfalls, always set Router ID in OSPF. And more importantly, set Router ID using loopbacks, and make sure they are not accidentally deleted. 

2 comments:

  1. I experienced something like this in our network.

    I know that it was OSPF that caused our whole ISP network to go down. Never found what was the root cause of it. Our OSPF database was HUGE! completely overloaded our core devices.

    This made me think of that. Any idea if this issue could happen if you make changes on another vendor router? (a week later)

    Thanks for sharing this info.

    ReplyDelete
  2. Standard based inter-vendor behavior usually works. But when an odd event occurs (in this case change of a Router-ID), then the "hooks" each vendor build in to treat such event may not be all consistent.

    I don't know of any easier way to detect this, especially when you are dealing with a large network. Maybe focus on a few problem routes and examine OSPF database. Figuring out one, is the same as figuring it out all.

    ReplyDelete