Sunday, March 31, 2013

Networking - Redefined

The networking industry as we know and love is shaken. New concepts emerge almost daily, buzz words abound, various “visions” in their nascent forms mixing reality with fantasy. Beneath the surface of chaos, fundamental changes are taking shape.

Like many network professionals, I feel the need to navigate through the frontal tide of confusion, and grasp the essence of change. Initially, as I swallowed a lot of information, I was easily confused and swayed one way or another. Let’s face it, most materials out there are vendor affiliated, which is inherently partial and biased. But over time, a clearer picture has emerged. However pure and simple, it has given me consistency and continuation in the thought process. I hope it will help you establish your own framework as well, and chart your own course forward.

Virtual Networking – the beginning of change
Let’s start with why, why the change, why now. To me, change is not about doing what networking already does, in a different way. Fundamentally, networking enables communication and supports compute, which enables applications. Compute has gone through its own revolution which is virtualization. Compute virtualization brought networking into the hypervisor environment, thus creating an overlap between two previously separate domains. This rudimentary form of virtual networking can be seen in the form of current generation virtual switches.



In the current generation architecture, virtual switches mainly serve to provide virtual ports for VMs, while most of the feature and security remain with the physical network. The advancement in compute virtualization has put more demand on networking, more segmentation to support multi-tenancy and security, more agility to support provisioning in minutes rather than days, requirements that hardware based networking and security simply cannot keep up with.  The catalyst of change is virtualization.

The Rise of Software Defined Networking
In the legacy model, virtualization is still closely coupled with network hardware. For traditional networking to be more agile, it needs to be “programmable”. Earlier OpenFlow architecture was proposed to be just that, but in reality hardware replacement was a non-starter. The Nicira model took a different approach. Rather than pushing programmability on hardware, it decouples virtualization from traditional networking.   A new form of virtual networking emerges around the hypervisor, mostly in the form of software. With emerging technology such as VXLAN, the virtual “edge” effectively becomes the new access layer, where much of the complexity such as segmentation as well as future services will reside. Traditional networking can be greatly simplified.

Thus the overlap between compute and networking has grown into a new layer. It has also become clear that this new layer on the edge is optimally positioned to deliver services such as load balancing, firewall and NAT.


There are two powerful differentiators that distinguish SDN:
·         Decoupling of virtual from physical. With new technology such as VXLAN, SDN provides an overlay network model which is mostly independent of the physical network. Decoupling makes it possible to instantiate VXLAN and deliver much of the cloud services without changing configurations on physical switches
·         Central decision making. The controller has full knowledge of the virtualized networks. Its cloud level view is ideal for managing resources centrally.

Networking, Redefined
Just as networking exists to serve applications, SDN emerges to support data center optimized for the cloud. In parallel with advancement in virtual switches (DVS, 1000v, Open vSwitch), a new class of cloud management system (vCloud, OpenStack) is emerging. In order for SDN to be successful, it must be an integral part of the Software Defined Data Center, supporting service/platform/application packaging, rapid provisioning, and automated service deployment.

Virtual networking is the new playground. The lines between network and virtualization vendors have blurred, as well as those between network and compute domains. There is a new domain emerging. I call it puzzle solving at the data center level, putting all the pieces together, compute, network, storage, security, making them fit seamlessly.


Networking’s growth area is with virtualization, in software. In this emerging field, networking no longer runs on dedicated hardware and ASIC. At the host level, it shares processing with compute. At the data center level, the distributed architecture becomes more centralized, with the controller becoming the new “supervisor”.

To remain competitive in a hybrid cloud environment, organizations need to move forward to take advantage of the power and features of Software Defined Networking. IT architects need to unify network, virtualization and software at the cloud level. I’ll brainstorm some concrete steps a network engineer can take in an upcoming post.

Sunday, November 25, 2012

Another look at Flow Control in the cloud

I posted about Cisco Flow Control with NetApp NAS during first round of cloud implementation almost two years ago. Paul commented about updates in NetApp documentation, so now is a good time for an update, with a fresh look at the general use of flow control.

Let’s start with NetApp’s “Ethernet Storage Best Practices”, which recommends:
  1.  Not enable flow control throughout the network
  2. Set storage to “send” flow control, set switch to “receive”
We can all agree on the first point. 802.3x Ethernet Flow Control has not been widely adopted in practice, due to implementation complexity and hardware dependency. Higher layer mechanism such as TCP Windowing, is more predictable and effective for end-to-end flow control.

So does the second point still make sense? Let’s revisit the basic concept of feedback in 802.3x: upon receiving a PAUSE frame, the sender responds by stopping transmission of any new packets until the receiver is ready to accept them again. NetApp here assumes that there is more buffer available on the switch side, so the receiver (storage) signals to the sender (switch) to hold packets in buffer until it is ready to receive more.  In today’s high speed 10G based networks, the opposite is often true. There should be more buffer space on storage than on switch side.

Cisco’s Nexus 5k is a typical access switch to connect with storage. On the Nexus 5500 platform, the packet buffer per port is 640KB, which is an increase from 480KB from the previous generation Nexus 5000. With 1G/10G speed, such buffer size only allows the link to be paused for a short time, typically measured in microseconds.  The question is, at what point does the storage receiver needs to send PAUSE? Doing so unnecessarily and pushing the bottleneck on the switch side, is unpredictable at best, and may do more harm than good.

Another NetApp documentation “NetApp storage best practices for Vmware vSphere” makes it more clear by stating “NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack”.

A new development with flow control is PFC. Approved in June 2011, Priority Based Flow control (PFC) became standard as 802.1Qbb, which extends 802.3 PAUSE on point-to-point links but supporting multiple priorities. Designed to support “lossless” Ethernet (a key component of DCB), its implementation is highly vendor and hardware specific. For example, Cisco’s Nexus 5k has unique buffer allocation implementation to support PFC. For now it is a special purpose technology with its success largely dependent on that of FCoE.  

I will summarize the revised recommendation as:
  • Turn off flow control unless explicitly requested by endpoint vendors
  • For FCoE and specialized lossless requirement, implement 802.1Qbb Priority-based Flow Control (PFC) only when necessary and when supported by hardware

Friday, October 19, 2012

Network based TCP MSS adjustment


Maximum Segment Size (MSS) is set by end points during initial TCP handshake. In special circumstances, router can step in to alter MSS.

Let’s look at such a scenario when two hosts communicate through an SSL tunnel. End points sees a path MTU of 1500 byte, and set MSS to be 1500. However, SSL adds extra overhead. Therefore, when a 1500 byte packet arrives at tunnel end points, it becomes a little larger. Furthermore, SSL often sets DF (Do not Fragment). Since the packet is now larger than 1500 byte, with DF set, the router drops it. This results in communication failure between hosts (while ping and traceroute appears to be working). An extended ping with varying packet size will verify this exact behavior.

How to get around this issue? Increase MTU? Reduce MSS set by host and application? There is an easier method available in IOS 12.2(4)T and higher. Configured under interface, router can intervene and “adjust” TCP MSS with “ip tcp adjust-mss” command.

With the TCP adjustment option, router examines TCP SYN coming through the interface, and adjust it if necessary to ensure that it is lower than the set value. In other words, the router can lower MSS to account for the extra tunnel overhead. All this happens transparent to applications. The end result is TCP session is set up with a slightly lower MSS than application originally intended. Now packets with DF set will remain within MTU 1500 even with tunnel overhead, and thus transmitted across instead of being dropped.

Friday, September 14, 2012

traceroute through MPLS


traceroute is often used as an effective analysis and troubleshooting tool. It is easily interpreted in a hop by hop routing network. Tracing packets through an MPLS network, however, requires more in-depth understanding of the internetworking between routing and tag switching.

The best place to start is the MPLS PE router. On the PE router, each customer’s VPN is represented by a vrf, in this case vrf “bigco”. Examining routing table for customer’s remote destination network (172.18.0.0), notice its “next hop” is the remote PE (BGP RR address). This may be counter-intuitive that a customer VPN has a next hop in the global routing table (effectively leaping from one vrf to another), but this is precisely where MPLS does its magic.

A_PE1#sho ip route vrf bigco 172.18.0.0
Routing entry for 172.18.0.0/16
  Last update from 10.8.0.1 5d18h ago
  Routing Descriptor Blocks:
  * 10.8.0.1 (Default-IP-Routing-Table), from 172.18.127.141, 5d18h ago

Note in the above display, customer VPN has a routing next hop 10.8.0.1 which exists only in the global routing table.  “Under the hood”, when customer VPN traffic arrives at PE, it is tag switched (not routed) through the MPLS network.

  • Customer VPN destinations are learned from BGP peers (in this case 10.8.0.1 is BGP RR)
  • Note 10.8.0.1 is not in vrf “bigco”, rather it is global “Default-IP-Routing-Table”
  • How can a VPN route’s next hop to be global? On PE it is necessary, it is a special internal hook to make the linkage between routing and tag switching. All VPN route next hops are PE peers at layer 3


Here is a command that clearly illustrates the linkage between PE next hop and tag switching of VPN routes: “show bgp vpnv4 unicast vrf … tag”. The “tag” option is hidden. Here it shows that the next hop for VPN traffic is a remote PE.

A_PE1#sh bgp vpnv4 uni vrf bigco tag
   Network          Next Hop      In tag/Out tag
   172.18.0.0      10.8.0.1      notag/15

In order to reach the remote PE, PE looks up its tag switching table. In this case, tag switching identifies 10.8.0.1 with a local tag of 78, and out tag of 34. Tag switching continues through the MPLS network, until it reaches the remote PE.
A_PE1#sh mpls forward
78     34          10.8.0.1/32     0             Gi0/1      10.8.0.162

The topology represents the simplest form of an MPLS network which consists of P and PE. The sample VPN has a customer destination of 172.18.0.1. When packet to that destination arrives at A_PE1, routing table indicates its “next hop” as the remote B_PE1. To reach remote next hop, packet is tag switched through the MPLS network. The core routers (P) have no concept of VPN destinations, they are simply tag switching between PE destinations.

Traceroute ,when interpreted correctly, provides a nice end to end view. Here it shows tag switching from PE on. Note the inside tag identifies the VPN destination which does not change during transport. The outside tags (34 and 64) are tagging switching through the MPLS network (P and PE). Once it gets to the remote PE, the inside tag (15) is popped and regular routing applies to next hop (CE).
A_PE1#traceroute vrf bigco 172.18.0.1
  1 10.8.0.130 [MPLS: Labels 48/15 Exp 0] 128 msec 184 msec 216 msec
  2 10.9.32.226 [MPLS: Labels 43/15 Exp 0] 196 msec 232 msec 152 msec
  3 10.8.33.18 [MPLS: Label 15 Exp 0] 152 msec 88 msec 168 msec
  4 10.8.33.17 656 msec 704 msec 644 msec

Monday, September 3, 2012

Sorting out System MAC addresses with VPC and VSS – Part 2


Following Part 1 which starts with VPC on Nexus platform, here VSS on Catalyst is compared side by side.

A simple and interesting topology can be used to illustrate. In this case, Nexus and Catalyst use different multichassis technology (VPC and VSS respectively), forming back to back virtual port channel. The effective logical topology becomes greatly simplified (shown on the right side), with benefits including utilization of full bisectional bandwidth, stable all forwarding STP, high resiliency, and ease of adding/removing physical members etc.


VSS Domain ID is very much similar to VPC Domain ID. It is a unique identifier in the topology, which represents logical virtual switch formed by two physical chassis. Only one VSS pair is associated with a particular domain.

Consequently, VSS Domain ID (1-255) is used in protocol negations, therefore must be unique in the network. To illustrate, a pair of 6500 forms VSS. Since VSS is a fully consolidated logical device, it operates as one device in the network. Therefore, the use of common system MAC is necessary to represent the VSS system, for usage such as SPT and LACP. The system MAC must be unique and not tied in with physical devices.

As shown below, a VSS system MAC is derived from the combination of a predefined address (0200.0000.00xx), as well as VSS Domain ID. Since in this case Domain ID is 100, which is 64 in hex, it becomes the last octet.
6500-VSS# sh lacp sys-id
32768,0200.0000.0064

The use of “0200.0000.00xx” may be curious, since it is not assigned to any manufacturer. In this case, it is only used as a system identifier, and its uniqueness assured by the uniqueness of domain ID, therefore it is perfectly acceptable. But imagine another vendor also adopting similar schemes, potential problems may exist.

Another subtlety is the use of VSS and VPC domain ID. Because VPC and VSS derive system MAC from different MAC pool, they can overlap in a common topology. This is another reason for Cisco to preserve assigned MAC addresses, so that future platforms and technologies can be developed.

Looking under the hood at MAC level can be surprising. On the topic of preserving MAC, both Catalyst and Nexus, uses the same MAC for all SVI interfaces (“show interface vlan”). In other words, the MAC addresses on all VLAN interfaces are the same, even though the IP addresses are different.

In order to support the above, the switch maintains its CAM and MAC address table per VLAN. As shown in the display, MAC address 0026.8888.7ac2 is used for all SVI interfaces.  The switch automatically creates a static MAC entry which points to supervisor (MSFC), where per VLAN resolution occurs.

Nexus7k-1# sh mac address
G     -    0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 304     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 306     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 562     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 564     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 820     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 565     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 566     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 590     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 592     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 594     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 340     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 596     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 342     0026.8888.7ac2    static       -       F    F  sup-eth1(R)
G 344     0026.8888.7ac2    static       -       F    F  sup-eth1(R)

Hopefully, a look at system MAC has provided a glimpse into the inner-working of two important data center technologies.

Friday, August 31, 2012

Sorting out System MAC addresses with VPC and VSS – Part 1


A number of multichassis aggregation technologies are deployed in the data center today, for example, Cisco’s Multichassis EtherChannel (MEC) on catalyst 6500 VSS, and Virtual Port Channel (vPC) on Nexus platforms. Inter-chassis aggregation greatly increases link utilization, while simplifying design by eliminating topology dependence on spanning tree protocol. STP becomes passive as most links are forwarding, and most failure scenarios no longer require STP re-convergence, thus minimizing disruptions. Furthermore, a more elegant data center design can be achieved, with lower operational complexity, and higher return on investment.  

System MAC address exists on individual devices, often used for device level negotiation, for example, bridge ID field in STP BPDU, or as part of LACP LAGID.

When multiple chassis operate in unison, software simulates the behavior of a common logical system, with the use of common virtual identifiers. Differentiating and sorting out the use of virtual system identifier and various MAC addresses is helpful for understanding, designing and deploying such systems.

It can be illustrated with a simple topology such as the one shown in the diagram, in which a pair of Nexus (in VPC domain 100) is connected to another pair (in VPC domain 101) on back to back VPCs.


The following display shows that each physical device has a system MAC address, and each pair of pairing devices also has a common VPC system MAC address.

N7k pair: VPC domain 100
Nexus7k-1
Nexus7k-2
Nexus7k-1# sh vpc role
vPC Role status
----------------------------------------------------
vPC role                        : primary
Dual Active Detection Status    : 0
vPC system-mac                  : 00:23:04:ee:be:64
vPC system-priority             : 32667
vPC local system-mac            : 00:26:98:08:7a:c2
vPC local role-priority         : 100
Nexus7k-2# sh vpc role
vPC Role status
----------------------------------------------------
vPC role                        : secondary
Dual Active Detection Status    : 0
vPC system-mac                  : 00:23:04:ee:be:64
vPC system-priority             : 32667
vPC local system-mac            : 00:26:98:08:7c:c2
vPC local role-priority         : 110
Nexus7k-1# sh lacp system-identifier
32768,0-26-98-8-7a-c2
Nexus7k-2# sh lacp system-identifier
32768,0-26-98-8-7c-c2


N5k pair: VPC domain 101
Nexus5k-1
Nexus5k-2
Nexus5k-1sh vpc role
vPC Role status
----------------------------------------------------
vPC role                        : primary
Dual Active Detection Status    : 0
vPC system-mac                  : 00:23:04:ee:be:65
vPC system-priority             : 32667
vPC local system-mac            : 00:05:9b:75:08:3c
vPC local role-priority         : 100
Nexus5k-2# sh vpc role
vPC Role status
----------------------------------------------------
vPC role                        : secondary
Dual Active Detection Status    : 0
vPC system-mac                  : 00:23:04:ee:be:65
vPC system-priority             : 32667
vPC local system-mac            : 00:05:9b:76:08:fc
vPC local role-priority         : 110
Nexus5k-1# sh lacp system-identifier
32768,0-5-9b-75-8-3c
Nexus5k-2# sh lacp system-identifier
32768,0-5-9b-76-8-fc

Note the “common” system MAC address is generated from a pre-defined set of MACs, with its last octet derived from the VPC domain ID. For the pair of 7k shown, domain 100 is 64 in hex. For the pair of 5k shown, domain 101 is 65 in hex. If both pairs happen to be using the same leading octets, then the last octet will ensure the uniqueness of VPC MAC. This is an example of why domain ID needs to be unique to each pair in the topology.

Visually, they look like this (shown for pair of 7k):

Note Local system MAC is used for communication among the pair:
Nexus7k-1# sh lacp port-channel
port-channel1
  System Mac=0-26-98-8-7a-c2
  Local System Identifier=0x8000,0-26-98-8-7a-c2
  Admin key=0x9
  Operational key=0x9
  Partner System Identifier=0x8000,0-26-98-8-7c-c2
  Operational key=0x9
  Max delay=0
  Aggregate or individual=1
  Member Port List=1,17
port-channel5
  System Mac=0-26-98-8-7a-c2
  Local System Identifier=0x8000,0-26-98-8-7a-c2
  Admin key=0x31
  Operational key=0x31
  Partner System Identifier=0x8000,0-26-98-8-7c-c2
  Operational key=0x31
  Max delay=0
  Aggregate or individual=1
  Member Port List=10,18

VPC system MAC is used for communication over VPC, for example, in this case LACP negotiation between pair of 7k and pair of 5k.

LACP uses its local system MAC for a local port channel (normal behavior); but in order for VPC to work, it must use “common” system MAC for VPC negotiation. The following command can be misleading as it only shows the local system MAC.
Nexus7k-1# sh lacp system-identifier
32768,0-26-98-8-7a-c2

It is more evident when looking from the remote side, that common MAC is used as LACP identifier on VPC, instead of local system MAC.

Nexus7k-1# sh lacp port-channel
port-channel101
  System Mac=0-26-98-8-7a-c2
  Local System Identifier=0x8000,0-26-98-8-7a-c2
  Admin key=0x8065
  Operational key=0x8065
  Partner System Identifier=0x7f9b,0-23-4-ee-be-65
  Operational key=0x8065
  Max delay=0
  Aggregate or individual=1
  Member Port List=17

Nexus5k-1# sh lacp nei int po101
Flags:  S - Device is sending Slow LACPDUs F - Device is sending Fast LACPDUs
        A - Device is in Active mode       P - Device is in Passive mode
port-channel101 neighbors
Partner's information
            Partner                Partner                     Partner
Port        System ID              Port Number     Age         Flags
Eth1/3      32667,0-23-4-ee-be-64  0x311           10761537    SA

            LACP Partner           Partner                     Partner
            Port Priority          Oper Key                    Port State
            32768                  0x8065                      0x3d

Note common system MAC only identifies the VPC system formed by the pair, not individual port channel or interfaces. For tracing and troubleshooting, they are not to be confused with MAC address on port channels and physical interfaces.

Also note VPC, just like other port channels, uses the first physical member’s MAC address as port channel MAC address. But the same is not true for port channel between the switch pair.