Sunday, March 6, 2011

Cisco Flow Control with NetApp NAS

Update Nov 2012: Thanks for the comment, Paul. I have added an update with another look at flow control.

When NAS is used with virtualization, performance and throughput are potential concerns. Enabling jumbo has proven to be the most effective method. Flow control, on the other hand, is not always easily agreeable by all parties.

NetApp has a one paragraph “best practice” on the subject. The recommendation is to set flow control to “receive on” on the switch port. In other words, allow the switch to receive “pause” from NAS.

As it appears, NetApp can send a lot of “pause”. This is easily shown on the port channel or physical interfaces connected to NAS:

Nexus5k# show interface e2/7
Ethernet2/7 is up
  30 seconds input rate 116486568 bits/sec, 2079 packets/sec
  30 seconds output rate 33970464 bits/sec, 792 packets/sec
  Load-Interval #2: 5 minute (300 seconds)
    input rate 57.99 Mbps, 1.58 Kpps; output rate 32.18 Mbps, 693 pps
  RX
    20604669084 unicast packets  176831374 multicast packets  0 broadcast packet
s
    20781500458 input packets  63305126825983 bytes
    8255225288 jumbo packets  0 storm suppression packets
    0 runts  0 giants  0 CRC  0 no buffer
    0 input error  0 short frame  0 overrun   0 underrun  0 ignored
    0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop
    0 input with dribble  0 input discard
    176409426 Rx pause
  TX
    12091810302 unicast packets  52908470 multicast packets  2650659 broadcast p
ackets
    12147369431 output packets  48540927928599 bytes
    7298956759 jumbo packets
    0 output errors  0 collision  0 deferred  0 late collision
    0 lost carrier  0 no carrier  0 babble
    0 Tx pause
  0 interface resets

So what does it mean? Flow control is only meaningful if the receiving party acts on it. Therefore the expectation here is for switch to “slow down” the transmission, since it is hearing NAS saying “slow down, I can’t keep up”.

According to Cisco, once Pause is enabled and received on switch egress port, it will back pressure the ingress port, eventually packets will be buffered on the ingress port. If pause is enabled on the ingress port, it can further send the pause to the upstream switch. In the ideal scenario, the pause eventually reaches the source, which is the ESXi host, thus slowing down the origination of the transmission.

However, taking a second look at the NetApp diagram here, the recommendation is to configure the end points, ESX servers & NetApp arrays with flow control set to ‘send on’ and ‘receive off’. In other words, ESX is not expected to receive flow control, which leaves only the network to absorb Pause.



Now let’s revisit the flow control scenario again, NAS sends Pause, switches receives it, there is no use applying back pressure all the way to the host since host won’t receive it. The best a switch can do is to buffer it, or apply back pressure to upstream, and have the upstream switch buffer somewhere.

How well that works really depends on the switch and linecard models, each have different capabilities and buffer size. In many cases, it is highly questionable how far back flow control is propagating to have any positive effect. In any case, you want to check:
  • Switch interface to NAS, to see the amount of Pause received
  • All interfaces where NAS traffic flows, to see if there are drops
 More clarifications on this topic are probably required from vendors. Device behavior will likely evolve with technology advancements. For now, it’s best to turn flow control on switch side, but monitor network behavior closely.

3 comments:

  1. Hi Sean,

    I have been investigating the same NetApp "best practices" and I think recently there may have been some changes. I found the following two references where NetApp suggest on modern 10GbE infrastructure flow control should be avoided.

    What are your thoughts to using flow control or 10GbE infrastructure? should it change our approach?

    Paul


    TR-3802 – Ethernet Storage Best Practices

    Page 22

    CONGESTION MANAGEMENT WITH FLOW CONTROL
    Flow control mechanisms exist at many different OSI Layers including the TCP window, XON/XOFF, and FECN/BECN for Frame Relay. In an Ethernet context, L2 flow control was unable to be implemented until the introduction of full duplex links, because a half duplex link is unable to send and receive traffic simultaneously. 802.3X allows a device on a point-to-point connection experiencing congestion to send a PAUSE frame to temporarily pause the flow of data. A reserved and defined multicast MAC address of 01-80-C2-00-00-01 is used to send the PAUSE frames, which also includes the length of pause requested.

    In simple networks, this method works well to can work well. However, with the introduction of larger and larger networks along with more advanced network equipment and software, technologies such as TCP windowing, increased switch buffering, and end-to-end QoS negate the need for simple flow control throughout the network.


    TR-3749 NetApp Storage Best Practices for VMware vSphere

    Page 25

    FLOW CONTROL

    Flow control is a low-level process for managing the rate of data transmission between two nodes to prevent a fast sender from overrunning a slow receiver. Flow control can be configured on ESX/ESXi servers, FAS storage arrays, and network switches. For modern network equipment, especially 10GbE equipment, NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack. For older equipment, typically GbE with smaller buffers and weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp arrays with the flow control set to "send."

    ReplyDelete
  2. Nice write up Sean! I actually stumbled upon this while doing some research on one of our tickets, where I was seeing incrementing rx pause frames on the vPC interface of one of our NetApp filers.

    ReplyDelete
  3. @runningvm. I am under the same impression that we should now not enable flow control on the netapp. I gather that from the docs you noted, and from netapp support.

    ReplyDelete