On the Premature Death of Spanning Tree and the Indiscriminate Killing of Canaries
I have a bee in my bonnet. After my last post full of love and bromance, this one is full of hate and vitriol – and I don’t apologise! We have all seen many presentations on each vendors latest and greatest “fabric” technologies over the past 18 months. It doesn’t matter which vendor, whether the presenter is sales or tech, or even enterprise or service provider focused – at some point almost every one declares that their solution is “the end of spanning tree”. It gets worse when they actively advise that you do not run spanning tree in your environment.
And I don’t buy it 🙁
The Premature Death of Spanning Tree
Spanning Tree: noun – A pox on the house of networking
Everybody loves to hate on Spanning Tree. Haters gonna hate. While we’ve all been bitten by something horrible happening related to spanning tree, I have seen many more things go wrong because people *didnt* configure Spanning Tree properly.
Vendors knew how painful it was and went to great lengths to ensure that we didn’t need to do anything so that it would “just work”. Which is great… right up until the point that you run into a limitation on the STP PixieDust Mode. Often this comes in VLAN dense environments when you max out the total number of spanning tree instances that your devices will handle. Oh thats easy – lets run out Multiple Spanning Tree!
I can hear the gasps from many people now. If people hated Spanning Tree, then they have a full lynch mob ready with pitch forks and stakes at the mere mention of poor MSTP. And they hate it for a reason. MSTP makes you think about how spanning tree works again and all the PixieDust goes away. And networks become hard again.
I will let you in on a little secret:
I actually really like MSTP and I implement it every chance I can get.
Yes its a little harder and requires a little more forethought, but I would rather do that at the design time then have to overhaul my design later to meet some new need well after my network has reached “critical mass”. I have spent many hours rebuilding spanning tree designs because I needed more than 128 instances. Sadly in more than one case I needed to work out the best way to deal with a group of Catalyst 3750 in PVST+ with 1000 VLANs configured (and 900 VLANs with STP disabled).
And things get messy, and things get hard. So lets find a new solution.
The Magical Healing Powers of Woven Unicorn Hair Fabrics
Somewhere over the rainbow, far beyond the Dark Forest of Broccoli Despair, many magical elves have worked hard to deliver us a the perfect solution to the problems I listed above. Vendors have taken this creation and moulded into their own “Fabric Solutions”. Some created skinny jeans, others an uncomfortable sweater vest. Sadly most of the time they have just presented us with a sensible pair of slacks that the sales people try to sell as a three piece suit.
A sensible pair of slacks (Unicorn Hair or otherwise), is perfectly apt when used as intended, but if you drape them over your shoulders and call them a shirt then your wrong (or a hipster. In which case your cardigan is probably over the top of your shirt-slacks).
And so it is with data centre fabrics. I agree that most of these solutions will allow us to disable spanning tree on our core/fabric facing interfaces. We will get many of the benefits of multi-path layer 2 and some times efficiencies gained by avoiding the flooding of L2 addressing information. Turning off spanning tree into the fabric core makes sense. Im happy with that.
So what about all those edge interfaces?
Do we live in a world where end users never plug two ports together?
A client PC never bridges interfaces?
How about “Oh my VoIP phone has two network ports let me just…. BOOM!”
Maybe you have no requirement to integrate with other networking infrastructure, but end stations can still do bad things and thats usually when you don’t want them to.
The Indiscriminate Killing of Canaries
So how do we go about detecting these loops? Well over the past couple of decades we’ve presented ourselves with a whole cage full of canaries that can alert us to loops or other similar problems in the network. These are our early warning signals that “Something bad just happened…” and better yet “… so let me just fix that for you!”. And sadly, many of these have been built around the functionality that Spanning Tree provides.
Let’s take the BPDU Guard feature as an example. BPDU Guard is set on an access port or another port that you do not expect to see Spanning Tree Packets (BPDUs). If a BPDU is detected, the switch will usually log a message and send the port into a blocking mode. In the scenarios listed previously the offending port is now taken out of action and the loop is removed. If we have disabled Spanning Tree on all ports then the BPDU will never be sent or received and our little bridging loop will happily continue. Well at least until your switch is a bubbling blob on the bottom of your rack.
Another feature available on most switches is the BPDU Filter. With BPDU Filter enabled on a port the switch will pass all traffic on the port but silently drop the BPDU messages. Now I agree that their are certain times when this feature is useful, such as when interconnecting with a 3rd Party that you “know” can never form a loop with you and you do not want to either learn a STP root from them or go into block due to election issues.
Sadly, our good friends at VMWare love to advocate that we implement BPDU Filter on the ports facing our VMWare Hosts. Unfortunately I have been bitten by loops coming from inside a VMWare environment due to a Microsoft Guest Bridging two vNICs in separate VLANs. A BPDU from the came in from the Physical NIC on VLAN A out to the vNIC in that VLAN and back out through vNIC and VLAN B. Thankfully when this happened, my canary (BPDU Guard) signalled that there was a problem then promptly died in its cage and disabled the port to the VMWare Host. Yes this would have some undesirable effects on all the other guests on that host, but we were alerted to the problem and needed to fix it. In the scenario with BPDU Filter these alerts would have been filtered out and the loop would continue unnoticed.
So what other methods do we have to detect possible bridging loops that do not involve Spanning Tree to be operational? I have the following list as a start to some ideas, and I am looking for others that you might know of too:
- Broadcast Storms
- Possible Mitigation: Storm Control
- Multiple Mac Addresses on a  Port
- Possible Mitigation:Â Max MAC Address restrictions
- MAC Address Flapping
- Possible Mitigation: MAC Flap Dampening
- High CPU Usage (in some cases)
Mop and Bucket
Yes, I’ve written his post at 2am, but its been something that I have been thinking about for the past 8 or 9 months.
I can see that Spanning Tree doesn’t have an indefinite future, but calling it dead today is premature. If you are looking at fabric technologies or worse still you dont have a new fangled fabric but hate spanning tree so bad that you have just turned it off, then ask yourself how you will detect loops in the edge networks and how you will mitigate them.
Take your canaries with you and let them do their job and don’t strangle them at the top of the mine shaft.
If you do you might just find that the Emperors new cloths are just a sensible pair of slacks!