Engineering Excellence: The VeriSign DDoS debacle and the Zero-Bug mantra

Discover how a DDoS debacle at VeriSign taught me the hard truths about engineering practices. It’s not just about avoiding bugs, it’s about fostering an environment where quality isn’t an afterthought.

Engineering Excellence: The VeriSign DDoS debacle and the Zero-Bug mantra
Photo by Marc Szeglat / Unsplash

Early on in my career, I got a painful lesson on the importance of sound engineering processes. I was working as a contractor for VeriSign, part of a team shipping a browser extension to fake international domain names. The team had a skip-level manager who knew little about computers, a tech lead who had convinced himself his opinion was the only one that mattered, and two manual testers who clicked buttons they were told to click but understood little about the product we were developing. 

As a result, a bug got through that wound up causing a DDoS attack against VeriSign. I’d been asked to write a function that indicated whether the client machine was online (this was early 2000s, when modems were still not obsolete), and did so by sending a ping to verisign.com, returning true if the ping succeeded. My tech lead then wrote a bug that essentially put my function inside of an infinite loop. The client was deployed to 85k clients, and the result, as they say, is history.

To return to the main theme of this series, an Engineering Manager has three main priorities: 

1.     Happy Team

2.     Customer Love

3.     Don’t Ship Stupid.

My Verisign experience was an example of Shipping Stupid. Aside from my own error and that of my tech lead, there were simply no processes in place to exhaustively test for and catch mistakes.  

In this, third article in the series, I’d like to explore Shipping Stupid.  

The Andon Cord

grey SUV
Photo by Brice Cooper / Unsplash

I will assume that, by now, the notion of the Toyota manufacturing mechanism known as the Andon Cord has seeped into public consciousness, so I won’t belabor the point by explaining it. The critical element of this mechanism I want to note is the psychological safety it requires. It’s hard to be the whistleblower, hard to feel like you’re putting your job, your career, your very future in jeopardy by stopping that assembly line.

As I mentioned previously, the culture on that team in Verisign was such that disagreeing with the tech lead was dangerous, especially in a public setting. He wasn’t the forgiving kind. So, over time, I learned to keep my head down and do what I was told. Hence, when told to write a function to check for online status and not finding any reliable Windows API way of doing so, I took the decision to send that fateful ping. I should’ve come back to the tech lead and asked for guidance, but I was scared. I didn’t trust that my question wouldn’t backfire in some way I couldn’t predict.

The worst thing you, as a leader, can do to your organization is to deprioritize psychological safety. This creates code monkeys, people who want to put in their 9-5, get their paycheck, and go home, hoping they will still have a job to come back to the next day. Such an environment is poorly suited to rockstar employees, or even to average employees willing to take a risk on an idea or check something they weren’t explicitly asked to check.

The swishing bug tail

orange tabby cat in blue plastic bucket
Photo by Alvan Nee / Unsplash

Fast-forward a few years. Back in early 2000s, when we were still in the dying days of the Waterfall, it wasn’t uncommon for a product to accumulate hundreds of bugs waiting to be resolved as the development cycle ground toward shipping. 

This doesn’t tend to happen with SaaS, or, at least, it happens in a slightly different way. Because SaaS prioritizes rapid shipping (As per Google’s SRE, just-in-time shipping is the golden standard), product bug tails don’t really get a chance to accumulate… at least not the critical bugs… at least not ones you wouldn’t want to ship…

Okay, I can see you shaking your head. Fine, fine. Product bugs accumulate just as much on SaaS platforms. They may not accumulate in the same way they do for client apps, but that’s more due to the fact that most testing is now automated and designed to catch regressions rather than find new unexpected bugs. In other words, product bugs don’t accumulate because they aren’t found. Additionally, while product bugs may not accumulate as much, tech debt certainly does, and it shows up in the form of greater overall bug density, more time needing to be spent on maintenance, and slower overall product velocity. 

Most SaaS teams I’ve talked to have ticket counts in the hundreds, and never enough time to resolve them all. They are barely threading water taking care of the high-severity ones. 

There is actually a fairly easy way to fix this issue if the team is willing to be ruthless. That is, simply, to take a zero-bug-tolerance approach, which is:

1.     Close any bugs older than 3 months outright.

2.     Close any lower priority bugs older than 1 month

3.     Triage anything left and close anything the team cannot commit to fixing right now.

4.     Drop all other work and fix all other bugs that remain.

5.     From then on, any bug that comes in must be either fixed immediately or closed.

In practice, however, I have yet to encounter a team willing to do this. There is – ha ha – a degree of psychological safety in having these bugs in your backlog. It seems to send a message of “see? I’m taking your bug seriously! I’ll get to it as soon as I can.”  

This feeling is sadly misleading. The originator of the bug doesn’t care that their bug is in your backlog. They just want it fixed. What good is it when a bug sits in the backlog for years, starved of resources because more important issues keep coming up?

By closing the bug you’re giving the requestor a choice of either dropping their complaint (and often, if a bug really has lingered for months, they’ve long since found a way to work around it), or yelling loudly that, no, they really do need the bug fixed. That’s when you perform that triage in step 5: are you going to drop everything to fix this bug now? If the originator isn’t the only one encountering the issue, or if they account for a high percentage of your product’s revenue, you really should. If not, you would politely tell the originator that, while you value their feedback, you do not see any way you will be able to prioritize their issue. Then you move on to other issues that you have deemed more important. 

The same goes for tech debt. 

Adapting to change

green lizard
Photo by Mark Stoop / Unsplash

A big part of not Shipping Stupid is having in a place a process to prevent it. That’s not just about your test strategy. Quality starts from the ground up, in other words, from idea to implementation. Agile has a huge toolkit for creating quality. Here are just a few to consider:

·       Skills Matrix

·       Definition of Ready

·       Definition of Done

·       Retrospectives

So, when we talk about process, we’re talking about it holistically, on the team (and even organizational) level, where components of this process interlock to create an environment suitable for quality.

But then something changes. Your senior leadership decides to pull a 180 on your product strategy. Your PM comes in with an amazing new idea that absolutely must go out tomorrow. Your company is bought out or experiences layoffs, and your team suddenly goes from 12 to 4.

Frequently, the first thing that gets dropped in this “emergency” situation is the quality of your deliverables. I’ve seen it happen in multiple teams: pressure from without creates stress and pressure within, and that translates into long hours, fatigue, and Shipping Stupid.

How do you prevent this? By having a process set up and enforced that accounts for the unexpected. In my previous articles, I’ve talked about such mechanisms, so I’ll just mention them here briefly. 

Your number-1 asset to deal with the unexpected is a prioritized stack of work. When new work comes in, the stack will help you both prioritize it and adjust expectations for what’s now deprioritized. 

Your second tool in this arsenal is Kanban. Again, I won’t belabor this point as I’ve written quite a bit about it. The important thing here is that these two work together to make your team resilient to the unexpected, helping it avoid Shipping Stupid.

What about you?

How about your team? What mechanisms do you use to ensure quality? How are they working out for you? Where are the holes? Join the discussion!