Navigating the Storm: Mastering Crisis Management in Tech Leadership

In the heat of a tech crisis, effective leadership is crucial. Discover how to shift from collaborative to directive management, guiding your team like a conductor in an orchestra. Learn from real-world scenarios to master crisis management. It's time to lead with confidence and clarity.

Navigating the Storm: Mastering Crisis Management in Tech Leadership
Photo by Jason Leung / Unsplash

Well, it happened. You just got thrown an unrealistic deadline. That new app release your team just published increased your crash rate by 600%. The feature update your star developer's been working on for a month and finally pushed to prod broke a dependent service, and they want it fixed yesterday.

If you have more than a year or two of management under your belt, you’ve seen this sort of crisis come and go. Hopefully, your team got through it unscathed, in other words, without too much upper management scrutiny, without burning bridges with other orgs within your company, and without team attrition.

Chances are, though, it hadn’t been quite that easy. I’ve seen my share of multi-week deathmarches. Earlier in my career, I’ve had a dev work 32 hours straight to unscrew the pooch on a high-priority prototype we were working on for the HoloLens. I’m not proud of letting him do that, and he quit shortly after. Lesson learned. 

As an EM for Skype for Business iOS and Mac clients, I’ve had app releases with unexpected bugs, requiring fast priority shifts to address. No amount of testing can catch everything, and sooner or later a bad bug slips through.

In my previous articles I’ve talked a lot about the value of servant-leadership, of driving the team by consensus, of letting team members decide what to work on. As an EM, it’s our job to promote ownership as a central value on our teams, to empower our team members to take initiative.

Today I want to talk about the one scenario where a more directive style of management is necessary.

Crisis.

Bringing out the Lead-Stick

man in black leather jacket wearing red helmet
Photo by Etienne Girardet / Unsplash

Back in Microsoft, we had a term for when a dev lead (the precursor to an Engineering Manager) needed to get hands-on: bringing out the Lead-Stick. This was mid-2000s, and the term hadn’t yet acquired the derogatory connotation it did in the years to come, once the servant-leader approach took a better hold in the company culture. (Oh, I could tell you stories.)

Over the years, I’ve come to think of an EM as operating in two modes:

  • The “normal” mode makes use of all those techniques I’ve discussed previously: collaborative, empowering, soft-touch. No micromanagement, full trust.
  • The “emergency” mode flips all that on its head. In an emergency, an EM is a micro-managing director, and over the years, I’ve seen over and over that the team appreciates this side to the role, because an emergency is stressful, and having someone organize the team toward a common goal reduces that stress.

When a crisis hits, the team is like an orchestra. They each have their talents. They each have ideas on how to address the crisis. And unless they have a conductor, they’re each going to play their own, favorite piece. The result will be chaos. Team members will step on each other’s toes. The “heroes” on the team will work insane hours to try and pull the situation out of crisis mode. No-one will know exactly how close the team is to resolving the crisis, and that will mean increased scrutiny from upper management, meaning even more stress.

This is the time when the Engineering Manager’s tactical acumen is most needed. Let’s now talk in-depth about how to approach this crisis-management, step by step.

First, Assumptions

Before we talk about the approach, let’s agree on a couple assumptions without which this approach will be far more difficult to implement:

  1. The EM has earned trust with their team. In other words, the team is ready to follow direction when given, not out of fear of retribution if they don’t, but because they genuinely respect the EM’s opinion.
  2. The EM understands the technical details on the ground at least at a 10k-foot level, i.e. they have a good grasp of the system’s architecture and data flows, if not an intimate understanding of the codebase.
  3. The EM understands the strengths and weaknesses of each member of their team.
  4. The EM has a good relationship with upper management, meaning the senior leaders will take the EM at her word rather than forcing additional oversight onto the team.

The points I mention above must already be in place by the time the crisis hits for your crisis management to be at its most effective. If these aren’t met, your efforts at crisis management will be far more complicated, though I’ve also found that sometimes (for example, a crisis hits while you’re still new on the team) it’s possible to build these while also managing the crisis. 

Step Zero: Communication

boy singing on microphone with pop filter
Photo by Jason Rosewell / Unsplash

The very first thing you should do when encountering a crisis is to communicate out. This is typically an email to the affected groups and senior management. In that email, you want to briefly summarize the nature of the crisis, the steps you’re taking toward resolution (at this stage the “step” is initial analysis), and a very rough timeline for your approach. At this point it’s understandable that you’ll have very little information to share beyond the nature of the crisis, but you should be able to set expectations for when your leadership should expect to hear from you next. That’s usually going to be after you complete…

Step one: Analysis

man wearing gray polo shirt beside dry-erase board
Photo by Kaleidico / Unsplash

Obviously, the first “real” step in a crisis is to understand what’s wrong, the business repercussions of the issue, and a couple different approaches to fixing it.

It’s important not to neglect the “repercussions” point. Understanding the business cost of the issue will help you understand its urgency, independent of the FUD (fear-uncertainty-doubt) that will swirl around it. I’ve frequently seen stakeholders cry bloody murder for an issue that, while serious, wasn’t truly blocking. This doesn’t mean those types of issues aren’t important or shouldn’t be fixed as soon as possible, but there’s a world of difference between a business-threatening problem and a highly annoying one. That difference is the difference between burning out your team to fix ASAP, and doing business-hours-only, high-priority work. More on that in a sec.

If there are no obvious resolution approaches evident, I will meet with the team and discuss until we brainstorm a couple options. The easiest option to come up with is going to be the “sunny-day” one, i.e. the way to mitigate risk assuming everything goes according to plan.

I suggest not stopping there. To come up with the other options, there are a few possible ways of thinking: 

  1. What happens if the issue isn’t fixed? Can the problem be worked from a business angle? (in other words, is the crisis really a crisis?) It’s always good to keep this line of questioning in your back pocket just in case no other options work out.
  2. Is there a cheap/quick way of partially fixing the issue? Generally, a crisis along the lines we’re talking involves a certain percentage of your customer base. Is there a way of narrowing down the number of customers affected to a point where the crisis is tolerable? (giving your team more time to implement the full fix?)
    1. You can take a similar approach for the “deliver feature X by date Y” crisis. Is there a way to scope down feature X to make the date easier to meet? Are there requirements of X that aren’t P0s?
  3. What happens if your sunny-day fix fails? Is there another approach you can take that might not be as efficient but is more likely to work? Is there a cheap “hack” fix you can put in that’ll give your team more time to deal with the issue properly? I will generally aim to have at least 2-3 backup plans in my back pocket, just in case the first one fails.

At the end of this analysis, you should have a decision tree that takes you through resolving the crisis and includes multiple contingencies. 

Depending on the urgency and scale of your crisis, this stage should take you between 30 minutes to a day, most of which will involve discussions with your team, the stakeholders, and PMs to come up with these alternative plans.

Once finished, your next step should be to communicate what you’ve learned so far. By now you should have a better idea of the timelines to resolution, and the approaches you’re planning to take. Send a follow-up email to your first one, describing your decision tree and a rough timeline to resolution, and mentioning when your next status update will be.

Stage Two: Work Assignment

sticky notes on corkboard
Photo by Jo Szczepanska / Unsplash

Once an approach has been selected, it’s time to put your orchestra conductor hat. The approach needs to be broken down into the smallest pieces of work possible. This is useful because it allows you to easily parallelize the work across several engineers, speeding up the overall progress. This is a similar approach to thin-slicing, a practice I’ve found to be incredibly useful for achieving timeline predictability over larger-scale pieces of work.

Whereas during normal operations, I encourage team members to pick up whatever tasks they feel most aligned to (working down the priority stack, of course), in emergency mode I tend to hand tasks out based on areas of expertise. This is disempowering and should not be done often, as we’ve discussed before.

As tasks get completed, I will continue to assign the remaining tasks in this manner until the issue is finished. 

This is a good time to send out another status update to the folks involved.

Stage Three: Execution

three people sitting in front of table laughing together
Photo by Brooke Cagle / Unsplash

Here comes the grind. This stage is a repeat of stages 1-2. As issues arise, evaluate them against your decision tree. If something unexpected happens, reevaluate your decision tree and send and update with that reevaluation.

Depending on the urgency of the crisis, you should aim for a status update anywhere between hourly to daily. 

By default, I would highly recommend not deathmarching to resolution, unless the issue really is truly critical. We’ve talked earlier about the importance of understanding the true cost of your emergency from the get-go, and here is when that understanding pays off. Do not burn the team unless you absolutely have to, and that “absolutely” should only happen in situations where the issue is disrupting your company’s business. (Blocked customers, blocked dependency organizations, and so on.)

Why is this so important? Because typically, stakeholders tend to inflate the importance of their issues. It’s human nature. We have a problem, and we want to have it fixed ASAP. But if you approach every problem as business-threatening, pretty soon you’ve burned out your team entirely, and now not only can you not address new crises, you also can’t make forward progress on your regular work. Just don’t do it. Let your engineers have the work-life balance they deserve.

Conclusion

Crises are specific. Advice is generic, and cheap, to boot. When you’re stuck in a crisis, it’s easy to lose sight of the bigger picture and operate in a reactive, scattershot way. Your team won’t thank you for this. A crisis demands structure even more than the usual grind does. I hope that the advice above has given you some ideas for what that structure might be, but the important thing to remember is the exact structure is less important than having one at all. So, come up with a crisis-management structure that works for you, and, when that crisis happens, follow it.