Effective Troubleshooting (Dev Edition)

G. Wade Johnson

How do you troubleshoot?

Give the audience a little time to make suggestions.

Symptoms, not the Problem

Troubleshooting does not usually start with a problem. My percentages are really just an estimate, but they seem somewhat reasonable. Usually, the person reporting the problem can only report on what they see, which are symptoms. They have no real way to get insight into what is actually wrong.

Start with Questions

A large number of the problems I have worked on have turned out to be completely different than what was reported or what I thought. Most of these are much easier to solve if you don't assume that the obvious has been ruled out.

The first question eliminates a surprising number of user reported problems. We had a great example of the second question last week when I got a report of strange errors reported by a partner. The next two go together.

If the system was working before and suddenly stopped, something must have changed. It may be a change to the code, configuration, environment, load, or something else that is not obvious. Identifying any changes may lead you to the key to finding the problem. The important thing to remember about the last question is "Correlation does not equal causation". A change is often related to a new problem, but it can be just a coincidence.

My Favorite First Step

The Telescan delayed server war story.

Gather Information

In the first stage, it is critically important to observe and query, and avoid making assumptions or interpreting the data. Unfortunately, this is pretty hard to do in general. Many troubleshooting failures are caused by missing this step, with large amounts of time spent looking in the wrong direction.

One important thing about computer problems, the computer is too stupid to lie. Humans can tell you things that aren't true (often not intentionally). You are one of the humans that can lie to you. Where ever possible get the data from the computer rather than assuming you know or believing what someone tells you.

Tactics versus Strategy

Most of the time that you see anything written on troubleshooting, there is more focus on techniques and tactics. However, without an effective strategy, it is difficult to apply an techniques effectively.

That's why this talk will focus more on strategy than tactics.

Strategy: Hunch/Guess and Test

To a junior developer, the developer with a lot of experience seems to just pull understanding and solutions out of the air. The junior developer might try to use guesswork to emulate the more senior developer.

If it ever works, even once, this technique becomes really hard to abandon. It impresses everyone, which gives incentive to try it again.

Part of what makes this approach seductive is the Dunning-Kruger effect. You don't know enough to evaluate how little you know.

This also happens with senior troubleshooters who end up working in a different system, or environment. In this case, they are used to being able to intuit the cause of a problem by just looking at the system. But, in a different system that intuition can be complete wrong.

More senior troubleshooters also fall for the Dunning-Kruger effect. Their confidence is based on knowledge of a different system, which caused them to underestimate what they don't know.

Strategy: Feynman's Algorithm

Many developers would love to be able to pull this off. Our normal belief that we are mostly rational and intelligent would nicely work with this as a strategy.

Strategy: Divide and Conquer

This strategy is pretty obvious to anyone who understands binary search. Developers should recognize the powers of two and O(log(n)) progression here..

It's also only this effective if each question divides the search space in half.

Divide and Conquer - Analysis - Pro

As strategies go, divide and conquer is really a best practice for troubleshooting. The core idea is to narrow down where in the system the problem is occurring, then investigate there to find the cause of the problem.

Choosing good questions/hypotheses is key. If done well, you should not care which way the test goes. Either result cuts the problem in half.

Divide and Conquer - Analysis - Con

There are a small number of circumstances where it won't be helpful (mostly when you can't replicate an issue), but in the overwhelming majority of cases, it works well.

Strategy: Scientific Method

  1. Observe the symptoms
  2. Form a new hypothesis
  3. Create a test for the hypothesis
  4. Run the test
  5. Observe the behavior
  6. Does the behavior support the hypothesis
    • Yes: Return to step 3
    • No: Return to step 2

The scientific method has been a very effective approach in the pursuit of knowledge. One of the important parts of the scientific method, is it's ability to reduce the effects of bias and error. While not guaranteed to find an answer, it has proven to be one of our most effective tools of inquiry.

The most important test of the hypothesis is to attempt to disprove the hypothesis. If you try hard to disprove your hypothesis and fail, this generates more confidence in the hypothesis.

After several tests strengthen your belief in the hypothesis, we develop more confidence in the hypothesis. Once you have a solid hypothesis about describing the problem, you can begin developing a solution.

Complements Divide and Conquer strategy. Use to pick points for partitioning the problem.

Strategy: Scientific Method, Analysis - Pro

As strategies go, the scientific method is a best practice for troubleshooting.

The scientific method is not comfortable to many people. They want to go quicker; skip steps; get to the conclusion intuitively. Of all of the methods we humans have used to solve problems, this has been the most successful. It can be lead astray, but it is more consistent than any other approach we've found.

Strategy: Scientific Method, Analysis - Con

Almost everyone wants to jump ahead. It's hard to come up with a hypothesis that is not so specific as to be practically useless.

Strategy: Scientific Method, Hypothesis

The hardest part of the scientific method is forming a testable hypothesis. In looking at a software problem, this usually involves trying to come up with an idea for a failure that would generate the symptom we see. A good start is to identify which system/library/component is likely to cause the problem.

Multiplier: Experience with System/Domain

This isn't actually a strategy. Developing experience requires a lot of time. You normally only see this effective in the hands of some of the most experienced of your team. Unfortunately, this strategy is also one that is almost impossible to teach. It requires knowledge of the system or domain. Not only does the troubleshooter need to be knowledgeable about the system, but also experience with the context in which the system executes.

Other thoughts

If you can't reproduce a problem, it's very hard to fix it. Wishful or magical thinking never helps in the long term with troubleshooting.

"Think horses, not zebras." Check/eliminate the obvious possibilities first. If something changed recently, if it's a new partner, if traffic has gone up, these are probably causes of new symptoms.

Save complicated hypothesis and problems for after you have eliminated the simple possibilities. It's overwhelmingly more likely that we got a bad request than that something has fundamentally changed with the way docker runs on one of our VMs.

Sometimes, but rarely, a random hunch works out. Especially as you get more and more familiar with a system. Try it out, but limit your time spent. I normally try not to spend more than 5-10 minutes tasting out a hunch. It gets bad when you suddenly realize that you've spent a few hours running down hunches and have made minimal real progress toward finding the problem. Go back to our best practice strategies.

Barriers to Troubleshooting

Watch out for ignoring data that does fit your current hypothesis, or looking specifically for data that matches what you think the problem is. Don't guess, observe.

The way you understand the system will affect how you approach troubleshooting it. The unconscious assumptions about how you think the system works will constrain how you approach solving it. If your assumptions are wrong, it will be hard to find a solution.

It's easy to get distracted by information that does not apply to the problem. If you run into something that does not directly relate, make note of it and continue with your strategy. You can come back to it if it becomes relevant. Not being distracted is more important.

Questions?

notes