Effective Troubleshooting (Dev Edition)

G. Wade Johnson

How do you troubleshoot?

Experience
Educated guesswork/Hunches
Trial and error
I have my methods.

Give the audience a little time to make suggestions.

Symptoms, not the Problem

You are almost never given a problem, just symptoms.
You are not usually given all of the right symptoms.
Probably 80-95% of troubleshooting is finding the problem.
Finding a solution is usually much easier.
Applying the solution is usually anti-climactic

Troubleshooting does not usually start with a problem. My percentages are really just an estimate, but they seem somewhat reasonable. Usually, the person reporting the problem can only report on what they see, which are symptoms. They have no real way to get insight into what is actually wrong.

Start with Questions

Is it on/running?
Did it ever work?
- separates bug reports from feature requests
What did you expect it to do?
What did it actually do?
What changed?
Really, what changed?

A large number of the problems I have worked on have turned out to be completely different than what was reported or what I thought. Most of these are much easier to solve if you don't assume that the obvious has been ruled out.

The first question eliminates a surprising number of user reported problems. We had a great example of the second question last week when I got a report of strange errors reported by a partner. The next two go together.

If the system was working before and suddenly stopped, something must have changed. It may be a change to the code, configuration, environment, load, or something else that is not obvious. Identifying any changes may lead you to the key to finding the problem. The important thing to remember about the last question is "Correlation does not equal causation". A change is often related to a new problem, but it can be just a coincidence.

My Favorite First Step

Is there something that could cause our symptoms
... that could not possibly be happening
... but is easy to verify?
Test it.

The Telescan delayed server war story.

Gather Information

Observe what is actually happening
Don't guess: measure or count
Ask the computer, don't simulate

In the first stage, it is critically important to observe and query, and avoid making assumptions or interpreting the data. Unfortunately, this is pretty hard to do in general. Many troubleshooting failures are caused by missing this step, with large amounts of time spent looking in the wrong direction.

One important thing about computer problems, the computer is too stupid to lie. Humans can tell you things that aren't true (often not intentionally). You are one of the humans that can lie to you. Where ever possible get the data from the computer rather than assuming you know or believing what someone tells you.

Tactics versus Strategy

Tactics
- Everyone has their own favorite tactics, tools, techniques, etc.
- Some tactics work better than others for a certain circumstance
- Some people are more comfortable with some tactics than others
Strategy
- A good strategy helps guide your tactics to their best potential
- A bad strategy can negate even the best tactics

Most of the time that you see anything written on troubleshooting, there is more focus on techniques and tactics. However, without an effective strategy, it is difficult to apply an techniques effectively.

That's why this talk will focus more on strategy than tactics.

Strategy: Hunch/Guess and Test

Normal novice approach
Superficially looks like the first, at least to the novice
Pro: Seems easy
Pro: If it works, it looks impressive
Con: Only effective by accident
Con: Can waste a lot of time with little movement
Bad Strategy

To a junior developer, the developer with a lot of experience seems to just pull understanding and solutions out of the air. The junior developer might try to use guesswork to emulate the more senior developer.

If it ever works, even once, this technique becomes really hard to abandon. It impresses everyone, which gives incentive to try it again.

Part of what makes this approach seductive is the Dunning-Kruger effect. You don't know enough to evaluate how little you know.

This also happens with senior troubleshooters who end up working in a different system, or environment. In this case, they are used to being able to intuit the cause of a problem by just looking at the system. But, in a different system that intuition can be complete wrong.

More senior troubleshooters also fall for the Dunning-Kruger effect. Their confidence is based on knowledge of a different system, which caused them to underestimate what they don't know.

Strategy: Feynman's Algorithm

Three step process
1. Write down problem
2. Think real hard
3. Write down solution
Many people would really like to be able to do this
Con: Only works well if you are Richard Feynman
Bad Strategy

Many developers would love to be able to pull this off. Our normal belief that we are mostly rational and intelligent would nicely work with this as a strategy.

Strategy: Divide and Conquer

Identify where in a system the problem resides
Binary search converges faster than you might think
- 1 test removes half of the search space
- 10 tests will narrow to 0.1% of search space
- 20 tests will get to 1 in a million
git bisect can automate a version of this

This strategy is pretty obvious to anyone who understands binary search. Developers should recognize the powers of two and O(log(n)) progression here..

It's also only this effective if each question divides the search space in half.

Divide and Conquer - Analysis - Pro

The best test gives the same amount of information regardless of result
- In front-end or back-end
- Before step A or after step A
- Not is it this one thing?
Guaranteed to find a single-cause problem if you can come up with repeatable tests
Limited number of dead-ends, each test should progress toward the solution
Good Strategy

As strategies go, divide and conquer is really a best practice for troubleshooting. The core idea is to narrow down where in the system the problem is occurring, then investigate there to find the cause of the problem.

Choosing good questions/hypotheses is key. If done well, you should not care which way the test goes. Either result cuts the problem in half.

Divide and Conquer - Analysis - Con

Slow and methodical, doesn't seem as cool
Not as effective with problems that have multiple causes
- Still works, but your tests need to be targeted more carefully
Only works if you can test if the problem has happened

There are a small number of circumstances where it won't be helpful (mostly when you can't replicate an issue), but in the overwhelming majority of cases, it works well.

Strategy: Scientific Method

Observe the symptoms
Form a new hypothesis
Create a test for the hypothesis
Run the test
Observe the behavior
Does the behavior support the hypothesis
- Yes: Return to step 3
- No: Return to step 2

The scientific method has been a very effective approach in the pursuit of knowledge. One of the important parts of the scientific method, is it's ability to reduce the effects of bias and error. While not guaranteed to find an answer, it has proven to be one of our most effective tools of inquiry.

The most important test of the hypothesis is to attempt to disprove the hypothesis. If you try hard to disprove your hypothesis and fail, this generates more confidence in the hypothesis.

After several tests strengthen your belief in the hypothesis, we develop more confidence in the hypothesis. Once you have a solid hypothesis about describing the problem, you can begin developing a solution.

Complements Divide and Conquer strategy. Use to pick points for partitioning the problem.

Strategy: Scientific Method, Analysis - Pro

Works even if you are not familiar with domain
Experience or Divide and Conquer complements this strategy
Methodically works toward a solution
Can help reduce the effect of confirmation bias
Good Strategy

As strategies go, the scientific method is a best practice for troubleshooting.

The scientific method is not comfortable to many people. They want to go quicker; skip steps; get to the conclusion intuitively. Of all of the methods we humans have used to solve problems, this has been the most successful. It can be lead astray, but it is more consistent than any other approach we've found.

Strategy: Scientific Method, Analysis - Con

Most people find this to be hard
Can be hard to keep your biases out of the form a hypothesis stage
Focus on why, not where

Almost everyone wants to jump ahead. It's hard to come up with a hypothesis that is not so specific as to be practically useless.

Strategy: Scientific Method, Hypothesis

Is it a UI/front-end effect or a back-end issue?
Is it in system A or system B?
Could it be bad data?
Could the user have typed something wrong?
Other ideas?

The hardest part of the scientific method is forming a testable hypothesis. In looking at a software problem, this usually involves trying to come up with an idea for a failure that would generate the symptom we see. A good start is to identify which system/library/component is likely to cause the problem.

Multiplier: Experience with System/Domain

Many senior developers can use this
Understand failure modes and context surrounding them
Mostly just requires a good memory (or knowledge base articles)
Pro: Very effective in a known area of expertise
Con: Not as effective in a new domain
Levels up actual strategies

This isn't actually a strategy. Developing experience requires a lot of time. You normally only see this effective in the hands of some of the most experienced of your team. Unfortunately, this strategy is also one that is almost impossible to teach. It requires knowledge of the system or domain. Not only does the troubleshooter need to be knowledgeable about the system, but also experience with the context in which the system executes.

Other thoughts

Reproducability
Critical thinking
Most probable problems/solutions first
- Think horses, not zebras.
KiSS principle
Follow a hunch if you like, but time-box it

If you can't reproduce a problem, it's very hard to fix it. Wishful or magical thinking never helps in the long term with troubleshooting.

"Think horses, not zebras." Check/eliminate the obvious possibilities first. If something changed recently, if it's a new partner, if traffic has gone up, these are probably causes of new symptoms.

Save complicated hypothesis and problems for after you have eliminated the simple possibilities. It's overwhelmingly more likely that we got a bad request than that something has fundamentally changed with the way docker runs on one of our VMs.

Sometimes, but rarely, a random hunch works out. Especially as you get more and more familiar with a system. Try it out, but limit your time spent. I normally try not to spend more than 5-10 minutes tasting out a hunch. It gets bad when you suddenly realize that you've spent a few hours running down hunches and have made minimal real progress toward finding the problem. Go back to our best practice strategies.

Barriers to Troubleshooting

Confirmation bias
Mental model/implicit assumptions
Unnecessary constraints (on your thinking)
Rushing to a solution too soon
Irrelevant information

Watch out for ignoring data that does fit your current hypothesis, or looking specifically for data that matches what you think the problem is. Don't guess, observe.

The way you understand the system will affect how you approach troubleshooting it. The unconscious assumptions about how you think the system works will constrain how you approach solving it. If your assumptions are wrong, it will be hard to find a solution.

It's easy to get distracted by information that does not apply to the problem. If you run into something that does not directly relate, make note of it and continue with your strategy. You can come back to it if it becomes relevant. Not being distracted is more important.

Questions?

notes

CoverMyMeds

February 1, 2017