I have my methods.
Give the audience a little time to make suggestions.
Troubleshooting does not usually start with a problem. My percentages are really just an estimate, but they seem somewhat reasonable. Usually, the person reporting the problem can only report on what they see, which are symptoms. They have no real way to get insight into what is actually wrong.
A large number of the problems I have worked on have turned out to be completely different than what was reported or what I thought. Most of these are much easier to solve if you don't assume that the obvious has been ruled out.
The first question eliminates a surprising number of user reported problems. We had a great example of the second question last week when I got a report of strange errors reported by a partner. The next two go together.
If the system was working before and suddenly stopped, something must have changed. It may be a change to the code, configuration, environment, load, or something else that is not obvious. Identifying any changes may lead you to the key to finding the problem. The important thing to remember about the last question is "Correlation does not equal causation". A change is often related to a new problem, but it can be just a coincidence.
The Telescan delayed server war story.
In the first stage, it is critically important to observe and query, and avoid making assumptions or interpreting the data. Unfortunately, this is pretty hard to do in general. Many troubleshooting failures are caused by missing this step, with large amounts of time spent looking in the wrong direction.
One important thing about computer problems, the computer is too stupid to lie. Humans can tell you things that aren't true (often not intentionally). You are one of the humans that can lie to you. Where ever possible get the data from the computer rather than assuming you know or believing what someone tells you.
Most of the time that you see anything written on troubleshooting, there is more focus on techniques and tactics. However, without an effective strategy, it is difficult to apply an techniques effectively.
That's why this talk will focus more on strategy than tactics.
To a junior troubleshooter, the troubleshooter with a lot of experience seems to just pull understanding and solutions out of the air. The junior troubleshooter might try to use guesswork to emulate the more senior troubleshooter.
If it ever works, even once, this technique becomes really hard to abandon. It impresses everyone, which gives incentive to try it again.
Part of what makes this approach seductive is the Dunning-Kruger effect. You don't know enough to evaluate how little you know.
This also happens with senior troubleshooters who end up working in a different system, or environment. In this case, they are used to being able to intuit the cause of a problem by just looking at the system. But, in a different system that intuition can be complete wrong.
More senior troubleshooters also fall for the Dunning-Kruger effect. Their confidence is based on knowledge of a different system, which caused them to underestimate what they don't know.
Many troubleshooters would love to be able to pull this off. Our normal belief that we are mostly rational and intelligent would nicely work with this as a strategy.
This strategy is pretty obvious to anyone who understands binary search.
It's also only this effective if each question divides the search space in half.
As strategies go, divide and conquer is really a best practice for troubleshooting. The core idea is to narrow down where in the system the problem is occurring, then investigate there to find the cause of the problem.
Choosing good questions/hypotheses is key. If done well, you should not care which way the test goes. Either result cuts the problem in half.
There are a small number of circumstances where it won't be helpful (mostly when you can't replicate an issue), but in the overwhelming majority of cases, it works well.
The scientific method has been a very effective approach in the pursuit of knowledge. One of the important parts of the scientific method, is it's ability to reduce the effects of bias and error. While not guaranteed to find an answer, it has proven to be one of our most effective tools of inquiry.
The most important test of the hypothesis is to attempt to disprove the hypothesis. If you try hard to disprove your hypothesis and fail, this generates more confidence in the hypothesis.
After several tests strengthen your belief in the hypothesis, we develop more confidence in the hypothesis. Once you have a solid hypothesis about describing the problem, you can begin developing a solution.
Complements Divide and Conquer strategy. Use to pick points for partitioning the problem.
As strategies go, the scientific method is a best practice for troubleshooting.
The scientific method is not comfortable to many people. They want to go quicker; skip steps; get to the conclusion intuitively. Of all of the methods we humans have used to solve problems, this has been the most successful. It can be lead astray, but it is more consistent than any other approach we've found.
Almost everyone wants to jump ahead. It's hard to come up with a hypothesis that is not so specific as to be practically useless.
The hardest part of the scientific method is forming a testable hypothesis. In looking at a software problem, this usually involves trying to come up with an idea for a failure that would generate the symptom we see. A good start is to identify which system/library/component is likely to cause the problem.
This isn't actually a strategy. Developing experience requires a lot of time. You normally only see this effective in the hands of some of the most experienced of your team. Unfortunately, this strategy is also one that is almost impossible to teach. It requires knowledge of the system or domain. Not only does the troubleshooter need to be knowledgeable about the system, but also experience with the context in which the system executes.
If you can't reproduce a problem, it's very hard to fix it. Wishful or magical thinking never helps in the long term with troubleshooting.
"Think horses, not zebras." Check/eliminate the obvious possibilities first. If something changed recently, if it's a new partner, if traffic has gone up, these are probably causes of new symptoms.
Save complicated hypothesis and problems for after you have eliminated the simple possibilities. It's overwhelmingly more likely that we got a bad request than that something has fundamentally changed with the way docker runs on one of our VMs.
Sometimes, but rarely, a random hunch works out. Especially as you get more and more familiar with a system. Try it out, but limit your time spent. I normally try not to spend more than 5-10 minutes tasting out a hunch. It gets bad when you suddenly realize that you've spent a few hours running down hunches and have made minimal real progress toward finding the problem. Go back to our best practice strategies.
Watch out for ignoring data that does fit your current hypothesis, or looking specifically for data that matches what you think the problem is. Don't guess, observe.
The way you understand the system will affect how you approach troubleshooting it. The unconscious assumptions about how you think the system works will constrain how you approach solving it. If your assumptions are wrong, it will be hard to find a solution.
It's easy to get distracted by information that does not apply to the problem. If you run into something that does not directly relate, make note of it and continue with your strategy. You can come back to it if it becomes relevant. Not being distracted is more important.
An important piece of the troubleshooting process is describing the problem. If the problem is not described well, the developer will either spend too much time trying to reproduce the problem, or not fix the problem. A well-written problem/bug report can make a huge difference. The act of trying to put together a well-written problem/bug report can also make it easier to spot the problem behind the symptoms.
Sometimes the problem is that the user is trying to do something that shows a misunderstanding of the system. Stating the task up front can help with leading the user to the right goal.
Sometimes the problem starts before the user thought it did. Knowing what the user did before the action started is important. (Setting up the wrong kind of request. Incognito mode. etc.)
Be explicit about every step taken. Leaving one out can make the problem impossible to duplicate.
Expectations are important. Sometimes the user's expectation does not match the intended result, even if everything is working fine. This could mean that there is a problem with reporting, but it may not mean a broken system.
It's best to have a really explicit description of what happened. Was there an error message? What was it? Did the intended action happen and just wasn't reported immediately?