Troubleshooting: The Lost Skill?
January 28, 2014 1 Comment
This blog entry comes from an email I received this morning asking me to “check in files”, as my failure to do so was causing a work stoppage. After a very short examination, I found the following:
- I had no files checked out (absolutely none)
- The problem was a permissions problem, not a check out problem, or the person who could not check in their files was not being stopped by my failure to act, but by someone else’s incorrect granting of permissions
- I had no permissions to solve the problem (i.e. grant the correct permissions
Further investigation of the problem would have revealed it was a permissions issue. In this case, the only consequence is another day lost of productivity, and a wonderful opportunity to learn. In some cases, the consequences are more dire. Consider, for example, Jack Welsh, the CEO of GE.
Jack made an assumption and ended up destroying a manufacturing plant. In one telling of Jack Welsh’s story, the dialog goes something like this:
Jack: Aren’t you going to fire me now?
Manager: Fire you? I just spent 2 million dollars training you.
Considering Jack Welch is now one of the most successful executives of all time, it is good his manager was able to troubleshoot the aftermath to a problem Jack had worked through on assumption. The point is plain: When we don’t troubleshoot a problem, we go on assumptions. In the email I received this morning, there was an assumption I had files checked out. Rather than test the assumption, work stopped.
Tony Robbins tells a story in his Personal Power program about a suit of armor. As he is walking on stage, every time he moves close to a suit of armor there is feedback. The audience eventually starts screaming at him “it’s the armor”. But he continues to stand near the armor and the feedback eventually goes away. He then moves away and the feedback comes back. Turns out there was a fire on a very close frequency and the messages were interfering with the microphone.
Personally, I think the above story is a myth, as I know the FCC is very careful on doling out bands and it is unlikely a microphone has the same band as emergency services. But this is also an assumption, and proper troubleshooting would have me examining the issue.
The path of least resistance
On Facebook … nay, on the Internet as a whole, a large majority of items are written out of assumptions or biases, and not an examination of the facts. For most people, whether you agree with Obamacare or not is not an exercise in examining the facts completely and then drawing conclusions. Instead, a quick sniff test is done to determine if you feel something smells, and then action is taken.
Let’s take an example. In 2006, the news media reported on the Duke Lacrosse team raping an African American stripper. The case seemed open and shut, as the evidence piled up. Duke University suspended the team and when the lacrosse coach refused to resign, Duke’s president cancelled the rest of the season. The case seemed so open and shut, Nancy Grace (CNN) was suggesting the team should be castrated.
When the assumptions were removed, a completely different story was told. Not only was the evidence thin, much of it was manufactured. The District Attorney, Ray Nifong, was disbarred and thrown in jail for contempt of court.
We can also look at the George Zimmerman case, where the initial wave of “evidence” painted another “open and shut” case. But the “open and shut” case, based on assumptions, began to crumble when it was discovered ABC edited the 911 tape to paint Zimmerman as a racist and carefully choose the video and picture evidence to paint a picture of a man that had no wounds and was the obvious aggressor.
The point here is not to rehash these cases, but to point out that assumptions can lead to incorrect conclusions. Some of these assumptions may lead to dire consequences, while most just result in a less than optimal solution.
Now to the title of the section: The path of least resistance.
When we look at the natural world, things take the path of least resistance. Water tends to travel downhill, eroding the softest soil. Plants find the most optimal path to the sunlight, even if it makes them crooked. Buffalos would rather roar at each other to establish dominance than fight, as the fighting takes precious energy. And humans look for the least amount of effort to produce a result.
Let’s pop back to Obamacare, or the PPACA (Patient Protection and Affordable Care Act), as it illustrates this point. Pretty much everyone I encounter has an opinion on the subject. In fact, you probably have an opinion. But is the opinion based on assumption? You might be inclined to say no, but have you actually read the bill? If not, then you are working on distillations of the bill, most likely filter through the sites you like to visit on a regular basis. And, more than likely, you have chosen these sites as they tend to fit your own biases.
I am not deriding you on this choice. I only want you to realize this choice is based more on assumptions than troubleshooting. Troubleshooting takes some effort. In most cases, not as much as reading a 900+ page bill (boring) or many more thousands of pages of DHS rules (even more boring). But, by not doing this, your opinion is likely based on incomplete, and perhaps improper, facts.
Answering Questions
I see questions all the time. Inside our organization, I see questions for the Microsoft Center of Excellence (or MSCOE). I have also spent years answering online questions in forums. The general path is:
- Person encounters problem
- Person assumes solution
- Person asks, on the MSCOE list, to help with the assumed solution – In general, the question is “How do I wash a walrus” type of question rather than one with proper background of the actual business problem and any steps (including code) taken to attempt to solve it
- Respondent answers how to solve the problem, based on their own assumptions, rather than using troubleshooting skills and asking questions to ensure they understand the problem
- Assumed: Person implements solution – While the solution may be inferior, this is also “path of least resistance” and safe. If the solution fails, they have the “expert” to blame for the problem (job security?). If it succeeds, they appear to have the proper troubleshooting skills. And very little effort expended.
What is interesting is how many times I have found the answer to be wrong when the actual business problem is examined. Here are some observations.
- The original poster, not taking time to troubleshoot, makes an assumption on the solution (path of least resistance)
- Respondent, taking path of least resistance, answers the question with links to people solving the problem posted
- If the original poster had used troubleshooting skills, rather than assumptions, he would have thrown out other possibilities, included all relevant information to help others help him troubleshoot, and would have expressed the actual business problem
- If the respondent had use troubleshooting skills, rather than assumptions (primarily the assumption the poster had used troubleshooting skills), he would have asked questions before giving answers.
To illustrate this, I once saw a post similar to the following on a Microsoft forum (meaning this is paraphrased from memory).
Can anybody help me. We have a site that has been working for years in IIS 4. We recently upgraded to Windows Server 2008 and the site is no longer serving up asp.net files located at C:\MyFiles. I just hate Microsoft right now, as I am sure it is something they changed in windows. I need to get this site up today, and f-ing Microsoft wants to charge for technical support.
The first answers dealt with how to solve the problem by turning off the feature in IIS that stops the web server from serving files outside of the web directory structure. While this was a solution, troubleshooting the problem would have shown it was a bad solution.
Imagine the user had written this instead.
We recently upgraded to Windows Server 2008 and the site is no longer serving up asp.net files located at C:\Windows\System32.
Turn off the feature in IIS would have still solve the problem, but there is now an open path directly to the system for hackers. And, if this is the way the person implements the solution, there is likely some other problems in the code base that will allow the exploit.
The proper troubleshooting would have been to first determine why ASP.NET files were being served from C:\MyFiles instead of IIS directories. Long story, as the reason had to do with an assumption developing on a developer box generally, perhaps always, led to development of sites that did not work in production. So every developer was working on a production server directly. The C:\MyFiles was created from an improper assumption about security, that is was more secure to have developers working from a share than an IIS directory. This led to kludges to make the files work, which failed once the site was moved to a server with a version of IIS that stopped file and folder traversing. This was done as a security provision, as hackers had learned to put in a URL like:
http://mysite.com/../../../Windows/System32/cmd.exe%20;
Or similar. I don’t have the actual syntax above, but it was similar to above and it worked. So IIS stopped you from using files outside of IIS folders. Problem solved.
Now, there are multiple “solutions” to the posters problem:
- Turn off the IIS feature and allow traversing of directories. This makes the site work again, but also leaves a security hole.
- Go into IIS and add the folder C:\MyFiles folder as a virtual folder. This is a better short term solution than the one above. I say short term, as there is some administrative overhead to this solution that is not needed in this particular case.
- Educate the organization on the proper way to set up development. This is not the path of least resistance, but a necessary step to get the organization on the right path. This will more than likely involve solving the original problem that created the string of kludges that ended with a post blaming Microsoft for bringing a site down.
Troubleshooting The Original Problem
I am going to use the original “Check in your files” problem to illustrate troubleshooting. The formula is general enough you can tailor it to your use, but I am using specifics.
First, create a hypothesis.
I cannot check in the files, so I hypothesize Greg has them checked out.
Next, try to disprove the hypothesis. This is done by attempting to find checked out files. In this case, the hypothesis would have easily been destroyed by examining the files and find out none were checked out.
So the next step would be to set up another hypothesis. But let’s assume we found this file as “checked out”. The next step is to look at the person who has the file checked out to ensure the problem is “Greg has the file checked out” and not “someone has the file checked out”.
Since the name Greg Beamer is not here, even if the file were checked out, he cannot solve the problem.
Next, even if you have a possible solution, make sure you eliminate other potential issues. In this case, let’s assume only some of the files were checked out when examined, but the user was still having problems uploading. What else can cause this issue.
Here is what I did.
- Assume I do have things checked out first, as it is a possible reason for the problem. When that failed, look at the user’s permissions on the files in question. I found this:
- Hypothesis: User does not have proper permissions. Attempted solution: Grant permissions
- Found out permissions were inherited, so it was not a good idea to grant at the page level. Move up to the site level required opening in SharePoint online, where I find the same permissions.
- Now, my inclination is to grant permissions myself, but I noticed something else.
which further led to this (looking at Site Collection Images):
The permissions here are completely different. The user is not restricted, so he can access these.
I did try to give permissions to solve the issue:
But end up with incomplete results:
I further got rid of the special permissions on some folders, as they were not needed. More than likely added to give the developer rights to those folders. I still have the above, however, which means someone more skilled needs to solve the problem.
The point here is numerous issues were found, none of which were the original hypothesis, which was reached via an assumption. The assumption was
I cannot check in, therefore I assume someone has the files checked out. Since Greg is the only other person I know working on the files, I assume he has them checked out.
Both assumptions were incorrect. But that is not the main point. The main point is even if they were correct, are there any other issues. As illustrated, there were numerous issues that needed to be solved.
Summary
Troubleshooting is a scientific endeavor. Like any experiment, you have to state the problem first. If you don’t understand a problem, you can’t solve it..
You then have to form a hypothesis. If it fails, you have to do over, perhaps even redefining the problem. You do this until you find a hypothesis that works.
After you solve the problem, you should look at other causes. Why? Because you either a) may not have the best solution and b) you may still have other issues. This is a step that is missed more often than not, especially by junior IT staff.
Let me end with a story on the importance of troubleshooting:
Almost everyone I know that has the CCIE certification took two tries to get it. If you don’t know what CCIE is, it is the Cisco Certified Internetwork Engineer certification. It is considered one of the most coveted certifications and one of the hardest to attain. The reason is you have to troubleshoot rather than simply solve the problem.
The certification is in two parts. A written exam, which most people pass the first time, and a practical exercise, which most fail. The practical exercise takes place over a day and has two parts:
- You have to set up a network in a lab according to specifications given at the beginning of the day.
- After lunch, you come back to find something not working and have to troubleshoot the problem
Many of the people I know that failed the first time solved the problem and got the network working.So why did they fail? They went on assumptions based on problems they had solved in the past rather than worked through a checklist of troubleshooting steps. Talking to one of my CCIE friends, he explained it this way (paraphrased, of course):
When you simply solve the problem, you may get things working, but you may also end up taking shortcuts that cause other problems down the line. Sometimes these further problems are more expensive than the original problem, especially if they deal with security.
Sound similar to a problem described in this blog entry? Just turn off IIS directory traversing and the site works. Both the poster and the hacker say thanks.
Please note there are times when the short solution is required, even if it is not the best. There are time constraints that dictate a less than optimal approach. Realize, however, this is technical debt, which will eventually have to be paid. When you do not take the time to troubleshoot, and run on assumption, build in time to go back through the problem later, when the crisis is over. That, of course, is a topic for another day.
Peace and Grace,
Greg
Twitter: @gbworld