How to Debug Systems (HtDS) — An Introduction


The ideas and recipes will be mostly on the system aspect which can assist in working with non-software processes and people from different backgrounds/roles (e.g: programmers, testers, designers, managers, users) should benefit greatly from these recipes. This point is crucial because even a programmer may not be able apply all the steps in a recipe for many reasons (e.g: the targeted program is an embedded system of a consumer device).

The word bug is a misnomer and maybe an issue or ticket is more appropriate but in anyway each person would have their own unique concept of each word and I personally feel more comfortable using bug. The word system is used instead of program because it’s appropriately broader in both software and non-software sense.

Each single step in each recipe is a topic of its own and deserves its own post so I will explain them very briefly to demonstrate the general approach. Also, each step is generalized to cover most bugs/projects and a different way to do things is always better than the generic one.

Core Ideas

Problems, Solutions, and Probing Space

Layers of Abstraction & Switching Gears

Chain of Assumptions

Core Steps

Getting the System

Producing the Bug

In most cases, a bug is better produced in the large than in the small. What I mean by in-the-large is as higher-level and as close to the user as possible and by in-the-small is as low-level to the code as possible. One reason for this is that it covers the whole interaction and it’s generally harder to accurately produce a bug in the small.

Assume the simplest scenarios first. Some major advantages to this approach are the following:

  1. Most bugs are simple and assuming the opposite can waste a lot of time.
  2. You will find the simplest scenario for most bugs.
  3. You can eliminate big parts of the system most of the time very quickly.
  4. You will find very tricky simple bugs that are easily producible in the simplest scenarios.
  5. You will start understanding and working with systems as a combination of smaller parts that are easier to reason about and diagnose.

Not Producing a Bug

Understanding the Bug

  1. To know the scope of the bug and what parts are affected.
  2. To determine whether the bug is malign or benign.
  3. To better communicate with other members of the team.
  4. To know which type of fix is best (e.g: quarantine, mitigation, treatment).
  5. To know whether it’s a disease or a symptom for a root-cause.
  6. To know how subjective it’s and whether if you’re in a different mood it would be a feature.
  7. To know how stealthy it’s (e.g: a ninja/silent bug).
  8. To determine how contagious it’s to other parts of the system.
  9. To know how much DPM (Damage Per Minute) it is causing.
  10. To know the cost (engineering, lost revenue) as a function of time of both a possible fix and a non-fix.
  11. To know whether it’s easier to fix or make it worse.
  12. To compare other similar systems and use them as a baseline if possible.
  13. To know how solvable it’s and whether an advanced technique is necessary for optimal results (e.g: changing the bug/problem to solve).

Closing the Bug

  1. The zero-solution: do nothing. This should always be considered first. You can replace it with something else that has the same effect (e.g: go play minecraft).
  2. Wait for the bug to disappear. This may be appropriate for transient bugs that have low impact, don’t happen often, and are understood enough. This can also be appropriate if it’s hard to understand, produce, or in a part of the system with too many landmines.
  3. Replace it with a different implementation (e.g: internal, external service). This may be appropriate for non-core parts of the system.
  4. Amputate it: cut out the smallest part of the system that contains the bug. This may be appropriate if it’s part of a system that turned to be more expensive to get right than expected and it has a low ROI.
  5. Wrap the affected part in a way to handle the bug. This may be appropriate if the system behavior is not well understood and this part of the system is not expected to change often in the future because it can have big negative effects in the long run for parts that change a lot.
  6. Replace it with one or more bugs. One example is if it turned out that it’s completely different than the initial bug.
  7. Make the error visible/discoverable. This may be appropriate if the major impact is in the error being invisible and the cost of a fix is too high.
  8. Fix the core where the incorrect result is first returned. This may be appropriate if the affected part is relatively loosely coupled from the rest of the system.
  9. Fix the neighbors. May be appropriate if the core incorrect behavior is important to the neighbors or simplifies the overall system.
  10. Make the bug and incorrect behavior close to it impossible. This may be appropriate if you have a good mental model of the affected part of the system or the desired behavior is too different from the current implementation.
  11. Solve a conceptually different much easier problem. May be appropriate a looot more than you might think.
  12. Provide a valid excuse. May be appropriate if logically the desired behavior is impossible to achieve or only non-technical solutions can solve the problem.
  13. Just type the following password as a reply in any standard issue-tracker (Slack could work too): “It works on my machine.” May be appropriate if you are a jerk.

Ensuring the Bug Remains Closed

Not quite, there are other non-technical reasons that can make this false. For example, the same bug can resurface many times dressed differently and users or management eyes can get used to bugs and the system magically becomes bug free.

On the serious side, you should strive for the following metrics:

  1. 0 confirmed bugs with agreed upon incorrect behavior.
  2. 0 recurring bugs.

This is possible because not all bugs are easily discoverable or have agreed upon incorrect behavior. It may also be cost-prohibitive to achieve. But still striving for this makes the system more correct because writing it off can gradually make the system more buggy overtime. And this can be a problem highlighter. For example If you build an overly complex system or work with requirements that require an infinite amount of time and money, you will look at these metrics as plain ridiculous.

Some ways to ensure bugs remain closed are the following (mostly test-automation related):

  1. Deterministic, fast, cost-effective way to test correctness: test-automation for any practical application.
  2. Avoiding expensive automated tests for parts that rarely change.
  3. Avoiding automated tests at a level that changes with code where another code is more future proof and is sufficient for most intents and purposes.
  4. Avoiding individual numerous expensive tests when a smaller number of tests is sufficient for most intents and purposes.
  5. Writing tests early because the cases can be forgotten later and the damage may already be done then.
  6. Avoiding slow tests. This can require changing a feature to make most of its functionality testable without expensive operations.
  7. Testing as much of the test result/side-effects (e.g: the user data added to the database on the registration) because they are the actual desired behavior and not testing them can lead to broken assumptions.
  8. Extracting complex logic, making it as pure as possible, and testing it exhaustively separately.
  9. Minimizing expensive tests (e.g: UI testing) and designing the rest of the system in a way that is easier to test.
  10. Using a driver encapsulating changing UI elements for UI tests.
  11. Focus more on testing all the correct behaviors because the incorrect behaviors are infinite.
  12. Avoid excessive abstractions to make the system more testable.
  13. Be aware of the state-of-the-art for automating any manual tasks you do and handle them holistically with a short-term/long-term balance.
  14. Be careful of cyclic bugs. This can happen when the problem is intractable and you find yourself closing a bug and then returning to it later. It can be especially hard to recognize these bugs because this can happen over months or years.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store