How to Debug Systems (HtDS) — An Introduction

Introduction

Core Ideas

Problems, Solutions, and Probing Space

Layers of Abstraction & Switching Gears

Chain of Assumptions

Core Steps

Getting the System

Producing the Bug

  1. Most bugs are simple and assuming the opposite can waste a lot of time.
  2. You will find the simplest scenario for most bugs.
  3. You can eliminate big parts of the system most of the time very quickly.
  4. You will find very tricky simple bugs that are easily producible in the simplest scenarios.
  5. You will start understanding and working with systems as a combination of smaller parts that are easier to reason about and diagnose.

Not Producing a Bug

Understanding the Bug

  1. To know the scope of the bug and what parts are affected.
  2. To determine whether the bug is malign or benign.
  3. To better communicate with other members of the team.
  4. To know which type of fix is best (e.g: quarantine, mitigation, treatment).
  5. To know whether it’s a disease or a symptom for a root-cause.
  6. To know how subjective it’s and whether if you’re in a different mood it would be a feature.
  7. To know how stealthy it’s (e.g: a ninja/silent bug).
  8. To determine how contagious it’s to other parts of the system.
  9. To know how much DPM (Damage Per Minute) it is causing.
  10. To know the cost (engineering, lost revenue) as a function of time of both a possible fix and a non-fix.
  11. To know whether it’s easier to fix or make it worse.
  12. To compare other similar systems and use them as a baseline if possible.
  13. To know how solvable it’s and whether an advanced technique is necessary for optimal results (e.g: changing the bug/problem to solve).

Closing the Bug

  1. The zero-solution: do nothing. This should always be considered first. You can replace it with something else that has the same effect (e.g: go play minecraft).
  2. Wait for the bug to disappear. This may be appropriate for transient bugs that have low impact, don’t happen often, and are understood enough. This can also be appropriate if it’s hard to understand, produce, or in a part of the system with too many landmines.
  3. Replace it with a different implementation (e.g: internal, external service). This may be appropriate for non-core parts of the system.
  4. Amputate it: cut out the smallest part of the system that contains the bug. This may be appropriate if it’s part of a system that turned to be more expensive to get right than expected and it has a low ROI.
  5. Wrap the affected part in a way to handle the bug. This may be appropriate if the system behavior is not well understood and this part of the system is not expected to change often in the future because it can have big negative effects in the long run for parts that change a lot.
  6. Replace it with one or more bugs. One example is if it turned out that it’s completely different than the initial bug.
  7. Make the error visible/discoverable. This may be appropriate if the major impact is in the error being invisible and the cost of a fix is too high.
  8. Fix the core where the incorrect result is first returned. This may be appropriate if the affected part is relatively loosely coupled from the rest of the system.
  9. Fix the neighbors. May be appropriate if the core incorrect behavior is important to the neighbors or simplifies the overall system.
  10. Make the bug and incorrect behavior close to it impossible. This may be appropriate if you have a good mental model of the affected part of the system or the desired behavior is too different from the current implementation.
  11. Solve a conceptually different much easier problem. May be appropriate a looot more than you might think.
  12. Provide a valid excuse. May be appropriate if logically the desired behavior is impossible to achieve or only non-technical solutions can solve the problem.
  13. Just type the following password as a reply in any standard issue-tracker (Slack could work too): “It works on my machine.” May be appropriate if you are a jerk.

Ensuring the Bug Remains Closed

  1. 0 confirmed bugs with agreed upon incorrect behavior.
  2. 0 recurring bugs.
  1. Deterministic, fast, cost-effective way to test correctness: test-automation for any practical application.
  2. Avoiding expensive automated tests for parts that rarely change.
  3. Avoiding automated tests at a level that changes with code where another code is more future proof and is sufficient for most intents and purposes.
  4. Avoiding individual numerous expensive tests when a smaller number of tests is sufficient for most intents and purposes.
  5. Writing tests early because the cases can be forgotten later and the damage may already be done then.
  6. Avoiding slow tests. This can require changing a feature to make most of its functionality testable without expensive operations.
  7. Testing as much of the test result/side-effects (e.g: the user data added to the database on the registration) because they are the actual desired behavior and not testing them can lead to broken assumptions.
  8. Extracting complex logic, making it as pure as possible, and testing it exhaustively separately.
  9. Minimizing expensive tests (e.g: UI testing) and designing the rest of the system in a way that is easier to test.
  10. Using a driver encapsulating changing UI elements for UI tests.
  11. Focus more on testing all the correct behaviors because the incorrect behaviors are infinite.
  12. Avoid excessive abstractions to make the system more testable.
  13. Be aware of the state-of-the-art for automating any manual tasks you do and handle them holistically with a short-term/long-term balance.
  14. Be careful of cyclic bugs. This can happen when the problem is intractable and you find yourself closing a bug and then returning to it later. It can be especially hard to recognize these bugs because this can happen over months or years.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store