How to Debug Systems (HtDS) — An Introduction
Introduction
This is the first post in a series where I write about approaches to analyzing systems and focus on debugging applications. I will break them down as recipes in future posts and elaborate on each recipe as steps.
The ideas and recipes will be mostly on the system aspect which can assist in working with non-software processes and people from different backgrounds/roles (e.g: programmers, testers, designers, managers, users) should benefit greatly from these recipes. This point is crucial because even a programmer may not be able apply all the steps in a recipe for many reasons (e.g: the targeted program is an embedded system of a consumer device).
The word bug is a misnomer and maybe an issue or ticket is more appropriate but in anyway each person would have their own unique concept of each word and I personally feel more comfortable using bug. The word system is used instead of program because it’s appropriately broader in both software and non-software sense.
Each single step in each recipe is a topic of its own and deserves its own post so I will explain them very briefly to demonstrate the general approach. Also, each step is generalized to cover most bugs/projects and a different way to do things is always better than the generic one.
Core Ideas
Recipes only go so far and some core ideas are much more useful but they are much harder to explain too. And recipe steps are more of an educational device and the optimal strategy for each bug is different.
Problems, Solutions, and Probing Space
You can reason about a system in different roles, different abstractions, and different environments. And each of these can be reasoned about in a different way and in different levels. And depending on what you want to do with a system it can be trivial in some of these spaces and impractical in others.
Layers of Abstraction & Switching Gears
It’s easy to focus only on a single layer of abstraction (e.g: network, system flow, code) but too hard to step back and focus on another layer and even harder to focus on multiple layers at the same time with weighted distributed-focus and tuning it in real-time while analyzing a system.
Chain of Assumptions
Another idea is the chain of assumptions. A single wrong assumption/step can guide in the wrong direction or even the opposite direction. For example, a simple bug in the base case can take a lot of time because of a strong assumption that the base case is working. Another example is that subtle bugs are harder to fix because the root-cause is too small to consider as a contributing factor.
Core Steps
Some steps are present in most recipes so I will explain them in this post in more detail and reference them in future recipes.
Getting the System
One major hurdle in fixing a bug is that you can’t get the version you want to debug. This can be a specific older version or the latest version with a proof-of-concept fix. From first hand experience, I can tell that it’s possible to automate this end-to-end using a CI/CD pipeline for backend systems, frontend systems, and mobile apps with less than a minute of human effort per-release/version. The setup for this can naturally take days or weeks of full-time work depending on how far you want to go but it still should be less than the total time engineers spend doing this manually and with better results, less mistakes, and less time per release.
Producing the Bug
Before even thinking of fixing a bug you should confirm beyond the shadow of a doubt that it exists. If you can’t produce a bug quickly and at will it can be much harder to fix. Another important part is that producing a bug is a critical part of understanding which parts are affected and a non-reproducible bug may very well be a ghost or an urban legend.
In most cases, a bug is better produced in the large than in the small. What I mean by in-the-large is as higher-level and as close to the user as possible and by in-the-small is as low-level to the code as possible. One reason for this is that it covers the whole interaction and it’s generally harder to accurately produce a bug in the small.
Assume the simplest scenarios first. Some major advantages to this approach are the following:
- Most bugs are simple and assuming the opposite can waste a lot of time.
- You will find the simplest scenario for most bugs.
- You can eliminate big parts of the system most of the time very quickly.
- You will find very tricky simple bugs that are easily producible in the simplest scenarios.
- You will start understanding and working with systems as a combination of smaller parts that are easier to reason about and diagnose.
Not Producing a Bug
Sometimes you can tell that a bug is possible to happen in a part of the system but it’s hard to produce (e.g: legacy distributed systems). This is only the case if you have extremely good understanding of this part of the system, the system overall, and even the domain of the system. One reason for working/fixing bugs without producing them is that the things a system shouldn’t do are far more than the things it should and focusing too much on the incorrect behavior can result in many problems in the far future.
Understanding the Bug
Given you can easily produce the bug, you should try to understand it more and play with it. Some reasons why this is important are the following:
- To know the scope of the bug and what parts are affected.
- To determine whether the bug is malign or benign.
- To better communicate with other members of the team.
- To know which type of fix is best (e.g: quarantine, mitigation, treatment).
- To know whether it’s a disease or a symptom for a root-cause.
- To know how subjective it’s and whether if you’re in a different mood it would be a feature.
- To know how stealthy it’s (e.g: a ninja/silent bug).
- To determine how contagious it’s to other parts of the system.
- To know how much DPM (Damage Per Minute) it is causing.
- To know the cost (engineering, lost revenue) as a function of time of both a possible fix and a non-fix.
- To know whether it’s easier to fix or make it worse.
- To compare other similar systems and use them as a baseline if possible.
- To know how solvable it’s and whether an advanced technique is necessary for optimal results (e.g: changing the bug/problem to solve).
Closing the Bug
Now that you understand the bug you should be equipped to close it. I used close instead of fix because fixing is only one of many ways to close a bug. Some of the possible ways to close a bug are the following:
- The zero-solution: do nothing. This should always be considered first. You can replace it with something else that has the same effect (e.g: go play minecraft).
- Wait for the bug to disappear. This may be appropriate for transient bugs that have low impact, don’t happen often, and are understood enough. This can also be appropriate if it’s hard to understand, produce, or in a part of the system with too many landmines.
- Replace it with a different implementation (e.g: internal, external service). This may be appropriate for non-core parts of the system.
- Amputate it: cut out the smallest part of the system that contains the bug. This may be appropriate if it’s part of a system that turned to be more expensive to get right than expected and it has a low ROI.
- Wrap the affected part in a way to handle the bug. This may be appropriate if the system behavior is not well understood and this part of the system is not expected to change often in the future because it can have big negative effects in the long run for parts that change a lot.
- Replace it with one or more bugs. One example is if it turned out that it’s completely different than the initial bug.
- Make the error visible/discoverable. This may be appropriate if the major impact is in the error being invisible and the cost of a fix is too high.
- Fix the core where the incorrect result is first returned. This may be appropriate if the affected part is relatively loosely coupled from the rest of the system.
- Fix the neighbors. May be appropriate if the core incorrect behavior is important to the neighbors or simplifies the overall system.
- Make the bug and incorrect behavior close to it impossible. This may be appropriate if you have a good mental model of the affected part of the system or the desired behavior is too different from the current implementation.
- Solve a conceptually different much easier problem. May be appropriate a looot more than you might think.
- Provide a valid excuse. May be appropriate if logically the desired behavior is impossible to achieve or only non-technical solutions can solve the problem.
- Just type the following password as a reply in any standard issue-tracker (Slack could work too): “It works on my machine.” May be appropriate if you are a jerk.
Ensuring the Bug Remains Closed
Closing a bug is one thing, making sure it remains so is everything. There’s a profound side to this. If every reported bug is closed and remains so, the system will have no bugs.
Not quite, there are other non-technical reasons that can make this false. For example, the same bug can resurface many times dressed differently and users or management eyes can get used to bugs and the system magically becomes bug free.
On the serious side, you should strive for the following metrics:
- 0 confirmed bugs with agreed upon incorrect behavior.
- 0 recurring bugs.
This is possible because not all bugs are easily discoverable or have agreed upon incorrect behavior. It may also be cost-prohibitive to achieve. But still striving for this makes the system more correct because writing it off can gradually make the system more buggy overtime. And this can be a problem highlighter. For example If you build an overly complex system or work with requirements that require an infinite amount of time and money, you will look at these metrics as plain ridiculous.
Some ways to ensure bugs remain closed are the following (mostly test-automation related):
- Deterministic, fast, cost-effective way to test correctness: test-automation for any practical application.
- Avoiding expensive automated tests for parts that rarely change.
- Avoiding automated tests at a level that changes with code where another code is more future proof and is sufficient for most intents and purposes.
- Avoiding individual numerous expensive tests when a smaller number of tests is sufficient for most intents and purposes.
- Writing tests early because the cases can be forgotten later and the damage may already be done then.
- Avoiding slow tests. This can require changing a feature to make most of its functionality testable without expensive operations.
- Testing as much of the test result/side-effects (e.g: the user data added to the database on the registration) because they are the actual desired behavior and not testing them can lead to broken assumptions.
- Extracting complex logic, making it as pure as possible, and testing it exhaustively separately.
- Minimizing expensive tests (e.g: UI testing) and designing the rest of the system in a way that is easier to test.
- Using a driver encapsulating changing UI elements for UI tests.
- Focus more on testing all the correct behaviors because the incorrect behaviors are infinite.
- Avoid excessive abstractions to make the system more testable.
- Be aware of the state-of-the-art for automating any manual tasks you do and handle them holistically with a short-term/long-term balance.
- Be careful of cyclic bugs. This can happen when the problem is intractable and you find yourself closing a bug and then returning to it later. It can be especially hard to recognize these bugs because this can happen over months or years.
Conclusion
This is the introductory post for the series and will serve as a reference for the future ones where a recipe will be introduced with each new post. It’s by no means exhaustive but definitely a start.