How do we share troubleshooting skills

Hello, I’m id:koemu, Backend Software Engineer of Merpay Payment Platform Team.

Abstract

This article is going to explain how we share troubleshooting knowledge in our team and how we further improve troubleshooting skills in various situations, especially on weekends or holidays. Troubleshooting will be difficult if you only learn it by sitting at your desk because there will be few opportunities to practice. That’s why we are trying to share the skills through reading system metrics at a weekly team meeting.

f:id:koemu:20190910120407j:plain:w640

Sharing the responsibility of troubleshooting with the entire team

Senior engineers can solve problems by analyzing system metrics and logs. On the other hand, troubleshooting can be too difficult to handle for many other engineers. Even if a system monitors various metrics, engineers need vast experience and knowledge to find the root cause of the problem or a failure.
Therefore, to reduce the burden on senior engineers, we started thinking about how to make sure that the entire team could handle troubleshooting.

Share knowledge through reading system metrics

In every week’s All Hands Meeting, our senior engineers started to share troubleshooting knowledge. This approach is helpful for engineers to understand how do the Senior Engineers read the system metrics when dealing with problems:

First, Senior Engineers explain what each metric means. Next, they walk through how to locate issues by referring to specific and helpful metric movements in previous troubleshooting incidents. Data from previous incidents help us understand real metric changes and possible future situations.

Achievement

Throughout this practice, engineers have learned troubleshooting skills to solve problems. However, this isn’t the end. We have to understand various troubleshooting even further. And for that, we’ll continue sharing how/what we did on troubleshooting on the team’s recurring All Hands meeting.

Conclusion

To sum up, I explained how senior engineers share troubleshooting skills. We can acquire troubleshooting skills through reading metrics with the help of engineers explanations. Now, not only senior engineers but also a lot of other members are able to cope with problems and issues. In the team’s recurring All Hands meeting in the future, we will continue to strive to learn more by sharing knowledge and skills among each other, coming not only from senior engineers, but also other members from the entire team.