Stability: Monitoring Application Crashes

Crashes are inevitable. For an application with many people actively using it, the amount of crashes is probably higher than you’d imagine. Thankfully, there are many in-house and third-party systems that allow you to track crashes with detailed meta data and crash stacks.

It takes some practice and diligence, but you can get very good at managing and fixing the most relevant crashes in your application.

When I started working at Tumblr, I noticed they were using an additional approach to tracking crashes: a real-time, low-cardinality, crash event

Hooked up to the time-series collection system used for other monitoring, you get a very up-to-date measure of the crash rate. It was the first time I’d seen crash rate measured this way. Plenty of crash tracking systems allow you to look at crash rate, but usually at the hourly level. Ingestion can take non-trivial amounts of time when dealing with parsing, symbolicating, and deobfuscation.

The real-time crash monitoring is perfect for alerting too. The time-series system already had support for notifying developers or paging on-call people.

Feature flags are an awesome way to ramp new functionality in a controlled manner. I’ve seen real-time crash monitoring catch issues moments after a problematic ramp. If you’re using feature flags, you should be using real-time crash reporting as well. It’s enormously beneficial and relatively simple to implement.