Stability: Smarter Monitoring Application Crashes

I had posted about the way Tumblr uses time-series monitoring to alert on crash spikes in the Android and iOS applications. Since then, we’ve done a lot of work to reduce the overall volume of crashes. As a result, we created a new problem: it was possible for a handful of people, caught in crash cycles, to cause our stability alerts to trigger.

Once the stability alert is triggered, we typically start looking in the crash logging systems, like Crashlytics or Sentry, to find more information about the crash. We found an increasing number of occurrences where no particular crash could be easily identified as causing the spike.

Getting paged at 2am because of a stability alert is not great, but not finding a crash was even worse.

The problem was the way we were monitoring. Simply watching all crash events wasn’t good enough. We had to start normalizing the events across the users. Thankfully, we collect events and not simple ticks. We have rich data in the crash event, including a way to group events coming from the same device.

Here is an example of the two styles of monitoring over the last week:

raw count shows a large spike, but the unique device count shows a normal trend

That after-midnight Raw Count spike on Friday would have paged our on-call person if we hadn’t changed to alert on the Unique Device Count instead. We still use the Raw Counts to identify issues and investigate, but we don’t alert on them. We can use the high-cardinality events to zero-in on the cause of the spike. In this case, two (2) people were having a bad experience using their Cubot Echo devices.

Since moving to the new alerting metric, we’ve had far fewer after-hour pages, while still being able to focus on the stability of the applications across our user base.

Performance: Design & Expectations


Providing a great user experience under a variety of performance situations means designing for a variety of expectations. Defining a great user experience should include performance budgets for the various pieces of the experience. Some basic examples include: Application startup, Image loading, and UI responsiveness.

Application Startup

First impressions are important, and startup time (page-load for web apps) is that first impression. There is a lot of information floating around that should convince you of the importance of a fast launch time. Still, we seem to cram more and more cruft into that part of the application. We initialize analytic libraries, load saved preferences, try to re-send failed or queued events and data, and maybe even send crash reports from previous sessions.

Create a startup time performance budget. How long should a person wait before being able to view and interact with real content? Once you set that limit, start moving less critical work out of the critical path. Queued events and data can remain queued a little while longer. Buffer new events before initializing heavy analytics libraries. Consider showing cached content, while downloading new content in the background.

Image Loading

Images are a big part of many applications. Avatars, photos and GIFs are everywhere. We want to display the images as fast as possible. Usually development happens in ideal situations: Best devices and best network speeds – the fast-path. If you’re measuring the real world performance characteristics of your application, you probably know that most of the people using the application don’t have the best devices or fast network speeds – the slow path.

Sometimes we fail to design for the slow-path. We assume it’s infrequent, or worse, we believe the fast-path behavior is correct for the product in any situation. People can just deal with the crummy experience. Remember your performance budget: how long should someone wait for an image to load?

Some common approaches to handling the slow-path:

  • Use server-side caching. This one is pretty obvious, but I have to mention it. Using a Content Delivery Network (CDN) means it takes less time to deliver images to the application because the images are “closer” to the application.
  • Use a more efficient file format. GIF is not known for being a lightweight format. Look into WebP and MP4 as low-bandwidth animated image alternatives that provide great quality.
  • Get better at picking JPEG quality levels. Etsy has a nice write-up on using SSIM (human vision estimation) to pick the lowest level without hurting perceived quality. Google has something called Butteraugli that does something similar.
  • Dynamically size images to fit the target rendering size. Don’t download large images only to reduce the size on the client. For less than excellent networking speeds, request images that are smaller than the target size and upscale them. You can save a lot of bandwidth and render the image quickly, keeping the application usable.
  • Aggressively cache images on the device. Never download the same image more than once. Cached images load quickly and reduce bandwidth usage. Yes, this might mean using 1GB or more for a cache, but if the space is available, it’s always worth it. Modern OSes will try to clear storage-based caches when running low on free space.
  • Consider Tap-to-Play interfaces to delay downloading large animated images until requested. Use a much smaller static image as a placeholder.

Some of these slow-path ideas might be so effective at saving bandwidth or improving image loading speed, that you make them options for the fast-path as well.

UI Responsiveness

Touch-based devices make unresponsive UIs very noticeable. Applications should maintain a responsive UI no matter what other activity is taking place. Use background threads to do the heavy lifting. Keep the UI thread free of any file I/O, networking and any other work that can be pushed to the background.

Remember to design for the slow-path when creating UI actions associated with network APIs. Don’t wait for the network response before changing the state of the UI. If the action fails, you can always say so and flip the state back. Delaying the state change makes the UI appear broken.

Smooth scrolling is another part of a responsive UI. iOS, Android and Web all have best practices for keeping high frame-rates while scrolling. There are also tools for profiling your rendering code.

Watch for situations where a design requirement (animation, layout, whatever) is causing the UI to become unresponsive. Find a way to fast-path/slow-path the requirement. If that’s not possible, get the requirement changed.

Design for Everyone

Never fall into the mindset of designing for only the latest hardware and fastest network speeds. You really need to factor the slow-path into your designs too. Yes, it’s more work and it’s probably not your ideal experience, but it’s far better than trying to force a fast-path design down a slow-path situation. That experience is usually horrible. You can do better!

Performance: The Merits of Measuring

If you can not measure it, you can not improve it. – Lord Kelvin.

There is so much information out there on ways to improve the performance of your mobile application or website. You probably feel you can just dive in and start making changes. But if you’re not measuring your application’s performance, you don’t know if anything is really helping or hurting. How do you know what effect any changes will have on the performance? Most applications are complex enough that we can’t assume our simplistic reasoning accurately reflects the code behavior. You need to measure.

You need measurements from before and after any changes are made. Your application has development phases, so should your measurement plan. Measure in CI to find improvements and regressions as soon as they happen. Measure in the real world to find how variations like network conditions, device fragmentation, and unpredictable user behavior manifests in performance.

Measuring in CI

The point of measuring performance in CI is to control variability and watch for relative differences on each change. Try to reduce the noisy variables like network calls and background services to create a fast and consistent surface on which you can monitor performance changes in a reliable and repeatable manner. The purpose is not to determine the performance your users will encounter. There are too many variables in the real world and you can’t control them well enough for apples-to-apples comparisons.

Use real devices when measuring performance, not emulators or simulators running on host hardware.

You can try to create reliable & repeatable simulations of some real world situations. Network connection speed is one example. You can use a network simulator, like Facebook’s Augmented Traffic Control system to simulate WiFi and mobile network conditions. This is especially useful if your application is designed to react differently under different network conditions. You can also use different types of content in the tests, trying to mimic some high level differences your users might encounter.

If you’re measuring data in CI, you should be storing and displaying it as well. Try to get the CI to alert on regressions, failing changes before they make it into the product.

Some common things to measure in CI:

  • Launch time to show UI
  • Launch time to interactive UI
  • Scroll performance (janky frames)
  • Time to load content
  • Memory usage (startup, after content loads, after scrolling content)

Remember to use multiple (physical) devices and a variety of content types.

Measuring in Real World

While CI measurements come from only a handful of tests & situations, the real world has many, many more situations. Depending on the number of active users, you could have millions of data points with thousands of unique situations. Collecting data from real users, at a large scale, allows you to investigate how things like global regions, network conditions — and even user types — can affect the performance of the application.

The are many third-party systems you can integrate into your code to easily and efficiently collect real world data. It’s not uncommon for companies to grow their own systems as well. In any case, make sure you are validating the data itself. Since real world data is messy, make sure you are vetting the collection systems and the data. Look for problems like payload corruption, clock skew, range errors or other oddities.

Create automated queries and reports, sent out broadly for people to review. Remember to go deeper than high-level summaries. Some of the interesting discoveries happen when you split out data across different dimensions.

Some common things to measure from real world:

  • Network usage, including start time, end time, content type and size of the response. Get detailed connection timing, if possible, for DNS and SSL handshake information.
    • For API endpoints, this is useful for tracking latency and payload size
    • For media loading, this gives a ballpark metric for how long people are staring at an empty box, waiting for an image to load.
  • Event, session and error state data. This can be used to track critical content impressions, but also can be used to learn how people use the application.

Remember to include some common metadata in each measurement so you can split out the data across different dimensions. Things like non-PII identifier, generic geo-location/region, device specs and connection type/speed can help you drill down into the data, looking for trends.

It’s also polite to allow people to opt-out of this type of data collection.