Mobile – stark raving finkle

January 26, 2020

Integration Testing: Time to Reboot

I tried to push a plan for bootstrapping an automated integration test system for our Android and iOS applications. The plan was based on similar strategies I’d used, or seen used, at other companies. It didn’t fit well with the current situation and workflows. I failed to take those differences into account and the initiative failed.

Developers never saw the value of using their time and workflow on the automated integration tests. Test engineers were overwhelmed by the amount of manual regression tests required for each release. Even with the manual tests, we have gaps in regression coverage leading to some severe defects shipped to users. We are wasting valuable manual testing time on hundreds of manual regression tests that rarely break, when we should be focusing those people on new feature and exploratory testing.

Looking Forward

We want to be able to do automated testing on our iOS and Android client apps. From the simplest type of “does the application start” smoke test to more complicated tests around critical features and functionality. More test automation means:

Finding bugs faster
Focusing manual testing on high value tasks (new features and exploratory testing)
Shipping releases faster with higher quality

Test engineering is highly motivated to do more integration testing as a way to reduce the number of manual regression test cases. Though they don’t have development experience, those folks want to start creating the tests. So we want to keep the barrier to writing tests very low. As we automate regression tests, we want to focus on manual testing on new features, exploratory, and ad-hoc edge case testing.

Objectives for our automated integration testing reboot:

Doesn’t require knowledge of building applications or the languages used to develop the applications.
Requires little knowledge of the structure used to build the UI of the applications.
Reuse integration testing framework, code, and knowledge across all application platforms.
Reduce the amount of manual integration testing as much as possible.

Approach

We intend to use a black-box approach to installing, launching, and driving the applications. The plan includes:

Using python-based Appium scripts as the framework for integration tests. Python is a good entry-level programming language and Appium has capabilities to black-box test Android and iOS clients. We’re leveraging the same language and framework for both mobile platforms.
Using emulators & simulators to run smoke and integration tests. Easy to setup and run locally, while also capable in CI.
Run the tests several times a day using CI, but not on each PR. The focus is on reducing manual regression testing, while not adding friction to developer workflows.
Only send consistently failing tests to QA for manual verification and ticket filing.

Milestones

Milestone 1 is about creating a solid foundation for the approach. We’ve completed the proof of concept:

Test engineers are building scripts using cross-platform tools — and learning to code.
Developers have added a few testing hooks into the clients to allow faster, more robust tests.
Python scripts have been created for over 200 integration tests using Appium
Tests are running in CI several times a day.
Test engineering created a simple system to send consistent failures to QA.
Reliability is better than the previous Espresso/XCUITest test suite.

We’ve already saved several tester-hours a day from manual regression testing.

Milestone 2 is really just expanding the test coverage from only high priority test cases to medium and even low priority test cases. We’re also expanding the tooling to support running the Appium tests on both Alpha and Beta channels, as well as self-service support for running on pull requests. Some additional tasks:

Get better at controlling feature flags for more deterministic test flows
Start mocking API responses for faster testing and less variations due to live data
Intercept outgoing requests to track and verify more analytics
Create a smaller, faster suite for PR testing

November 24, 2018November 26, 2018

Web ADB: Simple Web-based Access to Devices

I’ve had a number of occasions where I needed direct access to an Android device that wasn’t connected to the computer in front of me. I can usually SSH into the remote host machine and use ADB to try to debug the situation. If the simple stuff doesn’t work, I eventually start using ADB screencap to get a look at what’s on the device. If I’m lucky, I can remote desktop to the host. If not, I end up copying the images back to my machine to view them.

Remote Situation — Connecting to a remote host with some Android devices attached to it

Surely there is must be an easier way.

There is! I found the OpenSTF project which basically allows you to get web-based control of the Android and iOS devices. Just install the system on the host machine and install an agent on the Android devices. It looks pretty cool, but always seemed like overkill when I was in a remote debug situation.

So I decided I’d start hacking together a really simple system in Python. I started with the simplest Python API server I could find. Then I added a fairly basic webapp front-end. The result is Web ADB.

It’s a very minimal Python API server, which also serves up a basic single-page webapp. The approach is pretty simple: run ADB commands via Python, parse the output, send the results back through the API response.

The API supports getting attached devices, getting a screenshot of a device, sending key presses and screen taps, and even rebooting a device. The webapp just uses the API to make something useful. Maybe the only cool feature is clicking on a screenshot will send a tap to the device, then update the screenshot. I have some ideas for other features, as time permits.

September 29, 2018September 29, 2018

Stability: Monitoring Application Crashes

Crashes are inevitable. For an application with many people actively using it, the amount of crashes is probably higher than you’d imagine. Thankfully, there are many in-house and third-party systems that allow you to track crashes with detailed meta data and crash stacks.

It takes some practice and diligence, but you can get very good at managing and fixing the most relevant crashes in your application.

When I started working at Tumblr, I noticed they were using an additional approach to tracking crashes: a real-time, low-cardinality, crash event

Hooked up to the time-series collection system used for other monitoring, you get a very up-to-date measure of the crash rate. It was the first time I’d seen crash rate measured this way. Plenty of crash tracking systems allow you to look at crash rate, but usually at the hourly level. Ingestion can take non-trivial amounts of time when dealing with parsing, symbolicating, and deobfuscation.

The real-time crash monitoring is perfect for alerting too. The time-series system already had support for notifying developers or paging on-call people.

Feature flags are an awesome way to ramp new functionality in a controlled manner. I’ve seen real-time crash monitoring catch issues moments after a problematic ramp. If you’re using feature flags, you should be using real-time crash reporting as well. It’s enormously beneficial and relatively simple to implement.

July 9, 2017

Performance: Design & Expectations

Providing a great user experience under a variety of performance situations means designing for a variety of expectations. Defining a great user experience should include performance budgets for the various pieces of the experience. Some basic examples include: Application startup, Image loading, and UI responsiveness.

Application Startup

First impressions are important, and startup time (page-load for web apps) is that first impression. There is a lot of information floating around that should convince you of the importance of a fast launch time. Still, we seem to cram more and more cruft into that part of the application. We initialize analytic libraries, load saved preferences, try to re-send failed or queued events and data, and maybe even send crash reports from previous sessions.

Create a startup time performance budget. How long should a person wait before being able to view and interact with real content? Once you set that limit, start moving less critical work out of the critical path. Queued events and data can remain queued a little while longer. Buffer new events before initializing heavy analytics libraries. Consider showing cached content, while downloading new content in the background.

Image Loading

Images are a big part of many applications. Avatars, photos and GIFs are everywhere. We want to display the images as fast as possible. Usually development happens in ideal situations: Best devices and best network speeds – the fast-path. If you’re measuring the real world performance characteristics of your application, you probably know that most of the people using the application don’t have the best devices or fast network speeds – the slow path.

Sometimes we fail to design for the slow-path. We assume it’s infrequent, or worse, we believe the fast-path behavior is correct for the product in any situation. People can just deal with the crummy experience. Remember your performance budget: how long should someone wait for an image to load?

Some common approaches to handling the slow-path:

Use server-side caching. This one is pretty obvious, but I have to mention it. Using a Content Delivery Network (CDN) means it takes less time to deliver images to the application because the images are “closer” to the application.
Use a more efficient file format. GIF is not known for being a lightweight format. Look into WebP and MP4 as low-bandwidth animated image alternatives that provide great quality.
Get better at picking JPEG quality levels. Etsy has a nice write-up on using SSIM (human vision estimation) to pick the lowest level without hurting perceived quality. Google has something called Butteraugli that does something similar.
Dynamically size images to fit the target rendering size. Don’t download large images only to reduce the size on the client. For less than excellent networking speeds, request images that are smaller than the target size and upscale them. You can save a lot of bandwidth and render the image quickly, keeping the application usable.
Aggressively cache images on the device. Never download the same image more than once. Cached images load quickly and reduce bandwidth usage. Yes, this might mean using 1GB or more for a cache, but if the space is available, it’s always worth it. Modern OSes will try to clear storage-based caches when running low on free space.
Consider Tap-to-Play interfaces to delay downloading large animated images until requested. Use a much smaller static image as a placeholder.

Some of these slow-path ideas might be so effective at saving bandwidth or improving image loading speed, that you make them options for the fast-path as well.

UI Responsiveness

Touch-based devices make unresponsive UIs very noticeable. Applications should maintain a responsive UI no matter what other activity is taking place. Use background threads to do the heavy lifting. Keep the UI thread free of any file I/O, networking and any other work that can be pushed to the background.

Remember to design for the slow-path when creating UI actions associated with network APIs. Don’t wait for the network response before changing the state of the UI. If the action fails, you can always say so and flip the state back. Delaying the state change makes the UI appear broken.

Smooth scrolling is another part of a responsive UI. iOS, Android and Web all have best practices for keeping high frame-rates while scrolling. There are also tools for profiling your rendering code.

Watch for situations where a design requirement (animation, layout, whatever) is causing the UI to become unresponsive. Find a way to fast-path/slow-path the requirement. If that’s not possible, get the requirement changed.

Design for Everyone

Never fall into the mindset of designing for only the latest hardware and fastest network speeds. You really need to factor the slow-path into your designs too. Yes, it’s more work and it’s probably not your ideal experience, but it’s far better than trying to force a fast-path design down a slow-path situation. That experience is usually horrible. You can do better!

July 3, 2017July 9, 2017

Performance: The Merits of Measuring

If you can not measure it, you can not improve it. – Lord Kelvin.

There is so much information out there on ways to improve the performance of your mobile application or website. You probably feel you can just dive in and start making changes. But if you’re not measuring your application’s performance, you don’t know if anything is really helping or hurting. How do you know what effect any changes will have on the performance? Most applications are complex enough that we can’t assume our simplistic reasoning accurately reflects the code behavior. You need to measure.

You need measurements from before and after any changes are made. Your application has development phases, so should your measurement plan. Measure in CI to find improvements and regressions as soon as they happen. Measure in the real world to find how variations like network conditions, device fragmentation, and unpredictable user behavior manifests in performance.

Measuring in CI

The point of measuring performance in CI is to control variability and watch for relative differences on each change. Try to reduce the noisy variables like network calls and background services to create a fast and consistent surface on which you can monitor performance changes in a reliable and repeatable manner. The purpose is not to determine the performance your users will encounter. There are too many variables in the real world and you can’t control them well enough for apples-to-apples comparisons.

Use real devices when measuring performance, not emulators or simulators running on host hardware.

You can try to create reliable & repeatable simulations of some real world situations. Network connection speed is one example. You can use a network simulator, like Facebook’s Augmented Traffic Control system to simulate WiFi and mobile network conditions. This is especially useful if your application is designed to react differently under different network conditions. You can also use different types of content in the tests, trying to mimic some high level differences your users might encounter.

If you’re measuring data in CI, you should be storing and displaying it as well. Try to get the CI to alert on regressions, failing changes before they make it into the product.

Some common things to measure in CI:

Launch time to show UI
Launch time to interactive UI
Scroll performance (janky frames)
Time to load content
Memory usage (startup, after content loads, after scrolling content)

Remember to use multiple (physical) devices and a variety of content types.

Measuring in Real World

While CI measurements come from only a handful of tests & situations, the real world has many, many more situations. Depending on the number of active users, you could have millions of data points with thousands of unique situations. Collecting data from real users, at a large scale, allows you to investigate how things like global regions, network conditions — and even user types — can affect the performance of the application.

The are many third-party systems you can integrate into your code to easily and efficiently collect real world data. It’s not uncommon for companies to grow their own systems as well. In any case, make sure you are validating the data itself. Since real world data is messy, make sure you are vetting the collection systems and the data. Look for problems like payload corruption, clock skew, range errors or other oddities.

Create automated queries and reports, sent out broadly for people to review. Remember to go deeper than high-level summaries. Some of the interesting discoveries happen when you split out data across different dimensions.

Some common things to measure from real world:

Network usage, including start time, end time, content type and size of the response. Get detailed connection timing, if possible, for DNS and SSL handshake information.
- For API endpoints, this is useful for tracking latency and payload size
- For media loading, this gives a ballpark metric for how long people are staring at an empty box, waiting for an image to load.
Event, session and error state data. This can be used to track critical content impressions, but also can be used to learn how people use the application.

Remember to include some common metadata in each measurement so you can split out the data across different dimensions. Things like non-PII identifier, generic geo-location/region, device specs and connection type/speed can help you drill down into the data, looking for trends.

It’s also polite to allow people to opt-out of this type of data collection.

October 13, 2016

Always Be Shipping – Expect the Unexpected

Normal releases are consistent and predictable. Scheduled releases benefit developers, testers, support and PR. Unpredictable releases can cause communication problems, stress and fatigue. Those can lead to poor software quality and developer turn-over.

Sometimes we need to deal with unexpected issues that can’t wait for a normal release. Some examples include:

High volume crashes
Broken functionality
Security issues
Special date-based features

Anyone should be able to suggest an off-cycle release, so make sure there’s a straightforward, simple process for doing it. Identify that a special release is really necessary. Maybe the issue can wait for the next normal release. Consider using an approval process to decide if the release is warranted. An approval process creates a small hurdle that forces some justification. An off-cycle release is not cheap and has potential to derail the normal release process. Don’t put the normal release cycle at risk.

Some things to keep in mind:

Clearly identify the need. If you can’t, you probably don’t need the release.
Limit the scope of work to just what needs to be done for the issue. Be laser focused.
Make sure the work can be completed within the shortened cycle. Otherwise, just let the work happen in the normal release flow.
Choose an owner to drive the release and a set of stakeholders that need to track the release.
Triage frequently to make sure the short cycle stays on track. Over-communicate.
Test and verify the code changes. By limiting the scope, you should also be limiting the amount of required testing.

Be ready for the unexpected. Get really good at it. The best releases are boring releases.

September 14, 2016

Always Be Shipping

We all want to ship as fast as possible, while making sure we can control the quality of our product. Continuous deployment means we can ship at any time, right? Well, we still need to balance the unstable and stable parts of the codebase.

Web Deploys vs Application Deploys

The ability to control changes in your stable codebase is usually the limiting factor in how quickly and easily you can ship your product to people. For example, web products can ship frequently because it’s somewhat easy to control the state of the product people are using. When something is updated on the website, users get the update when loading the content or refreshing the page. With mobile applications, it can be harder to control the version of the product people are using. After pushing an update to the store, people need to update the application on their devices. This takes time and it’s disruptive. It’s typical for several versions of a mobile application to be active at any given time.

It’s common for mobile application development to use time-based deployment windows, such as 2 or 4 weeks. Every few weeks, the unstable codebase is promoted to the stable codebase and tasks (features and bug fixes) which are deemed stable are made ready to deploy. Getting ready to deploy could mean running a short Beta, to test the release candidate with a larger, more varied, test group.

It’s important to remember, these deployment windows are not development sprints! They are merely opportunities to deploy stable code. Some features or bug fixes could take many weeks to complete. Once complete, the code can be deployed at the next window.

Tracking the Tasks

Just because you use 2 week deployment windows doesn’t mean you can really ship a quality product every 2 weeks. The deployment window is an artificial framework we create to add some structure to the process. At the core, we need to be able to track the tasks. What is a task? Let’s start with something that’s easy to visualize: a feature.

What work goes into getting a feature shipped?

Planning: Define and scope the work.
Design: Design the UI and experience.
Coding: Do the implementation. Iterate with designers & product managers.
Reviewing: Examine & run the code, looking for problems. Code is ready to land after a successful review. Otherwise, it goes back to coding to fix issues.
Testing: Test that the feature is working correctly and nothing broke in the process. Defects might require sending the work back to development.
Push to Stable: Once implemented, tested and verified, the code can be moved to the stable codebase.

In the old days, this was a waterfall approach. These days, we can use iterative, overlapping processes. A flow might crudely look like this:

Each of these steps takes a non-zero amount of time. Some have to be repeated. The goal is to create a feature that has the desired behavior and at a known level of quality. Note that landing the code is not the final step. The work can only be called complete when it’s been verified as stable enough to ship.

Bug fixes are similar to features. The flow might look like this:

Imagine you have many of these flows happening at the same time. Ongoing work happens on the unstable codebase. As work is completed, tested and verified at an expectable level of quality, it can be moved to the stable codebase. All work happens on the unstable codebase. Try very hard to keep work on the stable codebase to a minimum – usually disabling/enabling code or backing out unstable code.

Crash Landings

One practice I’ve seen happen on development teams is attempting to crash land code right before a deployment window. This is bad for a few reasons:

It forces many code reviews to happen simultaneously across the team, leading to delays since code review is an iterative cycle.
It forces large amounts of code to be merged during a short time period, likely leading to merge conflicts – leading to more delays.
It forces a lot of testing to happen at the same time, leading to backlogs and delays. Especially since testing, fixing and verifying is an iterative cycle.

The end result is anti-climatic for everyone: code landed at a deployment window is almost never shipped in the window. In fact, the delays caused by crash landing lead to a lot of code missing the deployment window.

Smooth Landings

A different approach is to spread out the code landings. Allow code reviews and testing/fixing cycles to happen in a more balanced manner. More code is verified as stable and can ship in the deployment window. Code that is not stable is disabled via build-time or runtime flags, or in extreme cases, backout out of the stable codebase.

This balanced approach also reduces the stress that accompanies rushing code reviews and testing. The process becomes more predictable and even enjoyable. Teams thrive in healthy environments.

Once you get comfortable with deployment windows and sprints being very different things, you could even start getting more creative with deployments. Could you deploy weekly? I think it’s possible, but the limiting factor becomes your ability to create stable builds, test and verify those builds and submit those builds to the store. Yes, you still need to test the release candidates and react to any unexpected outcomes from the testing. Testing the release candidates with a larger group (Beta testing) will usually turn up issues not found in other testing. At larger scales, many things thought to be only hypothetical become reality and might need to be addressed. Allowing for this type of beta testing improves quality, but may limit how short a deployment window can be.

Remember, it’s difficult to undo or remove an unexpected issue from a mobile application user population. Users are just stuck with the problem until they get around to updating to a fixed version.

I’ve seen some companies use short deployment window techniques for internal test releases, so it’s certainly possible. Automation has to play a key role, as does tracking and triaging the bugs. Risk assessment is a big part of shipping software. Know your risks, ship your software.

April 5, 2016

Fun with Telemetry: Improving Our User Analytics Story

My last post talks about the initial work to create a real user analytics system based on the UI Telemetry event data collected in Firefox on Mobile. I’m happy to report that we’ve had much forward progress since then. Most importantly, we are no longer using the DIY setup on one of my Mac Minis. Working with the Mozilla Telemetry & Data team, we have a system that extracts data from UI Telemetry via Spark, imports the data into Presto-based storage, and allows SQL queries and visualization via Re:dash.

With data accessible via Re:dash, we can use SQL to focus on improving our analyses:

Track Active users, daily & monthly
Explore retention & churn
Look into which features lead to retention
Calculate user session length & event counts per session
Use funnel analysis to evaluate A/B experiments

Roberto posted about how we’re using Parquet, Presto and Re:dash to create an SQL based query and visualization system.

February 22, 2016

Fun with Telemetry: DIY User Analytics Lab in SQL

Firefox on Mobile has a system to collect telemetry data from user interactions. We created a simple event and session UI telemetry system, built on top of the core telemetry system. The core telemetry system has been mainly focused on performance and stability. The UI telemetry system is really focused on how people are interacting with the application itself.

Event-based data streams are commonly used to do user data analytics. We’re pretty fortunate to have streams of events coming from all of our distribution channels. I wanted to start doing different types of analyses on our data, but first I needed to build a simple system to get the data into a suitable format for hacking.

One of the best one-stop sources for a variety of user analytics is the Periscope Data blog. There are posts on active users, retention and churn, and lots of other cool stuff. The blog provides tons of SQL examples. If I could get the Firefox data into SQL, I’d be in a nice place.

Collecting Data

My first step is performing a little ETL (well, the E & T parts) on the raw data using Spark/Python framework for Mozilla Telemetry. I wanted to create two dataset:

clients: Dataset of the unique clients (users) tracked in the system. Besides containing the unique clientId, I wanted to store some metadata, like the profile creation date. (script)
events: Dataset of the event stream, associated to each client. The event data also has information about active A/B experiments. (script)

Building a Database

I installed Postgres on a Mac Mini (powerful stuff, I know) and created my database tables. I was periodically collecting the data via my Spark scripts and I couldn’t guarantee I wouldn’t re-collect data from the previous jobs. I couldn’t just bulk insert the data. I wrote some simple Python scripts to quickly import the data (clients & events), making sure not to create any duplicates.

I decided to start with 30 days of data from our Nightly and Beta channels. Nightly was relatively small (~330K rows of events), but Beta was more significant (~18M rows of events).

Analyzing and Visualizing

Now that I had my data, I could start exploring. There are a lot of analysis/visualization/sharing tools out there. Many are commercial and have lots of features. I stumbled across a few open-source tools:

Airpal: A web-based query execution tool from Airbnb. Makes it easy to save and share SQL analysis queries. Works with Facebook’s PrestoDB, but doesn’t seem to create any plots.
Re:dash: A web-based query, visualization and collaboration tool. It has tons of visualization support. You can set it up on your own server, but it was a little more than I wanted to take on over a weekend.
SQLPad: A web-based query and visualization tool. Simple and easy to setup, so I tried using it.

Even though I wanted to use SQLPad as much as possible, I found myself spending most of my time in pgAdmin. Debugging queries, using EXPLAIN to make queries faster, and setting up indexes. It was easier in pgAdmin. Once I got the basic things figured out, I was able to more efficiently use SQLPad. Below are some screenshots using the Nightly data:

Next Steps

Now that I have Firefox event data in SQL, I can start looking at retention, churn, active users, engagement and funnel analysis. Eventually, we want this process to be automated, data stored in Redshift (like a lot of other Mozilla data) and exposed via easy query/visualization/collaboration tools. We’re working with the Mozilla Telemetry & Data Pipeline teams to make that happen.

A big thanks to Roberto Vitillo and Mark Reid for the help in creating the Spark scripts, and Richard Newman for double-dog daring me to try this.

February 22, 2016

Firefox on Mobile: A/B Testing and Staged Rollouts

We have decided to start running A/B Testing in Firefox for Android. These experiments are intended to optimize specific outcomes, as well as, inform our long-term design decisions. We want to create the best Firefox experience we can, and these experiments will help.

The system will also allow us to throttle the release of features, called staged rollout or feature toggles, so we can monitor new features in a controlled manner across a large user base and a fragmented device ecosystem. If we need to rollback a feature for some reason, we’d have the ability to do that, quickly without needing people to update software.

Technical details:

Mozilla Switchboard is used to control experiment segmenting and staged rollout.
UI Telemetry is used to collect metrics about an experiment.
Unified Telemetry is used to track active experiments so we can correlate to application usage.

What is Mozilla Switchboard?

Mozilla Switchboard is based on Switchboard, an open source SDK for doing A/B testing and staged rollouts from the folks at KeepSafe. It connects to a server component, which maintains a list of active experiments.

The SDK does create a UUID, which is stored on the device. The UUID is sent to the server, which uses it to “bucket” the client, but the UUID is never stored on the server. In fact, the server does not store any data. The server we are using was ported to Node from PHP and is being hosted by Mozilla.

We decided to start using Switchboard because it’s simple, open source, has client code for Android and iOS, saves no data on the server and can be hosted by Mozilla.

Planning Experiments

The Mobile Product and UX teams are the primary drivers for creating experiments, but as is common on the Mobile team, ideas can come from anywhere. We have been working with the Mozilla Growth team, getting a better understanding of how to design the experiments and analyze the metrics. UX researchers also have input into the experiments.

Once Product and UX complete the experiment design, Development would land code in Firefox to implement the desired variations of the experiment. Development would also land code in the Switchboard server to control the configuration of the experiment: On what channels is it active? How are the variations distributed across the user population?

Since we use Telemetry to collect metrics on the experiments, the Beta channel is likely our best time period to run experiments. Telemetry is on by default on Nightly, Aurora and Beta; and Beta is the largest user base of those three channels.

Once we decide which variation of the experiment is the “winner”, we’ll change the Switchboard server configuration for the experiment so that 100% of the user base will flow through the winning variation.

Yes, a small percentage of the Release channel has Telemetry enabled, but it might be too small to be useful for experimentation. Time will tell.

What’s Happening Now?

We are trying to be very transparent about active experiments and staged rollouts. We have a few active experiments right now.

Onboarding A/B experiment with several variants.
Easy entry points for accessing History and Bookmarks on the main menu.
Experimenting with the awesomescreen behavior when displaying search results page.

You can always look at the Mozilla Switchboard configuration to see what’s happening. Over time, we’ll be adding support to Firefox for iOS as well.