Integration Testing: Time to Reboot

I tried to push a plan for bootstrapping an automated integration test system for our Android and iOS applications. The plan was based on similar strategies I’d used, or seen used, at other companies. It didn’t fit well with the current situation and workflows. I failed to take those differences into account and the initiative failed.

Developers never saw the value of using their time and workflow on the automated integration tests. Test engineers were overwhelmed by the amount of manual regression tests required for each release. Even with the manual tests, we have gaps in regression coverage leading to some severe defects shipped to users. We are wasting valuable manual testing time on hundreds of manual regression tests that rarely break, when we should be focusing those people on new feature and exploratory testing.

Looking Forward

We want to be able to do automated testing on our iOS and Android client apps. From the simplest type of “does the application start” smoke test to more complicated tests around critical features and functionality. More test automation means:

  1. Finding bugs faster
  2. Focusing manual testing on high value tasks (new features and exploratory testing)
  3. Shipping releases faster with higher quality

Test engineering is highly motivated to do more integration testing as a way to reduce the number of manual regression test cases. Though they don’t have development experience, those folks want to start creating the tests. So we want to keep the barrier to writing tests very low. As we automate regression tests, we want to focus on manual testing on new features, exploratory, and ad-hoc edge case testing.

Objectives for our automated integration testing reboot:

  • Doesn’t require knowledge of building applications or the languages used to develop the applications.
  • Requires little knowledge of the structure used to build the UI of the applications.
  • Reuse integration testing framework, code, and knowledge across all application platforms.
  • Reduce the amount of manual integration testing as much as possible.

Approach

We intend to use a black-box approach to installing, launching, and driving the applications. The plan includes:

  • Using python-based Appium scripts as the framework for integration tests. Python is a good entry-level programming language and Appium has capabilities to black-box test Android and iOS clients. We’re leveraging the same language and framework for both mobile platforms.
  • Using emulators & simulators to run smoke and integration tests. Easy to setup and run locally, while also capable in CI.
  • Run the tests several times a day using CI, but not on each PR. The focus is on reducing manual regression testing, while not adding friction to developer workflows.
  • Only send consistently failing tests to QA for manual verification and ticket filing.

Milestones

Milestone 1 is about creating a solid foundation for the approach. We’ve completed the proof of concept:

  • Test engineers are building scripts using cross-platform tools — and learning to code.
  • Developers have added a few testing hooks into the clients to allow faster, more robust tests.
  • Python scripts have been created for over 200 integration tests using Appium
  • Tests are running in CI several times a day.
  • Test engineering created a simple system to send consistent failures to QA.
  • Reliability is better than the previous Espresso/XCUITest test suite.

We’ve already saved several tester-hours a day from manual regression testing.

Milestone 2 is really just expanding the test coverage from only high priority test cases to medium and even low priority test cases. We’re also expanding the tooling to support running the Appium tests on both Alpha and Beta channels, as well as self-service support for running on pull requests. Some additional tasks:

  • Get better at controlling feature flags for more deterministic test flows
  • Start mocking API responses for faster testing and less variations due to live data
  • Intercept outgoing requests to track and verify more analytics
  • Create a smaller, faster suite for PR testing

Shipping Faster: The Hackday Mentality

We recently held our Summer Hackday at Tumblr and the results were impressive. I started to think about the mentality of a Hackday and how it differs from a more traditional product feature workflow.

It’s amazing what can be accomplished in a day.

Individuals or small groups start planning their projects in the days leading up to the event. On the day of the event, they’re off and running. They have 24 hours to get something working and demo it to the rest of the company.

Dead-end technical approaches are quickly discarded for alternatives, usually something more simple — the clock is ticking. There is no time for complexity or grand schemes. Get the basics working so you can impress your coworkers.

hackometer
Tumblr Hackometer measures completeness & shipping potential

After the demo presentations, there is always discussion about “how close this project is to shipping?” or “what’s left to do before we could release this project?” or some other notion that we could productize certain projects after a little clean-up work.

Productize — The death knell of the Hackday project. But why? I think it’s scope creep. Scope of the purpose, but also scope of the code.

The constraint of limited time is a gift, forcing or removing decisions which create a better environment for completing the project. Hackday projects are often more aligned with the core purpose of the product as well.

  • Focus on a singular purpose — try to be good at one thing.
  • No time or space for complexity — you can’t build whole new architectures.
  • Built on existing frameworks, patterns, and primitives — it fits into the existing product structure.

The Hackday mentality seems like a better process for building better products. It reminds me of the “fixed time, variable scope” principle from Basecamp’s Shape Up, a book describing their product process. They use six week time-boxes for any project.

Constraints limit our options without requiring us to do any of the cognitive work. With fewer decisions involved when we’re constrained, we’re less prone to decision fatigue. Constraints can actually speed up development.

Ship faster.

Thoughts on Organizational Culture

I’ve been thinking a lot about why it seems so hard to effect change in organizations. The change I’m referring to could be related to product strategy, processes, or improving engineering / operational excellence.

I’ve come to realize that in many situations our efforts and plans don’t always align with the organization’s culture. When that happens, change is difficult.

I’m using culture here to mean something deeper than espresso machines, foosball tables, and edgy office decor — the visible parts of an organization’s culture. I’m talking about an organization’s beliefs, values, and basic assumptions — the things people take for granted and guide decisions. These may have started from the founders, but they’ve evolved over time as we praise and recognize specific behavior.

From Edgar Schein’s “Organizational Culture and Leadership“:

The only thing of real importance that leaders do is to create and manage culture. If you do not manage culture, it manages you, and you may not even be aware of the extent to which this is happening.

We need to become aware of the organization’s culture and learn to manage it in the direction of our desired outcomes.

From Schein’s framework for changing culture:

Change creates learning anxiety. The higher the learning anxiety, the stronger the resistance.

  • The only way to overcome resistance is to reduce the learning anxiety by making the learner feel “psychologically safe”.
  • The change goal must be defined concretely in terms of the specific problem you are trying to fix, not as culture change.

I find myself trying to learn what people value about an existing behavior and how it relates to a purpose or mission. If I want to change to a different behavior, I must show a higher value in the new behavior. This can sometimes be easier if I can create an association to our existing basic assumptions.

From a Kellan Elliott-McCrea post:

Culture is what you celebrate. Rituals are the tools you use to shape culture

Celebrate work and actions that align with strategy. We need to reinforce what we think is important. Reinforcement requires consistent messaging.

  • Create a brief and to the point mission or high-level purpose
  • Establish a few simple & crisp principles that support the mission

Use these as a framework to scope & define objectives and strategy. They also provide a foundation for shaping culture.

 

Thoughts on Dependency Injection

I hear some strong opinions on dependency injection (DI). I’ve never really thought too much about DI specifically, but it is part of an Inversion of Control strategy, which I think about a lot.

Focus on the developer experience, low-friction maintenance and code health outcomes. What’s important to me:

  • Loosely coupled code
  • Easy to test code
  • Simple code
  • Easy to maintain code

Many folks seem to focus on constructor or method based DI. I agree the approach works great for shallow code hierarchies. I’d argue that loosely coupled, easily testable code requires constructor/method DI. Trying to inject everything across deep call stacks can get painful the deeper you go. It creates friction for developers trying to update code, possibly inhibiting code health refactors.

Singletons are usually considered pure evil — hiding code details, creating global state, and making it difficult to test code. That said, they work nicely for accessing basic services and configuration from anywhere.

Service locators can sit somewhere in between the pure DI and singletons. Pretty easy to swap concrete and mock services, but you are adding a single dependency wherever it’s used. TBH, I think of things like Dagger as annotation-based service locator tools.

The tl;dr is that code gets complicated and instead of being too idealistic on the implementation details, know when to be pragmatic and focus on the higher-level objectives. Consider the pros & cons of different approaches. Legacy code is not an ideal situation, but one you need to handle. It’s a rare treat when you get to work in brand new code. This means you need to make pragmatic choices.

  • Make the best compromise to increase the testability of your code.
  • Avoid thousand line code changes to add logging or a network check in one place.
  • Keep the code simple and clean, focusing on a code-maintenance point of view.

Don’t blindly follow to idealistic dogma. Make choices that deliver the best impact.

Web ADB: Simple Web-based Access to Devices

I’ve had a number of occasions where I needed direct access to an Android device that wasn’t connected to the computer in front of me. I can usually SSH into the remote host machine and use ADB to try to debug the situation. If the simple stuff doesn’t work, I eventually start using ADB screencap to get a look at what’s on the device. If I’m lucky, I can remote desktop to the host. If not, I end up copying the images back to my machine to view them.

Remote Situation
Connecting to a remote host with some Android devices attached to it

Surely there is must be an easier way.

There is! I found the OpenSTF project which basically allows you to get web-based control of the Android and iOS devices. Just install the system on the host machine and install an agent on the Android devices. It looks pretty cool, but always seemed like overkill when I was in a remote debug situation.

So I decided I’d start hacking together a really simple system in Python. I started with the simplest Python API server I could find. Then I added a fairly basic webapp front-end. The result is Web ADB.

It’s a very minimal Python API server, which also serves up a basic single-page webapp. The approach is pretty simple: run ADB commands via Python, parse the output, send the results back through the API response.

The API supports getting attached devices, getting a screenshot of a device, sending key presses and screen taps, and even rebooting a device. The webapp just uses the API to make something useful. Maybe the only cool feature is clicking on a screenshot will send a tap to the device, then update the screenshot. I have some ideas for other features, as time permits.

Stability: Monitoring Application Crashes

Crashes are inevitable. For an application with many people actively using it, the amount of crashes is probably higher than you’d imagine. Thankfully, there are many in-house and third-party systems that allow you to track crashes with detailed meta data and crash stacks.

It takes some practice and diligence, but you can get very good at managing and fixing the most relevant crashes in your application.

When I started working at Tumblr, I noticed they were using an additional approach to tracking crashes: a real-time, low-cardinality, crash event

Hooked up to the time-series collection system used for other monitoring, you get a very up-to-date measure of the crash rate. It was the first time I’d seen crash rate measured this way. Plenty of crash tracking systems allow you to look at crash rate, but usually at the hourly level. Ingestion can take non-trivial amounts of time when dealing with parsing, symbolicating, and deobfuscation.

The real-time crash monitoring is perfect for alerting too. The time-series system already had support for notifying developers or paging on-call people.

Feature flags are an awesome way to ramp new functionality in a controlled manner. I’ve seen real-time crash monitoring catch issues moments after a problematic ramp. If you’re using feature flags, you should be using real-time crash reporting as well. It’s enormously beneficial and relatively simple to implement.

Decisions, Decisions, Decisions

A big part of leading people and teams is making decisions. You can’t move forward without making decisions. As the lead, or manager, it’s part of your job. Don’t pass it down to someone on your team. They’re looking to you.

  • Get the facts: Try to be well informed about the situation, the inputs effecting the decision, and the outcomes from the decision. If you’re not well informed, start asking people who are. They might have the information, or parts of it, but aren’t in the position to make the decision.
  • Trust your gut: It’s unlikely you’ll ever have 100% of the information you need. That’s fine, the 40/70 rule says waiting for 100% is waiting too long.
  • Know the risks: Not all decisions will be the right decision. Mistakes happen, but the trick is trying to minimize the cost of a potential mistake by knowing and lowering the risk. People point to the 10/10/10 approach as a way to do this.
  • Be consistent: Try to have a process or convention or guideline or philosophy to fall back on to help shape your decision. Exceptions can happen, but decisions are easier to make if you have a well know starting place to build from.
  • Don’t delay: The longer it takes to make a decision, the more problematic a situation can become. In the long run, every decision is short term. Delays keep the team from moving forward. Brian Valentine, lead developer on Windows 2000, had a famous quote: “Decisions in 10 minutes or less, or the next one is free.”

These guidelines can be useful when you’re in a period of heavy decision making. Decision fatigue is a real condition. It can lead to a reduced ability to make trade-offs, understand risk, or even just flat out avoid making the decision altogether. If you’re not prepared, your decisions will suffer and so will your team.

Random Thoughts on Team Structure


I’ve written previously about my thoughts on team structure. I’m a fan of product-centric teams — multidisciplinary teams that embed members from functional groups on the same team, all working together to create and ship a software product.

Team Evolution

At some point, a team might grow large enough that you want to split into smaller groups, each with a primary focus. You’re still building a single product, but now you have a collection of product-centric teams working on specific features. How did you get here?

  • Teams get harder to manage and coordinate as they grow in size.
  • Product-drivers feel like it’s a struggle to get development focus on their features.

I’ve been able to work in both situations: single product-centric team, and multiple feature-based teams. My preference is still the single product-based team. The downsides of feature-based teams outweigh the advantages.

  • Teams become silos and stop focusing on the product as a whole.
  • Issues without a clear owner become someone else’s problem.
  • Cross-team communication becomes more difficult as more groups are created.
  • Individual team ambitions inadvertently dilute the primary focus of the product.

Conway’s Law tells us that organizations tend to build products based on the organization’s structure. Using several small teams, with a focus on specific features, will have an effect on the final product. It might not be a desired effect.

Mindful Divisions

I’m not suggesting teams grow beyond 7 to 10 people. There is plenty of literature, and experience, that tells us that would be bad, and even less efficient. But how you divide teams is important. Some divisions are more natural than others:

  • By platform (Desktop, Android, iOS, Web): Make sure there is some product consistency across platforms.
  • By front-end / back-end: Make sure both sides are part of defining the interaction APIs.
  • By application / UI widgets: Make sure both sides are part of defining the component APIs.

These separations are clean and easier to identify.

Feature Survival

Single product-centric teams bring us back to the issue of product-drivers fighting for development focus. I think this is a good thing.

When features are implemented through a single team, you need to be good at prioritizing. It shouldn’t be easy to add every little feature to the product. By making all features compete for priority, you make sure the best features get the attention.

I believe this makes the product stronger.

Performance: Design & Expectations


Providing a great user experience under a variety of performance situations means designing for a variety of expectations. Defining a great user experience should include performance budgets for the various pieces of the experience. Some basic examples include: Application startup, Image loading, and UI responsiveness.

Application Startup

First impressions are important, and startup time (page-load for web apps) is that first impression. There is a lot of information floating around that should convince you of the importance of a fast launch time. Still, we seem to cram more and more cruft into that part of the application. We initialize analytic libraries, load saved preferences, try to re-send failed or queued events and data, and maybe even send crash reports from previous sessions.

Create a startup time performance budget. How long should a person wait before being able to view and interact with real content? Once you set that limit, start moving less critical work out of the critical path. Queued events and data can remain queued a little while longer. Buffer new events before initializing heavy analytics libraries. Consider showing cached content, while downloading new content in the background.

Image Loading

Images are a big part of many applications. Avatars, photos and GIFs are everywhere. We want to display the images as fast as possible. Usually development happens in ideal situations: Best devices and best network speeds – the fast-path. If you’re measuring the real world performance characteristics of your application, you probably know that most of the people using the application don’t have the best devices or fast network speeds – the slow path.

Sometimes we fail to design for the slow-path. We assume it’s infrequent, or worse, we believe the fast-path behavior is correct for the product in any situation. People can just deal with the crummy experience. Remember your performance budget: how long should someone wait for an image to load?

Some common approaches to handling the slow-path:

  • Use server-side caching. This one is pretty obvious, but I have to mention it. Using a Content Delivery Network (CDN) means it takes less time to deliver images to the application because the images are “closer” to the application.
  • Use a more efficient file format. GIF is not known for being a lightweight format. Look into WebP and MP4 as low-bandwidth animated image alternatives that provide great quality.
  • Get better at picking JPEG quality levels. Etsy has a nice write-up on using SSIM (human vision estimation) to pick the lowest level without hurting perceived quality. Google has something called Butteraugli that does something similar.
  • Dynamically size images to fit the target rendering size. Don’t download large images only to reduce the size on the client. For less than excellent networking speeds, request images that are smaller than the target size and upscale them. You can save a lot of bandwidth and render the image quickly, keeping the application usable.
  • Aggressively cache images on the device. Never download the same image more than once. Cached images load quickly and reduce bandwidth usage. Yes, this might mean using 1GB or more for a cache, but if the space is available, it’s always worth it. Modern OSes will try to clear storage-based caches when running low on free space.
  • Consider Tap-to-Play interfaces to delay downloading large animated images until requested. Use a much smaller static image as a placeholder.

Some of these slow-path ideas might be so effective at saving bandwidth or improving image loading speed, that you make them options for the fast-path as well.

UI Responsiveness

Touch-based devices make unresponsive UIs very noticeable. Applications should maintain a responsive UI no matter what other activity is taking place. Use background threads to do the heavy lifting. Keep the UI thread free of any file I/O, networking and any other work that can be pushed to the background.

Remember to design for the slow-path when creating UI actions associated with network APIs. Don’t wait for the network response before changing the state of the UI. If the action fails, you can always say so and flip the state back. Delaying the state change makes the UI appear broken.

Smooth scrolling is another part of a responsive UI. iOS, Android and Web all have best practices for keeping high frame-rates while scrolling. There are also tools for profiling your rendering code.

Watch for situations where a design requirement (animation, layout, whatever) is causing the UI to become unresponsive. Find a way to fast-path/slow-path the requirement. If that’s not possible, get the requirement changed.

Design for Everyone

Never fall into the mindset of designing for only the latest hardware and fastest network speeds. You really need to factor the slow-path into your designs too. Yes, it’s more work and it’s probably not your ideal experience, but it’s far better than trying to force a fast-path design down a slow-path situation. That experience is usually horrible. You can do better!

Performance: The Merits of Measuring

If you can not measure it, you can not improve it. – Lord Kelvin.

There is so much information out there on ways to improve the performance of your mobile application or website. You probably feel you can just dive in and start making changes. But if you’re not measuring your application’s performance, you don’t know if anything is really helping or hurting. How do you know what effect any changes will have on the performance? Most applications are complex enough that we can’t assume our simplistic reasoning accurately reflects the code behavior. You need to measure.

You need measurements from before and after any changes are made. Your application has development phases, so should your measurement plan. Measure in CI to find improvements and regressions as soon as they happen. Measure in the real world to find how variations like network conditions, device fragmentation, and unpredictable user behavior manifests in performance.

Measuring in CI

The point of measuring performance in CI is to control variability and watch for relative differences on each change. Try to reduce the noisy variables like network calls and background services to create a fast and consistent surface on which you can monitor performance changes in a reliable and repeatable manner. The purpose is not to determine the performance your users will encounter. There are too many variables in the real world and you can’t control them well enough for apples-to-apples comparisons.

Use real devices when measuring performance, not emulators or simulators running on host hardware.

You can try to create reliable & repeatable simulations of some real world situations. Network connection speed is one example. You can use a network simulator, like Facebook’s Augmented Traffic Control system to simulate WiFi and mobile network conditions. This is especially useful if your application is designed to react differently under different network conditions. You can also use different types of content in the tests, trying to mimic some high level differences your users might encounter.

If you’re measuring data in CI, you should be storing and displaying it as well. Try to get the CI to alert on regressions, failing changes before they make it into the product.

Some common things to measure in CI:

  • Launch time to show UI
  • Launch time to interactive UI
  • Scroll performance (janky frames)
  • Time to load content
  • Memory usage (startup, after content loads, after scrolling content)

Remember to use multiple (physical) devices and a variety of content types.

Measuring in Real World

While CI measurements come from only a handful of tests & situations, the real world has many, many more situations. Depending on the number of active users, you could have millions of data points with thousands of unique situations. Collecting data from real users, at a large scale, allows you to investigate how things like global regions, network conditions — and even user types — can affect the performance of the application.

The are many third-party systems you can integrate into your code to easily and efficiently collect real world data. It’s not uncommon for companies to grow their own systems as well. In any case, make sure you are validating the data itself. Since real world data is messy, make sure you are vetting the collection systems and the data. Look for problems like payload corruption, clock skew, range errors or other oddities.

Create automated queries and reports, sent out broadly for people to review. Remember to go deeper than high-level summaries. Some of the interesting discoveries happen when you split out data across different dimensions.

Some common things to measure from real world:

  • Network usage, including start time, end time, content type and size of the response. Get detailed connection timing, if possible, for DNS and SSL handshake information.
    • For API endpoints, this is useful for tracking latency and payload size
    • For media loading, this gives a ballpark metric for how long people are staring at an empty box, waiting for an image to load.
  • Event, session and error state data. This can be used to track critical content impressions, but also can be used to learn how people use the application.

Remember to include some common metadata in each measurement so you can split out the data across different dimensions. Things like non-PII identifier, generic geo-location/region, device specs and connection type/speed can help you drill down into the data, looking for trends.

It’s also polite to allow people to opt-out of this type of data collection.