Guiding Teams to Outcomes


You work in an organization that sets some high-level goals. Your team might be accountable for some of those goals. However, to hit the goals, you’ll need cooperation from groups outside of your team.

What do you do? How do you get everyone on the path to finishing the shared outcomes?

Situations like this happen a lot. Some ideas:

  • Make sure the path is clearly marked.
  • Make it easy for people to stay on the path.
  • Make it hard for people to go off the path.
  • Be the voice of encouragement.
  • Be the voice of recognition.
  • Assume people want to be on the path, but they might also be busy with other problems.

Managing “friction” can be a useful technique in getting everyone working toward the goals. Try to reduce friction on anything that positively affects getting to the outcomes, but add friction to those things that are negative.

  • Centralize documentation for checklist processes. Better yet, automate as many of the steps as possible. Even better might be to add the manual steps to your automated steps so you only have one true list.
  • Do more checks in your continuous integration (CI) system, especially adding automated tests (unit, integration and performance). Stop regressions ASAP.
  • Make sure the output of your process is being measured and is clearly visible to everyone. Put up monitors with charts and graphs in your open office spaces. Showing progress and trends helps to reinforce the importance of everyone’s role in hitting goals.
  • Add anomaly detection to the measurement data. Don’t count on people to find the problems in real-time.
  • Don’t be surprised if you need to keep repeating the plan.

Communicating via IRC/Slack


I’ve used messaging tools like IRC and Slack at work every day for the last 11+ years. For much of that time, I was working remotely. Even my co-workers, working in offices, used messaging tools as much as I did.

When I first started working for Mozilla and visited one of the main offices, it was weird to see this happening. People sitting a few desks away from each other were chatting via messaging instead of just speaking. There are many reasons why this happens, but a primary outcome was that it allowed remote workers to be included in almost all discussions.

Spending so much time in messaging tools has downsides. Communicating efficiently via text-based messaging requires learning how to be a better communicator. Text-based messaging loses almost all the tone and nuance of a spoken conversation. A lot of context and expression happens via non-verbal communication. That is missing from text-based messaging and it can lead to communication problems.

Assume people are generally good: It’s easy to read something and assume the worst. This generally doesn’t end well.

  • You’re disrupting a person’s workflow. You don’t know what they are doing or their mood. Be gentle, not demanding.
  • When things start to get complicated, confusing or heated, switch to face-to-face or video chat.
  • Make your intent clear. When people are left to infer, things can go poorly. For example:
      me: Is your project almost finished?
    them: (is he asking for status or thinking I'm too slow?)

Be careful delivering critical feedback: Giving someone feedback can be tricky in face-to-face conversations. Over text messaging, it can be dangerous. It’s easy for feedback to be seen as a personal attack.

  • Use questions, not statements. This encourages more discussion and doesn’t put people on the defensive as much.
  • Use examples of your own failures. Show you know people make mistakes.
  • Focus on the outcome, not the current implementation. Are we getting to the outcome we want?
  • Use ‘we’ instead of ‘you’ when possible. We are in this together.

Be ready to moderate: Sometimes people aren’t on the same page. If you see a conversation getting heated, try to diffuse it. It’s important to keep communication channels as a ‘safe space’ for everybody.

  • Use private channels to let people know a conversation is off-track. People can be unaware how badly things have become.
  • Be public when needed. Other people in the channel will benefit from knowing the limits of unacceptable behavior.
  • Moderation is a form of critical feedback. See above.

Not everyone will get the joke: Using humor over text-based messaging can backfire. Trying to be funny can lead to situations where no one is laughing, or even worse, people could be offended. Use caution.

Always Be Shipping – Expect the Unexpected

Normal releases are consistent and predictable. Scheduled releases benefit developers, testers, support and PR. Unpredictable releases can cause communication problems, stress and fatigue. Those can lead to poor software quality and developer turn-over.

Sometimes we need to deal with unexpected issues that can’t wait for a normal release. Some examples include:

  • High volume crashes
  • Broken functionality
  • Security issues
  • Special date-based features

Anyone should be able to suggest an off-cycle release, so make sure there’s a straightforward, simple process for doing it. Identify that a special release is really necessary. Maybe the issue can wait for the next normal release. Consider using an approval process to decide if the release is warranted. An approval process creates a small hurdle that forces some justification. An off-cycle release is not cheap and has potential to derail the normal release process. Don’t put the normal release cycle at risk.

Some things to keep in mind:

  • Clearly identify the need. If you can’t, you probably don’t need the release.
  • Limit the scope of work to just what needs to be done for the issue. Be laser focused.
  • Make sure the work can be completed within the shortened cycle. Otherwise, just let the work happen in the normal release flow.
  • Choose an owner to drive the release and a set of stakeholders that need to track the release.
  • Triage frequently to make sure the short cycle stays on track. Over-communicate.
  • Test and verify the code changes. By limiting the scope, you should also be limiting the amount of required testing.

Be ready for the unexpected. Get really good at it. The best releases are boring releases.

Always Be Shipping

We all want to ship as fast as possible, while making sure we can control the quality of our product. Continuous deployment means we can ship at any time, right? Well, we still need to balance the unstable and stable parts of the codebase.

Web Deploys vs Application Deploys

The ability to control changes in your stable codebase is usually the limiting factor in how quickly and easily you can ship your product to people. For example, web products can ship frequently because it’s somewhat easy to control the state of the product people are using. When something is updated on the website, users get the update when loading the content or refreshing the page. With mobile applications, it can be harder to control the version of the product people are using. After pushing an update to the store, people need to update the application on their devices. This takes time and it’s disruptive. It’s typical for several versions of a mobile application to be active at any given time.

It’s common for mobile application development to use time-based deployment windows, such as 2 or 4 weeks. Every few weeks, the unstable codebase is promoted to the stable codebase and tasks (features and bug fixes) which are deemed stable are made ready to deploy. Getting ready to deploy could mean running a short Beta, to test the release candidate with a larger, more varied, test group.

It’s important to remember, these deployment windows are not development sprints! They are merely opportunities to deploy stable code. Some features or bug fixes could take many weeks to complete. Once complete, the code can be deployed at the next window.

Tracking the Tasks

Just because you use 2 week deployment windows doesn’t mean you can really ship a quality product every 2 weeks. The deployment window is an artificial framework we create to add some structure to the process. At the core, we need to be able to track the tasks. What is a task? Let’s start with something that’s easy to visualize: a feature.

What work goes into getting a feature shipped?

  • Planning: Define and scope the work.
  • Design: Design the UI and experience.
  • Coding: Do the implementation. Iterate with designers & product managers.
  • Reviewing: Examine & run the code, looking for problems. Code is ready to land after a successful review. Otherwise, it goes back to coding to fix issues.
  • Testing: Test that the feature is working correctly and nothing broke in the process. Defects might require sending the work back to development.
  • Push to Stable: Once implemented, tested and verified, the code can be moved to the stable codebase.

In the old days, this was a waterfall approach. These days, we can use iterative, overlapping processes. A flow might crudely look like this:

feature-cycle

Each of these steps takes a non-zero amount of time. Some have to be repeated. The goal is to create a feature that has the desired behavior and at a known level of quality. Note that landing the code is not the final step. The work can only be called complete when it’s been verified as stable enough to ship.

Bug fixes are similar to features. The flow might look like this:

bug-cycle

Imagine you have many of these flows happening at the same time. Ongoing work happens on the unstable codebase. As work is completed, tested and verified at an expectable level of quality, it can be moved to the stable codebase. All work happens on the unstable codebase. Try very hard to keep work on the stable codebase to a minimum – usually disabling/enabling code or backing out unstable code.

Crash Landings

One practice I’ve seen happen on development teams is attempting to crash land code right before a deployment window. This is bad for a few reasons:

  • It forces many code reviews to happen simultaneously across the team, leading to delays since code review is an iterative cycle.
  • It forces large amounts of code to be merged during a short time period, likely leading to merge conflicts – leading to more delays.
  • It forces a lot of testing to happen at the same time, leading to backlogs and delays. Especially since testing, fixing and verifying is an iterative cycle.

The end result is anti-climatic for everyone: code landed at a deployment window is almost never shipped in the window. In fact, the delays caused by crash landing lead to a lot of code missing the deployment window.

crash-landing

Smooth Landings

A different approach is to spread out the code landings. Allow code reviews and testing/fixing cycles to happen in a more balanced manner. More code is verified as stable and can ship in the deployment window. Code that is not stable is disabled via build-time or runtime flags, or in extreme cases, backout out of the stable codebase.

smooth-landing

This balanced approach also reduces the stress that accompanies rushing code reviews and testing. The process becomes more predictable and even enjoyable. Teams thrive in healthy environments.

Once you get comfortable with deployment windows and sprints being very different things, you could even start getting more creative with deployments. Could you deploy weekly? I think it’s possible, but the limiting factor becomes your ability to create stable builds, test and verify those builds and submit those builds to the store. Yes, you still need to test the release candidates and react to any unexpected outcomes from the testing. Testing the release candidates with a larger group (Beta testing) will usually turn up issues not found in other testing. At larger scales, many things thought to be only hypothetical become reality and might need to be addressed. Allowing for this type of beta testing improves quality, but may limit how short a deployment window can be.

Remember, it’s difficult to undo or remove an unexpected issue from a mobile application user population. Users are just stuck with the problem until they get around to updating to a fixed version.

I’ve seen some companies use short deployment window techniques for internal test releases, so it’s certainly possible. Automation has to play a key role, as does tracking and triaging the bugs. Risk assessment is a big part of shipping software. Know your risks, ship your software.

On the Merits of Bug Tracking

Yes, I wrote a whole blog post about bugs. Bugs are boring and managing bugs can be mind-numbing. However, all software has bugs and managing those bugs helps you understand the health and quality of your software, helps you understand the risk associated with new features, and helps you figure out if you’re ready to ship or not.

At some point in our careers, developers have a desire to fix all bugs before releasing. This might work for small projects or in situations where you don’t have a lot of testing. For larger projects, especially as projects mature, it’s just not possible to fix all the bugs before releasing an new version, so it’s time to manage your backlog. Creating good bugs helps reduce the time it takes to manage and fix bugs.

  • Summary – Be explicit and contextual. This is the text that shows up in bug lists. Something too vague, like “Crash when posting” will require people to always open the details to figure out the context.
  • Steps to reproduce – Be clear and precise. What was the expected behavior vs actual behavior? Does it happen all the time or is it intermittent and hard to reproduce?
  • Description – Is the bug a crash, broken behavior, performance or regression? Make sure you add these details. For UI related issues, add a screenshot or video. Can you provide a minimal test case for the issue?

You can’t fix them all, so it’s time to triage. Bug debt contributes to the risk of shipping, so you need to manage the set of bugs like many other aspects of your development process. Don’t be frightened by a large quantity of bugs in your backlog. It just means people are testing your software, which is a good thing.

Bug triage is the process of going through the list to find bugs that need assistance, escalation, or follow-up. This is usually done in a group, but sometimes individually to clean incoming bugs. Through this process the nastiest, riskiest bugs are identified.

  • Prioritize – Don’t guess. Use a decision tree, or some other system, to determine a real priority.
  • Estimate – Don’t guess. If it’s too hard to figure out, you should break the work up into smaller tasks. Link those sub-tasks back to the original bug.
  • Adjust – Bug metadata is not set in stone. Situations change over time, so can the bug priority.

Bugs have their own social networks. New code always spawns bugs so link those regressions back to the source feature or fix. Link duplicates back to the original issue. Sometimes those are not 100% duplicates and it’s good to verify all the duplicates are really fixed. Link code landings back to bugs. Code archeology is a real thing so make it easier by creating a map of bugs to code. You should be able to start with a line of code and easily find out why/when it was added. You should also be able to start with a bug and find the code used to fix the issue.

The bug metadata should be factual, but separate from the decision to ship. Don’t lower a bug priority just to make the decision to ship easier for someone else. Let those people own the decision to ship with a known level of quality.

Triage helps keep bug status up to date, which is how real-time roadmaps are created. In a time-oriented release schedule, release roadmaps can change because some features aren’t ready to ship. When the enough code lands and regressions that need to be fixed are fixed, a feature is ready to ship. Triaging bugs and managing feature status frequently allows you to be proactive about changes in roadmaps, not reactive.

Leaving Mozilla

I joined Mozilla in 2006 wanting to learn how to build & ship software at a large scale, to push myself to the next level, and to have an impact millions of people. Mozilla also gave me an opportunity to build teams, lead people, and focus on products. It’s been a great experience and I have definitely accomplished my original goals, but after nearly 10 years, I have decided to move on.

One of the most unexpected joys from my time at Mozilla has been working with contributors and the Mozilla Community. The mentorship and communication with contributors creates a positive environment that benefits everyone on the team. Watching someone get excited and engaged from the process of landing code in a Firefox is an awesome feeling.

People of Mozilla, past and present: Thank you for your patience, your trust and your guidance. Ten years creates a lot of memories.

Special shout-out to the Mozilla Mobile team. I’m very proud of the work we (mostly you) accomplished and continue to deliver. You’re a great group of people. Thanks for helping me become a better leader.

It's a Small World - Orlando All Hands Dec 2015
It’s a Small World – Orlando All Hands Dec 2015

Pitching Ideas – It’s Not About Perfect

I realized a long time ago that I was not the type of person who could create, build & polish ideas all by myself. I need collaboration with others to hone and build ideas. More than not, I’m not the one who starts the idea. I pick up something from someone else – bend it, twist it, and turn it into something different.

Like many others, I have a problem with ‘fear of rejection’, which kept me from shepherding my ideas from beginning to shipped. If I couldn’t finish the idea myself or share it within my trusted circle, the idea would likely die. I had most successes when sharing ideas with others. I have been working to increase the size of the trusted circle, but it still has limits.

Some time last year, Mozilla was doing some annual planning for 2016 and Mark Mayo suggested creating informal pitch documents for new ideas, and we’d put those into the planning process. I created a simple template and started turning ideas into pitches, sending the documents out to a large (it felt large to me) list of recipients. To people who were definitely outside my circle.

The world didn’t end. In fact, it’s been a very positive experience, thanks in large part to the quality of the people I work with. I don’t get worried about feeling the idea isn’t ready for others to see. I get to collaborate at a larger scale.

Writing the ideas into pitches also forces me to get a clear message, define objectives & outcomes. I have 1x1s with a variety of folks during the week, and we end up talking about the idea, allowing me to further build and hone the document before sending it out to a larger group.

I’m hooked! These days, I send out pitches quite often. Maybe too often?

Fun with Telemetry: Improving Our User Analytics Story

My last post talks about the initial work to create a real user analytics system based on the UI Telemetry event data collected in Firefox on Mobile. I’m happy to report that we’ve had much forward progress since then. Most importantly, we are no longer using the DIY setup on one of my Mac Minis. Working with the Mozilla Telemetry & Data team, we have a system that extracts data from UI Telemetry via Spark, imports the data into Presto-based storage, and allows SQL queries and visualization via Re:dash.

With data accessible via Re:dash, we can use SQL to focus on improving our analyses:

  • Track Active users, daily & monthly
  • Explore retention & churn
  • Look into which features lead to retention
  • Calculate user session length & event counts per session
  • Use funnel analysis to evaluate A/B experiments

loadurl-types

loadurl-retention-effect

dropoff-rate

Roberto posted about how we’re using Parquet, Presto and Re:dash to create an SQL based query and visualization system.

Fun with Telemetry: DIY User Analytics Lab in SQL

Firefox on Mobile has a system to collect telemetry data from user interactions. We created a simple event and session UI telemetry system, built on top of the core telemetry system. The core telemetry system has been mainly focused on performance and stability. The UI telemetry system is really focused on how people are interacting with the application itself.

Event-based data streams are commonly used to do user data analytics. We’re pretty fortunate to have streams of events coming from all of our distribution channels. I wanted to start doing different types of analyses on our data, but first I needed to build a simple system to get the data into a suitable format for hacking.

One of the best one-stop sources for a variety of user analytics is the Periscope Data blog. There are posts on active users, retention and churn, and lots of other cool stuff. The blog provides tons of SQL examples. If I could get the Firefox data into SQL, I’d be in a nice place.

Collecting Data

My first step is performing a little ETL (well, the E & T parts) on the raw data using Spark/Python framework for Mozilla Telemetry. I wanted to create two dataset:

  • clients: Dataset of the unique clients (users) tracked in the system. Besides containing the unique clientId, I wanted to store some metadata, like the profile creation date. (script)
  • events: Dataset of the event stream, associated to each client. The event data also has information about active A/B experiments. (script)

Building a Database

I installed Postgres on a Mac Mini (powerful stuff, I know) and created my database tables. I was periodically collecting the data via my Spark scripts and I couldn’t guarantee I wouldn’t re-collect data from the previous jobs. I couldn’t just bulk insert the data. I wrote some simple Python scripts to quickly import the data (clients & events), making sure not to create any duplicates.

fennec-telemetry-data

I decided to start with 30 days of data from our Nightly and Beta channels. Nightly was relatively small (~330K rows of events), but Beta was more significant (~18M rows of events).

Analyzing and Visualizing

Now that I had my data, I could start exploring. There are a lot of analysis/visualization/sharing tools out there. Many are commercial and have lots of features. I stumbled across a few open-source tools:

  • Airpal: A web-based query execution tool from Airbnb. Makes it easy to save and share SQL analysis queries. Works with Facebook’s PrestoDB, but doesn’t seem to create any plots.
  • Re:dash: A web-based query, visualization and collaboration tool. It has tons of visualization support. You can set it up on your own server, but it was a little more than I wanted to take on over a weekend.
  • SQLPad: A web-based query and visualization tool. Simple and easy to setup, so I tried using it.

Even though I wanted to use SQLPad as much as possible, I found myself spending most of my time in pgAdmin. Debugging queries, using EXPLAIN to make queries faster, and setting up indexes. It was easier in pgAdmin. Once I got the basic things figured out, I was able to more efficiently use SQLPad. Below are some screenshots using the Nightly data:

sqlpad-query

sqlpad-chart

Next Steps

Now that I have Firefox event data in SQL, I can start looking at retention, churn, active users, engagement and funnel analysis. Eventually, we want this process to be automated, data stored in Redshift (like a lot of other Mozilla data) and exposed via easy query/visualization/collaboration tools. We’re working with the Mozilla Telemetry & Data Pipeline teams to make that happen.

A big thanks to Roberto Vitillo and Mark Reid for the help in creating the Spark scripts, and Richard Newman for double-dog daring me to try this.

Firefox on Mobile: A/B Testing and Staged Rollouts

We have decided to start running A/B Testing in Firefox for Android. These experiments are intended to optimize specific outcomes, as well as, inform our long-term design decisions. We want to create the best Firefox experience we can, and these experiments will help.

The system will also allow us to throttle the release of features, called staged rollout or feature toggles, so we can monitor new features in a controlled manner across a large user base and a fragmented device ecosystem. If we need to rollback a feature for some reason, we’d have the ability to do that, quickly without needing people to update software.

Technical details:

  • Mozilla Switchboard is used to control experiment segmenting and staged rollout.
  • UI Telemetry is used to collect metrics about an experiment.
  • Unified Telemetry is used to track active experiments so we can correlate to application usage.

What is Mozilla Switchboard?

Mozilla Switchboard is based on Switchboard, an open source SDK for doing A/B testing and staged rollouts from the folks at KeepSafe. It connects to a server component, which maintains a list of active experiments.

The SDK does create a UUID, which is stored on the device. The UUID is sent to the server, which uses it to “bucket” the client, but the UUID is never stored on the server. In fact, the server does not store any data. The server we are using was ported to Node from PHP and is being hosted by Mozilla.

We decided to start using Switchboard because it’s simple, open source, has client code for Android and iOS, saves no data on the server and can be hosted by Mozilla.

Planning Experiments

The Mobile Product and UX teams are the primary drivers for creating experiments, but as is common on the Mobile team, ideas can come from anywhere. We have been working with the Mozilla Growth team, getting a better understanding of how to design the experiments and analyze the metrics. UX researchers also have input into the experiments.

Once Product and UX complete the experiment design, Development would land code in Firefox to implement the desired variations of the experiment. Development would also land code in the Switchboard server to control the configuration of the experiment: On what channels is it active? How are the variations distributed across the user population?

Since we use Telemetry to collect metrics on the experiments, the Beta channel is likely our best time period to run experiments. Telemetry is on by default on Nightly, Aurora and Beta; and Beta is the largest user base of those three channels.

Once we decide which variation of the experiment is the “winner”, we’ll change the Switchboard server configuration for the experiment so that 100% of the user base will flow through the winning variation.

Yes, a small percentage of the Release channel has Telemetry enabled, but it might be too small to be useful for experimentation. Time will tell.

What’s Happening Now?

We are trying to be very transparent about active experiments and staged rollouts. We have a few active experiments right now.

  • Onboarding A/B experiment with several variants.
  • Easy entry points for accessing History and Bookmarks on the main menu.
  • Experimenting with the awesomescreen behavior when displaying search results page.

You can always look at the Mozilla Switchboard configuration to see what’s happening. Over time, we’ll be adding support to Firefox for iOS as well.