Tracking Work is Fundamental

“Developers should only need Github Issues and Pull Requests to do their job” — Why should anyone need more than that to track work?

Small companies and startups have small engineering teams. The amount of effort required to understand the ongoing and planned work is low due to sheer lack of ability to take on too much and succeed. Failure weeds out the companies that take on too much, too soon.

Companies succeed and grow, and so do the engineering teams. At some point, multiple engineering teams are created or evolve. Ideally, these teams are self-sufficient and isolated from each other, creating modular and decoupled output. This ideal state rarely lasts and soon cross-team projects start to appear. Teams continue to evolve into product and platform functions, creating more opportunities for cross-team dependencies.

At this point, work can no longer be tracked at the developer-level alone. Success requires collaboration and coordination. Companies without a cohesive work tracking system that can span individual teams start to slow down. Requirements and dependencies become difficult to track and are often not meeting expectations which lead to rework and churn. Deliverables aren’t meeting the guesstimate timelines and drag on.

Making work visible is a core attribute to many different methodologies and processes, even the ad-hoc ones. If you don’t have a bird’s eye view of the engineering work happening at your company, what can you say about your situation? Very little. Try to ascertain the status of a given cross-team project without asking someone. If it takes you longer than 5 minutes, you’re in trouble and the people you would have asked don’t really know either. All of this work required to figure out a project status is wasting people’s time.

Work tracking is something that isn’t hard to introduce and provides value. It doesn’t require adding any extra work for developers, but starts to also provide value to team leads, project managers, and senior leadership.

“Not Jira!” The cry goes out across engineering. It doesn’t need to be Jira, but don’t hate a tool for being successful at what it does. Just because most companies don’t put enough effort into running Jira well doesn’t make work tracking tools bad in general. Pick something else — except fucking spreadsheets!

“You die a hero lightweight tool, or you live long enough to become the villain bloated enterprise-ready system


See also: Merits of Bug Tracking

Thoughts on Dependency Injection

I hear some strong opinions on dependency injection (DI). I’ve never really thought too much about DI specifically, but it is part of an Inversion of Control strategy, which I think about a lot.

Focus on the developer experience, low-friction maintenance and code health outcomes. What’s important to me:

  • Loosely coupled code
  • Easy to test code
  • Simple code
  • Easy to maintain code

Many folks seem to focus on constructor or method based DI. I agree the approach works great for shallow code hierarchies. I’d argue that loosely coupled, easily testable code requires constructor/method DI. Trying to inject everything across deep call stacks can get painful the deeper you go. It creates friction for developers trying to update code, possibly inhibiting code health refactors.

Singletons are usually considered pure evil — hiding code details, creating global state, and making it difficult to test code. That said, they work nicely for accessing basic services and configuration from anywhere.

Service locators can sit somewhere in between the pure DI and singletons. Pretty easy to swap concrete and mock services, but you are adding a single dependency wherever it’s used. TBH, I think of things like Dagger as annotation-based service locator tools.

The tl;dr is that code gets complicated and instead of being too idealistic on the implementation details, know when to be pragmatic and focus on the higher-level objectives. Consider the pros & cons of different approaches. Legacy code is not an ideal situation, but one you need to handle. It’s a rare treat when you get to work in brand new code. This means you need to make pragmatic choices.

  • Make the best compromise to increase the testability of your code.
  • Avoid thousand line code changes to add logging or a network check in one place.
  • Keep the code simple and clean, focusing on a code-maintenance point of view.

Don’t blindly follow to idealistic dogma. Make choices that deliver the best impact.

Decisions, Decisions, Decisions

A big part of leading people and teams is making decisions. You can’t move forward without making decisions. As the lead, or manager, it’s part of your job. Don’t pass it down to someone on your team. They’re looking to you.

  • Get the facts: Try to be well informed about the situation, the inputs effecting the decision, and the outcomes from the decision. If you’re not well informed, start asking people who are. They might have the information, or parts of it, but aren’t in the position to make the decision.
  • Trust your gut: It’s unlikely you’ll ever have 100% of the information you need. That’s fine, the 40/70 rule says waiting for 100% is waiting too long.
  • Know the risks: Not all decisions will be the right decision. Mistakes happen, but the trick is trying to minimize the cost of a potential mistake by knowing and lowering the risk. People point to the 10/10/10 approach as a way to do this.
  • Be consistent: Try to have a process or convention or guideline or philosophy to fall back on to help shape your decision. Exceptions can happen, but decisions are easier to make if you have a well know starting place to build from.
  • Don’t delay: The longer it takes to make a decision, the more problematic a situation can become. In the long run, every decision is short term. Delays keep the team from moving forward. Brian Valentine, lead developer on Windows 2000, had a famous quote: “Decisions in 10 minutes or less, or the next one is free.”

These guidelines can be useful when you’re in a period of heavy decision making. Decision fatigue is a real condition. It can lead to a reduced ability to make trade-offs, understand risk, or even just flat out avoid making the decision altogether. If you’re not prepared, your decisions will suffer and so will your team.

Random Thoughts on Team Structure


I’ve written previously about my thoughts on team structure. I’m a fan of product-centric teams — multidisciplinary teams that embed members from functional groups on the same team, all working together to create and ship a software product.

Team Evolution

At some point, a team might grow large enough that you want to split into smaller groups, each with a primary focus. You’re still building a single product, but now you have a collection of product-centric teams working on specific features. How did you get here?

  • Teams get harder to manage and coordinate as they grow in size.
  • Product-drivers feel like it’s a struggle to get development focus on their features.

I’ve been able to work in both situations: single product-centric team, and multiple feature-based teams. My preference is still the single product-based team. The downsides of feature-based teams outweigh the advantages.

  • Teams become silos and stop focusing on the product as a whole.
  • Issues without a clear owner become someone else’s problem.
  • Cross-team communication becomes more difficult as more groups are created.
  • Individual team ambitions inadvertently dilute the primary focus of the product.

Conway’s Law tells us that organizations tend to build products based on the organization’s structure. Using several small teams, with a focus on specific features, will have an effect on the final product. It might not be a desired effect.

Mindful Divisions

I’m not suggesting teams grow beyond 7 to 10 people. There is plenty of literature, and experience, that tells us that would be bad, and even less efficient. But how you divide teams is important. Some divisions are more natural than others:

  • By platform (Desktop, Android, iOS, Web): Make sure there is some product consistency across platforms.
  • By front-end / back-end: Make sure both sides are part of defining the interaction APIs.
  • By application / UI widgets: Make sure both sides are part of defining the component APIs.

These separations are clean and easier to identify.

Feature Survival

Single product-centric teams bring us back to the issue of product-drivers fighting for development focus. I think this is a good thing.

When features are implemented through a single team, you need to be good at prioritizing. It shouldn’t be easy to add every little feature to the product. By making all features compete for priority, you make sure the best features get the attention.

I believe this makes the product stronger.

Performance: Design & Expectations


Providing a great user experience under a variety of performance situations means designing for a variety of expectations. Defining a great user experience should include performance budgets for the various pieces of the experience. Some basic examples include: Application startup, Image loading, and UI responsiveness.

Application Startup

First impressions are important, and startup time (page-load for web apps) is that first impression. There is a lot of information floating around that should convince you of the importance of a fast launch time. Still, we seem to cram more and more cruft into that part of the application. We initialize analytic libraries, load saved preferences, try to re-send failed or queued events and data, and maybe even send crash reports from previous sessions.

Create a startup time performance budget. How long should a person wait before being able to view and interact with real content? Once you set that limit, start moving less critical work out of the critical path. Queued events and data can remain queued a little while longer. Buffer new events before initializing heavy analytics libraries. Consider showing cached content, while downloading new content in the background.

Image Loading

Images are a big part of many applications. Avatars, photos and GIFs are everywhere. We want to display the images as fast as possible. Usually development happens in ideal situations: Best devices and best network speeds – the fast-path. If you’re measuring the real world performance characteristics of your application, you probably know that most of the people using the application don’t have the best devices or fast network speeds – the slow path.

Sometimes we fail to design for the slow-path. We assume it’s infrequent, or worse, we believe the fast-path behavior is correct for the product in any situation. People can just deal with the crummy experience. Remember your performance budget: how long should someone wait for an image to load?

Some common approaches to handling the slow-path:

  • Use server-side caching. This one is pretty obvious, but I have to mention it. Using a Content Delivery Network (CDN) means it takes less time to deliver images to the application because the images are “closer” to the application.
  • Use a more efficient file format. GIF is not known for being a lightweight format. Look into WebP and MP4 as low-bandwidth animated image alternatives that provide great quality.
  • Get better at picking JPEG quality levels. Etsy has a nice write-up on using SSIM (human vision estimation) to pick the lowest level without hurting perceived quality. Google has something called Butteraugli that does something similar.
  • Dynamically size images to fit the target rendering size. Don’t download large images only to reduce the size on the client. For less than excellent networking speeds, request images that are smaller than the target size and upscale them. You can save a lot of bandwidth and render the image quickly, keeping the application usable.
  • Aggressively cache images on the device. Never download the same image more than once. Cached images load quickly and reduce bandwidth usage. Yes, this might mean using 1GB or more for a cache, but if the space is available, it’s always worth it. Modern OSes will try to clear storage-based caches when running low on free space.
  • Consider Tap-to-Play interfaces to delay downloading large animated images until requested. Use a much smaller static image as a placeholder.

Some of these slow-path ideas might be so effective at saving bandwidth or improving image loading speed, that you make them options for the fast-path as well.

UI Responsiveness

Touch-based devices make unresponsive UIs very noticeable. Applications should maintain a responsive UI no matter what other activity is taking place. Use background threads to do the heavy lifting. Keep the UI thread free of any file I/O, networking and any other work that can be pushed to the background.

Remember to design for the slow-path when creating UI actions associated with network APIs. Don’t wait for the network response before changing the state of the UI. If the action fails, you can always say so and flip the state back. Delaying the state change makes the UI appear broken.

Smooth scrolling is another part of a responsive UI. iOS, Android and Web all have best practices for keeping high frame-rates while scrolling. There are also tools for profiling your rendering code.

Watch for situations where a design requirement (animation, layout, whatever) is causing the UI to become unresponsive. Find a way to fast-path/slow-path the requirement. If that’s not possible, get the requirement changed.

Design for Everyone

Never fall into the mindset of designing for only the latest hardware and fastest network speeds. You really need to factor the slow-path into your designs too. Yes, it’s more work and it’s probably not your ideal experience, but it’s far better than trying to force a fast-path design down a slow-path situation. That experience is usually horrible. You can do better!

Performance: The Merits of Measuring

If you can not measure it, you can not improve it. – Lord Kelvin.

There is so much information out there on ways to improve the performance of your mobile application or website. You probably feel you can just dive in and start making changes. But if you’re not measuring your application’s performance, you don’t know if anything is really helping or hurting. How do you know what effect any changes will have on the performance? Most applications are complex enough that we can’t assume our simplistic reasoning accurately reflects the code behavior. You need to measure.

You need measurements from before and after any changes are made. Your application has development phases, so should your measurement plan. Measure in CI to find improvements and regressions as soon as they happen. Measure in the real world to find how variations like network conditions, device fragmentation, and unpredictable user behavior manifests in performance.

Measuring in CI

The point of measuring performance in CI is to control variability and watch for relative differences on each change. Try to reduce the noisy variables like network calls and background services to create a fast and consistent surface on which you can monitor performance changes in a reliable and repeatable manner. The purpose is not to determine the performance your users will encounter. There are too many variables in the real world and you can’t control them well enough for apples-to-apples comparisons.

Use real devices when measuring performance, not emulators or simulators running on host hardware.

You can try to create reliable & repeatable simulations of some real world situations. Network connection speed is one example. You can use a network simulator, like Facebook’s Augmented Traffic Control system to simulate WiFi and mobile network conditions. This is especially useful if your application is designed to react differently under different network conditions. You can also use different types of content in the tests, trying to mimic some high level differences your users might encounter.

If you’re measuring data in CI, you should be storing and displaying it as well. Try to get the CI to alert on regressions, failing changes before they make it into the product.

Some common things to measure in CI:

  • Launch time to show UI
  • Launch time to interactive UI
  • Scroll performance (janky frames)
  • Time to load content
  • Memory usage (startup, after content loads, after scrolling content)

Remember to use multiple (physical) devices and a variety of content types.

Measuring in Real World

While CI measurements come from only a handful of tests & situations, the real world has many, many more situations. Depending on the number of active users, you could have millions of data points with thousands of unique situations. Collecting data from real users, at a large scale, allows you to investigate how things like global regions, network conditions — and even user types — can affect the performance of the application.

The are many third-party systems you can integrate into your code to easily and efficiently collect real world data. It’s not uncommon for companies to grow their own systems as well. In any case, make sure you are validating the data itself. Since real world data is messy, make sure you are vetting the collection systems and the data. Look for problems like payload corruption, clock skew, range errors or other oddities.

Create automated queries and reports, sent out broadly for people to review. Remember to go deeper than high-level summaries. Some of the interesting discoveries happen when you split out data across different dimensions.

Some common things to measure from real world:

  • Network usage, including start time, end time, content type and size of the response. Get detailed connection timing, if possible, for DNS and SSL handshake information.
    • For API endpoints, this is useful for tracking latency and payload size
    • For media loading, this gives a ballpark metric for how long people are staring at an empty box, waiting for an image to load.
  • Event, session and error state data. This can be used to track critical content impressions, but also can be used to learn how people use the application.

Remember to include some common metadata in each measurement so you can split out the data across different dimensions. Things like non-PII identifier, generic geo-location/region, device specs and connection type/speed can help you drill down into the data, looking for trends.

It’s also polite to allow people to opt-out of this type of data collection.

Guiding Teams to Outcomes


You work in an organization that sets some high-level goals. Your team might be accountable for some of those goals. However, to hit the goals, you’ll need cooperation from groups outside of your team.

What do you do? How do you get everyone on the path to finishing the shared outcomes?

Situations like this happen a lot. Some ideas:

  • Make sure the path is clearly marked.
  • Make it easy for people to stay on the path.
  • Make it hard for people to go off the path.
  • Be the voice of encouragement.
  • Be the voice of recognition.
  • Assume people want to be on the path, but they might also be busy with other problems.

Managing “friction” can be a useful technique in getting everyone working toward the goals. Try to reduce friction on anything that positively affects getting to the outcomes, but add friction to those things that are negative.

  • Centralize documentation for checklist processes. Better yet, automate as many of the steps as possible. Even better might be to add the manual steps to your automated steps so you only have one true list.
  • Do more checks in your continuous integration (CI) system, especially adding automated tests (unit, integration and performance). Stop regressions ASAP.
  • Make sure the output of your process is being measured and is clearly visible to everyone. Put up monitors with charts and graphs in your open office spaces. Showing progress and trends helps to reinforce the importance of everyone’s role in hitting goals.
  • Add anomaly detection to the measurement data. Don’t count on people to find the problems in real-time.
  • Don’t be surprised if you need to keep repeating the plan.

Communicating via IRC/Slack


I’ve used messaging tools like IRC and Slack at work every day for the last 11+ years. For much of that time, I was working remotely. Even my co-workers, working in offices, used messaging tools as much as I did.

When I first started working for Mozilla and visited one of the main offices, it was weird to see this happening. People sitting a few desks away from each other were chatting via messaging instead of just speaking. There are many reasons why this happens, but a primary outcome was that it allowed remote workers to be included in almost all discussions.

Spending so much time in messaging tools has downsides. Communicating efficiently via text-based messaging requires learning how to be a better communicator. Text-based messaging loses almost all the tone and nuance of a spoken conversation. A lot of context and expression happens via non-verbal communication. That is missing from text-based messaging and it can lead to communication problems.

Assume people are generally good: It’s easy to read something and assume the worst. This generally doesn’t end well.

  • You’re disrupting a person’s workflow. You don’t know what they are doing or their mood. Be gentle, not demanding.
  • When things start to get complicated, confusing or heated, switch to face-to-face or video chat.
  • Make your intent clear. When people are left to infer, things can go poorly. For example:
      me: Is your project almost finished?
    them: (is he asking for status or thinking I'm too slow?)

Be careful delivering critical feedback: Giving someone feedback can be tricky in face-to-face conversations. Over text messaging, it can be dangerous. It’s easy for feedback to be seen as a personal attack.

  • Use questions, not statements. This encourages more discussion and doesn’t put people on the defensive as much.
  • Use examples of your own failures. Show you know people make mistakes.
  • Focus on the outcome, not the current implementation. Are we getting to the outcome we want?
  • Use ‘we’ instead of ‘you’ when possible. We are in this together.

Be ready to moderate: Sometimes people aren’t on the same page. If you see a conversation getting heated, try to diffuse it. It’s important to keep communication channels as a ‘safe space’ for everybody.

  • Use private channels to let people know a conversation is off-track. People can be unaware how badly things have become.
  • Be public when needed. Other people in the channel will benefit from knowing the limits of unacceptable behavior.
  • Moderation is a form of critical feedback. See above.

Not everyone will get the joke: Using humor over text-based messaging can backfire. Trying to be funny can lead to situations where no one is laughing, or even worse, people could be offended. Use caution.

Always Be Shipping – Expect the Unexpected

Normal releases are consistent and predictable. Scheduled releases benefit developers, testers, support and PR. Unpredictable releases can cause communication problems, stress and fatigue. Those can lead to poor software quality and developer turn-over.

Sometimes we need to deal with unexpected issues that can’t wait for a normal release. Some examples include:

  • High volume crashes
  • Broken functionality
  • Security issues
  • Special date-based features

Anyone should be able to suggest an off-cycle release, so make sure there’s a straightforward, simple process for doing it. Identify that a special release is really necessary. Maybe the issue can wait for the next normal release. Consider using an approval process to decide if the release is warranted. An approval process creates a small hurdle that forces some justification. An off-cycle release is not cheap and has potential to derail the normal release process. Don’t put the normal release cycle at risk.

Some things to keep in mind:

  • Clearly identify the need. If you can’t, you probably don’t need the release.
  • Limit the scope of work to just what needs to be done for the issue. Be laser focused.
  • Make sure the work can be completed within the shortened cycle. Otherwise, just let the work happen in the normal release flow.
  • Choose an owner to drive the release and a set of stakeholders that need to track the release.
  • Triage frequently to make sure the short cycle stays on track. Over-communicate.
  • Test and verify the code changes. By limiting the scope, you should also be limiting the amount of required testing.

Be ready for the unexpected. Get really good at it. The best releases are boring releases.

Always Be Shipping

We all want to ship as fast as possible, while making sure we can control the quality of our product. Continuous deployment means we can ship at any time, right? Well, we still need to balance the unstable and stable parts of the codebase.

Web Deploys vs Application Deploys

The ability to control changes in your stable codebase is usually the limiting factor in how quickly and easily you can ship your product to people. For example, web products can ship frequently because it’s somewhat easy to control the state of the product people are using. When something is updated on the website, users get the update when loading the content or refreshing the page. With mobile applications, it can be harder to control the version of the product people are using. After pushing an update to the store, people need to update the application on their devices. This takes time and it’s disruptive. It’s typical for several versions of a mobile application to be active at any given time.

It’s common for mobile application development to use time-based deployment windows, such as 2 or 4 weeks. Every few weeks, the unstable codebase is promoted to the stable codebase and tasks (features and bug fixes) which are deemed stable are made ready to deploy. Getting ready to deploy could mean running a short Beta, to test the release candidate with a larger, more varied, test group.

It’s important to remember, these deployment windows are not development sprints! They are merely opportunities to deploy stable code. Some features or bug fixes could take many weeks to complete. Once complete, the code can be deployed at the next window.

Tracking the Tasks

Just because you use 2 week deployment windows doesn’t mean you can really ship a quality product every 2 weeks. The deployment window is an artificial framework we create to add some structure to the process. At the core, we need to be able to track the tasks. What is a task? Let’s start with something that’s easy to visualize: a feature.

What work goes into getting a feature shipped?

  • Planning: Define and scope the work.
  • Design: Design the UI and experience.
  • Coding: Do the implementation. Iterate with designers & product managers.
  • Reviewing: Examine & run the code, looking for problems. Code is ready to land after a successful review. Otherwise, it goes back to coding to fix issues.
  • Testing: Test that the feature is working correctly and nothing broke in the process. Defects might require sending the work back to development.
  • Push to Stable: Once implemented, tested and verified, the code can be moved to the stable codebase.

In the old days, this was a waterfall approach. These days, we can use iterative, overlapping processes. A flow might crudely look like this:

feature-cycle

Each of these steps takes a non-zero amount of time. Some have to be repeated. The goal is to create a feature that has the desired behavior and at a known level of quality. Note that landing the code is not the final step. The work can only be called complete when it’s been verified as stable enough to ship.

Bug fixes are similar to features. The flow might look like this:

bug-cycle

Imagine you have many of these flows happening at the same time. Ongoing work happens on the unstable codebase. As work is completed, tested and verified at an expectable level of quality, it can be moved to the stable codebase. All work happens on the unstable codebase. Try very hard to keep work on the stable codebase to a minimum – usually disabling/enabling code or backing out unstable code.

Crash Landings

One practice I’ve seen happen on development teams is attempting to crash land code right before a deployment window. This is bad for a few reasons:

  • It forces many code reviews to happen simultaneously across the team, leading to delays since code review is an iterative cycle.
  • It forces large amounts of code to be merged during a short time period, likely leading to merge conflicts – leading to more delays.
  • It forces a lot of testing to happen at the same time, leading to backlogs and delays. Especially since testing, fixing and verifying is an iterative cycle.

The end result is anti-climatic for everyone: code landed at a deployment window is almost never shipped in the window. In fact, the delays caused by crash landing lead to a lot of code missing the deployment window.

crash-landing

Smooth Landings

A different approach is to spread out the code landings. Allow code reviews and testing/fixing cycles to happen in a more balanced manner. More code is verified as stable and can ship in the deployment window. Code that is not stable is disabled via build-time or runtime flags, or in extreme cases, backout out of the stable codebase.

smooth-landing

This balanced approach also reduces the stress that accompanies rushing code reviews and testing. The process becomes more predictable and even enjoyable. Teams thrive in healthy environments.

Once you get comfortable with deployment windows and sprints being very different things, you could even start getting more creative with deployments. Could you deploy weekly? I think it’s possible, but the limiting factor becomes your ability to create stable builds, test and verify those builds and submit those builds to the store. Yes, you still need to test the release candidates and react to any unexpected outcomes from the testing. Testing the release candidates with a larger group (Beta testing) will usually turn up issues not found in other testing. At larger scales, many things thought to be only hypothetical become reality and might need to be addressed. Allowing for this type of beta testing improves quality, but may limit how short a deployment window can be.

Remember, it’s difficult to undo or remove an unexpected issue from a mobile application user population. Users are just stuck with the problem until they get around to updating to a fixed version.

I’ve seen some companies use short deployment window techniques for internal test releases, so it’s certainly possible. Automation has to play a key role, as does tracking and triaging the bugs. Risk assessment is a big part of shipping software. Know your risks, ship your software.