Work-as-Imagined vs Work-as-Done

With engineering focus on reducing incidents and improving operational reliability, I frequently come back to the realization that humans are fallible and we should be learning ways to nudge people toward success rather than failure.

There are whole industries and research machines built around the study of Human Factors, and how to improve safety, reliability, and quality. One topic that struck me as extremely useful to software engineering was the concepts of Work-as-Imagined (WAI) versus Work-as-Done (WAD). Anytime you’ve heard “the system failed because someone executed a process differently than it was documented” could be a WAI vs WAD issue. This comes up a lot in healthcare, manufacturing, and transportation — where accidents can have horrible consequences.

Full disclosure: There are actually many varieties of human work, but WAI and WAD are good enough to make the point. Steven Shorrock covers the subject so well on his blog: Humanistic Systems

Work-as-Imagined

When thinking about a process or set of tasks that make up work, we need to imagine the steps and work others must do to accomplish the tasks. We do this for many good reasons, like scheduling, planning, and forecasting. WAI is usually formed by past experiences of actually doing work. While this is a good starting point, it’s likely the situation, assumptions, and variables are not the same.

To a greater or lesser extent, all of these imaginations – or mental models – will be wrong; our imagination of others’ work is a gross simplification, is incomplete, and is also fundamentally incorrect in various ways, depending partly on the differences in work and context between the imaginer and the imagined. — The Varieties of Human Work

Work-as-Done

Work-as-Done is literally the work people do. It happens in the real world, under a variety of different conditions and variables. It’s hard to document WAD because of the unique situation in which the work was done and the specific adjustments and tradeoffs required to complete the work for a given situation.

In any normal day, a work-as-done day, people:

Adapt and adjust to situations and change their actions accordingly
Deal with unintended consequences and unexpected situations
Interpret policies and procedures and apply them to match the conditions
Detect and correct when something is about to go wrong and intervene to prevent it from happening

Mind the Gap

Monitoring the gap between the WAI and the WAD of a given task has been highlighted as an important practice for organizations aiming to achieve high reliability. The gap between WAI and WAD can result in “human error” conditions. We frequently hear about incidents and accidents that were caused by “human error” in a variety of situations:

Air traffic near misses at airports
Train derailments and accidents
Critical computer systems taken offline
Mistakes made during medical procedures

It’s natural for us to blame the problem on the gap — people didn’t follow the process — and try to improve reliability and reduce errors by focusing on stricter adherence to WAI. Perhaps unsurprisingly, this results in more rules and processes which can certainly slow down overall productivity, and even increase the gap between WAI and WAD.

Safety management must correspond to Work-As-Done and not rely on Work-As-Imagined. — Can We Ever Imagine How Work is Done?

In recent decades, there is more focus on WAD. Examining the reasons why the WAD gap exists and working to align WAI more closely with WAD. Embracing the reality of how the work is done and working to formalize it. Instead of optimizing for the way we imagine work is done, we acknowledge the way work is actually done.

Closing the Gap

In my work, production incidents that occur in software systems are an easy area to find WAI vs WAD happening. Incident management and postmortems have best practices that usually involve blameless reviews of what led to the incident. In many case, the easiest answer to “how can we stop this incident from happening again?” is better documentation and more process.

Modern incident management is focusing more on learning from incidents and less about root-cause analysis. One reason is that incidents rarely happen the exact same way in the future. Focusing on fixing a specific incident yields less value than learning about how the system worked to create the incident in the first place. Learning about how your system work in production is harder, but yields more impact in discovering weak parts of the systems.

This section could be an entire book, or at least several posts, so I’ll leave it to you to read some of the links.

Desire Paths

The whole WAI vs WAD discussion reminds me of desire paths, which visually show the difference between the planned and actual outcomes.

Desire paths typically emerge as convenient shortcuts where more deliberately constructed paths take a longer or more circuitous route, have gaps, or are non-existent

Tying together desire paths with WAI & WAD, some universities and cities have reportedly waited to see which routes people would take regularly before deciding where to pave additional pathways across their campuses and walking paths.

April 9, 2024April 21, 2024

Information Flows in Organizations

I’ve had cause to looked into research and ideas about the ways information flows within organizations. Discussions about transparency, decision making, empowering teams, and trust seem to intersect at organizational communication and information flows.

One of my favorite people to follow in this space is Gene Kim (Phoenix Project, DevOps Handbook, Accelerate, and DORA Reports). He has done a few podcasts that focused on relevant topics and concluded that you can predict whether an organization is a high performer or a low performer, just by looking at the communication paths of an organization, as well as their frequency and intensity. (Episode 16, @54 min)

Some of these ideas might resonate with you. There are generally two forms of information flows:

Slow flows where we need detailed granularity and accuracy of information. Leadership usually needs to be involved in these discussions so communication tends to escalate up and down hierarchies.
Fast flows where frequency and speed tend to be more important. These flows occur in the operational realm, where work is executed, and happen directly between teams using existing interfaces.

In the ideal case, a majority of the communication is happening within and between teams using fast flows. Forcing escalation up and down the hierarchy means getting people involved who probably don’t have a solid grasp of the details. Interactions are slow and likely lead to poor decisions. On the other hand, when teammates talk to each other or where there are sanctioned ways for teams to work with each other with a shared goal, integrated problem solving is very fast.

This doesn’t mean all information flows should be fast. There are two phases where slow flows are critical: Upfront planning and Retrospective assessment. Planning and preparation are the activities where we need leaders to be thoughtful about defining the goals and then defining responsibilities and the structures to support them. Later, slow communications come back when we assess and improve our performance and outcomes.

Thinking, Fast and Slow

I want to be clear that fast and slow information flows are different concepts than the fast and slow modes of thinking explored in Daniel Kahneman’s book Thinking, Fast and Slow. The book explores two systems of thinking that drive the way humans make decisions.

System 1 (Fast Thinking): This system is intuitive, automatic, and operates quickly with little effort or voluntary control. It’s responsible for quick decisions, habits, and reactions.
System 2 (Slow Thinking): This system is deliberate, analytical, and requires effortful mental activity. It’s used for complex computations, learning new information, and solving difficult problems.

Kahneman discusses how these two systems can work together, but sometimes lead to biases and errors in judgment. He talks about how these modes can affect decision-making and offers suggestions into how we can become more aware of these biases to make better decisions.

Obviously another area worth exploring to help understand how organizations can support people to create better outcomes.

April 8, 2022

Project: Networked LED Pixel Display

I have been wanting to play around with an ESP32-based micro for a while. Once I became comfortable with Adafruit’s microcontrollers and CircuitPython, I thought I’d try out some of their ESP32 offerings. I bought a few Airlift (ESP32) Featherwings to use with the Feather RP2040 boards I was experimenting with.

I’ve also been messing around with some WS2818 / NeoPixel LED 8×8 and 16×16 grids, so I thought it might be interesting to work on a web-based pixel display.

CircuitPython has some very handy libraries for building LED animations on strings or grids of WS / Neopixels. The Airlift ESP32 also has a library to create network clients and access points—surely this wouldn’t be too hard.

Here’s the shapes I used 3D print the parts:

(I really need to improve my enclosure design skills)

Something I’ve picked up from other people making an LED grid displays: Use a lattice grid and diffuser to create an even “pixel” instead of a bright point of light (depending on your tastes). It took a few tries to get the lattice to match nicely with the LED matrix circuit board. While most people use an acrylic diffuser, I just used a piece of card stock paper.

Using the CircuitPython LED animations library, it was relatively easy to try out a variety of different LED matrix animation patterns. Building on the primitives in the library, I created additional functionality that supports text-based and sprite-based animations.

With so many different patterns, sprites, and options to manage—I decided to use a web-based UI to handle the experience.

The UI is served from the device itself. This turned out to be more challenging than you might think. I’ll do a separate post on serving a web-based UI while also running LED animations on the device.

January 9, 2022

Project: LED Fiber Optic Lamp

Looking back at one of the first real projects I attempted which combined 3D printing and microprocessors. I received an Creality Ender 3 V2 a year ago and after playing around with some test prints, I wanted to try building some more interesting and complex projects. I came across this fiber optic LED lamp project via Instructables. It was just the right amount of 3D printing, microprocessors, and coding I was looking for at the time.

I tend to use components from Adafruit. They have a strong focus on learning. The guides and CircuitPython are great for getting started. So given the great set of instructions, my challenge was basically recreating the 3D models and porting to an Adafruit microprocessor running CircuitPython. The author already provided the 3D models as STLs and in Tinkercad (I also really like the simplicity of Tinkercad), but I wanted to reduce the number of fiber strands and make the lamp slightly smaller.

I figured out the general structure and process of the original model pieces by investigating the Tinkercad project. It didn’t take long to recreate some shapes that I could use to build the lamp structure.

CircuitPython has great support for individually addressable WS2818 / NeoPixel strands, so it was relatively simple to get some code working that would create some simple LED animations. I was using a Trinket M0, which is so tiny. I ran into some space issues where I couldn’t add all the animation support code I wanted onto the board. If I ever revisit this project, I’ll probably switch to a QT Py RP2040 or QT Py ESP32-S2, both of which have plenty of space, way more power, and the ESP32-S2 board would even allow for some network/web configuration UX.

Here are a few photos of the assembly process of the LEDs and optic fiber in the frame.

The fiber optic cable is a “side glow” type used for decorations. It’s designed to create a glow.

Here is the base with the wires and a breadboard for the Trinket M0 (not inserted yet), along with a small button which can be used to change the animation modes.

The CircuitPython code is very simple and is available in a Github repo. I’m pretty happy with the finished project. Some things I’d want to address if I decide to work on a revision:

Using CircuitPython doesn’t leave much room for user code on the Trinket M0, so I’d probably just bump up to one of the newer QT Py models. I’ll be able to add more animation modes too.
Hot gluing the breadboard into the base isn’t sturdy enough. I’ll need to attach the next board with screws/nuts.
Selecting animation modes using the button is not very friendly. If I bump up to a QT Py ESP32-S2, I’ll add a web setup UI.

December 5, 2021

CircuitPython, LEDs, and Animations

I’ve been playing around with some WS2818 / NeoPixel LED 8×8 grids and CircuitPython. The CircuitPython ecosystem is really rich and Adafruit makes some very handy support libraries. I was using the LED Animation library to create some patterns on the neopixel grid, but wanted to try adding more capabilities. The time-slicing approach made it nice to add other code without blocking the program executing while the animations were happening.

In particular, I wanted to add bitmap sprite animations and text scrolling. There are great libraries and examples in Adafruit’s collection of Learn tutorials, but I didn’t see anything that played well with the time-slicing. I took a crack at building some of my own support.

Animation Extras is a couple of simple code helpers that add bitmap sprite and text scrolling support by building on the LED Animation library.

The LED Animation library has some slick ways of grouping individual pixels to create patterns. I added helpers to create the rectangular animation pattern based on those grouping primitives. Checkout the repo for some example usage.

October 18, 2021October 18, 2021

Tracking Work is Fundamental

“Developers should only need Github Issues and Pull Requests to do their job” — Why should anyone need more than that to track work?

Small companies and startups have small engineering teams. The amount of effort required to understand the ongoing and planned work is low due to sheer lack of ability to take on too much and succeed. Failure weeds out the companies that take on too much, too soon.

Companies succeed and grow, and so do the engineering teams. At some point, multiple engineering teams are created or evolve. Ideally, these teams are self-sufficient and isolated from each other, creating modular and decoupled output. This ideal state rarely lasts and soon cross-team projects start to appear. Teams continue to evolve into product and platform functions, creating more opportunities for cross-team dependencies.

At this point, work can no longer be tracked at the developer-level alone. Success requires collaboration and coordination. Companies without a cohesive work tracking system that can span individual teams start to slow down. Requirements and dependencies become difficult to track and are often not meeting expectations which lead to rework and churn. Deliverables aren’t meeting the guesstimate timelines and drag on.

Making work visible is a core attribute to many different methodologies and processes, even the ad-hoc ones. If you don’t have a bird’s eye view of the engineering work happening at your company, what can you say about your situation? Very little. Try to ascertain the status of a given cross-team project without asking someone. If it takes you longer than 5 minutes, you’re in trouble and the people you would have asked don’t really know either. All of this work required to figure out a project status is wasting people’s time.

Work tracking is something that isn’t hard to introduce and provides value. It doesn’t require adding any extra work for developers, but starts to also provide value to team leads, project managers, and senior leadership.

“Not Jira!” The cry goes out across engineering. It doesn’t need to be Jira, but don’t hate a tool for being successful at what it does. Just because most companies don’t put enough effort into running Jira well doesn’t make work tracking tools bad in general. Pick something else — except fucking spreadsheets!

“You die a ~~hero~~ lightweight tool, or you live long enough to become the ~~villain~~ bloated enterprise-ready system“

Continuously Doing a Thing

Practice makes perfect — Anonymous Parent

A theme that keeps popping up in my world is the idea of how often an action is done being correlated to how well the action is done.

Deploying application and system code
Releasing application distributions
Triaging issues
Testing product behavior
Creating objectives
Running experiments
Executing migrations

A lot has been written about high-performing engineering teams. Accelerate is a great resource for exploring the behaviors of such teams. Frequent deploys is one of the leading indicators and was chosen by the authors as a key metric. With enough practice, deploys become low-risk and low-stress.

Small batches are another trait of successful teams. Performing a deployment more frequently usually means there are fewer changes happening each time. These small batches can actually improve overall quality because fewer changes happen in each cycle.

Rotating a large group of people through activity shifts, like handling issue triage or the application release process, allows the group to share the burden, but there are downsides too. If the activity isn’t part of the group’s primary deliverable, it’s likely not a priority. If there are long stretches of time between any given person taking on the activity there might only be enough time to just do the work, but never think about how to improve the process or tooling. There is no time to become good at the process.

The DevOps Handbook talks a lot about the benefits of shorter feedback loops across many different aspects of engineering organizations. In most situations, shorter feedback loops happen when an activity becomes a more continuous process.

If you have an area that could be improved, maybe you could ask yourself if the process could happen more often.

July 25, 2020July 26, 2020

Being an Effective Engineering Leader

I often wonder if I’m being effective at my job. Might be related to my impostor syndrome, but in engineering management, the signals of effectiveness aren’t always clear. I have some basic, high level criteria I try to think about monthly, or so, to provide some insight.

Providing a clear direction

Lack of clear direction can sometimes be seen when teams are doing medium-term/quarterly planning. If the objectives aren’t aligned with upper management, it’s probably my fault for not creating clear direction and expected outcomes. Try not to be too prescriptive, but make sure the goals are clearly defined.

Try to have a good narrative for each of these levels:

Vision: How the team(s) create impact
Mission: Role the team(s) within the company
Objectives: Think about the next year

Shipping what matters

Do the same problems keep coming up? Make sure we are prioritizing the right work. Make sure we are completing the work. Talk to senior engineers about problems that seem to be holding us back.

Focus on capabilities. Keep improving the operational capabilities of the company. Feature projects are built on capabilities.

Maintain a healthy mix of project sizes. Big projects can stall shipping momentum. Make sure big projects are broken into smaller milestones and iterations. Small projects might feel like low impact, but sometimes are just what people are asking to see.

Helping people grow

Have honest conversations about expectations and performance, and providing actionable feedback.

Make space for other people by getting out of the way. For projects and meetings where I’m getting invited as a point-of-contact, look for other people I can delegate the role to.

Surveys and Feedback Loops

Workplaces typically have company-wide engagement surveys to get feedback on many aspects. Those usually have a management section, and this feedback can be a gift. Interpreting feedback in a positive a way and not a personal attack might be a learned skill, but worth learning.

Thanks to Nick DiStefano for a reminder that manager surveys are also a useful way to get regular feedback on how things are going. Manager surveys can happen more frequently than company-wide engagement surveys and are usually more focused at the team-level.

April 18, 2020May 2, 2020

Stability: Smarter Monitoring Application Crashes

I had posted about the way Tumblr uses time-series monitoring to alert on crash spikes in the Android and iOS applications. Since then, we’ve done a lot of work to reduce the overall volume of crashes. As a result, we created a new problem: it was possible for a handful of people, caught in crash cycles, to cause our stability alerts to trigger.

Once the stability alert is triggered, we typically start looking in the crash logging systems, like Crashlytics or Sentry, to find more information about the crash. We found an increasing number of occurrences where no particular crash could be easily identified as causing the spike.

Getting paged at 2am because of a stability alert is not great, but not finding a crash was even worse.

The problem was the way we were monitoring. Simply watching all crash events wasn’t good enough. We had to start normalizing the events across the users. Thankfully, we collect events and not simple ticks. We have rich data in the crash event, including a way to group events coming from the same device.

Here is an example of the two styles of monitoring over the last week:

raw count shows a large spike, but the unique device count shows a normal trend

That after-midnight Raw Count spike on Friday would have paged our on-call person if we hadn’t changed to alert on the Unique Device Count instead. We still use the Raw Counts to identify issues and investigate, but we don’t alert on them. We can use the high-cardinality events to zero-in on the cause of the spike. In this case, two (2) people were having a bad experience using their Cubot Echo devices.

Since moving to the new alerting metric, we’ve had far fewer after-hour pages, while still being able to focus on the stability of the applications across our user base.

April 6, 2020

Engineering Productivity: Being Actionable

There is a lot of information about engineering productivity out there. No one says it’s easy, but it can be downright difficult to turn the practices you hear about into plans you can put into action. What follows is an example of how we can create an actionable plan to increase our productivity.

Let’s define engineering productivity as how effectively your engineering team can get important and valuable work done.

How do you determine important and valuable work? — goals and objectives.
How do you effectively get work done? — remove wasted time and effort from the delivery cycle

Goals, Planning, and Prioritizing

If productivity is an organizational goal, you need to make sure people understand why and how it affects them. You need to communicate the message over and over in as many venues as possible. The more developers understand the goals and the direction, the more engaged they’ll be with the work.

Engineering teams should try to set ambitious goals, focused through the lens of the team’s mission statement. We also try to create measures for defining success — yes, this is the OKR framework, but any goal or strategy planning can be used. We try to keep goals (objectives) and measures (key results) from being project to-do lists. Projects are tasks we can use to move the measures. Goals are bigger than projects.

Engineers need to clearly understand the importance of their work. Large backlogs of work create decision fatigue about what work to prioritize. Without planning and prioritizing, we can end up with teams that aren’t aligned — the opposite of productivity. Use your goals, even the high-level organizational goals, as a guide to prioritize work.

Removing Wasted Time and Effort

A great resource for exploring engineering performance is the book Accelerate. Based on years of research and collected data (State of DevOps reports), the book sets out to find a way to measure software delivery performance — and what drives it. Some important measures include:

Lead Time: time it takes to go from a customer making a request to the request being satisfied.
Deployment Frequency: frequency as a proxy for batch size since it is easy to measure and typically has low variability. In other words: smaller batches correlates with higher deploy frequency and higher quality.
Time to Restore: given software failures are expected, it makes more sense to measure how quickly teams recover from failure.
Change Fail Percentage: a proxy measure for quality throughout the process.

Each of these measures could be a goal we want to focus and improve. Each measure has an impact on our ability to deliver software faster with better quality. Let’s also call-out that these measures are somewhat overlapping and interdependent.

Creating an Action Plan

Picking a Goal

As an experiment, let’s take one, Lead Time, and see how we could brainstorm ways to improve it. In a different favorite book, The DevOps Handbook, we’re presented with ways to effect change in Lead Time. A short summary that does not do justice to the depth presented in the book:

Reduce toil with automation
Reduce number of hand-offs
Find and remove non-value time
Create fast and frequent feedback loops

Let’s think about what’s involved between filing a ticket to start work — to delivering the work to the end user? Many different tasks and activities happen within this cycle. This becomes the scope we can work within. Some high-level things come to mind:

Designing
Coding
Reviewing
Testing
Bug Fixing
Ramping
Monitoring

Picking Measures

We should be thinking about ways to measure success and failure of these activities. This should be independent of the work we intend to undertake. We can draw upon the pain and stumbles that have happened in the past. Finding good measurements can be a very hard process itself. Let’s not be unrealistic about our expectations on manual processes — we’re only human and people make mistakes. Think about ways to make it easy to succeed and hard to fail:

Find more defects in pre-release than post-release: We’re always going to have bugs, but let’s try to find and fix more of them before releasing.
Reduce the times a project gets bumped to next release: This happens a lot and for many different reasons. We should be better at hitting the desired timeline.
Reduce the time it takes people to be exposed to a feature release: It can take days or week for people to “see” new features appear in the apps when ramping a feature flag. This also makes A/B testing painful.
Reduce the times a feature flag is rolled back: Finding problems after we ramp a feature in production is costly, painful, and slows the release of the feature.
Reduce time to detect and time to mitigate incidents: We’ll always have breaking incidents, but we need to minimize the disruptions to people using the product. Minutes, not days.
Reduce amount of non-value time: It’s hard to say “code should be reviewed in X minutes”, or “bugs should be found in Y hours”, but it’s easier to identify dead-time in those activities.

Brainstorming Projects

With our objective and measures sketched out, let’s think about the activities and tasks we want to change. Some are manual. Many involve multiple teams. There are a lot of hand-offs. Let’s create smaller affinity groups based on the tasks and activities using the framework.

Reduce toil with automation

Fast and continuous integration/UI testing
Canary monitoring and alerting
Simple hands-off deployments
Easy low risk feature ramping

Reduce number of hand-offs by keeping cross-functional teams informed and involved

Spec and requirement generation
Test plan generation and updates
Pre-release testing setup

Find and remove non-value time, usually the gaps between stages

Fast edit/build/test cycles for developers
Timely code reviews
All code merges ready for QA next day
File new defect tickets ASAP
Prioritized pre-release defect tickets
Merging green code

Create fast and frequent feedback loops

Timely code reviews
Fast and continuous integration/UI testing
All code merges ready for QA next day
File new defect tickets ASAP
Fast, short feature ramps
Canary monitoring and alerting

This level of grouping is perfect to start brainstorming actual project ideas. We’ve started at an organization-level objective (Increase engineering productivity), focused on a contributing factor (Lead time), and created a nice list of projects that could be used to affect the factor. This is important — we’re not focused on a single large project! We have many potential small, diverse projects. This dramatically increases the probability that we will succeed, to some degree. A single project is an all-or-nothing situation and lowers your probability of success. Most projects fail to complete, for one reason or another.

We also see that some idea groups appear multiple times. This allows us to leverage work to create impact in more ways.

If you take anything away from this post, I hope it’s that improving engineering productivity is an actionable goal. We can be systematic and measure results.

Accelerate and The DevOps Handbook cover a lot more than what I’ve presented here. The information on organizational culture and its effects on performance are also very enlightening. I’d recommend both books to anyone who wants to learn more about ways to improve engineering productivity.

stark raving finkle

Posts