Project: Networked LED Pixel Display

I have been wanting to play around with an ESP32-based micro for a while. Once I became comfortable with Adafruit’s microcontrollers and CircuitPython, I thought I’d try out some of their ESP32 offerings. I bought a few Airlift (ESP32) Featherwings to use with the Feather RP2040 boards I was experimenting with.

I’ve also been messing around with some WS2818 / NeoPixel LED 8×8 and 16×16 grids, so I thought it might be interesting to work on a web-based pixel display.

CircuitPython has some very handy libraries for building LED animations on strings or grids of WS / Neopixels. The Airlift ESP32 also has a library to create network clients and access points—surely this wouldn’t be too hard.

Here’s the shapes I used 3D print the parts:

(I really need to improve my enclosure design skills)

Something I’ve picked up from other people making an LED grid displays: Use a lattice grid and diffuser to create an even “pixel” instead of a bright point of light (depending on your tastes). It took a few tries to get the lattice to match nicely with the LED matrix circuit board. While most people use an acrylic diffuser, I just used a piece of card stock paper.

Using the CircuitPython LED animations library, it was relatively easy to try out a variety of different LED matrix animation patterns. Building on the primitives in the library, I created additional functionality that supports text-based and sprite-based animations.

With so many different patterns, sprites, and options to manage—I decided to use a web-based UI to handle the experience.

The UI is served from the device itself. This turned out to be more challenging than you might think. I’ll do a separate post on serving a web-based UI while also running LED animations on the device.

Project: LED Fiber Optic Lamp

Looking back at one of the first real projects I attempted which combined 3D printing and microprocessors. I received an Creality Ender 3 V2 a year ago and after playing around with some test prints, I wanted to try building some more interesting and complex projects. I came across this fiber optic LED lamp project via Instructables. It was just the right amount of 3D printing, microprocessors, and coding I was looking for at the time.

I tend to use components from Adafruit. They have a strong focus on learning. The guides and CircuitPython are great for getting started. So given the great set of instructions, my challenge was basically recreating the 3D models and porting to an Adafruit microprocessor running CircuitPython. The author already provided the 3D models as STLs and in Tinkercad (I also really like the simplicity of Tinkercad), but I wanted to reduce the number of fiber strands and make the lamp slightly smaller.

I figured out the general structure and process of the original model pieces by investigating the Tinkercad project. It didn’t take long to recreate some shapes that I could use to build the lamp structure.

CircuitPython has great support for individually addressable WS2818 / NeoPixel strands, so it was relatively simple to get some code working that would create some simple LED animations. I was using a Trinket M0, which is so tiny. I ran into some space issues where I couldn’t add all the animation support code I wanted onto the board. If I ever revisit this project, I’ll probably switch to a QT Py RP2040 or QT Py ESP32-S2, both of which have plenty of space, way more power, and the ESP32-S2 board would even allow for some network/web configuration UX.

Here are a few photos of the assembly process of the LEDs and optic fiber in the frame.

The fiber optic cable is a “side glow” type used for decorations. It’s designed to create a glow.

Here is the base with the wires and a breadboard for the Trinket M0 (not inserted yet), along with a small button which can be used to change the animation modes.

The CircuitPython code is very simple and is available in a Github repo. I’m pretty happy with the finished project. Some things I’d want to address if I decide to work on a revision:

  • Using CircuitPython doesn’t leave much room for user code on the Trinket M0, so I’d probably just bump up to one of the newer QT Py models. I’ll be able to add more animation modes too.
  • Hot gluing the breadboard into the base isn’t sturdy enough. I’ll need to attach the next board with screws/nuts.
  • Selecting animation modes using the button is not very friendly. If I bump up to a QT Py ESP32-S2, I’ll add a web setup UI.

CircuitPython, LEDs, and Animations

I’ve been playing around with some WS2818 / NeoPixel LED 8×8 grids and CircuitPython. The CircuitPython ecosystem is really rich and Adafruit makes some very handy support libraries. I was using the LED Animation library to create some patterns on the neopixel grid, but wanted to try adding more capabilities. The time-slicing approach made it nice to add other code without blocking the program executing while the animations were happening.

In particular, I wanted to add bitmap sprite animations and text scrolling. There are great libraries and examples in Adafruit’s collection of Learn tutorials, but I didn’t see anything that played well with the time-slicing. I took a crack at building some of my own support.

Animation Extras is a couple of simple code helpers that add bitmap sprite and text scrolling support by building on the LED Animation library.

The LED Animation library has some slick ways of grouping individual pixels to create patterns. I added helpers to create the rectangular animation pattern based on those grouping primitives. Checkout the repo for some example usage.

Tracking Work is Fundamental

“Developers should only need Github Issues and Pull Requests to do their job” — Why should anyone need more than that to track work?

Small companies and startups have small engineering teams. The amount of effort required to understand the ongoing and planned work is low due to sheer lack of ability to take on too much and succeed. Failure weeds out the companies that take on too much, too soon.

Companies succeed and grow, and so do the engineering teams. At some point, multiple engineering teams are created or evolve. Ideally, these teams are self-sufficient and isolated from each other, creating modular and decoupled output. This ideal state rarely lasts and soon cross-team projects start to appear. Teams continue to evolve into product and platform functions, creating more opportunities for cross-team dependencies.

At this point, work can no longer be tracked at the developer-level alone. Success requires collaboration and coordination. Companies without a cohesive work tracking system that can span individual teams start to slow down. Requirements and dependencies become difficult to track and are often not meeting expectations which lead to rework and churn. Deliverables aren’t meeting the guesstimate timelines and drag on.

Making work visible is a core attribute to many different methodologies and processes, even the ad-hoc ones. If you don’t have a bird’s eye view of the engineering work happening at your company, what can you say about your situation? Very little. Try to ascertain the status of a given cross-team project without asking someone. If it takes you longer than 5 minutes, you’re in trouble and the people you would have asked don’t really know either. All of this work required to figure out a project status is wasting people’s time.

Work tracking is something that isn’t hard to introduce and provides value. It doesn’t require adding any extra work for developers, but starts to also provide value to team leads, project managers, and senior leadership.

“Not Jira!” The cry goes out across engineering. It doesn’t need to be Jira, but don’t hate a tool for being successful at what it does. Just because most companies don’t put enough effort into running Jira well doesn’t make work tracking tools bad in general. Pick something else — except fucking spreadsheets!

“You die a hero lightweight tool, or you live long enough to become the villain bloated enterprise-ready system


See also: Merits of Bug Tracking

Continuously Doing a Thing

Practice makes perfect — Anonymous Parent

A theme that keeps popping up in my world is the idea of how often an action is done being correlated to how well the action is done

  • Deploying application and system code
  • Releasing application distributions
  • Triaging issues
  • Testing product behavior
  • Creating objectives
  • Running experiments
  • Executing migrations

A lot has been written about high-performing engineering teams. Accelerate is a great resource for exploring the behaviors of such teams. Frequent deploys is one of the leading indicators and was chosen by the authors as a key metric. With enough practice, deploys become low-risk and low-stress.

Small batches are another trait of successful teams. Performing a deployment more frequently usually means there are fewer changes happening each time. These small batches can actually improve overall quality because fewer changes happen in each cycle.

Rotating a large group of people through activity shifts, like handling issue triage or the application release process, allows the group to share the burden, but there are downsides too. If the activity isn’t part of the group’s primary deliverable, it’s likely not a priority. If there are long stretches of time between any given person taking on the activity there might only be enough time to just do the work, but never think about how to improve the process or tooling. There is no time to become good at the process.

The DevOps Handbook talks a lot about the benefits of shorter feedback loops across many different aspects of engineering organizations. In most situations, shorter feedback loops happen when an activity becomes a more continuous process.

If you have an area that could be improved, maybe you could ask yourself if the process could happen more often.

Being an Effective Engineering Leader

I often wonder if I’m being effective at my job. Might be related to my impostor syndrome, but in engineering management, the signals of effectiveness aren’t always clear. I have some basic, high level criteria I try to think about monthly, or so, to provide some insight.

Providing a clear direction

Lack of clear direction can sometimes be seen when teams are doing medium-term/quarterly planning. If the objectives aren’t aligned with upper management, it’s probably my fault for not creating clear direction and expected outcomes. Try not to be too prescriptive, but make sure the goals are clearly defined.

Try to have a good narrative for each of these levels:

  • Vision: How the team(s) create impact
  • Mission: Role the team(s) within the company
  • Objectives: Think about the next year

Shipping what matters

Do the same problems keep coming up? Make sure we are prioritizing the right work. Make sure we are completing the work. Talk to senior engineers about problems that seem to be holding us back.

Focus on capabilities. Keep improving the operational capabilities of the company. Feature projects are built on capabilities.

Maintain a healthy mix of project sizes. Big projects can stall shipping momentum. Make sure big projects are broken into smaller milestones and iterations. Small projects might feel like low impact, but sometimes are just what people are asking to see.

Helping people grow

Have honest conversations about expectations and performance, and providing actionable feedback.

Make space for other people by getting out of the way. For projects and meetings where I’m getting invited as a point-of-contact, look for other people I can delegate the role to.

Surveys and Feedback Loops

Workplaces typically have company-wide engagement surveys to get feedback on many aspects. Those usually have a management section, and this feedback can be a gift. Interpreting feedback in a positive a way and not a personal attack might be a learned skill, but worth learning.

Thanks to Nick DiStefano for a reminder that manager surveys are also a useful way to get regular feedback on how things are going. Manager surveys can happen more frequently than company-wide engagement surveys and are usually more focused at the team-level.

Stability: Smarter Monitoring Application Crashes

I had posted about the way Tumblr uses time-series monitoring to alert on crash spikes in the Android and iOS applications. Since then, we’ve done a lot of work to reduce the overall volume of crashes. As a result, we created a new problem: it was possible for a handful of people, caught in crash cycles, to cause our stability alerts to trigger.

Once the stability alert is triggered, we typically start looking in the crash logging systems, like Crashlytics or Sentry, to find more information about the crash. We found an increasing number of occurrences where no particular crash could be easily identified as causing the spike.

Getting paged at 2am because of a stability alert is not great, but not finding a crash was even worse.

The problem was the way we were monitoring. Simply watching all crash events wasn’t good enough. We had to start normalizing the events across the users. Thankfully, we collect events and not simple ticks. We have rich data in the crash event, including a way to group events coming from the same device.

Here is an example of the two styles of monitoring over the last week:

raw count shows a large spike, but the unique device count shows a normal trend

That after-midnight Raw Count spike on Friday would have paged our on-call person if we hadn’t changed to alert on the Unique Device Count instead. We still use the Raw Counts to identify issues and investigate, but we don’t alert on them. We can use the high-cardinality events to zero-in on the cause of the spike. In this case, two (2) people were having a bad experience using their Cubot Echo devices.

Since moving to the new alerting metric, we’ve had far fewer after-hour pages, while still being able to focus on the stability of the applications across our user base.

Engineering Productivity: Being Actionable

There is a lot of information about engineering productivity out there. No one says it’s easy, but it can be downright difficult to turn the practices you hear about into plans you can put into action. What follows is an example of how we can create an actionable plan to increase our productivity.

Let’s define engineering productivity as how effectively your engineering team can get important and valuable work done.

  • How do you determine important and valuable work? — goals and objectives.
  • How do you effectively get work done? — remove wasted time and effort from the delivery cycle

Goals, Planning, and Prioritizing

If productivity is an organizational goal, you need to make sure people understand why and how it affects them. You need to communicate the message over and over in as many venues as possible. The more developers understand the goals and the direction, the more engaged they’ll be with the work.

Engineering teams should try to set ambitious goals, focused through the lens of the team’s mission statement. We also try to create measures for defining success — yes, this is the OKR framework, but any goal or strategy planning can be used. We try to keep goals (objectives) and measures (key results) from being project to-do lists. Projects are tasks we can use to move the measures. Goals are bigger than projects. 

Engineers need to clearly understand the importance of their work. Large backlogs of work create decision fatigue about what work to prioritize. Without planning and prioritizing, we can end up with teams that aren’t aligned — the opposite of productivity. Use your goals, even the high-level organizational goals, as a guide to prioritize work.

Removing Wasted Time and Effort

A great resource for exploring engineering performance is the book Accelerate. Based on years of research and collected data (State of DevOps reports), the book sets out to find a way to measure software delivery performance — and what drives it. Some important measures include:

  • Lead Time: time it takes to go from a customer making a request to the request being satisfied.
  • Deployment Frequency: frequency as a proxy for batch size since it is easy to measure and typically has low variability. In other words: smaller batches correlates with higher deploy frequency and higher quality.
  • Time to Restore: given software failures are expected, it makes more sense to measure how quickly teams recover from failure.
  • Change Fail Percentage: a proxy measure for quality throughout the process.

Each of these measures could be a goal we want to focus and improve. Each measure has an impact on our ability to deliver software faster with better quality. Let’s also call-out that these measures are somewhat overlapping and interdependent.

Creating an Action Plan

Picking a Goal

As an experiment, let’s take one, Lead Time, and see how we could brainstorm ways to improve it. In a different favorite book, The DevOps Handbook, we’re presented with ways to effect change in Lead Time. A short summary that does not do justice to the depth presented in the book:

  • Reduce toil with automation
  • Reduce number of hand-offs
  • Find and remove non-value time
  • Create fast and frequent feedback loops

Let’s think about what’s involved between filing a ticket to start work — to delivering the work to the end user? Many different tasks and activities happen within this cycle. This becomes the scope we can work within. Some high-level things come to mind:

  • Designing
  • Coding
  • Reviewing
  • Testing
  • Bug Fixing
  • Ramping
  • Monitoring

Picking Measures

We should be thinking about ways to measure success and failure of these activities. This should be independent of the work we intend to undertake. We can draw upon the pain and stumbles that have happened in the past. Finding good measurements can be a very hard process itself. Let’s not be unrealistic about our expectations on manual processes — we’re only human and people make mistakes. Think about ways to make it easy to succeed and hard to fail:

  • Find more defects in pre-release than post-release: We’re always going to have bugs, but let’s try to find and fix more of them before releasing.
  • Reduce the times a project gets bumped to next release: This happens a lot and for many different reasons. We should be better at hitting the desired timeline.
  • Reduce the time it takes people to be exposed to a feature release: It can take days or week for people to “see” new features appear in the apps when ramping a feature flag. This also makes A/B testing painful.
  • Reduce the times a feature flag is rolled back: Finding problems after we ramp a feature in production is costly, painful, and slows the release of the feature.
  • Reduce time to detect and time to mitigate incidents: We’ll always have breaking incidents, but we need to minimize the disruptions to people using the product. Minutes, not days.
  • Reduce amount of non-value time: It’s hard to say “code should be reviewed in X minutes”, or “bugs should be found in Y hours”, but it’s easier to identify dead-time in those activities.

Brainstorming Projects

With our objective and measures sketched out, let’s think about the activities and tasks we want to change. Some are manual. Many involve multiple teams. There are a lot of hand-offs. Let’s create smaller affinity groups based on the tasks and activities using the framework.

Reduce toil with automation

  • Fast and continuous integration/UI testing
  • Canary monitoring and alerting
  • Simple hands-off deployments
  • Easy low risk feature ramping

Reduce number of hand-offs by keeping cross-functional teams informed and involved

  • Spec and requirement generation
  • Test plan generation and updates
  • Pre-release testing setup

Find and remove non-value time, usually the gaps between stages

  • Fast edit/build/test cycles for developers
  • Timely code reviews
  • All code merges ready for QA next day
  • File new defect tickets ASAP
  • Prioritized pre-release defect tickets
  • Merging green code

Create fast and frequent feedback loops

  • Timely code reviews
  • Fast and continuous integration/UI testing
  • All code merges ready for QA next day
  • File new defect tickets ASAP
  • Fast, short feature ramps
  • Canary monitoring and alerting

This level of grouping is perfect to start brainstorming actual project ideas. We’ve started at an organization-level objective (Increase engineering productivity), focused on a contributing factor (Lead time), and created a nice list of projects that could be used to affect the factor. This is important — we’re not focused on a single large project! We have many potential small, diverse projects. This dramatically increases the probability that we will succeed, to some degree. A single project is an all-or-nothing situation and lowers your probability of success. Most projects fail to complete, for one reason or another.

We also see that some idea groups appear multiple times. This allows us to leverage work to create impact in more ways.


If you take anything away from this post, I hope it’s that improving engineering productivity is an actionable goal. We can be systematic and measure results.

Accelerate and The DevOps Handbook cover a lot more than what I’ve presented here. The information on organizational culture and its effects on performance are also very enlightening. I’d recommend both books to anyone who wants to learn more about ways to improve engineering productivity.

Integration Testing: Time to Reboot

I tried to push a plan for bootstrapping an automated integration test system for our Android and iOS applications. The plan was based on similar strategies I’d used, or seen used, at other companies. It didn’t fit well with the current situation and workflows. I failed to take those differences into account and the initiative failed.

Developers never saw the value of using their time and workflow on the automated integration tests. Test engineers were overwhelmed by the amount of manual regression tests required for each release. Even with the manual tests, we have gaps in regression coverage leading to some severe defects shipped to users. We are wasting valuable manual testing time on hundreds of manual regression tests that rarely break, when we should be focusing those people on new feature and exploratory testing.

Looking Forward

We want to be able to do automated testing on our iOS and Android client apps. From the simplest type of “does the application start” smoke test to more complicated tests around critical features and functionality. More test automation means:

  1. Finding bugs faster
  2. Focusing manual testing on high value tasks (new features and exploratory testing)
  3. Shipping releases faster with higher quality

Test engineering is highly motivated to do more integration testing as a way to reduce the number of manual regression test cases. Though they don’t have development experience, those folks want to start creating the tests. So we want to keep the barrier to writing tests very low. As we automate regression tests, we want to focus on manual testing on new features, exploratory, and ad-hoc edge case testing.

Objectives for our automated integration testing reboot:

  • Doesn’t require knowledge of building applications or the languages used to develop the applications.
  • Requires little knowledge of the structure used to build the UI of the applications.
  • Reuse integration testing framework, code, and knowledge across all application platforms.
  • Reduce the amount of manual integration testing as much as possible.

Approach

We intend to use a black-box approach to installing, launching, and driving the applications. The plan includes:

  • Using python-based Appium scripts as the framework for integration tests. Python is a good entry-level programming language and Appium has capabilities to black-box test Android and iOS clients. We’re leveraging the same language and framework for both mobile platforms.
  • Using emulators & simulators to run smoke and integration tests. Easy to setup and run locally, while also capable in CI.
  • Run the tests several times a day using CI, but not on each PR. The focus is on reducing manual regression testing, while not adding friction to developer workflows.
  • Only send consistently failing tests to QA for manual verification and ticket filing.

Milestones

Milestone 1 is about creating a solid foundation for the approach. We’ve completed the proof of concept:

  • Test engineers are building scripts using cross-platform tools — and learning to code.
  • Developers have added a few testing hooks into the clients to allow faster, more robust tests.
  • Python scripts have been created for over 200 integration tests using Appium
  • Tests are running in CI several times a day.
  • Test engineering created a simple system to send consistent failures to QA.
  • Reliability is better than the previous Espresso/XCUITest test suite.

We’ve already saved several tester-hours a day from manual regression testing.

Milestone 2 is really just expanding the test coverage from only high priority test cases to medium and even low priority test cases. We’re also expanding the tooling to support running the Appium tests on both Alpha and Beta channels, as well as self-service support for running on pull requests. Some additional tasks:

  • Get better at controlling feature flags for more deterministic test flows
  • Start mocking API responses for faster testing and less variations due to live data
  • Intercept outgoing requests to track and verify more analytics
  • Create a smaller, faster suite for PR testing

Shipping Faster: The Hackday Mentality

We recently held our Summer Hackday at Tumblr and the results were impressive. I started to think about the mentality of a Hackday and how it differs from a more traditional product feature workflow.

It’s amazing what can be accomplished in a day.

Individuals or small groups start planning their projects in the days leading up to the event. On the day of the event, they’re off and running. They have 24 hours to get something working and demo it to the rest of the company.

Dead-end technical approaches are quickly discarded for alternatives, usually something more simple — the clock is ticking. There is no time for complexity or grand schemes. Get the basics working so you can impress your coworkers.

hackometer
Tumblr Hackometer measures completeness & shipping potential

After the demo presentations, there is always discussion about “how close this project is to shipping?” or “what’s left to do before we could release this project?” or some other notion that we could productize certain projects after a little clean-up work.

Productize — The death knell of the Hackday project. But why? I think it’s scope creep. Scope of the purpose, but also scope of the code.

The constraint of limited time is a gift, forcing or removing decisions which create a better environment for completing the project. Hackday projects are often more aligned with the core purpose of the product as well.

  • Focus on a singular purpose — try to be good at one thing.
  • No time or space for complexity — you can’t build whole new architectures.
  • Built on existing frameworks, patterns, and primitives — it fits into the existing product structure.

The Hackday mentality seems like a better process for building better products. It reminds me of the “fixed time, variable scope” principle from Basecamp’s Shape Up, a book describing their product process. They use six week time-boxes for any project.

Constraints limit our options without requiring us to do any of the cognitive work. With fewer decisions involved when we’re constrained, we’re less prone to decision fatigue. Constraints can actually speed up development.

Ship faster.