Exploring LLMs as Agents: WebAgent Gets a Refactor

My fifth post on LLMs as Agents. I just it’s a series now.

After refactoring ToolAgent, I decided to do the same with WebAgent, building it on the ToolAgent system. For more context and a refresher on WebAgent, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
  • Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.
  • Tools & Benchmarking: I refactored the ToolAgent code into a reusable base class, added some additional mock tools, and even some basic benchmarking.

Take a look at the repository to see the code.

WebAgent

After refactoring ToolAgent, I decided to see if I could build WebAgent on top of ToolAgent. I think it turned out pretty nicely. Again, Copilot helped too.

WebAgent now derives from ToolAgent, and PageManager (the wrapper around Playwright for browser automation) became a ToolProvider. All of the wrappers for Playwright automation became tools the agent could use. This worked better than I thought.

Starting task: Navigate to wikipedia.org and find the featured article of the day

Thought: "Okay, I will start by navigating to wikipedia.org. Then I will analyze the page content to find the featured article of the day. Finally, I will extract the text of the featured article and output it."
Tool: "navigate", Param: "https://www.wikipedia.org"

Thought: "Now that I have navigated to the Wikipedia homepage, I need to identify the element containing the 'Featured article'. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought: "The HTML content I retrieved is for the main wikipedia.org page, which is just a language selection portal. I need to navigate to the English Wikipedia to find the featured article. I will click the English link.",
Tool: "click", Param: "#js-link-box-en"

Thought": "Now that I have clicked the link, I need to wait for the page to load.",
Tool: "wait_for_navigation"

Thought: "Now that the English Wikipedia page has loaded, I need to find the 'Featured article' section. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought": "I have analyzed the page content and found that the 'Featured article' is located within a div with the id 'mp-tfa'. I will now use the 'get_text' tool to extract the text from this element.",
Tool: "get_text", Param: "#mp-tfa"

Thought: "I have successfully extracted the text of the featured article. Now I will output the result."

===== TASK SUMMARY =====
Task: Navigate to wikipedia.org and find the featured article of the day
Status: incomplete
Duration: 12.39 seconds
Input tokens: 68486
Output tokens: 827

The token usage was noticeably higher than my ToolAgent experiments. This was almost entirely due to get_page_content which was pushing the entire HTML content of the page into the LLM.

I had to fix that.

Learnings: Managing Context

As I said, my initial implementation of get_page_content just returned the entire HTML content back into the LLM. This wasn’t good. Because the LLM was using the page content to decide what HTML elements it needed to interact with, I could take some steps to limit which HTML elements I returned.

Step 1: Regex

I apologize up front that I even tried this, but I just wanted to see if I could get some quick wins. I asked Copilot to create an implementation of get_page_content that filtered out any elements that were not: Form elements, Elements with IDs, Elements with Roles or Aria-Labels, and a small set of special class names (like .btn).

The result was a much smaller content context, but there were issues. Of course there were issues! I lived through the “Use regex on HTML” era and it’s not good.

Step 2: Playwright

I was already using Playwright, so why not use it in a way similar to how we’d use BeautifulSoup? One well crafted prompt later, and Copilot had a very functional version.

Copilot: You’re absolutely right! Using Playwright directly to select elements gives us much better control and maintains the structure and order of the original page.

Even Copilot was relieved we moved away from regex.

The Playwright-based version had some other benefits:

  • Elements are extracted in the same order they appear in the DOM, preserving visual hierarchy
  • Filtering allowed me to only includes important attributes like id, class, role, etc
  • Executing some code in the DOM allowed me to only include visible elements

These helped the LLM make better decisions about the HTML elements it needed to interact with during the task. It also dramatically reduced the amount of input tokens sent to the LLM.

Step 3: Greedy LLM

The last thing I noticed and corrected was the LLM’s desire to ask for the page content whenever it got confused with CSS selectors. And yes, it gets confused. The LLM would fetch the page content, even when the page had not changed. Token usage goes up, and for the most part, the LLM didn’t zero-in on the right CSS selector. It likely just started guessing.

I added a is_maybe_dirty check in PageManager which was set to false after any call to get_page_content. I set is_maybe_dirty to true after: navigate, click, and type_text.

If is_maybe_dirty=false in get_page_content, I return an error result of “Use previously extracted page content”. The LLM correctly interpreted the result and knew it should stick with the current HTML content it already had.

Here’s an example run: “Use ‘https://duckduckgo.com’, search for ‘python tutorial’, and return the title of the first result found” (yes, it’s too fast to read so use the scrubber)

What’s Next

If I keep pushing on ToolAgent and WebAgent, I run the risk of starting to build simple frameworks. That might not be the worst outcome, but wasn’t my intention when starting these projects.

Instead, I want to explore some embedding and vector search projects. I’ve been using Simon Willison’s llm library, and it already includes some support for creating embeddings. Both SQLite and DuckDB have support for vector search. Sounds like I’ll have a lot to keep me busy.

I also want to begin using some local models. Again, llm already support this, so I’m unblocked.

Exploring LLMs as Agents: Tools & Benchmarking

I spent some time refactoring the Tool Agent code, added some additional mock tools and even some basic benchmarking. For more context, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
  • Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.

Take a look at the repository to see the code.

Code Refactor

The biggest change to the code is the somewhat large refactor to extract the agent-specific code into a ToolAgent class and the tool-specific code into a set of ToolProvider classes. ToolProvider classes are registered with the ToolAgent:

# Register tool providers
agent = ToolAgent(model_name='gemini-2.0-flash')
agent.register_provider(UtilityToolProvider())
agent.register_provider(AppointmentToolProvider())

ToolAgent can then be used in different ways. tool_agent_test.py is a way to run the agent in a terminal window. I’ve been playing with using the agent in a chat system too, but more on that next time.

python tool_agent_test.py

Tools

I wanted to keep adding more tools, which contributed to the code refactor. I couldn’t keep adding more code into the one Python file. With the new tools, I explored two new areas (for me):

  1. LLM-based Tools: A tool that was implemented using a secondary LLM. For a given set of topics, I wanted to “guess” which topics might be of interest to a user based on their input. LLMs are good at that stuff, so I made a simple tool call that used a separate system prompt. (I just realized that I didn’t add token tracking to that tool!)
  2. Multi-Parameter Tool Calls: Up until now, my tool calls took zero or one parameters. I wanted to start using multiple parameters, so I introduced an object parameter type with a simple schema. In the future, I want the agent to provide feedback when required parameters are not provided.

Benchmarking

I mentioned in the last post that I needed to start doing some benchmarking on the agent. The code factor made it somewhat easy for Copilot (yes, I do some vibe coding too) to whip up a very basic benchmarking harness. Some details (more in the readme):

  • Test Cases: Run multiple predefined test cases
  • Success Metrics: Track success rates, execution times, and token usage
  • Reporting: Get detailed reports on test results
  • CSV/JSON Export: Export results for further analysis
  • Token Usage: Track and analyze token consumption

I added simple token tracking because I was running the benchmarking suite quite a bit and wanted to know what kind of money I was spending. Turns out, it was less than a penny a run, but it’s good to know.

Learnings

I was running the benchmark suite trying to figure out why tests kept failing. The benchmarking framework measures success based on two metrics: (1) Did the agent use the minimum set of required tools? (2) Did the agent output a few key words in the final response. Yes, those aren’t the best success checks, but it’s what I have for now.

More Jedi Prompt Tricks

The agent was usually failing at using the required tools. Looking at the debug [thought] output from the agent, I noticed that it would assume knowledge instead of using the tools. I found a blog post from Pinecone that talks about this when using LangChain too. They suggested adding explicit items to the system prompt. I already had something along the lines of “ALWAYS use the tools. NEVER assume information” but that wasn’t enough. I added the following:

IMPORTANT: Do not assume or guess or hallucinate information. Call the tools to get ALL needed information.
    - You DO NOT know the current date or time. Use tools to determine the current date and time.
    - You CANNOT do math. Use tools to calculate math.

Temperature

For the randomness on the output, I saw this note in the same Pinecone blog post:

Here we initialize the LLM with a temperature of 0. A low temperature is useful when using tools as it decreases the amount of “randomness” or “creativity” in the generated text of the LLMs, which is ideal for encouraging it to follow strict instructions — as required for tool usage.

Honestly, I had not played with any LLM options yet, but added the temperature=0 setting and it did help.

More JSON Output Handling

I already had some code to handle the times when the LLM returned JSON using Markdown code fences, but there were times when the LLM would apologize for a mistake and append the JSON. The results were correct, but my code failed to extract it! So I added more defensive measures to safely extract JSON from LLM responses. Yes, I should update to use LLM built-in support for JSON schemas at this point.

With these changes, I regular had 100% success in my benchmarking runs! This means I need to make the tests harder.

You: I want to get a haircut tomorrow morning

Thought: I need to find out the current date to determine what tomorrow is. I will use the get_datetime tool to get the current date and time.
Thought: The current date is 2025-03-15. I need to check for available appointments for a haircut. Since there is no haircut specialty, I should first check the available specialties.
Thought: Okay, 'hair' is a valid specialty. I will now check for available 'hair' appointments.
Thought: I have a list of available hair appointments. The user wants an appointment tomorrow morning, which is March 16th. There are two appointments available on March 16th: 11:00 AM and 2:00 PM. The user wants a morning appointment, so the 11:00 AM appointment is the best fit. I will ask the user if they want to book appointment ID 11.

Agent: I see two appointments available for hair on March 16th: 11:00 AM and 2:00 PM. Would you like to book appointment ID 11 for 11:00 AM?

What’s Next

As I started to explore some non-chat user experiences, I discovered that I will need to agent to provide more structured output. Right now the output is very conversational, which works well for chat, but won’t be easy to use with other types of UX. I found some discussion on taming the output of an LLM in this blog post from Tryolabs.

I’ve also started using ToolAgent in my own chat server. I’m sure this branch of experimentation will continue as well.

Exploring LLMs as Agents: Planning via Prompting

My exploration of how to use LLMs as Agents continues. In this post, I’m looking at ways to improve the outcomes by using better approaches to planning. For more context, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.

Take a look at the repository to see the code.

Planning & Reasoning

I have seen a lot of people talking about “planning” and “reasoning” as crucial parts of an LLM agent system, so I started to read and learn more about different strategies. There is obviously a ton written about LLMs and Agents all over the internet, but I stumbled upon a useful web site that collects a lot of the information and has references to where the ideas originated.

Prompt Engineering Guide, not surprisingly, focuses on many different prompting strategies. It turns out, most of the different planning and reasoning strategies are based on different prompting strategies. Here are some of the strategies I progressed through:

  • Zero-shot: I think of this strategy as “hoping the LLM was trained well enough to just know the answer”. Hope is not a strategy, but you’d be surprised how many people believe the LLM can just figure everything out on its own.
  • Few-shot: This strategy enables in-context learning and depends on adding some examples to your prompt to help guide the model to better outcomes. My initial tool_agent used this approach before I even knew what it was called.
  • ReAct (Reasoning + Acting): This strategy combines reasoning and acting with LLMs into the prompt. ReAct prompts LLMs to generate reasoning and actions for a task. Asking the LLM to “think about the step” is enough to create significant improvements to the outcomes.

Funny thing: I thought I wanted to try using something called Chain of Thought (CoT), which is another prompt strategy to get the LLM to think about the steps. CoT ended up being a pain to implement in a simple way. I ended up wanting to merge the “thought” (reasoning) and “action” parts together into the same response — which turns out to basically be ReAct.

Example Flows

I honestly don’t know if the ReAct strategy is yielding better outcomes than the Few-shot approach I was using. I guess I need to set up benchmarks next? The flows are promising though, and the ReAct approach didn’t add much to the size of the code.

You: What's the weather in Beverly Hills?

Thought: The user wants to know the weather in Beverly Hills. First, I need to get the zipcode for Beverly Hills, then I can use that to get the weather.
Thought: Now that I have the zipcode for Beverly Hills, I can get the weather.
Thought: I have the weather information. Now I need to present it to the user.

Agent: The weather in Beverly Hills is 75 F and Sunny.

Here’s a longer one that uses the appointment scheduling tools:

You: I want to book a haircut

Thought: The user wants to book a haircut. I need to find the appropriate specialty to book an appointment for. I should first check the available specialties.
Thought: The available specialties are hair, dentist, and vision. Since the user wants a haircut, the relevant specialty is 'hair'. Now I need to find available appointments for 'hair'.
Thought: I have found the available appointments for a haircut. Now I need to present them to the user and ask them to choose one.

Agent: Here are the available appointments for a haircut:

* ID: 10, Date: 2025-03-09, Time: 10:30 AM
* ID: 11, Date: 2025-03-10, Time: 2:00 PM
* ID: 12, Date: 2025-03-16, Time: 11:00 AM
* ID: 13, Date: 2025-03-16, Time: 3:00 PM

Please tell me the ID of the appointment you would like to book.

You: 12

Thought: The user has selected appointment ID 12. Now I need to book this appointment.
Thought: The appointment has been successfully booked. I need to inform the user.

Agent: Great! Your appointment has been booked successfully for March 16, 2025, at 11:00 AM. The appointment ID is 12.

Learnings

Reflection and Correction

There are others ways we can improve the outcomes of the LLM agent beyond planning and reasoning. Reflection and correction is a way to help the model learn from its mistakes. I already had some basic checks in place to catch invalid JSON and unknown tools, but I made some additions to the code to help:

  • Validation of the JSON actions for expected fields
  • Validation loops to tell the LLM about any validation failures and allow the LLM to try again.

Prompt & Tool Consistency

I also realized, once again, that I was confusing the LLM with some of the ways I framed the prompt and tools.

  • I referred to both “tools” and “functions” in the initial prompt and subsequent conversational prompts. I decided to only use “tool” and changed all “function” references the LLM saw to “tool”.
  • I made some of the tool names ambiguous and the LLM would try to use tools that were not appropriate. An easy one is get_specialties which returns the list of specialties used for appointment booking. The LLM would sometimes avoid using the tool to figure out what types of appointments could be booked and would try other tools or guess. I changed the tool name to get_appointment_specialties and the LLM always used it correctly.

What’s Next?

I want to add more tools. I want to add tools that use secondary LLMs and I want to add tools that allow me to use the Agent for my own tasks.

I want to explore different UX around agents. Chat is an obvious choice, and I will likely expose tool_agent into a simple Web Chat system to play with it more and try to use it for actual tasks.

As mentioned above, I may look into benchmarks too.

Exploring LLMs as Agents: Taking Action

I’m still exploring how to use LLMs to build agents. In the first post, I described some of the motivation and approach. I started working on two agents:

  • web_agent: A basic agent that completes a web-based task using browser automation actions.
  • tool_agent: A basic agent that completes a task using a set of supplied tools or methods.

I’ve been focused more on tool_agent, trying to expand the types of tools I can provide. I had a suggestion to add some read/write type tools and scenarios, so I decided to try that out.

Taking Action

an agent is characterized by the environment it operates in and the set of actions it can perform. — Chip Huyen Agents

I want to create a more realistic set of actions for the agent to use. Scheduling appointments seemed like a straightforward scenario with some interesting possible flows. I added a few tool methods and a mocked appointment data store.

  • get_specialties(): Retrieves a list of available specialties for scheduling appointments.
  • get_available_appointments(specialty): Retrieves available appointments for a given specialty.
  • book_appointment(booking_data): Books an appointment based on provided booking data. The agent can pass in an appointment ID or a combination of a date and time.
  • get_my_appointments(): Retrieves the list of appointments booked by the user.
  • cancel_appointment(appointment_id): Cancels a booked appointment based on the provided appointment ID.

Checkout the tool_agent.py code for more detail on the way the tools work and the way the tools are added to the system prompt.

Example Flows

I was pleased with how well this rudimentary system worked. Without many tweaks (some on that below), I was able to get some useful exchanges.

You: what kind of appts can I book?
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["dentist", "vision", "hair"]
Action raw: {"type": "output", "value": "You can book appointments for dentist, vision, and hair."}
Agent: You can book appointments for dentist, vision, and hair.

The LLM did a good job of taking some ambiguity (“appts” and “book”) to figure out that it needed to use get_specialties to answer the question.

You: I want to get a haircut
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["vision", "dentist", "hair"]
Action raw: {"type": "call_function", "tool": "get_available_appointments", "param": "hair"}
Function result: [... list of appointments ...]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Okay, here are the available hair appointments:

* ID: 10, Date: 2025-03-01, Time: 10:30 AM
* ID: 11, Date: 2025-03-02, Time: 2:00 PM
* ID: 12, Date: 2025-03-08, Time: 11:00 AM
* ID: 13, Date: 2025-03-08, Time: 3:00 PM

In this exchange, the agent was able to figure out that it needed to display “hair” appointments. Interesting: the agent came up with the output display format on its own. When I saw the agent using this format, I added the ability to reference appointments by ID in the tool methods.

You: show me my appointments
Action raw: {"type": "call_function", "tool": "get_my_appointments"}
Function result: [{"id": "3", "date": "2025-03-02", "time": "11:00 AM", "specialty": "dentist"}]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Here are your booked appointments:

* ID: 3, Date: 2025-03-02, Time: 11:00 AM, Specialty: dentist

You: cancel the dentist appointment
Action raw: {"type": "call_function", "tool": "cancel_appointment", "param": "3"}
Function result: {"status": "success", "message": "Appointment canceled successfully."}
Action raw: {"type": "output", "value": "... output ..."}
Agent: Your appointment has been canceled successfully.

I thought I could easily confuse the agent by trying to cancel an appointment by the type (specialty) of the appointment, but the agent used the correct ID for the appointment when cancelling via the tool method.

Learning

At one point, the agent started returning invalid JSON. I assumed I need to be more strict in my system prompt, but I discovered an error. Since I was surrounding my Python strings with double quotes, the inner quotes were single quotes. My JSON examples were using single quotes — which is illegal in JSON. The LLM was relying on those examples, so I was causing the LLM to emit invalid JSON.

As I was looking at the code for similar issues, I noticed I was sending Python formatted object output back into the conversation. The LLM was handling it well enough but I decided to change the output to be legal JSON output:

-  response.conversation.prompt(f"Function result: {function_result}")
+  function_result_json = json.dumps(function_result)
+  response = conversation.prompt(f"Function result: {function_result_json}")

What’s Next?

Most of Chip Huyen’s post on Agents talks about “planning”, but I have not really adding any planning specific code to tool_agent yet. Right now, I am getting by with whatever amount of planning the LLM can create itself.

I want to learn more about planning, and how to add a little code to help the agent deal with even more complicated scenarios.

Exploring LLMs as Agents: A Minimalist Approach

Large Language Models (LLMs) are powerful tools for generating text, answering questions, and coding. We’ve moved beyond generating content, and LLMs are now being used to take actions as agents — independent entities that can act, use tools, and interact with their environment. You probably already know all of this.

I wanted to explore using LLMs as agents, but I like to get an understanding of the underlying components before using high-level frameworks to hide all of the minutiae and make the process of building production-ready systems. Understanding how the different components work and interact is important to my own learning process.

That’s exactly what my LLM Agents project sets out to do. Instead of relying on frameworks that abstract away the details, the project takes a bare-bones approach to learning how LLMs can function as agents. By minimizing dependencies, I can get a clearer understanding of the challenges, possibilities, and mechanics of building LLM-powered agents.

Why Minimal Dependencies Matter (To Me)

Many existing frameworks promise powerful LLM agent capabilities, but they often come at the cost of hiding the underlying complexities. While these frameworks can be useful, starting with minimal dependencies allows us to:

  • Understand the fundamentals: How does an LLM process information to take actions? How does the system prompt impact the effectiveness of the agent?
  • Explore limitations: What challenges arise when an agent tries to perform a multi-step task? How does the shape of the tools (functions or APIs) impact how the agent can process the flow.
  • Control the design: Without being boxed into a framework’s way of doing things, we can experiment freely. We can then use this knowledge to help pick the right type of framework for more advanced and production use cases.

This project keeps things simple, using only a lightweight LLM library, Simon Willison’s great llm Python library (code & docs), and Playwright for handling web automation.

These agents are not production ready, but they are trimmed down enough to see the mechanisms at work.

Meet the Agents

The repository contains two primary agents:

Web Agent: Navigating the Web with LLMs

The Web Agent is designed to interact with websites using the Playwright Python library. Instead of treating a webpage as structured data, this agent lets an LLM interpret raw HTML and decide what actions to take—whether that means clicking a button, typing into a form, or extracting text. I wanted to see how well an agent could navigate something as confusing as a modern website. If you’ve ever done a “view source” or “inspect” on a modern webpage, you know what I mean.

How It Works

  • A PageManager class handles the browser automation using Playwright.
  • The LLM generates the next action based on the current page content and the assigned task.
  • Two modes are available:
    • Non-conversational mode: Every step is processed independently.
    • Conversational mode: The agent maintains memory across multiple interactions, reducing the need to repeat context.
Example Task:
web_agent_conversation("gemini-2.0-flash", "Search for 'LLM agents' and return the first result's title.", "https://duckduckgo.com/")

This runs a search query and extracts the first result’s title, all without predefined scraping rules.

How Did It Go

At one point, the agent was not using valid CSS selector syntax and couldn’t “click” the search button. In spite of not getting to the search results page, the agent returned a “successful” answer. I wondered if the LLM was somehow using its trained knowledge to find a valid answer, but I could not find the result anywhere. I searched DuckDuckGo and Google for the title.

I added the “Explain how you solved the task” prompt and the agent replied that since it was not able to get to the search results, it created a hypothetical answer.

I did two things:

  • I told the agent it was not allowed to make up answers. Just fail gracefully.
  • I gave the agent examples of valid CSS selectors for id, class, and attribute selectors. This really improved the CSS selector accuracy. I had hoped the LLM’s training would have been good enough.

The conversational mode, unsurprisingly, could finish tasks with fewer steps. Memory and retained context matter.

Tool Agent: Using LLMs to Call Functions

The Tool Agent extends an LLM’s capabilities by allowing it to call external functions. Instead of just answering questions, it can interact with a set of predefined tools—simulating API calls, performing calculations, retrieving weather data, and more.

How It Works:

  • A registry of tool functions provides capabilities like:
    • Web search (search_web)
    • Weather lookup (get_weather)
    • Date and time retrieval (get_datetime)
  • The agent follows a conversational loop:
    1. Receives a user query.
    2. Decides whether a tool is needed.
    3. Calls the tool and processes the response.
    4. Outputs the final answer.
Example Interaction:
You: What's the weather in Beverly Hills?
Function result: {'zipcode': '90210'}
Function result: {'temperature': '75 F', 'conditions': 'Sunny'}
Agent: The weather in Beverly Hills (zipcode 90210) is 75 F and Sunny.

Here, the LLM autonomously determines that it needs to retrieve a zip code first before getting the weather.

How Did It Go

It’s not easy to get an LLM to only and always respond using structured output, such as JSON. Some models do better than others, and there are lots of ways to use the system prompt to help get the results you want. I found that I still need to check for Markdown code fences in the output, and remove those.

Note: I saw Simon Willison just updated llm to support schemas, to make structured output easier.

Getting the agent to use the tools (Python functions) required not only being specific about the JSON format and the parameters, but also showing examples. The examples seemed to help a lot. I found some discussions about using XML formatted block to describe the set of tools in the system prompt. Something about LLMs being able to handle XML better than JSON. Maybe that is outdated?

I was pretty happy to see the agent use two successive tools (as shown above) to complete a task. I want to play around more to see how that type of chaining can be improved and expanded.

What’s Next?

This has been a fun project and I think there are a few more things I want to try before moving on to the real frameworks:

  • Expanding the set of tools to include real API integrations.
  • Use separate agents to implement tools.
  • Fine-tuning the prompt engineering for better decision-making.
  • Improving the agent’s ability to recover from errors.

Further Reading

  • Simon Willison’s blog is a great place to learn about LLMs and keep updated
  • browser-use is a full featured Python framework for creating Browsing using research agents
  • PydanticAI is a full featured Python library that makes it easy to get started building tool-using agents

Work-as-Imagined vs Work-as-Done

With engineering focus on reducing incidents and improving operational reliability, I frequently come back to the realization that humans are fallible and we should be learning ways to nudge people toward success rather than failure.

There are whole industries and research machines built around the study of Human Factors, and how to improve safety, reliability, and quality. One topic that struck me as extremely useful to software engineering was the concepts of Work-as-Imagined (WAI) versus Work-as-Done (WAD). Anytime you’ve heard “the system failed because someone executed a process differently than it was documented” could be a WAI vs WAD issue. This comes up a lot in healthcare, manufacturing, and transportation — where accidents can have horrible consequences.

Full disclosure: There are actually many varieties of human work, but WAI and WAD are good enough to make the point. Steven Shorrock covers the subject so well on his blog: Humanistic Systems 

Work-as-Imagined

When thinking about a process or set of tasks that make up work, we need to imagine the steps and work others must do to accomplish the tasks. We do this for many good reasons, like scheduling, planning, and forecasting. WAI is usually formed by past experiences of actually doing work. While this is a good starting point, it’s likely the situation, assumptions, and variables are not the same.

To a greater or lesser extent, all of these imaginations – or mental models – will be wrong; our imagination of others’ work is a gross simplification, is incomplete, and is also fundamentally incorrect in various ways, depending partly on the differences in work and context between the imaginer and the imagined. — The Varieties of Human Work

Work-as-Done

Work-as-Done is literally the work people do. It happens in the real world, under a variety of different conditions and variables.  It’s hard to document WAD because of the unique situation in which the work was done and the specific adjustments and tradeoffs required to complete the work for a given situation.

In any normal day, a work-as-done day, people:

  • Adapt and adjust to situations and change their actions accordingly
  • Deal with unintended consequences and unexpected situations
  • Interpret policies and procedures and apply them to match the conditions
  • Detect and correct when something is about to go wrong and intervene to prevent it from happening

Mind the Gap

Monitoring the gap between the WAI and the WAD of a given task has been highlighted as an important practice for organizations aiming to achieve high reliability. The gap between WAI and WAD can result in “human error” conditions. We frequently hear about incidents and accidents that were caused by “human error” in a variety of situations:

  • Air traffic near misses at airports
  • Train derailments and accidents
  • Critical computer systems taken offline
  • Mistakes made during medical procedures

It’s natural for us to blame the problem on the gap — people didn’t follow the process — and try to improve reliability and reduce errors by focusing on stricter adherence to WAI. Perhaps unsurprisingly, this results in more rules and processes which can certainly slow down overall productivity, and even increase the gap between WAI and WAD.

Safety management must correspond to Work-As-Done and not rely on Work-As-Imagined. — Can We Ever Imagine How Work is Done?

In recent decades, there is more focus on WAD. Examining the reasons why the WAD gap exists and working to align WAI more closely with WAD. Embracing the reality of how the work is done and working to formalize it. Instead of optimizing for the way we imagine work is done, we acknowledge the way work is actually done.

Closing the Gap

In my work, production incidents that occur in software systems are an easy area to find WAI vs WAD happening. Incident management and postmortems have best practices that usually involve blameless reviews of what led to the incident. In many case, the easiest answer to “how can we stop this incident from happening again?” is better documentation and more process.

Modern incident management is focusing more on learning from incidents and less about root-cause analysis. One reason is that incidents rarely happen the exact same way in the future. Focusing on fixing a specific incident yields less value than learning about how the system worked to create the incident in the first place. Learning about how your system work in production is harder, but yields more impact in discovering weak parts of the systems.

This section could be an entire book, or at least several posts, so I’ll leave it to you to read some of the links.

Desire Paths

The whole WAI vs WAD discussion reminds me of desire paths, which visually show the difference between the planned and actual outcomes.

Desire paths typically emerge as convenient shortcuts where more deliberately constructed paths take a longer or more circuitous route, have gaps, or are non-existent

Tying together desire paths with WAI & WAD, some universities and cities have reportedly waited to see which routes people would take regularly before deciding where to pave additional pathways across their campuses and walking paths.

Information Flows in Organizations

I’ve had cause to looked into research and ideas about the ways information flows within organizations. Discussions about transparency, decision making, empowering teams, and trust seem to intersect at organizational communication and information flows.

One of my favorite people to follow in this space is Gene Kim (Phoenix Project, DevOps Handbook, Accelerate, and DORA Reports). He has done a few podcasts that focused on relevant topics and concluded that you can predict whether an organization is a high performer or a low performer, just by looking at the communication paths of an organization, as well as their frequency and intensity. (Episode 16, @54 min)

Some of these ideas might resonate with you. There are generally two forms of information flows:

  • Slow flows where we need detailed granularity and accuracy of information. Leadership usually needs to be involved in these discussions so communication tends to escalate up and down hierarchies.
  • Fast flows where frequency and speed tend to be more important. These flows occur in the operational realm, where work is executed, and happen directly between teams using existing interfaces.

In the ideal case, a majority of the communication is happening within and between teams using fast flows. Forcing escalation up and down the hierarchy means getting people involved who probably don’t have a solid grasp of the details. Interactions are slow and likely lead to poor decisions. On the other hand, when teammates talk to each other or where there are sanctioned ways for teams to work with each other with a shared goal, integrated problem solving is very fast.

This doesn’t mean all information flows should be fast. There are two phases where slow flows are critical: Upfront planning and Retrospective assessment. Planning and preparation are the activities where we need leaders to be thoughtful about defining the goals and then defining responsibilities and the structures to support them. Later, slow communications come back when we assess and improve our performance and outcomes.

Thinking, Fast and Slow

I want to be clear that fast and slow information flows are different concepts than the fast and slow modes of thinking explored in Daniel Kahneman’s book Thinking, Fast and Slow. The book explores two systems of thinking that drive the way humans make decisions.

  • System 1 (Fast Thinking): This system is intuitive, automatic, and operates quickly with little effort or voluntary control. It’s responsible for quick decisions, habits, and reactions.
  • System 2 (Slow Thinking): This system is deliberate, analytical, and requires effortful mental activity. It’s used for complex computations, learning new information, and solving difficult problems.

Kahneman discusses how these two systems can work together, but sometimes lead to biases and errors in judgment. He talks about how these modes can affect decision-making and offers suggestions into how we can become more aware of these biases to make better decisions.

Obviously another area worth exploring to help understand how organizations can support people to create better outcomes.

Project: Networked LED Pixel Display

I have been wanting to play around with an ESP32-based micro for a while. Once I became comfortable with Adafruit’s microcontrollers and CircuitPython, I thought I’d try out some of their ESP32 offerings. I bought a few Airlift (ESP32) Featherwings to use with the Feather RP2040 boards I was experimenting with.

I’ve also been messing around with some WS2818 / NeoPixel LED 8×8 and 16×16 grids, so I thought it might be interesting to work on a web-based pixel display.

CircuitPython has some very handy libraries for building LED animations on strings or grids of WS / Neopixels. The Airlift ESP32 also has a library to create network clients and access points—surely this wouldn’t be too hard.

Here’s the shapes I used 3D print the parts:

(I really need to improve my enclosure design skills)

Something I’ve picked up from other people making an LED grid displays: Use a lattice grid and diffuser to create an even “pixel” instead of a bright point of light (depending on your tastes). It took a few tries to get the lattice to match nicely with the LED matrix circuit board. While most people use an acrylic diffuser, I just used a piece of card stock paper.

Using the CircuitPython LED animations library, it was relatively easy to try out a variety of different LED matrix animation patterns. Building on the primitives in the library, I created additional functionality that supports text-based and sprite-based animations.

With so many different patterns, sprites, and options to manage—I decided to use a web-based UI to handle the experience.

The UI is served from the device itself. This turned out to be more challenging than you might think. I’ll do a separate post on serving a web-based UI while also running LED animations on the device.

Project: LED Fiber Optic Lamp

Looking back at one of the first real projects I attempted which combined 3D printing and microprocessors. I received an Creality Ender 3 V2 a year ago and after playing around with some test prints, I wanted to try building some more interesting and complex projects. I came across this fiber optic LED lamp project via Instructables. It was just the right amount of 3D printing, microprocessors, and coding I was looking for at the time.

I tend to use components from Adafruit. They have a strong focus on learning. The guides and CircuitPython are great for getting started. So given the great set of instructions, my challenge was basically recreating the 3D models and porting to an Adafruit microprocessor running CircuitPython. The author already provided the 3D models as STLs and in Tinkercad (I also really like the simplicity of Tinkercad), but I wanted to reduce the number of fiber strands and make the lamp slightly smaller.

I figured out the general structure and process of the original model pieces by investigating the Tinkercad project. It didn’t take long to recreate some shapes that I could use to build the lamp structure.

CircuitPython has great support for individually addressable WS2818 / NeoPixel strands, so it was relatively simple to get some code working that would create some simple LED animations. I was using a Trinket M0, which is so tiny. I ran into some space issues where I couldn’t add all the animation support code I wanted onto the board. If I ever revisit this project, I’ll probably switch to a QT Py RP2040 or QT Py ESP32-S2, both of which have plenty of space, way more power, and the ESP32-S2 board would even allow for some network/web configuration UX.

Here are a few photos of the assembly process of the LEDs and optic fiber in the frame.

The fiber optic cable is a “side glow” type used for decorations. It’s designed to create a glow.

Here is the base with the wires and a breadboard for the Trinket M0 (not inserted yet), along with a small button which can be used to change the animation modes.

The CircuitPython code is very simple and is available in a Github repo. I’m pretty happy with the finished project. Some things I’d want to address if I decide to work on a revision:

  • Using CircuitPython doesn’t leave much room for user code on the Trinket M0, so I’d probably just bump up to one of the newer QT Py models. I’ll be able to add more animation modes too.
  • Hot gluing the breadboard into the base isn’t sturdy enough. I’ll need to attach the next board with screws/nuts.
  • Selecting animation modes using the button is not very friendly. If I bump up to a QT Py ESP32-S2, I’ll add a web setup UI.

CircuitPython, LEDs, and Animations

I’ve been playing around with some WS2818 / NeoPixel LED 8×8 grids and CircuitPython. The CircuitPython ecosystem is really rich and Adafruit makes some very handy support libraries. I was using the LED Animation library to create some patterns on the neopixel grid, but wanted to try adding more capabilities. The time-slicing approach made it nice to add other code without blocking the program executing while the animations were happening.

In particular, I wanted to add bitmap sprite animations and text scrolling. There are great libraries and examples in Adafruit’s collection of Learn tutorials, but I didn’t see anything that played well with the time-slicing. I took a crack at building some of my own support.

Animation Extras is a couple of simple code helpers that add bitmap sprite and text scrolling support by building on the LED Animation library.

The LED Animation library has some slick ways of grouping individual pixels to create patterns. I added helpers to create the rectangular animation pattern based on those grouping primitives. Checkout the repo for some example usage.