Exploring LLMs as Agents: Google Agent Dev Kit

Up to this point, I have been experimenting with my own home-grown agent framework, based on Simon Willison’s great LLM project to handle wrapping the different LLMs. Armed with some experience, I want to start looking at real frameworks. Google released an Agent Development Kit at Google Next. It supports many of the features I have been playing with, including tool calling, planning, MCP, and local models. It also supports some advanced concepts like agent orchestration.

TLDR: Here’s the code repo for my initial ADK exploration.

Features

Google’s ADK has many features that are becoming common in agent frameworks:

  • Agent with support for multiple LLMs, including local using LiteLLM
  • Tools that support custom methods, a set of built-ins, 3rd party wrappers (including MCP), and even an easy AgentTool
  • Session and Memory capabilities with in-memory and persistent examples
  • Orchestration with different types of Agents to handle parallel, loops, and sequential workflows
  • Built-in system of Evals for testing your Agents

ADK has a command line interface that is somewhat unique to agent frameworks. It reminds me of the Flask command line system:

adk run <agent>
adk web
adk api_server

ADK also supports a mechanism to deploy your agents to Google infrastructure (Agent Engine or Cloud Run). I’m hoping this could be used to deploy to other infrastructure too.

Converting ToolAgent

I wanted to see how easily I could convert my ToolAgent to use ADK. The core Agent was fairly small and simple:

root_agent = Agent(
    name="multi_tool_agent",
    model="gemini-2.0-flash",
    instruction=SYSTEM_PROMPT,
    generate_content_config=types.GenerateContentConfig(
        temperature=0 # More deterministic output
    ),
    planner=PlanReActPlanner(),
    tools=[
        FunctionTool(func=get_weather),
        FunctionTool(func=get_datetime),
        FunctionTool(func=get_current_location),
        FunctionTool(func=get_zipcode),
        FunctionTool(func=calculate),
        FunctionTool(func=get_appointment_specialties),
        FunctionTool(func=get_available_appointments),
        FunctionTool(func=get_appointment_details),
        FunctionTool(func=book_appointment),
        FunctionTool(func=cancel_appointment),
    ],
)

This code uses a much simpler system prompt because much of the “PLEASE USE TOOLS” prompt is handled by the tool calling system. I use a ReAct planner to get the “thinking” aspects I had in ToolAgent. The tools themselves are still simple, mocked Python methods. ADK, like other frameworks, uses Python Typing and Docstrings to extract the specification details for the tool methods, so I changed my code (for the better):

def get_available_appointments(specialty: str) -> list[dict]:
    """
    Gets the available appointments for the given specialty.
    Args:
      specialty (str): the specialty name
    Returns:
      list: a list of available appointments
    Example:
        >>> get_available_appointments('dentist')
        [{'id': '1', 'date': '2022-01-01', 'time': '10:00 AM'}, ...]
    """
    <IMPL REMOVED>

Here’s a simple exchange that required a few tools:

user: what dentist appointments are available tomorrow?

[multi_tool_agent]: /*PLANNING*/
1. Get the current date using the `get_datetime` tool.
2. Extract the date from the response and calculate tomorrow's date.
3. Get the available dentist appointments using the `get_available_appointments` tool with the specialty set to "dentist".
4. Filter the appointments to only include those that are on tomorrow's date.
5. Respond to the user with the available appointments for tomorrow.

/*ACTION*/

[multi_tool_agent]: /*REASONING*/
1. The current date is 2025-04-13.
2. Tomorrow's date is 2025-04-14.
3. Now I need to find the available dentist appointments.

/*ACTION*/

[multi_tool_agent]: /*REASONING*/
I have the available dentist appointments. I need to filter the appointments to only include those that are on tomorrow's date, which is 2025-04-14.
The available appointments for tomorrow are:
- id: 3, time: 11:00 AM
- id: 4, time: 3:00 PM

/*FINAL_ANSWER*/
The available dentist appointments for tomorrow, 2025-04-14, are:
- id: 3, time: 11:00 AM
- id: 4, time: 3:00 PM

Overall, I thought it was pretty easy to get a basic agent working. Maybe I’ll try to create a WebAgent too. I’ll keep exploring some of the more advanced concepts to see how those can be used.

Checkout the code repo

Exploring LLMs as Agents: Local Models

I finally got around to exploring local models, which is surprisingly simple to set up. I wanted to see how well a local model would perform in ToolAgent compared to the remote models (mostly Gemini) I have been using.

For more context on my explorations , checkout the previous posts: Minimalist Approach, Taking Action, Planning via Prompting, Tools & Benchmarking, WebAgent Gets a Refactor, and Model Context Protocol (MCP). Take a look at the repository to see the code.

Local Models

Local large language models (LLMs) are models that can be run directly on a personal device, such as a laptop, desktop computer, or smartphone, rather than relying on cloud-based servers. They offer features like privacy, security, compliance, offline availability, and lower operating costs.

One of the easiest ways to get started with local models is with Ollama, a system to download, manage, and run models on your laptop. Another frequently used resource is Hugging Face, which I’ve heard called the “GitHub of Models”.

My ToolAgent is based on Simon Willison’s llm project, which has plugins for several different local model system, including Ollama (llm-ollama) and Hugging Face models via MLX (llm-mlx). I was able to use both approaches to test out some local models.

Which Models

There are a mind-boggling number of local models and variations to choose from. Hugging Face has a leader board. Based on what I saw other people talking about, I decided to start with a few:

  • Ollama mistral-small3.1:24B (4bit) - 15GB
  • Ollama gemma3:27B (4bit) - 17GB
  • mlx-community/Qwen2.5-0.5B-Instruct-4bit — 278 MB
  • mlx-community/Mistral-7B-Instruct-v0.3-4bit — 4.08 GB
  • mlx-community/Mistral-Small-24B-Instruct-2501-4bit — 13.26 GB
  • mlx-community/Llama-3.3-70B-Instruct-4bit — 40 GB

You’ll notice a few specifications, like 24B and 4bit. These relate to some performance tradeoffs you need to consider when running LLMs on your laptop.

Model size is specified by the number of parameters the model supports, 7B is 7 billion and 27B is 27 billion. Generally, the more parameters a model supports, the better the model is at reasoning and following instructions.

Quantization (4-bit, 8-bit, 16-bit) is about how precisely the model stores its weights. Lower quantization means smaller files sizes and less RAM is required to run the model. The downside is less accuracy or ability to reason.

Results

Once I had a few local models installed and ready to go, it was time to try them out with ToolAgent! Because ToolAgent is based on the llm library, I just had to pass the model name using the rules for the plugin (llm-ollama or llm-mlx) I wanted to use.

Speed: Local models are slower than cloud-based models. Ollama and MLX use GPU on my Macbook Pro M1 with 64GB, but it’s sluggish. MLX is built for Apple silicon, so I expect the model was working as well as it could on a single machine.

Reasoning: The smaller models (Qwen2.5-0.5B and Mistral-7B) were not able to function as an agent. They couldn’t follow the instructions in the system prompt and they couldn’t call any tool functions. The larger models (Mistral-Small-24B and Llama-3.3-70B) were able to function as agents. They didn’t work as smoothly as the cloud-based models, but they were very functional.

An exchange with Mistral-Small-24B:

You: what is tomorrow's date?

Thought: "I need to determine the current date and time to calculate tomorrow's date.",
"type": "call_tool", "tool": "get_datetime", "param": null

Thought: "I have the current date, which is 2025-04-13. To find tomorrow's date, I need to add one day to the current date.",
"type": "call_tool", "tool": "calculate", "param": "2025-04-13 + 1 day"

Thought: "I cannot calculate dates directly. I need to determine the date by using the current date and adding one day manually.",
"type": "output", "value": "... output text ..."

Agent: Tomorrow's date is April 14, 2025.

You: what dentist appointments are available tomorrow?
Thought: "I need to determine the available dentist appointments for tomorrow. First, I need to check the available specialties to ensure 'dentist' is one of them.",
"type": "call_tool", "tool": "get_appointment_specialties", "param": null

Thought: "I have confirmed that 'dentist' is an available specialty. Next, I need to find the available dentist appointments for tomorrow.",
"type": "call_tool", "tool": "get_available_appointments", "param": "dentist"

Thought: "I have the list of available dentist appointments. I need to filter the appointments for tomorrow's date, which is April 14, 2025.",
"type": "output", "value": "... output text ..."

Agent: Here are the available dentist appointments for tomorrow, April 14, 2025:
Appointment ID: 3, Time: 11:00 AM
Appointment ID: 4, Time: 3:00 PM

It made a bad assumption with calculating the next day, but recovered. Otherwise worked well.

What’s Next

I have started to look at real agent frameworks, like Pydantic-AI. I think I have learned enough of the basics to be dangerous and it’s time to move beyond my own toy framework. Google released an Agent Development Kit at Google Next. It supports many of the features I have been playing with, including tool calling, planning, MCP, and local models. It also supports some advanced concepts like agent orchestration.

Let’s start exploring Google’s ADK!

Exploring LLMs as Agents: Model Context Protocol

I will likely run out of topics to cover in the “LLMs as Agents” series, eventually. I wanted to explore creating a chat UI and what I could do with embeddings but I didn’t do that. For more context , checkout the previous posts: Minimalist Approach, Taking Action, Planning via Prompting, Tools & Benchmarking, and WebAgent Gets a Refactor. Take a look at the repository to see the code.

I’m back to cover the new hotness in the agent space: Model Context Protocol (MCP) from Anthropic

MCP is having its jQuery moment, and that’s not an insult. Even OpenAI is adopting MCP. I’ve learned about hooking tools up to LLM-based agents, and it was always clear that competing standards for how to connect LLM-based agents to tools would create fragmentation and hurt growth. MCP is an open protocol for building agents. The MCP system contains:

  • MCP Hosts: Programs like Claude Desktop, IDEs, or other tools that contain a client
  • MCP Clients: Protocol clients that maintain 1:1 connections with servers
  • MCP Servers: Lightweight programs that each expose specific capabilities through the standardized Model Context Protocol

There is a ton of information available for you to learn about MCP. This post is not teaching you about MCP.

This post is about MCP-enabling the ToolProvider and ToolAgent framework I created! That’s right, I’m exposing ToolProviders as MCP Servers and ToolAgent as an MCP Client — and I’m doing it with zero dependencies on any MCP framework. I like to see how things work under the covers. Having spent time building Browsers, I also believe that any open protocol needs to have multiple implementations, even crappy ones like mine.

Getting Started

Before I started building an MCP Client and Server from scratch, I made a few decisions:

  • I was going to use Copilot to help do the work. It just seemed silly to try this otherwise.
  • I copy/pasted many sections of the MCP Specification into a single, long Markdown document. I had to give Copilot the right context. I also used the MCP Schema.
  • I was going to implement the stdio transport. HTTP SSE and WebSockets (I think) are also available, but a local system was good enough to learn how this works.
  • I was only going to support “Tools” in my Client and Server for now. You can also support “Resources” and “Prompts”.

ToolProvider MCP Server

Copilot handled all the heavy lifting here. I asked to create an MCP Server, using the MCP specification, that was implemented using ToolProviders. ToolProviders already supported everything an MCP Server needed to support tools, so Copilot just had to map from my basic JSON schema to the JSON schema used by MCP.

Copilot was able to create a basic stdio-based request/response system, and was able to add the JSON-RPC message passing. Copilot was able to glean some of the messages and the message flows from the MCP specification.

I install Claude Desktop to test it out. It didn’t work on the first try. We missed some nuance of the JSON-RPC message passing, and we had to update some message types based on the newer schema. It did not take long before we had my ToolProvider tools running in Claude as an MCP Server.

ToolAgent MCP Client

Again, Copilot did most of the work. We already had a basic stdio-based system working, and we had several other examples of ToolAgent being used in different ways: terminal chat, web-based chat, and some tests — so it didn’t take long to have a a basic system with MCPToolProvider (since ToolAgent connects to ToolProviders, we have to map MCP Servers to ToolProviders) and an ToolAgentMCPClient that wraps ToolAgent itself.

I also installed a 3rd party MCP Server (filesystem) so I could test a Server that wasn’t something I wrote.

We had to tweak the Client stdio request/response mechanism to get things working reliably. Timing, buffering, and flushing are important! Because the Client is spawning Servers as processes, I had to add some code to make sure the Client was cleaning up the spawned Servers on shutdown or in failure situations.

I was able to get both ToolProvider MCP Server and filesystem MCP Server working in my ToolAgent-based MCP Client.

Connecting to MCP server: filesystem
Connected to secure-filesystem-server v0.2.0
Available tools: 11
Connecting to MCP server: mock
Connected to ToolProvider MCP Server v0.1.0
Available tools: 19

Chat with the agent (type 'exit' to quit, 'tools' to list tools):
> What files do I have in my Desktop folder?

Thought: "I need to find out which directories I have access to before I can list the files in the Desktop folder.",
"type": "call_tool", "tool": "list_allowed_directories"

Thought: "I can now list the files in the Desktop folder since I know I have access to it.",
"type": "call_tool", "tool": "list_directory", "param": "/Users/mfinkle-mac/Desktop"

Thought: "I made a mistake in the previous JSON format. The parameter for list_directory should be an object.",
"type": "call_tool", "tool": "list_directory", "param": { "path": "/Users/mfinkle-mac/Desktop"}

Thought: "I have successfully listed the files and directories in the Desktop folder.",

Agent: Here are the files and directories in your Desktop folder:
.DS_Store
.localized
Reactions (directory)
Screen Recording 2024-06-24 at 5.12.08 PM.mov
Screen Recording 2024-07-19 at 10.30.18 AM.mov
Screen Recording 2024-11-04 at 12.53.58 PM.mov
Screen Recording 2025-03-17 at 11.56.51 AM.mov
Stuff (directory)
live-chat-sample.mov

You can see ToolAgent processing the requests, but it’s using the filesystem MCP Server to perform the actions.

The code for both the ToolProvider MCP Server and the ToolAgent MCP Client are in the LLMs Agents repo.

Learnings

MCP is a very handy mechanism to connect LLM-based agents to different types of tools and resources. We’ve seen “broker” or “middleware” systems like this in the industry before and there is usually a lot of benefit for the ecosystem while a dominate player exists. I remember using COM and ODBC.

There are a lot of MCP Servers being created everyday (curated list). Many are of questionable value, but that’s the beauty to the early adopter phase. We’ll start to see some examples that push the boundaries, and that will be interesting.

There are fewer MCP Clients, but that will certainly change. I wonder how people (or IT/Security) will feel about all of this data being moved around. It makes me want to start looking into local models (another topic I failed to start).

What’s Next

I want to refactor the ToolProvider approach to defining tools. The MCP approach is cleaner and would make the mapping code (from ToolProvider to MCP Server) simpler too. The Playwright team released a Playwright MCP Server, which when used in my ToolAgent MCP Client could mean WebAgent doesn’t need to exist anymore.

With the growing number of easy to connect MCP Servers, I want to start investigating ways to orchestrate all those tools into a meaningful and useful Agent.

Also:

  • Local Models
  • Chat UI (and non-Chat UI) experiences
  • Embedding and Vector search

Exploring LLMs as Agents: WebAgent Gets a Refactor

My fifth post on LLMs as Agents. I just it’s a series now.

After refactoring ToolAgent, I decided to do the same with WebAgent, building it on the ToolAgent system. For more context and a refresher on WebAgent, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
  • Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.
  • Tools & Benchmarking: I refactored the ToolAgent code into a reusable base class, added some additional mock tools, and even some basic benchmarking.

Take a look at the repository to see the code.

WebAgent

After refactoring ToolAgent, I decided to see if I could build WebAgent on top of ToolAgent. I think it turned out pretty nicely. Again, Copilot helped too.

WebAgent now derives from ToolAgent, and PageManager (the wrapper around Playwright for browser automation) became a ToolProvider. All of the wrappers for Playwright automation became tools the agent could use. This worked better than I thought.

Starting task: Navigate to wikipedia.org and find the featured article of the day

Thought: "Okay, I will start by navigating to wikipedia.org. Then I will analyze the page content to find the featured article of the day. Finally, I will extract the text of the featured article and output it."
Tool: "navigate", Param: "https://www.wikipedia.org"

Thought: "Now that I have navigated to the Wikipedia homepage, I need to identify the element containing the 'Featured article'. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought: "The HTML content I retrieved is for the main wikipedia.org page, which is just a language selection portal. I need to navigate to the English Wikipedia to find the featured article. I will click the English link.",
Tool: "click", Param: "#js-link-box-en"

Thought": "Now that I have clicked the link, I need to wait for the page to load.",
Tool: "wait_for_navigation"

Thought: "Now that the English Wikipedia page has loaded, I need to find the 'Featured article' section. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought": "I have analyzed the page content and found that the 'Featured article' is located within a div with the id 'mp-tfa'. I will now use the 'get_text' tool to extract the text from this element.",
Tool: "get_text", Param: "#mp-tfa"

Thought: "I have successfully extracted the text of the featured article. Now I will output the result."

===== TASK SUMMARY =====
Task: Navigate to wikipedia.org and find the featured article of the day
Status: incomplete
Duration: 12.39 seconds
Input tokens: 68486
Output tokens: 827

The token usage was noticeably higher than my ToolAgent experiments. This was almost entirely due to get_page_content which was pushing the entire HTML content of the page into the LLM.

I had to fix that.

Learnings: Managing Context

As I said, my initial implementation of get_page_content just returned the entire HTML content back into the LLM. This wasn’t good. Because the LLM was using the page content to decide what HTML elements it needed to interact with, I could take some steps to limit which HTML elements I returned.

Step 1: Regex

I apologize up front that I even tried this, but I just wanted to see if I could get some quick wins. I asked Copilot to create an implementation of get_page_content that filtered out any elements that were not: Form elements, Elements with IDs, Elements with Roles or Aria-Labels, and a small set of special class names (like .btn).

The result was a much smaller content context, but there were issues. Of course there were issues! I lived through the “Use regex on HTML” era and it’s not good.

Step 2: Playwright

I was already using Playwright, so why not use it in a way similar to how we’d use BeautifulSoup? One well crafted prompt later, and Copilot had a very functional version.

Copilot: You’re absolutely right! Using Playwright directly to select elements gives us much better control and maintains the structure and order of the original page.

Even Copilot was relieved we moved away from regex.

The Playwright-based version had some other benefits:

  • Elements are extracted in the same order they appear in the DOM, preserving visual hierarchy
  • Filtering allowed me to only includes important attributes like id, class, role, etc
  • Executing some code in the DOM allowed me to only include visible elements

These helped the LLM make better decisions about the HTML elements it needed to interact with during the task. It also dramatically reduced the amount of input tokens sent to the LLM.

Step 3: Greedy LLM

The last thing I noticed and corrected was the LLM’s desire to ask for the page content whenever it got confused with CSS selectors. And yes, it gets confused. The LLM would fetch the page content, even when the page had not changed. Token usage goes up, and for the most part, the LLM didn’t zero-in on the right CSS selector. It likely just started guessing.

I added a is_maybe_dirty check in PageManager which was set to false after any call to get_page_content. I set is_maybe_dirty to true after: navigate, click, and type_text.

If is_maybe_dirty=false in get_page_content, I return an error result of “Use previously extracted page content”. The LLM correctly interpreted the result and knew it should stick with the current HTML content it already had.

Here’s an example run: “Use ‘https://duckduckgo.com’, search for ‘python tutorial’, and return the title of the first result found” (yes, it’s too fast to read so use the scrubber)

What’s Next

If I keep pushing on ToolAgent and WebAgent, I run the risk of starting to build simple frameworks. That might not be the worst outcome, but wasn’t my intention when starting these projects.

Instead, I want to explore some embedding and vector search projects. I’ve been using Simon Willison’s llm library, and it already includes some support for creating embeddings. Both SQLite and DuckDB have support for vector search. Sounds like I’ll have a lot to keep me busy.

I also want to begin using some local models. Again, llm already support this, so I’m unblocked.

Exploring LLMs as Agents: Tools & Benchmarking

I spent some time refactoring the Tool Agent code, added some additional mock tools and even some basic benchmarking. For more context, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
  • Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.

Take a look at the repository to see the code.

Code Refactor

The biggest change to the code is the somewhat large refactor to extract the agent-specific code into a ToolAgent class and the tool-specific code into a set of ToolProvider classes. ToolProvider classes are registered with the ToolAgent:

# Register tool providers
agent = ToolAgent(model_name='gemini-2.0-flash')
agent.register_provider(UtilityToolProvider())
agent.register_provider(AppointmentToolProvider())

ToolAgent can then be used in different ways. tool_agent_test.py is a way to run the agent in a terminal window. I’ve been playing with using the agent in a chat system too, but more on that next time.

python tool_agent_test.py

Tools

I wanted to keep adding more tools, which contributed to the code refactor. I couldn’t keep adding more code into the one Python file. With the new tools, I explored two new areas (for me):

  1. LLM-based Tools: A tool that was implemented using a secondary LLM. For a given set of topics, I wanted to “guess” which topics might be of interest to a user based on their input. LLMs are good at that stuff, so I made a simple tool call that used a separate system prompt. (I just realized that I didn’t add token tracking to that tool!)
  2. Multi-Parameter Tool Calls: Up until now, my tool calls took zero or one parameters. I wanted to start using multiple parameters, so I introduced an object parameter type with a simple schema. In the future, I want the agent to provide feedback when required parameters are not provided.

Benchmarking

I mentioned in the last post that I needed to start doing some benchmarking on the agent. The code factor made it somewhat easy for Copilot (yes, I do some vibe coding too) to whip up a very basic benchmarking harness. Some details (more in the readme):

  • Test Cases: Run multiple predefined test cases
  • Success Metrics: Track success rates, execution times, and token usage
  • Reporting: Get detailed reports on test results
  • CSV/JSON Export: Export results for further analysis
  • Token Usage: Track and analyze token consumption

I added simple token tracking because I was running the benchmarking suite quite a bit and wanted to know what kind of money I was spending. Turns out, it was less than a penny a run, but it’s good to know.

Learnings

I was running the benchmark suite trying to figure out why tests kept failing. The benchmarking framework measures success based on two metrics: (1) Did the agent use the minimum set of required tools? (2) Did the agent output a few key words in the final response. Yes, those aren’t the best success checks, but it’s what I have for now.

More Jedi Prompt Tricks

The agent was usually failing at using the required tools. Looking at the debug [thought] output from the agent, I noticed that it would assume knowledge instead of using the tools. I found a blog post from Pinecone that talks about this when using LangChain too. They suggested adding explicit items to the system prompt. I already had something along the lines of “ALWAYS use the tools. NEVER assume information” but that wasn’t enough. I added the following:

IMPORTANT: Do not assume or guess or hallucinate information. Call the tools to get ALL needed information.
    - You DO NOT know the current date or time. Use tools to determine the current date and time.
    - You CANNOT do math. Use tools to calculate math.

Temperature

For the randomness on the output, I saw this note in the same Pinecone blog post:

Here we initialize the LLM with a temperature of 0. A low temperature is useful when using tools as it decreases the amount of “randomness” or “creativity” in the generated text of the LLMs, which is ideal for encouraging it to follow strict instructions — as required for tool usage.

Honestly, I had not played with any LLM options yet, but added the temperature=0 setting and it did help.

More JSON Output Handling

I already had some code to handle the times when the LLM returned JSON using Markdown code fences, but there were times when the LLM would apologize for a mistake and append the JSON. The results were correct, but my code failed to extract it! So I added more defensive measures to safely extract JSON from LLM responses. Yes, I should update to use LLM built-in support for JSON schemas at this point.

With these changes, I regular had 100% success in my benchmarking runs! This means I need to make the tests harder.

You: I want to get a haircut tomorrow morning

Thought: I need to find out the current date to determine what tomorrow is. I will use the get_datetime tool to get the current date and time.
Thought: The current date is 2025-03-15. I need to check for available appointments for a haircut. Since there is no haircut specialty, I should first check the available specialties.
Thought: Okay, 'hair' is a valid specialty. I will now check for available 'hair' appointments.
Thought: I have a list of available hair appointments. The user wants an appointment tomorrow morning, which is March 16th. There are two appointments available on March 16th: 11:00 AM and 2:00 PM. The user wants a morning appointment, so the 11:00 AM appointment is the best fit. I will ask the user if they want to book appointment ID 11.

Agent: I see two appointments available for hair on March 16th: 11:00 AM and 2:00 PM. Would you like to book appointment ID 11 for 11:00 AM?

What’s Next

As I started to explore some non-chat user experiences, I discovered that I will need to agent to provide more structured output. Right now the output is very conversational, which works well for chat, but won’t be easy to use with other types of UX. I found some discussion on taming the output of an LLM in this blog post from Tryolabs.

I’ve also started using ToolAgent in my own chat server. I’m sure this branch of experimentation will continue as well.

Exploring LLMs as Agents: Planning via Prompting

My exploration of how to use LLMs as Agents continues. In this post, I’m looking at ways to improve the outcomes by using better approaches to planning. For more context, checkout the previous posts:

  • Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
  • Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.

Take a look at the repository to see the code.

Planning & Reasoning

I have seen a lot of people talking about “planning” and “reasoning” as crucial parts of an LLM agent system, so I started to read and learn more about different strategies. There is obviously a ton written about LLMs and Agents all over the internet, but I stumbled upon a useful web site that collects a lot of the information and has references to where the ideas originated.

Prompt Engineering Guide, not surprisingly, focuses on many different prompting strategies. It turns out, most of the different planning and reasoning strategies are based on different prompting strategies. Here are some of the strategies I progressed through:

  • Zero-shot: I think of this strategy as “hoping the LLM was trained well enough to just know the answer”. Hope is not a strategy, but you’d be surprised how many people believe the LLM can just figure everything out on its own.
  • Few-shot: This strategy enables in-context learning and depends on adding some examples to your prompt to help guide the model to better outcomes. My initial tool_agent used this approach before I even knew what it was called.
  • ReAct (Reasoning + Acting): This strategy combines reasoning and acting with LLMs into the prompt. ReAct prompts LLMs to generate reasoning and actions for a task. Asking the LLM to “think about the step” is enough to create significant improvements to the outcomes.

Funny thing: I thought I wanted to try using something called Chain of Thought (CoT), which is another prompt strategy to get the LLM to think about the steps. CoT ended up being a pain to implement in a simple way. I ended up wanting to merge the “thought” (reasoning) and “action” parts together into the same response — which turns out to basically be ReAct.

Example Flows

I honestly don’t know if the ReAct strategy is yielding better outcomes than the Few-shot approach I was using. I guess I need to set up benchmarks next? The flows are promising though, and the ReAct approach didn’t add much to the size of the code.

You: What's the weather in Beverly Hills?

Thought: The user wants to know the weather in Beverly Hills. First, I need to get the zipcode for Beverly Hills, then I can use that to get the weather.
Thought: Now that I have the zipcode for Beverly Hills, I can get the weather.
Thought: I have the weather information. Now I need to present it to the user.

Agent: The weather in Beverly Hills is 75 F and Sunny.

Here’s a longer one that uses the appointment scheduling tools:

You: I want to book a haircut

Thought: The user wants to book a haircut. I need to find the appropriate specialty to book an appointment for. I should first check the available specialties.
Thought: The available specialties are hair, dentist, and vision. Since the user wants a haircut, the relevant specialty is 'hair'. Now I need to find available appointments for 'hair'.
Thought: I have found the available appointments for a haircut. Now I need to present them to the user and ask them to choose one.

Agent: Here are the available appointments for a haircut:

* ID: 10, Date: 2025-03-09, Time: 10:30 AM
* ID: 11, Date: 2025-03-10, Time: 2:00 PM
* ID: 12, Date: 2025-03-16, Time: 11:00 AM
* ID: 13, Date: 2025-03-16, Time: 3:00 PM

Please tell me the ID of the appointment you would like to book.

You: 12

Thought: The user has selected appointment ID 12. Now I need to book this appointment.
Thought: The appointment has been successfully booked. I need to inform the user.

Agent: Great! Your appointment has been booked successfully for March 16, 2025, at 11:00 AM. The appointment ID is 12.

Learnings

Reflection and Correction

There are others ways we can improve the outcomes of the LLM agent beyond planning and reasoning. Reflection and correction is a way to help the model learn from its mistakes. I already had some basic checks in place to catch invalid JSON and unknown tools, but I made some additions to the code to help:

  • Validation of the JSON actions for expected fields
  • Validation loops to tell the LLM about any validation failures and allow the LLM to try again.

Prompt & Tool Consistency

I also realized, once again, that I was confusing the LLM with some of the ways I framed the prompt and tools.

  • I referred to both “tools” and “functions” in the initial prompt and subsequent conversational prompts. I decided to only use “tool” and changed all “function” references the LLM saw to “tool”.
  • I made some of the tool names ambiguous and the LLM would try to use tools that were not appropriate. An easy one is get_specialties which returns the list of specialties used for appointment booking. The LLM would sometimes avoid using the tool to figure out what types of appointments could be booked and would try other tools or guess. I changed the tool name to get_appointment_specialties and the LLM always used it correctly.

What’s Next?

I want to add more tools. I want to add tools that use secondary LLMs and I want to add tools that allow me to use the Agent for my own tasks.

I want to explore different UX around agents. Chat is an obvious choice, and I will likely expose tool_agent into a simple Web Chat system to play with it more and try to use it for actual tasks.

As mentioned above, I may look into benchmarks too.

Exploring LLMs as Agents: Taking Action

I’m still exploring how to use LLMs to build agents. In the first post, I described some of the motivation and approach. I started working on two agents:

  • web_agent: A basic agent that completes a web-based task using browser automation actions.
  • tool_agent: A basic agent that completes a task using a set of supplied tools or methods.

I’ve been focused more on tool_agent, trying to expand the types of tools I can provide. I had a suggestion to add some read/write type tools and scenarios, so I decided to try that out.

Taking Action

an agent is characterized by the environment it operates in and the set of actions it can perform. — Chip Huyen Agents

I want to create a more realistic set of actions for the agent to use. Scheduling appointments seemed like a straightforward scenario with some interesting possible flows. I added a few tool methods and a mocked appointment data store.

  • get_specialties(): Retrieves a list of available specialties for scheduling appointments.
  • get_available_appointments(specialty): Retrieves available appointments for a given specialty.
  • book_appointment(booking_data): Books an appointment based on provided booking data. The agent can pass in an appointment ID or a combination of a date and time.
  • get_my_appointments(): Retrieves the list of appointments booked by the user.
  • cancel_appointment(appointment_id): Cancels a booked appointment based on the provided appointment ID.

Checkout the tool_agent.py code for more detail on the way the tools work and the way the tools are added to the system prompt.

Example Flows

I was pleased with how well this rudimentary system worked. Without many tweaks (some on that below), I was able to get some useful exchanges.

You: what kind of appts can I book?
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["dentist", "vision", "hair"]
Action raw: {"type": "output", "value": "You can book appointments for dentist, vision, and hair."}
Agent: You can book appointments for dentist, vision, and hair.

The LLM did a good job of taking some ambiguity (“appts” and “book”) to figure out that it needed to use get_specialties to answer the question.

You: I want to get a haircut
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["vision", "dentist", "hair"]
Action raw: {"type": "call_function", "tool": "get_available_appointments", "param": "hair"}
Function result: [... list of appointments ...]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Okay, here are the available hair appointments:

* ID: 10, Date: 2025-03-01, Time: 10:30 AM
* ID: 11, Date: 2025-03-02, Time: 2:00 PM
* ID: 12, Date: 2025-03-08, Time: 11:00 AM
* ID: 13, Date: 2025-03-08, Time: 3:00 PM

In this exchange, the agent was able to figure out that it needed to display “hair” appointments. Interesting: the agent came up with the output display format on its own. When I saw the agent using this format, I added the ability to reference appointments by ID in the tool methods.

You: show me my appointments
Action raw: {"type": "call_function", "tool": "get_my_appointments"}
Function result: [{"id": "3", "date": "2025-03-02", "time": "11:00 AM", "specialty": "dentist"}]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Here are your booked appointments:

* ID: 3, Date: 2025-03-02, Time: 11:00 AM, Specialty: dentist

You: cancel the dentist appointment
Action raw: {"type": "call_function", "tool": "cancel_appointment", "param": "3"}
Function result: {"status": "success", "message": "Appointment canceled successfully."}
Action raw: {"type": "output", "value": "... output ..."}
Agent: Your appointment has been canceled successfully.

I thought I could easily confuse the agent by trying to cancel an appointment by the type (specialty) of the appointment, but the agent used the correct ID for the appointment when cancelling via the tool method.

Learning

At one point, the agent started returning invalid JSON. I assumed I need to be more strict in my system prompt, but I discovered an error. Since I was surrounding my Python strings with double quotes, the inner quotes were single quotes. My JSON examples were using single quotes — which is illegal in JSON. The LLM was relying on those examples, so I was causing the LLM to emit invalid JSON.

As I was looking at the code for similar issues, I noticed I was sending Python formatted object output back into the conversation. The LLM was handling it well enough but I decided to change the output to be legal JSON output:

-  response.conversation.prompt(f"Function result: {function_result}")
+  function_result_json = json.dumps(function_result)
+  response = conversation.prompt(f"Function result: {function_result_json}")

What’s Next?

Most of Chip Huyen’s post on Agents talks about “planning”, but I have not really adding any planning specific code to tool_agent yet. Right now, I am getting by with whatever amount of planning the LLM can create itself.

I want to learn more about planning, and how to add a little code to help the agent deal with even more complicated scenarios.

Exploring LLMs as Agents: A Minimalist Approach

Large Language Models (LLMs) are powerful tools for generating text, answering questions, and coding. We’ve moved beyond generating content, and LLMs are now being used to take actions as agents — independent entities that can act, use tools, and interact with their environment. You probably already know all of this.

I wanted to explore using LLMs as agents, but I like to get an understanding of the underlying components before using high-level frameworks to hide all of the minutiae and make the process of building production-ready systems. Understanding how the different components work and interact is important to my own learning process.

That’s exactly what my LLM Agents project sets out to do. Instead of relying on frameworks that abstract away the details, the project takes a bare-bones approach to learning how LLMs can function as agents. By minimizing dependencies, I can get a clearer understanding of the challenges, possibilities, and mechanics of building LLM-powered agents.

Why Minimal Dependencies Matter (To Me)

Many existing frameworks promise powerful LLM agent capabilities, but they often come at the cost of hiding the underlying complexities. While these frameworks can be useful, starting with minimal dependencies allows us to:

  • Understand the fundamentals: How does an LLM process information to take actions? How does the system prompt impact the effectiveness of the agent?
  • Explore limitations: What challenges arise when an agent tries to perform a multi-step task? How does the shape of the tools (functions or APIs) impact how the agent can process the flow.
  • Control the design: Without being boxed into a framework’s way of doing things, we can experiment freely. We can then use this knowledge to help pick the right type of framework for more advanced and production use cases.

This project keeps things simple, using only a lightweight LLM library, Simon Willison’s great llm Python library (code & docs), and Playwright for handling web automation.

These agents are not production ready, but they are trimmed down enough to see the mechanisms at work.

Meet the Agents

The repository contains two primary agents:

Web Agent: Navigating the Web with LLMs

The Web Agent is designed to interact with websites using the Playwright Python library. Instead of treating a webpage as structured data, this agent lets an LLM interpret raw HTML and decide what actions to take—whether that means clicking a button, typing into a form, or extracting text. I wanted to see how well an agent could navigate something as confusing as a modern website. If you’ve ever done a “view source” or “inspect” on a modern webpage, you know what I mean.

How It Works

  • A PageManager class handles the browser automation using Playwright.
  • The LLM generates the next action based on the current page content and the assigned task.
  • Two modes are available:
    • Non-conversational mode: Every step is processed independently.
    • Conversational mode: The agent maintains memory across multiple interactions, reducing the need to repeat context.
Example Task:
web_agent_conversation("gemini-2.0-flash", "Search for 'LLM agents' and return the first result's title.", "https://duckduckgo.com/")

This runs a search query and extracts the first result’s title, all without predefined scraping rules.

How Did It Go

At one point, the agent was not using valid CSS selector syntax and couldn’t “click” the search button. In spite of not getting to the search results page, the agent returned a “successful” answer. I wondered if the LLM was somehow using its trained knowledge to find a valid answer, but I could not find the result anywhere. I searched DuckDuckGo and Google for the title.

I added the “Explain how you solved the task” prompt and the agent replied that since it was not able to get to the search results, it created a hypothetical answer.

I did two things:

  • I told the agent it was not allowed to make up answers. Just fail gracefully.
  • I gave the agent examples of valid CSS selectors for id, class, and attribute selectors. This really improved the CSS selector accuracy. I had hoped the LLM’s training would have been good enough.

The conversational mode, unsurprisingly, could finish tasks with fewer steps. Memory and retained context matter.

Tool Agent: Using LLMs to Call Functions

The Tool Agent extends an LLM’s capabilities by allowing it to call external functions. Instead of just answering questions, it can interact with a set of predefined tools—simulating API calls, performing calculations, retrieving weather data, and more.

How It Works:

  • A registry of tool functions provides capabilities like:
    • Web search (search_web)
    • Weather lookup (get_weather)
    • Date and time retrieval (get_datetime)
  • The agent follows a conversational loop:
    1. Receives a user query.
    2. Decides whether a tool is needed.
    3. Calls the tool and processes the response.
    4. Outputs the final answer.
Example Interaction:
You: What's the weather in Beverly Hills?
Function result: {'zipcode': '90210'}
Function result: {'temperature': '75 F', 'conditions': 'Sunny'}
Agent: The weather in Beverly Hills (zipcode 90210) is 75 F and Sunny.

Here, the LLM autonomously determines that it needs to retrieve a zip code first before getting the weather.

How Did It Go

It’s not easy to get an LLM to only and always respond using structured output, such as JSON. Some models do better than others, and there are lots of ways to use the system prompt to help get the results you want. I found that I still need to check for Markdown code fences in the output, and remove those.

Note: I saw Simon Willison just updated llm to support schemas, to make structured output easier.

Getting the agent to use the tools (Python functions) required not only being specific about the JSON format and the parameters, but also showing examples. The examples seemed to help a lot. I found some discussions about using XML formatted block to describe the set of tools in the system prompt. Something about LLMs being able to handle XML better than JSON. Maybe that is outdated?

I was pretty happy to see the agent use two successive tools (as shown above) to complete a task. I want to play around more to see how that type of chaining can be improved and expanded.

What’s Next?

This has been a fun project and I think there are a few more things I want to try before moving on to the real frameworks:

  • Expanding the set of tools to include real API integrations.
  • Use separate agents to implement tools.
  • Fine-tuning the prompt engineering for better decision-making.
  • Improving the agent’s ability to recover from errors.

Further Reading

  • Simon Willison’s blog is a great place to learn about LLMs and keep updated
  • browser-use is a full featured Python framework for creating Browsing using research agents
  • PydanticAI is a full featured Python library that makes it easy to get started building tool-using agents

Work-as-Imagined vs Work-as-Done

With engineering focus on reducing incidents and improving operational reliability, I frequently come back to the realization that humans are fallible and we should be learning ways to nudge people toward success rather than failure.

There are whole industries and research machines built around the study of Human Factors, and how to improve safety, reliability, and quality. One topic that struck me as extremely useful to software engineering was the concepts of Work-as-Imagined (WAI) versus Work-as-Done (WAD). Anytime you’ve heard “the system failed because someone executed a process differently than it was documented” could be a WAI vs WAD issue. This comes up a lot in healthcare, manufacturing, and transportation — where accidents can have horrible consequences.

Full disclosure: There are actually many varieties of human work, but WAI and WAD are good enough to make the point. Steven Shorrock covers the subject so well on his blog: Humanistic Systems 

Work-as-Imagined

When thinking about a process or set of tasks that make up work, we need to imagine the steps and work others must do to accomplish the tasks. We do this for many good reasons, like scheduling, planning, and forecasting. WAI is usually formed by past experiences of actually doing work. While this is a good starting point, it’s likely the situation, assumptions, and variables are not the same.

To a greater or lesser extent, all of these imaginations – or mental models – will be wrong; our imagination of others’ work is a gross simplification, is incomplete, and is also fundamentally incorrect in various ways, depending partly on the differences in work and context between the imaginer and the imagined. — The Varieties of Human Work

Work-as-Done

Work-as-Done is literally the work people do. It happens in the real world, under a variety of different conditions and variables.  It’s hard to document WAD because of the unique situation in which the work was done and the specific adjustments and tradeoffs required to complete the work for a given situation.

In any normal day, a work-as-done day, people:

  • Adapt and adjust to situations and change their actions accordingly
  • Deal with unintended consequences and unexpected situations
  • Interpret policies and procedures and apply them to match the conditions
  • Detect and correct when something is about to go wrong and intervene to prevent it from happening

Mind the Gap

Monitoring the gap between the WAI and the WAD of a given task has been highlighted as an important practice for organizations aiming to achieve high reliability. The gap between WAI and WAD can result in “human error” conditions. We frequently hear about incidents and accidents that were caused by “human error” in a variety of situations:

  • Air traffic near misses at airports
  • Train derailments and accidents
  • Critical computer systems taken offline
  • Mistakes made during medical procedures

It’s natural for us to blame the problem on the gap — people didn’t follow the process — and try to improve reliability and reduce errors by focusing on stricter adherence to WAI. Perhaps unsurprisingly, this results in more rules and processes which can certainly slow down overall productivity, and even increase the gap between WAI and WAD.

Safety management must correspond to Work-As-Done and not rely on Work-As-Imagined. — Can We Ever Imagine How Work is Done?

In recent decades, there is more focus on WAD. Examining the reasons why the WAD gap exists and working to align WAI more closely with WAD. Embracing the reality of how the work is done and working to formalize it. Instead of optimizing for the way we imagine work is done, we acknowledge the way work is actually done.

Closing the Gap

In my work, production incidents that occur in software systems are an easy area to find WAI vs WAD happening. Incident management and postmortems have best practices that usually involve blameless reviews of what led to the incident. In many case, the easiest answer to “how can we stop this incident from happening again?” is better documentation and more process.

Modern incident management is focusing more on learning from incidents and less about root-cause analysis. One reason is that incidents rarely happen the exact same way in the future. Focusing on fixing a specific incident yields less value than learning about how the system worked to create the incident in the first place. Learning about how your system work in production is harder, but yields more impact in discovering weak parts of the systems.

This section could be an entire book, or at least several posts, so I’ll leave it to you to read some of the links.

Desire Paths

The whole WAI vs WAD discussion reminds me of desire paths, which visually show the difference between the planned and actual outcomes.

Desire paths typically emerge as convenient shortcuts where more deliberately constructed paths take a longer or more circuitous route, have gaps, or are non-existent

Tying together desire paths with WAI & WAD, some universities and cities have reportedly waited to see which routes people would take regularly before deciding where to pave additional pathways across their campuses and walking paths.

Information Flows in Organizations

I’ve had cause to looked into research and ideas about the ways information flows within organizations. Discussions about transparency, decision making, empowering teams, and trust seem to intersect at organizational communication and information flows.

One of my favorite people to follow in this space is Gene Kim (Phoenix Project, DevOps Handbook, Accelerate, and DORA Reports). He has done a few podcasts that focused on relevant topics and concluded that you can predict whether an organization is a high performer or a low performer, just by looking at the communication paths of an organization, as well as their frequency and intensity. (Episode 16, @54 min)

Some of these ideas might resonate with you. There are generally two forms of information flows:

  • Slow flows where we need detailed granularity and accuracy of information. Leadership usually needs to be involved in these discussions so communication tends to escalate up and down hierarchies.
  • Fast flows where frequency and speed tend to be more important. These flows occur in the operational realm, where work is executed, and happen directly between teams using existing interfaces.

In the ideal case, a majority of the communication is happening within and between teams using fast flows. Forcing escalation up and down the hierarchy means getting people involved who probably don’t have a solid grasp of the details. Interactions are slow and likely lead to poor decisions. On the other hand, when teammates talk to each other or where there are sanctioned ways for teams to work with each other with a shared goal, integrated problem solving is very fast.

This doesn’t mean all information flows should be fast. There are two phases where slow flows are critical: Upfront planning and Retrospective assessment. Planning and preparation are the activities where we need leaders to be thoughtful about defining the goals and then defining responsibilities and the structures to support them. Later, slow communications come back when we assess and improve our performance and outcomes.

Thinking, Fast and Slow

I want to be clear that fast and slow information flows are different concepts than the fast and slow modes of thinking explored in Daniel Kahneman’s book Thinking, Fast and Slow. The book explores two systems of thinking that drive the way humans make decisions.

  • System 1 (Fast Thinking): This system is intuitive, automatic, and operates quickly with little effort or voluntary control. It’s responsible for quick decisions, habits, and reactions.
  • System 2 (Slow Thinking): This system is deliberate, analytical, and requires effortful mental activity. It’s used for complex computations, learning new information, and solving difficult problems.

Kahneman discusses how these two systems can work together, but sometimes lead to biases and errors in judgment. He talks about how these modes can affect decision-making and offers suggestions into how we can become more aware of these biases to make better decisions.

Obviously another area worth exploring to help understand how organizations can support people to create better outcomes.