Exploring LLMs as Agents: Taking Action

I’m still exploring how to use LLMs to build agents. In the first post, I described some of the motivation and approach. I started working on two agents:

  • web_agent: A basic agent that completes a web-based task using browser automation actions.
  • tool_agent: A basic agent that completes a task using a set of supplied tools or methods.

I’ve been focused more on tool_agent, trying to expand the types of tools I can provide. I had a suggestion to add some read/write type tools and scenarios, so I decided to try that out.

Taking Action

an agent is characterized by the environment it operates in and the set of actions it can perform. — Chip Huyen Agents

I want to create a more realistic set of actions for the agent to use. Scheduling appointments seemed like a straightforward scenario with some interesting possible flows. I added a few tool methods and a mocked appointment data store.

  • get_specialties(): Retrieves a list of available specialties for scheduling appointments.
  • get_available_appointments(specialty): Retrieves available appointments for a given specialty.
  • book_appointment(booking_data): Books an appointment based on provided booking data. The agent can pass in an appointment ID or a combination of a date and time.
  • get_my_appointments(): Retrieves the list of appointments booked by the user.
  • cancel_appointment(appointment_id): Cancels a booked appointment based on the provided appointment ID.

Checkout the tool_agent.py code for more detail on the way the tools work and the way the tools are added to the system prompt.

Example Flows

I was pleased with how well this rudimentary system worked. Without many tweaks (some on that below), I was able to get some useful exchanges.

You: what kind of appts can I book?
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["dentist", "vision", "hair"]
Action raw: {"type": "output", "value": "You can book appointments for dentist, vision, and hair."}
Agent: You can book appointments for dentist, vision, and hair.

The LLM did a good job of taking some ambiguity (“appts” and “book”) to figure out that it needed to use get_specialties to answer the question.

You: I want to get a haircut
Action raw: {"type": "call_function", "tool": "get_specialties"}
Function result: ["vision", "dentist", "hair"]
Action raw: {"type": "call_function", "tool": "get_available_appointments", "param": "hair"}
Function result: [... list of appointments ...]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Okay, here are the available hair appointments:

* ID: 10, Date: 2025-03-01, Time: 10:30 AM
* ID: 11, Date: 2025-03-02, Time: 2:00 PM
* ID: 12, Date: 2025-03-08, Time: 11:00 AM
* ID: 13, Date: 2025-03-08, Time: 3:00 PM

In this exchange, the agent was able to figure out that it needed to display “hair” appointments. Interesting: the agent came up with the output display format on its own. When I saw the agent using this format, I added the ability to reference appointments by ID in the tool methods.

You: show me my appointments
Action raw: {"type": "call_function", "tool": "get_my_appointments"}
Function result: [{"id": "3", "date": "2025-03-02", "time": "11:00 AM", "specialty": "dentist"}]
Action raw: {"type": "output", "value": "... output ..."}
Agent: Here are your booked appointments:

* ID: 3, Date: 2025-03-02, Time: 11:00 AM, Specialty: dentist

You: cancel the dentist appointment
Action raw: {"type": "call_function", "tool": "cancel_appointment", "param": "3"}
Function result: {"status": "success", "message": "Appointment canceled successfully."}
Action raw: {"type": "output", "value": "... output ..."}
Agent: Your appointment has been canceled successfully.

I thought I could easily confuse the agent by trying to cancel an appointment by the type (specialty) of the appointment, but the agent used the correct ID for the appointment when cancelling via the tool method.

Learning

At one point, the agent started returning invalid JSON. I assumed I need to be more strict in my system prompt, but I discovered an error. Since I was surrounding my Python strings with double quotes, the inner quotes were single quotes. My JSON examples were using single quotes — which is illegal in JSON. The LLM was relying on those examples, so I was causing the LLM to emit invalid JSON.

As I was looking at the code for similar issues, I noticed I was sending Python formatted object output back into the conversation. The LLM was handling it well enough but I decided to change the output to be legal JSON output:

-  response.conversation.prompt(f"Function result: {function_result}")
+  function_result_json = json.dumps(function_result)
+  response = conversation.prompt(f"Function result: {function_result_json}")

What’s Next?

Most of Chip Huyen’s post on Agents talks about “planning”, but I have not really adding any planning specific code to tool_agent yet. Right now, I am getting by with whatever amount of planning the LLM can create itself.

I want to learn more about planning, and how to add a little code to help the agent deal with even more complicated scenarios.

Exploring LLMs as Agents: A Minimalist Approach

Large Language Models (LLMs) are powerful tools for generating text, answering questions, and coding. We’ve moved beyond generating content, and LLMs are now being used to take actions as agents — independent entities that can act, use tools, and interact with their environment. You probably already know all of this.

I wanted to explore using LLMs as agents, but I like to get an understanding of the underlying components before using high-level frameworks to hide all of the minutiae and make the process of building production-ready systems. Understanding how the different components work and interact is important to my own learning process.

That’s exactly what my LLM Agents project sets out to do. Instead of relying on frameworks that abstract away the details, the project takes a bare-bones approach to learning how LLMs can function as agents. By minimizing dependencies, I can get a clearer understanding of the challenges, possibilities, and mechanics of building LLM-powered agents.

Why Minimal Dependencies Matter (To Me)

Many existing frameworks promise powerful LLM agent capabilities, but they often come at the cost of hiding the underlying complexities. While these frameworks can be useful, starting with minimal dependencies allows us to:

  • Understand the fundamentals: How does an LLM process information to take actions? How does the system prompt impact the effectiveness of the agent?
  • Explore limitations: What challenges arise when an agent tries to perform a multi-step task? How does the shape of the tools (functions or APIs) impact how the agent can process the flow.
  • Control the design: Without being boxed into a framework’s way of doing things, we can experiment freely. We can then use this knowledge to help pick the right type of framework for more advanced and production use cases.

This project keeps things simple, using only a lightweight LLM library, Simon Willison’s great llm Python library (code & docs), and Playwright for handling web automation.

These agents are not production ready, but they are trimmed down enough to see the mechanisms at work.

Meet the Agents

The repository contains two primary agents:

Web Agent: Navigating the Web with LLMs

The Web Agent is designed to interact with websites using the Playwright Python library. Instead of treating a webpage as structured data, this agent lets an LLM interpret raw HTML and decide what actions to take—whether that means clicking a button, typing into a form, or extracting text. I wanted to see how well an agent could navigate something as confusing as a modern website. If you’ve ever done a “view source” or “inspect” on a modern webpage, you know what I mean.

How It Works

  • A PageManager class handles the browser automation using Playwright.
  • The LLM generates the next action based on the current page content and the assigned task.
  • Two modes are available:
    • Non-conversational mode: Every step is processed independently.
    • Conversational mode: The agent maintains memory across multiple interactions, reducing the need to repeat context.
Example Task:
web_agent_conversation("gemini-2.0-flash", "Search for 'LLM agents' and return the first result's title.", "https://duckduckgo.com/")

This runs a search query and extracts the first result’s title, all without predefined scraping rules.

How Did It Go

At one point, the agent was not using valid CSS selector syntax and couldn’t “click” the search button. In spite of not getting to the search results page, the agent returned a “successful” answer. I wondered if the LLM was somehow using its trained knowledge to find a valid answer, but I could not find the result anywhere. I searched DuckDuckGo and Google for the title.

I added the “Explain how you solved the task” prompt and the agent replied that since it was not able to get to the search results, it created a hypothetical answer.

I did two things:

  • I told the agent it was not allowed to make up answers. Just fail gracefully.
  • I gave the agent examples of valid CSS selectors for id, class, and attribute selectors. This really improved the CSS selector accuracy. I had hoped the LLM’s training would have been good enough.

The conversational mode, unsurprisingly, could finish tasks with fewer steps. Memory and retained context matter.

Tool Agent: Using LLMs to Call Functions

The Tool Agent extends an LLM’s capabilities by allowing it to call external functions. Instead of just answering questions, it can interact with a set of predefined tools—simulating API calls, performing calculations, retrieving weather data, and more.

How It Works:

  • A registry of tool functions provides capabilities like:
    • Web search (search_web)
    • Weather lookup (get_weather)
    • Date and time retrieval (get_datetime)
  • The agent follows a conversational loop:
    1. Receives a user query.
    2. Decides whether a tool is needed.
    3. Calls the tool and processes the response.
    4. Outputs the final answer.
Example Interaction:
You: What's the weather in Beverly Hills?
Function result: {'zipcode': '90210'}
Function result: {'temperature': '75 F', 'conditions': 'Sunny'}
Agent: The weather in Beverly Hills (zipcode 90210) is 75 F and Sunny.

Here, the LLM autonomously determines that it needs to retrieve a zip code first before getting the weather.

How Did It Go

It’s not easy to get an LLM to only and always respond using structured output, such as JSON. Some models do better than others, and there are lots of ways to use the system prompt to help get the results you want. I found that I still need to check for Markdown code fences in the output, and remove those.

Note: I saw Simon Willison just updated llm to support schemas, to make structured output easier.

Getting the agent to use the tools (Python functions) required not only being specific about the JSON format and the parameters, but also showing examples. The examples seemed to help a lot. I found some discussions about using XML formatted block to describe the set of tools in the system prompt. Something about LLMs being able to handle XML better than JSON. Maybe that is outdated?

I was pretty happy to see the agent use two successive tools (as shown above) to complete a task. I want to play around more to see how that type of chaining can be improved and expanded.

What’s Next?

This has been a fun project and I think there are a few more things I want to try before moving on to the real frameworks:

  • Expanding the set of tools to include real API integrations.
  • Use separate agents to implement tools.
  • Fine-tuning the prompt engineering for better decision-making.
  • Improving the agent’s ability to recover from errors.

Further Reading

  • Simon Willison’s blog is a great place to learn about LLMs and keep updated
  • browser-use is a full featured Python framework for creating Browsing using research agents
  • PydanticAI is a full featured Python library that makes it easy to get started building tool-using agents