Exploring LLMs as Agents: A Minimalist Approach

Large Language Models (LLMs) are powerful tools for generating text, answering questions, and coding. We’ve moved beyond generating content, and LLMs are now being used to take actions as agents — independent entities that can act, use tools, and interact with their environment. You probably already know all of this.

I wanted to explore using LLMs as agents, but I like to get an understanding of the underlying components before using high-level frameworks to hide all of the minutiae and make the process of building production-ready systems. Understanding how the different components work and interact is important to my own learning process.

That’s exactly what my LLM Agents project sets out to do. Instead of relying on frameworks that abstract away the details, the project takes a bare-bones approach to learning how LLMs can function as agents. By minimizing dependencies, I can get a clearer understanding of the challenges, possibilities, and mechanics of building LLM-powered agents.

Why Minimal Dependencies Matter (To Me)

Many existing frameworks promise powerful LLM agent capabilities, but they often come at the cost of hiding the underlying complexities. While these frameworks can be useful, starting with minimal dependencies allows us to:

Understand the fundamentals: How does an LLM process information to take actions? How does the system prompt impact the effectiveness of the agent?
Explore limitations: What challenges arise when an agent tries to perform a multi-step task? How does the shape of the tools (functions or APIs) impact how the agent can process the flow.
Control the design: Without being boxed into a framework’s way of doing things, we can experiment freely. We can then use this knowledge to help pick the right type of framework for more advanced and production use cases.

This project keeps things simple, using only a lightweight LLM library, Simon Willison’s great llm Python library (code & docs), and Playwright for handling web automation.

These agents are not production ready, but they are trimmed down enough to see the mechanisms at work.

Meet the Agents

The repository contains two primary agents:

Web Agent: Navigating the Web with LLMs

The Web Agent is designed to interact with websites using the Playwright Python library. Instead of treating a webpage as structured data, this agent lets an LLM interpret raw HTML and decide what actions to take—whether that means clicking a button, typing into a form, or extracting text. I wanted to see how well an agent could navigate something as confusing as a modern website. If you’ve ever done a “view source” or “inspect” on a modern webpage, you know what I mean.

How It Works

A PageManager class handles the browser automation using Playwright.
The LLM generates the next action based on the current page content and the assigned task.
Two modes are available:
- Non-conversational mode: Every step is processed independently.
- Conversational mode: The agent maintains memory across multiple interactions, reducing the need to repeat context.

Example Task:

web_agent_conversation("gemini-2.0-flash", "Search for 'LLM agents' and return the first result's title.", "https://duckduckgo.com/")

This runs a search query and extracts the first result’s title, all without predefined scraping rules.

How Did It Go

At one point, the agent was not using valid CSS selector syntax and couldn’t “click” the search button. In spite of not getting to the search results page, the agent returned a “successful” answer. I wondered if the LLM was somehow using its trained knowledge to find a valid answer, but I could not find the result anywhere. I searched DuckDuckGo and Google for the title.

I added the “Explain how you solved the task” prompt and the agent replied that since it was not able to get to the search results, it created a hypothetical answer.

I did two things:

I told the agent it was not allowed to make up answers. Just fail gracefully.
I gave the agent examples of valid CSS selectors for id, class, and attribute selectors. This really improved the CSS selector accuracy. I had hoped the LLM’s training would have been good enough.

The conversational mode, unsurprisingly, could finish tasks with fewer steps. Memory and retained context matter.

Tool Agent: Using LLMs to Call Functions

The Tool Agent extends an LLM’s capabilities by allowing it to call external functions. Instead of just answering questions, it can interact with a set of predefined tools—simulating API calls, performing calculations, retrieving weather data, and more.

How It Works:

A registry of tool functions provides capabilities like:
- Web search (search_web)
- Weather lookup (get_weather)
- Date and time retrieval (get_datetime)
The agent follows a conversational loop:
1. Receives a user query.
2. Decides whether a tool is needed.
3. Calls the tool and processes the response.
4. Outputs the final answer.

Example Interaction:

You: What's the weather in Beverly Hills?
Function result: {'zipcode': '90210'}
Function result: {'temperature': '75 F', 'conditions': 'Sunny'}
Agent: The weather in Beverly Hills (zipcode 90210) is 75 F and Sunny.

Here, the LLM autonomously determines that it needs to retrieve a zip code first before getting the weather.

How Did It Go

It’s not easy to get an LLM to only and always respond using structured output, such as JSON. Some models do better than others, and there are lots of ways to use the system prompt to help get the results you want. I found that I still need to check for Markdown code fences in the output, and remove those.

Note: I saw Simon Willison just updated llm to support schemas, to make structured output easier.

Getting the agent to use the tools (Python functions) required not only being specific about the JSON format and the parameters, but also showing examples. The examples seemed to help a lot. I found some discussions about using XML formatted block to describe the set of tools in the system prompt. Something about LLMs being able to handle XML better than JSON. Maybe that is outdated?

I was pretty happy to see the agent use two successive tools (as shown above) to complete a task. I want to play around more to see how that type of chaining can be improved and expanded.

What’s Next?

This has been a fun project and I think there are a few more things I want to try before moving on to the real frameworks:

Expanding the set of tools to include real API integrations.
Use separate agents to implement tools.
Fine-tuning the prompt engineering for better decision-making.
Improving the agent’s ability to recover from errors.

Exploring LLMs as Agents: A Minimalist Approach

Why Minimal Dependencies Matter (To Me)

Meet the Agents

Web Agent: Navigating the Web with LLMs

How It Works

Example Task:

How Did It Go

Tool Agent: Using LLMs to Call Functions

How It Works:

Example Interaction:

How Did It Go

What’s Next?

Further Reading

Related

4 Replies to “Exploring LLMs as Agents: A Minimalist Approach”

Leave a Reply