My fifth post on LLMs as Agents. I just it’s a series now.
After refactoring ToolAgent
, I decided to do the same with WebAgent
, building it on the ToolAgent
system. For more context and a refresher on WebAgent
, checkout the previous posts:
- Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
- Taking Action: I decided to focus on the
tool_agent
and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts. - Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.
- Tools & Benchmarking: I refactored the
ToolAgent
code into a reusable base class, added some additional mock tools, and even some basic benchmarking.
Take a look at the repository to see the code.
WebAgent
After refactoring ToolAgent
, I decided to see if I could build WebAgent
on top of ToolAgent
. I think it turned out pretty nicely. Again, Copilot helped too.
WebAgent
now derives from ToolAgent
, and PageManager
(the wrapper around Playwright for browser automation) became a ToolProvider
. All of the wrappers for Playwright automation became tools the agent could use. This worked better than I thought.
Starting task: Navigate to wikipedia.org and find the featured article of the day
Thought: "Okay, I will start by navigating to wikipedia.org. Then I will analyze the page content to find the featured article of the day. Finally, I will extract the text of the featured article and output it."
Tool: "navigate", Param: "https://www.wikipedia.org"
Thought: "Now that I have navigated to the Wikipedia homepage, I need to identify the element containing the 'Featured article'. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"
Thought: "The HTML content I retrieved is for the main wikipedia.org page, which is just a language selection portal. I need to navigate to the English Wikipedia to find the featured article. I will click the English link.",
Tool: "click", Param: "#js-link-box-en"
Thought": "Now that I have clicked the link, I need to wait for the page to load.",
Tool: "wait_for_navigation"
Thought: "Now that the English Wikipedia page has loaded, I need to find the 'Featured article' section. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"
Thought": "I have analyzed the page content and found that the 'Featured article' is located within a div with the id 'mp-tfa'. I will now use the 'get_text' tool to extract the text from this element.",
Tool: "get_text", Param: "#mp-tfa"
Thought: "I have successfully extracted the text of the featured article. Now I will output the result."
===== TASK SUMMARY =====
Task: Navigate to wikipedia.org and find the featured article of the day
Status: incomplete
Duration: 12.39 seconds
Input tokens: 68486
Output tokens: 827
The token usage was noticeably higher than my ToolAgent
experiments. This was almost entirely due to get_page_content
which was pushing the entire HTML content of the page into the LLM.
I had to fix that.
Learnings: Managing Context
As I said, my initial implementation of get_page_content
just returned the entire HTML content back into the LLM. This wasn’t good. Because the LLM was using the page content to decide what HTML elements it needed to interact with, I could take some steps to limit which HTML elements I returned.
Step 1: Regex
I apologize up front that I even tried this, but I just wanted to see if I could get some quick wins. I asked Copilot to create an implementation of get_page_content
that filtered out any elements that were not: Form elements, Elements with IDs, Elements with Roles or Aria-Labels, and a small set of special class names (like .btn).
The result was a much smaller content context, but there were issues. Of course there were issues! I lived through the “Use regex on HTML” era and it’s not good.
Step 2: Playwright
I was already using Playwright, so why not use it in a way similar to how we’d use BeautifulSoup? One well crafted prompt later, and Copilot had a very functional version.
Copilot: You’re absolutely right! Using Playwright directly to select elements gives us much better control and maintains the structure and order of the original page.
Even Copilot was relieved we moved away from regex.
The Playwright-based version had some other benefits:
- Elements are extracted in the same order they appear in the DOM, preserving visual hierarchy
- Filtering allowed me to only includes important attributes like id, class, role, etc
- Executing some code in the DOM allowed me to only include visible elements
These helped the LLM make better decisions about the HTML elements it needed to interact with during the task. It also dramatically reduced the amount of input tokens sent to the LLM.
Step 3: Greedy LLM
The last thing I noticed and corrected was the LLM’s desire to ask for the page content whenever it got confused with CSS selectors. And yes, it gets confused. The LLM would fetch the page content, even when the page had not changed. Token usage goes up, and for the most part, the LLM didn’t zero-in on the right CSS selector. It likely just started guessing.
I added a is_maybe_dirty
check in PageManager
which was set to false
after any call to get_page_content
. I set is_maybe_dirty
to true
after: navigate
, click
, and type_text
.
If is_maybe_dirty=false
in get_page_content
, I return an error result of “Use previously extracted page content”. The LLM correctly interpreted the result and knew it should stick with the current HTML content it already had.
Here’s an example run: “Use ‘https://duckduckgo.com’, search for ‘python tutorial’, and return the title of the first result found” (yes, it’s too fast to read so use the scrubber)
What’s Next
If I keep pushing on ToolAgent
and WebAgent
, I run the risk of starting to build simple frameworks. That might not be the worst outcome, but wasn’t my intention when starting these projects.
Instead, I want to explore some embedding and vector search projects. I’ve been using Simon Willison’s llm
library, and it already includes some support for creating embeddings. Both SQLite and DuckDB have support for vector search. Sounds like I’ll have a lot to keep me busy.
I also want to begin using some local models. Again, llm
already support this, so I’m unblocked.