Exploring LLMs as Agents: WebAgent Gets a Refactor

My fifth post on LLMs as Agents. I just it’s a series now.

After refactoring ToolAgent, I decided to do the same with WebAgent, building it on the ToolAgent system. For more context and a refresher on WebAgent, checkout the previous posts:

Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.
Tools & Benchmarking: I refactored the ToolAgent code into a reusable base class, added some additional mock tools, and even some basic benchmarking.

Take a look at the repository to see the code.

WebAgent

After refactoring ToolAgent, I decided to see if I could build WebAgent on top of ToolAgent. I think it turned out pretty nicely. Again, Copilot helped too.

WebAgent now derives from ToolAgent, and PageManager (the wrapper around Playwright for browser automation) became a ToolProvider. All of the wrappers for Playwright automation became tools the agent could use. This worked better than I thought.

Starting task: Navigate to wikipedia.org and find the featured article of the day

Thought: "Okay, I will start by navigating to wikipedia.org. Then I will analyze the page content to find the featured article of the day. Finally, I will extract the text of the featured article and output it."
Tool: "navigate", Param: "https://www.wikipedia.org"

Thought: "Now that I have navigated to the Wikipedia homepage, I need to identify the element containing the 'Featured article'. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought: "The HTML content I retrieved is for the main wikipedia.org page, which is just a language selection portal. I need to navigate to the English Wikipedia to find the featured article. I will click the English link.",
Tool: "click", Param: "#js-link-box-en"

Thought": "Now that I have clicked the link, I need to wait for the page to load.",
Tool: "wait_for_navigation"

Thought: "Now that the English Wikipedia page has loaded, I need to find the 'Featured article' section. I will use the 'get_page_content' tool to get the HTML content of the page and analyze it to find the appropriate CSS selector.",
Tool: "get_page_content"

Thought": "I have analyzed the page content and found that the 'Featured article' is located within a div with the id 'mp-tfa'. I will now use the 'get_text' tool to extract the text from this element.",
Tool: "get_text", Param: "#mp-tfa"

Thought: "I have successfully extracted the text of the featured article. Now I will output the result."

===== TASK SUMMARY =====
Task: Navigate to wikipedia.org and find the featured article of the day
Status: incomplete
Duration: 12.39 seconds
Input tokens: 68486
Output tokens: 827

The token usage was noticeably higher than my ToolAgent experiments. This was almost entirely due to get_page_content which was pushing the entire HTML content of the page into the LLM.

I had to fix that.

Learnings: Managing Context

As I said, my initial implementation of get_page_content just returned the entire HTML content back into the LLM. This wasn’t good. Because the LLM was using the page content to decide what HTML elements it needed to interact with, I could take some steps to limit which HTML elements I returned.

Step 1: Regex

I apologize up front that I even tried this, but I just wanted to see if I could get some quick wins. I asked Copilot to create an implementation of get_page_content that filtered out any elements that were not: Form elements, Elements with IDs, Elements with Roles or Aria-Labels, and a small set of special class names (like .btn).

The result was a much smaller content context, but there were issues. Of course there were issues! I lived through the “Use regex on HTML” era and it’s not good.

Step 2: Playwright

I was already using Playwright, so why not use it in a way similar to how we’d use BeautifulSoup? One well crafted prompt later, and Copilot had a very functional version.

Copilot: You’re absolutely right! Using Playwright directly to select elements gives us much better control and maintains the structure and order of the original page.

Even Copilot was relieved we moved away from regex.

The Playwright-based version had some other benefits:

Elements are extracted in the same order they appear in the DOM, preserving visual hierarchy
Filtering allowed me to only includes important attributes like id, class, role, etc
Executing some code in the DOM allowed me to only include visible elements

These helped the LLM make better decisions about the HTML elements it needed to interact with during the task. It also dramatically reduced the amount of input tokens sent to the LLM.

Step 3: Greedy LLM

The last thing I noticed and corrected was the LLM’s desire to ask for the page content whenever it got confused with CSS selectors. And yes, it gets confused. The LLM would fetch the page content, even when the page had not changed. Token usage goes up, and for the most part, the LLM didn’t zero-in on the right CSS selector. It likely just started guessing.

I added a is_maybe_dirty check in PageManager which was set to false after any call to get_page_content. I set is_maybe_dirty to true after: navigate, click, and type_text.

If is_maybe_dirty=false in get_page_content, I return an error result of “Use previously extracted page content”. The LLM correctly interpreted the result and knew it should stick with the current HTML content it already had.

Here’s an example run: “Use ‘https://duckduckgo.com’, search for ‘python tutorial’, and return the title of the first result found” (yes, it’s too fast to read so use the scrubber)

What’s Next

If I keep pushing on ToolAgent and WebAgent, I run the risk of starting to build simple frameworks. That might not be the worst outcome, but wasn’t my intention when starting these projects.

Instead, I want to explore some embedding and vector search projects. I’ve been using Simon Willison’s llm library, and it already includes some support for creating embeddings. Both SQLite and DuckDB have support for vector search. Sounds like I’ll have a lot to keep me busy.

I also want to begin using some local models. Again, llm already support this, so I’m unblocked.