Exploring LLMs as Agents: Tools & Benchmarking

I spent some time refactoring the Tool Agent code, added some additional mock tools and even some basic benchmarking. For more context, checkout the previous posts:

Minimalist Approach: I kicked off my exploration by making two agents using the bare-minimum dependencies. I wanted to learn the concepts, before using a do-it-all framework.
Taking Action: I decided to focus on the tool_agent and add some read/write tools for the agent to use. I also discovered the need to send clear and consistent context in prompts.
Planning via Prompting: I looked at ways to improve the agent outcomes by using better approaches to planning. I settled on using ReAct, with some Few-Shot prompts, and had the LLM do some Chain of Thought (CoT) output.

Take a look at the repository to see the code.

Code Refactor

The biggest change to the code is the somewhat large refactor to extract the agent-specific code into a ToolAgent class and the tool-specific code into a set of ToolProvider classes. ToolProvider classes are registered with the ToolAgent:

# Register tool providers
agent = ToolAgent(model_name='gemini-2.0-flash')
agent.register_provider(UtilityToolProvider())
agent.register_provider(AppointmentToolProvider())

ToolAgent can then be used in different ways. tool_agent_test.py is a way to run the agent in a terminal window. I’ve been playing with using the agent in a chat system too, but more on that next time.

python tool_agent_test.py

Tools

I wanted to keep adding more tools, which contributed to the code refactor. I couldn’t keep adding more code into the one Python file. With the new tools, I explored two new areas (for me):

LLM-based Tools: A tool that was implemented using a secondary LLM. For a given set of topics, I wanted to “guess” which topics might be of interest to a user based on their input. LLMs are good at that stuff, so I made a simple tool call that used a separate system prompt. (I just realized that I didn’t add token tracking to that tool!)
Multi-Parameter Tool Calls: Up until now, my tool calls took zero or one parameters. I wanted to start using multiple parameters, so I introduced an object parameter type with a simple schema. In the future, I want the agent to provide feedback when required parameters are not provided.

Benchmarking

I mentioned in the last post that I needed to start doing some benchmarking on the agent. The code factor made it somewhat easy for Copilot (yes, I do some vibe coding too) to whip up a very basic benchmarking harness. Some details (more in the readme):

Test Cases: Run multiple predefined test cases
Success Metrics: Track success rates, execution times, and token usage
Reporting: Get detailed reports on test results
CSV/JSON Export: Export results for further analysis
Token Usage: Track and analyze token consumption

I added simple token tracking because I was running the benchmarking suite quite a bit and wanted to know what kind of money I was spending. Turns out, it was less than a penny a run, but it’s good to know.

Learnings

I was running the benchmark suite trying to figure out why tests kept failing. The benchmarking framework measures success based on two metrics: (1) Did the agent use the minimum set of required tools? (2) Did the agent output a few key words in the final response. Yes, those aren’t the best success checks, but it’s what I have for now.

More Jedi Prompt Tricks

The agent was usually failing at using the required tools. Looking at the debug [thought] output from the agent, I noticed that it would assume knowledge instead of using the tools. I found a blog post from Pinecone that talks about this when using LangChain too. They suggested adding explicit items to the system prompt. I already had something along the lines of “ALWAYS use the tools. NEVER assume information” but that wasn’t enough. I added the following:

IMPORTANT: Do not assume or guess or hallucinate information. Call the tools to get ALL needed information.
    - You DO NOT know the current date or time. Use tools to determine the current date and time.
    - You CANNOT do math. Use tools to calculate math.

Temperature

For the randomness on the output, I saw this note in the same Pinecone blog post:

Here we initialize the LLM with a temperature of 0. A low temperature is useful when using tools as it decreases the amount of “randomness” or “creativity” in the generated text of the LLMs, which is ideal for encouraging it to follow strict instructions — as required for tool usage.

Honestly, I had not played with any LLM options yet, but added the temperature=0 setting and it did help.

More JSON Output Handling

I already had some code to handle the times when the LLM returned JSON using Markdown code fences, but there were times when the LLM would apologize for a mistake and append the JSON. The results were correct, but my code failed to extract it! So I added more defensive measures to safely extract JSON from LLM responses. Yes, I should update to use LLM built-in support for JSON schemas at this point.

With these changes, I regular had 100% success in my benchmarking runs! This means I need to make the tests harder.

You: I want to get a haircut tomorrow morning

Thought: I need to find out the current date to determine what tomorrow is. I will use the get_datetime tool to get the current date and time.
Thought: The current date is 2025-03-15. I need to check for available appointments for a haircut. Since there is no haircut specialty, I should first check the available specialties.
Thought: Okay, 'hair' is a valid specialty. I will now check for available 'hair' appointments.
Thought: I have a list of available hair appointments. The user wants an appointment tomorrow morning, which is March 16th. There are two appointments available on March 16th: 11:00 AM and 2:00 PM. The user wants a morning appointment, so the 11:00 AM appointment is the best fit. I will ask the user if they want to book appointment ID 11.

Agent: I see two appointments available for hair on March 16th: 11:00 AM and 2:00 PM. Would you like to book appointment ID 11 for 11:00 AM?

What’s Next

As I started to explore some non-chat user experiences, I discovered that I will need to agent to provide more structured output. Right now the output is very conversational, which works well for chat, but won’t be easy to use with other types of UX. I found some discussion on taming the output of an LLM in this blog post from Tryolabs.

I’ve also started using ToolAgent in my own chat server. I’m sure this branch of experimentation will continue as well.

One Reply to “Exploring LLMs as Agents: Tools & Benchmarking”

Danny says:

March 20, 2025 at 12:06 am

> but there were times when the LLM would apologize for a mistake and append the JSON.
It must be Canadian.

Keep sharing the updates!

This site uses Akismet to reduce spam. Learn how your comment data is processed.