I finally got around to exploring local models, which is surprisingly simple to set up. I wanted to see how well a local model would perform in ToolAgent
compared to the remote models (mostly Gemini) I have been using.
For more context on my explorations , checkout the previous posts: Minimalist Approach, Taking Action, Planning via Prompting, Tools & Benchmarking, WebAgent Gets a Refactor, and Model Context Protocol (MCP). Take a look at the repository to see the code.
Local Models
Local large language models (LLMs) are models that can be run directly on a personal device, such as a laptop, desktop computer, or smartphone, rather than relying on cloud-based servers. They offer features like privacy, security, compliance, offline availability, and lower operating costs.
One of the easiest ways to get started with local models is with Ollama, a system to download, manage, and run models on your laptop. Another frequently used resource is Hugging Face, which I’ve heard called the “GitHub of Models”.
My ToolAgent
is based on Simon Willison’s llm
project, which has plugins for several different local model system, including Ollama (llm-ollama) and Hugging Face models via MLX (llm-mlx). I was able to use both approaches to test out some local models.
Which Models
There are a mind-boggling number of local models and variations to choose from. Hugging Face has a leader board. Based on what I saw other people talking about, I decided to start with a few:
Ollama mistral-small3.1:24B (4bit) - 15GB
Ollama gemma3:27B (4bit) - 17GB
mlx-community/Qwen2.5-0.5B-Instruct-4bit — 278 MB
mlx-community/Mistral-7B-Instruct-v0.3-4bit — 4.08 GB
mlx-community/Mistral-Small-24B-Instruct-2501-4bit — 13.26 GB
mlx-community/Llama-3.3-70B-Instruct-4bit — 40 GB
You’ll notice a few specifications, like 24B and 4bit. These relate to some performance tradeoffs you need to consider when running LLMs on your laptop.
Model size is specified by the number of parameters the model supports, 7B is 7 billion and 27B is 27 billion. Generally, the more parameters a model supports, the better the model is at reasoning and following instructions.
Quantization (4-bit, 8-bit, 16-bit) is about how precisely the model stores its weights. Lower quantization means smaller files sizes and less RAM is required to run the model. The downside is less accuracy or ability to reason.
Results
Once I had a few local models installed and ready to go, it was time to try them out with ToolAgent
! Because ToolAgent
is based on the llm
library, I just had to pass the model name using the rules for the plugin (llm-ollama
or llm-mlx
) I wanted to use.
Speed: Local models are slower than cloud-based models. Ollama and MLX use GPU on my Macbook Pro M1 with 64GB, but it’s sluggish. MLX is built for Apple silicon, so I expect the model was working as well as it could on a single machine.
Reasoning: The smaller models (Qwen2.5-0.5B
and Mistral-7B
) were not able to function as an agent. They couldn’t follow the instructions in the system prompt and they couldn’t call any tool functions. The larger models (Mistral-Small-24B
and Llama-3.3-70B
) were able to function as agents. They didn’t work as smoothly as the cloud-based models, but they were very functional.
An exchange with Mistral-Small-24B
:
You: what is tomorrow's date?
Thought: "I need to determine the current date and time to calculate tomorrow's date.",
"type": "call_tool", "tool": "get_datetime", "param": null
Thought: "I have the current date, which is 2025-04-13. To find tomorrow's date, I need to add one day to the current date.",
"type": "call_tool", "tool": "calculate", "param": "2025-04-13 + 1 day"
Thought: "I cannot calculate dates directly. I need to determine the date by using the current date and adding one day manually.",
"type": "output", "value": "... output text ..."
Agent: Tomorrow's date is April 14, 2025.
You: what dentist appointments are available tomorrow?
Thought: "I need to determine the available dentist appointments for tomorrow. First, I need to check the available specialties to ensure 'dentist' is one of them.",
"type": "call_tool", "tool": "get_appointment_specialties", "param": null
Thought: "I have confirmed that 'dentist' is an available specialty. Next, I need to find the available dentist appointments for tomorrow.",
"type": "call_tool", "tool": "get_available_appointments", "param": "dentist"
Thought: "I have the list of available dentist appointments. I need to filter the appointments for tomorrow's date, which is April 14, 2025.",
"type": "output", "value": "... output text ..."
Agent: Here are the available dentist appointments for tomorrow, April 14, 2025:
Appointment ID: 3, Time: 11:00 AM
Appointment ID: 4, Time: 3:00 PM
It made a bad assumption with calculating the next day, but recovered. Otherwise worked well.
What’s Next
I have started to look at real agent frameworks, like Pydantic-AI. I think I have learned enough of the basics to be dangerous and it’s time to move beyond my own toy framework. Google released an Agent Development Kit at Google Next. It supports many of the features I have been playing with, including tool calling, planning, MCP, and local models. It also supports some advanced concepts like agent orchestration.
Let’s start exploring Google’s ADK!