Agents: A Systematic Introduction to Agents

Two aspects that determine an agent's capabilities: tools and planning.

Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. This means that the characteristics of an agent depend on the environment it is in and the range of actions it can perform. If an agent is developed to play games (e.g., Minecraft, Go, Dota), then the game is its environment. If you want the agent to scrape documents from the internet, then the internet is its environment. The environment for an autonomous vehicle agent is the road system and its surrounding areas.

The range of actions that an AI agent can perform is enhanced by the tools it has access to. ChatGPT is an agent that can search the web, execute Python code, and generate images. RAG systems are also agents—text retrievers, image retrievers, and SQL executors are their tools.

There is a strong dependency between an agent's environment and its toolkit. The environment determines the tools that the agent may use. For example, if the environment is a game of chess, the only possible actions for the agent are valid chess moves. However, the agent's toolkit limits the environments it can operate in. For instance, if the only action available to a robot is swimming, it will be restricted to aquatic environments.

SWE-agent is a coding agent, environment is the computer, and operations include navigation, searching, viewing files, and editing.

Compared to non-agent use cases, agents typically require more powerful models for two reasons:

Composite errors: Agents often need to perform multiple steps to complete a task, and as the number of steps increases, overall accuracy declines. If the model has an accuracy of 95% at each step, after 10 steps, the accuracy will drop to 60%, and after 100 steps, it will be only 0.6%.
Higher risk: With tools, agents can perform more impactful tasks, but any failure can have more serious consequences.

Whenever I talk about autonomous AI agents to a group of people, someone always mentions self-driving cars. "What if someone hacks the car and kidnaps you?" While the example of self-driving cars seems intuitive due to their physical nature, AI systems can cause harm without existing in the physical world.

Any organization looking to leverage artificial intelligence needs to take safety and security issues seriously. However, this does not mean that AI systems should never be given the ability to act in the real world. If we can trust machines to send us into space, I hope that one day, safety measures will be sufficient for us to trust autonomous AI systems. Additionally, humans can also make mistakes. Personally, I trust self-driving cars more than I would trust a stranger to give me a ride.

Planning and execution can be combined in the same prompt. For example, give the model a prompt asking it to think step-by-step (such as using chain-of-thought prompts) and then execute those steps in one prompt. But what if the model devises a 1000-step plan and fails to achieve the goal? Without supervision, the agent might run these steps for hours, wasting time and API call costs until you realize it is making no progress.

To avoid ineffective execution, planning should be decoupled from execution. It should be required that the agent first generates a plan, and only executes it after the plan has been validated. Plans can be validated using heuristic methods. For example, a simple heuristic method is to exclude plans that contain invalid operations. If the generated plan requires a Google search, and the agent cannot access Google search, then the plan is invalid. Another simple heuristic method might be to exclude any plans with more than X steps.

Decoupling planning (Planner) from execution (Executor) ensures that only validated (Evaluator) plans are executed.

The planning system now consists of three components: one for generating plans, one for validating plans, and another for executing plans. If each component is viewed as an agent, this can be seen as a multi-agent system. Since most agent workflows are complex enough to involve multiple components, most agents are multi-agent.

Solving a task typically involves the following process. It is important to note that reflection is not a strict requirement for agents, but it significantly enhances their performance.

Plan generation: Create a plan to complete the task. A plan is a series of manageable actions, so this process is also known as task decomposition.
Reflection and error correction: Evaluate the generated plan. If the plan is poor, generate a new plan.
Execution: Take action according to the generated plan. This usually involves calling specific functions.
Reflection and error correction: After receiving the results of the actions, evaluate those results and determine whether the goal has been achieved. Identify and correct errors. If the goal is not completed, generate a new plan.

The simplest way to transform a model into a plan generator is through prompt engineering. Suppose you create an agent to help customers understand Kitty Vogue's products. This agent has access to three external tools: retrieving products by price, retrieving popular products, and retrieving product information. Below is an example of a prompt for plan generation. This prompt is for illustrative purposes only; prompts in actual production may be more complex.

Propose a plan to solve the task. You have access to 5 actions:

* get_today_date()
* fetch_top_products(start_date, end_date, num_products)
* fetch_product_info(product_name)
* generate_query(task_history, tool_output)
* generate_response(query)

The plan must be a sequence of valid actions.

Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]

Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]

Task: {USER INPUT}
Plan:

Many model providers offer tool usage capabilities for their models, effectively transforming their models into agents, where tools are functions. Therefore, calling tools is often referred to as function calling. Generally, the operation of function calling is as follows:

Create a list of tools. Declare all tools that the model might want to use. Each tool is described by its execution entry point (e.g., its function name), parameters, and its documentation (e.g., the function's capabilities and required parameters).
Specify the tools that the agent can use for queries. Since different queries may require different tools, many APIs allow specifying a list of declared tools to use for each query.

For example, given the user query "How many kilograms are 40 pounds?", the agent might decide that it needs to use the tool lbs_to_kg_tool and pass in the parameter value 40. The agent's response might look like this.

response = ModelResponse(
    finish_reason='tool_calls',
    message=chat.Message(
        content=None,
        role='assistant',
        tool_calls=[
            ToolCall(
                function=Function(
                    arguments=**'{"lbs":40}'**,
                    name='lbs_to_kg'),
                type='function')
        ])
)

A plan is a roadmap outlining the steps needed to complete a task. The roadmap can have different levels of granularity. When planning for a year, quarterly plans are higher-level than monthly plans, while monthly plans are higher-level than weekly plans.

Detailed plans are harder to create but easier to execute; higher-level plans are easier to create but more challenging to execute. One way to avoid this trade-off is to adopt hierarchical planning. First, use a planner to generate a high-level plan, such as a quarterly plan. Then, for each quarter, use the same or a different planner to further refine the monthly plan.

Using more natural language helps the plan generator adapt better to changes in the tool API. If the model is primarily trained on natural language, it is likely to be better at understanding and generating natural language plans and less likely to hallucinate.

The downside of this approach is the need for a translator to convert each natural language action into executable commands. Chameleon (Lu et al., 2023) refers to this translator as a program generator. However, translation is much simpler than planning and can be accomplished by weaker models, with a lower risk of hallucination.

Control flow includes sequences, parallelism, if statements, and for loops. When evaluating agent frameworks, check the control flow they support. For example, if the system needs to browse ten websites, can it do so simultaneously? Parallel execution can significantly reduce user-perceived latency.

Even the best plans require constant evaluation and adjustment to maximize the chances of success. While reflection is not a strict necessity for agent operation, it is essential for agent success.

Reflection and error correction are two complementary mechanisms. Reflection generates insights that help identify errors that need to be corrected. Reflection can be accomplished by the same agent using self-critique prompts. It can also be done through a separate component, such as a dedicated scorer: a model that outputs specific scores for each result.

ReAct (Yao et al., 2022) proposes that interleaving reasoning with action has become a common pattern for agents. Yao et al. use the term "reasoning" to encompass planning and reflection. At each step, the agent is asked to explain its thinking (planning), take action, and then analyze the observed results (reflection) until the agent believes the task is complete.

A ReAct agent in action.

Reflection mechanisms can be implemented in multi-agent environments: one agent is responsible for planning and executing actions, while another agent evaluates the results after each step or several steps. If the agent's response fails to complete the task, you can prompt the agent to reflect on the reasons for the failure and how to improve. Based on this suggestion, the agent generates a new plan. This allows the agent to learn from its mistakes.

In the Reflexion framework, reflection is divided into two modules: an analyzer that evaluates results and a self-reflection module that analyzes the reasons for errors. The term "trajectory" is used to refer to plans. After each step, following evaluation and self-reflection, the agent proposes a new trajectory.

An example of how Reflexion works.

Reflection is relatively easier to implement compared to plan generation and can lead to surprising performance improvements. The downside of this approach is latency and cost. Thinking, observing, and sometimes even acting may require generating a large number of tokens, increasing costs and user-perceived latency, especially for tasks with many intermediate steps. To encourage their agents to follow the format, the authors of ReAct and Reflexion used numerous examples in their prompts. This increases the cost of computing input tokens and reduces the contextual space available for other information.

More tools will give agents more capabilities. However, the more tools there are, the harder it becomes to use them efficiently. This is similar to the increasing difficulty humans face when mastering a large number of tools. Adding tools also means increasing tool descriptions, which may not fit within the model's context.

Like many other decisions when building AI applications, tool selection requires experimentation and analysis. Here are some considerations that can help you make decisions:

Compare agent performance across different toolsets;
Conduct ablation studies to see how much agent performance declines if a certain tool is removed. If a tool can be removed without degrading performance, remove it;
Look for tools that the agent frequently struggles with. If a tool is too difficult for the agent to use, for example, even extensive prompting or fine-tuning cannot teach the model to use it, then change the tool;
Map the distribution of tool calls to see which tools are used the most and which are used the least;

Different models and tasks exhibit different tool usage patterns. Different tasks require different tools. ScienceQA (science question-answering task) relies more on knowledge retrieval tools than TabMWP (table math problem-solving task). Different models have different tool preferences. For example, GPT-4 seems to choose a broader range of tools than ChatGPT. ChatGPT appears to prefer image descriptions, while GPT-4 leans more towards knowledge retrieval.

Planning is difficult and can fail in many ways. The most common mode of planning failure is tool usage failure. An agent may generate a plan that contains one or more of the following errors.

Invalid tool
For example, it generates a plan that includes bing_search, which is not in the tool list.
Valid tool, invalid parameters.
For example, it calls lbs_to_kg with two parameters when that function only requires one parameter, lbs.
Valid tool, incorrect parameter values.
For example, it calls lbs_to_kg with a parameter lbs, but uses the value 100 to represent pounds when it should have used 120.

Another mode of planning failure is goal failure: the agent fails to achieve the goal. This may be because the plan does not address the task, or although it addresses the task, it does not follow the constraints. For example, suppose you ask the model to plan a two-week trip from San Francisco to India with a budget of $5000. The agent might plan a trip from San Francisco to Vietnam, or it might plan a two-week trip from San Francisco to India, but the cost far exceeds the budget.

In agent evaluation, a commonly overlooked constraint is time. In many cases, the time it takes for the agent to complete a task is not that important, as tasks can be assigned to the agent and checked upon completion. However, in many cases, the agent's effectiveness diminishes over time. For example, if you ask the agent to prepare a grant proposal, and the agent completes it after the funding deadline, then the agent's help is not very useful.

An agent may generate an effective plan to complete a task using the correct tools, but it may not be efficient. Here are several aspects you might want to track to evaluate agent efficiency:

How many steps does the agent typically need to complete a task?
What is the average cost of the agent completing a task?
How long does each operation typically take? Are there particularly time-consuming or expensive operations?

You can compare these metrics against your baseline, which could be another agent or a human. When comparing AI agents to humans, keep in mind that the operational modes of humans and AI are vastly different, so behaviors that are efficient for humans may be inefficient for AI, and vice versa. For example, visiting 100 web pages may be inefficient for a human who can only visit one page at a time, but it is a breeze for an AI agent that can access all pages simultaneously.

Original link: https://huyenchip.com/2025/01/07/agents.html