LLM Agents

Agent

Required Readings:

A Survey on Large Language Model based Autonomous Agents

More Readings:

Position Paper: Agent AI Towards a Holistic Intelligence

Tool Use in LLMs

Practices for Governing Agentic AI Systems

Emergent autonomous scientific research capabilities of large language models

What Makes a Dialog Agent Useful?

Blog: In this session, our blog covers:

Position Paper: Agent AI Towards a Holistic Intelligence

1     Introduction

  1. Agent AI is an intelligent agent capable of autonomously executing appropriate and contextually relevant actions based on sensory input, whether in a physical, virtual, or mixed-reality environment.
  2. Agent AI is emerging as a promising avenue toward Artificial General Intelligence (AGI).

2     Agent AI Fundamentals

  1. Learning: An agent needs to observe its environment, understand the impact of its actions on that environment, and learn from human demonstrations.
  2. Memory: Long-term memory is the whole knowledge base of an agent; short-term memory is the history of actions taken and perceptions observed during the actions.
  3. Action: Real-world operations often cannot be completed in one shot and thus require multi-round interactions between humans or the environment and the agent.
  4. Perception: Visual and Video perception are very crucial.
  5. Planning: Planning should be goal-oriented so that it can enable flexible operations.
  6. Cognitive Aspects: An agent AI’s ability to focus on the utility of the system as a whole.

3     Agent AI Categorization

  1. Manipulation Action: low-level fine action manipulation.
  2. Intention Action: high-level information transmission for a robot or human’s intent. instruction

4     Robotics

SayCan

  1. A significant weakness of language models is that they lack real-world experience.
  2. SayCan extracts and leverages the knowledge within LLMs in physically-grounded tasks.
  3. Insturction Relevance with LLMs: The probability that a skill makes progress toward actually completing the instruction
  4. Skill Affordances with Value Functions: The probability of completing the skill successfully from the current state
  5. Given a high-level instruction, SayCan combines probabilities from a LLM with the probabilities from a value function to select the skill to perform. This emits a skill that is both possible and useful.
  6. The process is repeated by appending the skill to the re- sponse and querying the models again, until the output step is to terminate.

5     Gaming

Ark (Augmented Reality with Knowledge Interactive Emergent Ability)

  1. Ark leverages knowledge-memory to generate scenes in the unseen physical world and virtual reality environments.
  2. At inference time, we first generate an image from the input text to learn the prior.
  3. The knowledge agent then generates a question and answer tuple which is fed as an input to GPT-3.5.
  4. The output of GPT-3.5 is an enhanced version of the input text with added information from external knowledge sources.
  5. This text is then given to ChatGPT that outputs the spatial arrangements and low-level program synthesis code.
  6. Finally, this code is rendered using Unity engine to output the desired 3D object.

6     Interactive HealthCare

  1. Diagnostic Agents: Medical chatbots offer a pathway to improve healthcare for millions of people, understanding various languages, cultures, and health conditions, with initial results showing promise using healthcare-knowledgeable LLMs trained on large-scale web data, but suffer from hallucinations.
  2. Knowledge Retrieval Agents: Pairing diagnostic agents with medical knowledge retrieval agents can reduce hallucinations and improve response quality and preciseness
  3. Telemedicine and Remote Monitoring: Agent-based AI in Telemedicine and Remote Monitoring can enhance healthcare access, improve communication between healthcare providers and patients, and increase the efficiency of doctor-patient interactions.

7     Conclusion and Future Directions

  1. we already have some great works of agent AI in robotics, but the other fields are still under exploration
  2. There are many potential research directions: such as
    • Exploring new paradigms
    • Developing methodologies for grounding different modalities
    • Generating intuitive human interface
    • Better taming LLM/VLMs
    • Bridging the gap between simulation and realilty.

What Are Tools Anyway? A Survey from the Language Model Perspective

1     Introduction

  1. Language models often struggle to perform tasks that require complex skills​. They are unable to solve tasks that require access to information that is not included in their training data.
  2. Thus, more and more are turning to language models enhanced with tools.​

2     What are Tools?

  1. Tools are often computer programs that are executable in the corresponding environment of the language model.
  2. Definition: A language model-used tool is a function interface to a computer program​, that runs externally to the language model. The language model generates the function calls and the input arguments to use the tool. ​3. Tools either extend the language model’s capabilities, or facilitates task solving.

3     Why are Tools Helpful?

  1. Tools help task-solving in a variety of ways​.
  2. There are three main categories of tools:
    • Perception
      • Provide and collect information from the environment
    • Action
      • Exert actions on the environment and change its state
    • Computation
      • Use programs to tackle complex computational tasks​
  3. Tools can fall into multiple categories

4     Tool Use Paradigm

  1. At each step of the output process, the language model decides whether to generate text or tools calls.
  2. Thus, shifting between text-generation mode to tool-execution mode is key.

  1. Language models learn to use tools in two ways:
    • Inference-time prompting
      • In-context learning
      • Provide task instructions and example pairs of queries and solutions that use tools​
    • Learning by training​
      • Trained on examples that use tools​
      • LMs trained to generate tool-using solutions​

5     Scenarios For Tools

  1. The following chart shows different tool categories and useful examples of each category:

  2. Tools are not as useful on tasks that are not easy to perform using non-ML methods​.

    • These are tasks that can be performed by a powerful LM alone​, such as Sentiment Analysis.
    • The tools leveraged are neural networks and have limited advantages over the base LM.​

6     Tool Selection and Usage

  1. How do we choose which tools to use for our tasks? There are three main scenarios:
    • Tools designated for task​
      • If there are specific tools designed for your task, no tool selection is necessary.
    • If we have a small number of tools (5-10) in our toolbox:
      • Provide metadata and use cases of tools as input contexts along with user query​
      • LM directly selects​
    • If we have a large toolbox (>10 tools):
      • Train a seperate retriever model to short list the most relevant tools
      • Then provide that short list to the LM

7     Tools in Programmatic Contexts

  1. Code language models can solve problems by generating programs​
  2. Tools can be seen as compositions of basic functions​
  3. Some main categories and examples of programmatic tools can be seen in the following Figure:

8     Tool Creation

  1. Language models can be used to make tools for tasks that do not have readily available ones.​
  2. Examples​:
    • Compose frequently-used-together actions as shortcut tools​
    • Design an automatic learning curriculum to make and use Java program tools​

9     Evaluating Tool Usage

  1. One way to evaluate tool usage is to utilize repurposed existing datasets that can additionally benefit from tools.
    • These are tasks that are solvable by LMs with difficulty​.
  2. Another way is to design newly crafted benchmarks that necessitate tool use​.
    • Perform example generation given a selected set of tools. These examples are either:
      • Human annotated​
      • Created using LMs

10     Properties

  1. The main properties that are being measured for tools at the moment are:
    • Task completion​
    • Tool selection​
    • Tool reusability
  2. The authors argue that the following properties are missing and should also be measured:
    • Efficiency of tool integration​
    • Tool quality​
    • Reliability of unstable tools​
    • Reproducible testing​
    • Safe Usage​​

11     Results

  1. Tasks that cover multiple domains experience highest increase when utilizing tools​
  2. The best results came with ToolAlpaca, which is a framework that is designed to generate a diverse tool-use corpus
  3. Worst results came from multilingual tasks, which showed degradation.
  4. Training time vs inference time cost is a consideration
    • Training only needs to be completed once, whereas inference happens every usage

Emergent autonomous scientific research capabilities of large language models​

1     Introduction

2     Overview of the system architecture​

3     Web search for synthesis planning​

Figure: Agent’s capabilities in the synthesis planning task. A. Ibuprofen synthesis. B. Aspirin​ synthesis. C. Suzuki reaction mechanism study, where the Agent had to choose how to study the​ mechanism. D. Aspartame synthesis.

4     Vector search for document retrieval​

5     Mastering Automation: multi-instrument systems​ controlled by natural language

Figure: A. Overview of the Agent’s configuration. ​B-E. Drawing geometrical figures. F. The Agent solves a color identification problem using UV-Vis data.​

5     Discussion​

A Survey on Large Language Model based Autonomous Agents

1     Overview:

Autonomous Agents

This survey is a systematic review for existing studies in the field of LLM-based agents and focuses on three aspects:​ ​

2     LLM-based Autonomous Agent Construction:

LLM-based autonomous agents are expected to effectively perform diverse tasks by leveraging the human-like capabilities of LLMs. In order to achieve this goal, there are two significant aspects, that is, (1) which architecture should be designed to better use LLMs and (2) give the designed architecture, how to enable the agent to acquire capabilities for accomplishing specific tasks. In specific, the overall structure of our framework is illustrated Figure 2.

2.1   Profiling Module:

The profiling module aims to indicate the profiles of the agent roles, which are​ usually written into the prompt to influence the LLM behaviors.​ ​
Profile Contents:
  • basic information such as age, gender, and career​​
  • psychology information, reflecting the personalities of the agents​
  • social information, detailing the relationships between agents​

Generation Strategies:
  1. Handcrafting Method: Agent profiles are manually specified.​ For instance, if one would like to design agents with different personalities, he can use ”you are an outgoing person” or ”you are an introverted person” to profile the agent.​
  2. LLM-generation Method: Agent profiles are automatically generated based on LLMs. Typically, it begins by indicating the profile generation rules, elucidating the composition and attributes of the agent profiles​ within the target population. ​
  3. Dataset Alignment Method : Here, agent profiles are obtained from real-world datasets.

2.2   Memory Module:

The memory module can help the agent to accumulate experiences, self-evolve,​ and behave in a more consistent, reasonable, and effective manner.​
Memory Structures: ​​ ​
  1. Unified Memory​: It simulates the human short-term memory​ usually realized by in-context learning, and​ the memory information is directly written into the prompts​​
  2. Hybrid Memory​: This structure explicitly models the human short-term and long-term memories.​ short-term memory temporarily buffers recent perceptions​ long-term memory consolidates important information over time​

Memory Formats: ​​ ​
  1. Natural Languages​: In this format, memory information are directly described using raw​ natural language.​​
  2. Embeddings​: In this format, memory information is encoded into​ embedding vectors. It enhance the memory retrieval and reading efficiency.​
  3. Databases​: In this format, memory information is stored in databases, allowing the agent to manipulate memories efficiently and comprehensively​​
  4. Structured Lists​: In this format, memory information is organized into​ lists, and the semantic of memory can be conveyed in an efficient and​ concise manner.​​

Memory Operations: ​​ ​
  1. Memory Reading​: The objective of memory reading is to extract​ meaningful information from memory to enhance the agent’s actions.​ For example, using the previously successful actions to achieve similar​ goals. The following equation from existing literature for memory information​ extraction.​ ​​
  2. Memory Writing​: The purpose of memory writing is to store​ information about the perceived environment in memory. there are two​ potential problems that should be carefully addressed a) Memory Duplicated and b) Memory Overflow​​
  3. Memory Reflection​: To independently summarize and infer more abstract,​ complex and high-level information.​​

2.3   Planning Module:

The planning module aims to empower the agents with human capability of deconstructing a ​ task into subtasks, which is expected to make the agent behave more reasonably, powerfully, ​ and reliably.
Planning without Feedback:
  1. Single-path Reasoning: In this strategy, the final task is decomposed into several intermediate steps. These steps are connected in a cascading manner, with each step leading to only one subsequent step. LLMs follow these steps to achieve the final goal.​
  2. Multi-path Reasoning: In this strategy, the reasoning steps for generating the final plans are organized into a tree-like structure. Each intermediate step may have multiple subsequent steps. This approach is analogous to human thinking, as individuals may have multiple choices at each reasoning step
  3. External Planner: Despite the demonstrated power of LLMs in zero-shot planning, effectively generating plans for domain-specific problems remains highly challenging. To address this challenge, researchers turn to external planners. These tools are well-developed and employ efficient search algorithms to rapidly identify correct, or even optimal, plans.

Planning with Feedback: To tackle complex human tasks, individual​ agents may iteratively make and revise their plans based on external​ feedback.​ ​
  1. Environmental Feedback: This feedback is obtained from the​ objective world or virtual environment. ​​
  2. Human Feedback: Directly Interacting with humans is also a very​ intuitive strategy to enhance the agent planning capability.
  3. Model Feedback : Apart from the aforementioned environmental and​ human feedback, which are external signals, researchers have also​ investigated the utilization of internal feedback from the agents​ themselves.

2.3   Action Module:

The action module is responsible for translating the agent’s​ decisions into specific outcomes.
Action goal:​ what are the intended outcomes of the actions?
  1. Task Completion: In this scenario, the agent’s actions are aimed at accomplishing specific tasks, such as crafting an iron pickaxe in Minecraft.​
  2. Communication: In this case, the actions are taken to communicate with the other agents or real humans for sharing information or collaboration. For example, the agents in ChatDev may communicate with each other to collectively accomplish software development tasks.
  3. Exploration: In this example, the agent aims to explore unfamiliar environments to expand its perception and strike a balance between exploring and exploiting. For instance, the agent in Voyager may explore unknown skills in their task completion process, and continually refine the skill execution code based on environment feedback through trial and error.

Action Production: how are the actions generated? ​
  1. Action via Memory Recollection: In this strategy, the action is generated by extracting information from the agent ​ memory according to the current task. The task and the extracted memories are ​ used as prompts to trigger the agent actions.​ ​​
  2. Action via Plan Following : In this strategy, the agent takes actions​ following its pre-generated plan.​

Action space: what are the available actions?​ ​
  1. External Tools: API, Databases Knowledge Bases, External Models.​​ ​​
  2. Internal Knowledge : Planning Capability, Conversation​ Capability and Common Sense Understanding Capability.​

Action impact: what are the consequences of the actions?​ ​
  1. Changing Environments: Agents can directly alter environment states​ by actions, such as moving their positions, collecting items, constructing​ buildings, etc​​​
  2. Altering Internal States : Actions taken by the agent can also change​ the agent itself, including updating memories, forming new plans,​ acquiring novel knowledge, and more.​
  3. Triggering New Actions :In the task completion process, one agent​ action can be triggered by another one.

3     Agent Capability Acquisition

Considering LLMs as Operating System (OS), we have seen ‘Hardware’ perspective. Now, we dive into ‘Software’ perspective which can be interpreted as acquiring ‘a specific task solving ability’ (capability).

To acquire the capability, we consider 1) with Fine-tuning 2) without Fine-tuning

3.1   Capability Acquisition with Fine-tuning

In order to fine-tune the model, we can use 1) Human Annotated Datasets, 2) LLM Generated Datasets, 3) Real World Datasets.


  1. Human Annotated Datasets
    • Chain of hindsight (CoH): Use human preference to align ‘good’ or ‘bad’ answer. Then use hindsight information to get a better answer.
    • WebShop: Web shopping simulation with human experts.
    • EduChat: Fine-tune with well-curated Education dataset.
    • SWIFTSAGE: Fine-tun with Human annotated dataset to solve interactive reasoning tasks

  1. LLM Generated Datasets
    • ToolBench: LLM generates task such as solving math problems or web shopping. Agents learn using tools (Cacluator API, Web Coding API) to solve the generated task. If the task is solved, the solution trajectory is saved in ToolBench (LLM generated datasets).
    • SandBox: Simulation for social capability, each agent has their own persona and interacts with other agents.

  1. Real World Datasets
    • MIND2WEB: Collect Human annotated dataset using real world website. For example, solution trajectory to book a fligt on the real world website.

3.2   Capability Acquisition without Fine-tuning

We improve task-solving ability without fine-tuning, we can use 1) Prompt Engineering, 2) Mechanism Engineering.

We further categorize Mechanism Engieering into 1) Trial-and-error, 2) Crowd-sourcing, 3) Experience Accumulation, 4) Self-driven Evolution.

3.2.1   Prompt Engineering

There are advanced prompt engineering such as Chain of Thought (CoT), Self-consistency (CoT-SC), Tree of Thought (ToT), Graph of Thought (GoT) and Retrospective Prompting.

Retrospective Prompting uses self-feedback (reflection) system going over previous answers and questions.

(Left) without retrospection (Right) with retrospection

Retrospect previous response to act a better role-playing with Demon ‘Bogus’ chracter.

Retroformer uses another language model to make reflection response getting reflection prompt.

3.2.2   Mechanism Engineering

RoCo: Multiple agents interacts with each other and get feedback from each other.

RAH: Agent serves as an assistant and gets feedback from human.

  1. Trial-and-error: Take Action and Get Feedback
    • RAH: Agent serves as an assistance and gets feedback from human.
    • DEPS: Agent plans and execute plans. From a failure of the plan, agent get feedback.
    • RoCo: Multiple robots collaborate with each other via language interaction. Each agent proposes a sub-plan.
    • PREFER: Agent evaluates its performance on a subset of data to solve a task. If the task fails, agent generated feedback information from the failure.

  1. Crowd-sourcing: Debate and Reach consensus
    • Self-consistency with multi-agents: This can be seen as a generalization of self-consistency embracing multiple agents. Each round, answers will be verified among agents to check whether it is consistent. Otherwise, they proceed to the second round adding reasoning from other agents.

GITM: Big difference to RL is that agent is not directly using the information from environment. Planning and Executing are done with LLM agent.

Voyager: Solve tasks using skill library. Once they succefully solve a higher level of tasks, they save the skill to skill library.

  1. Experience Accumulation: Explore and Use Memory
    • GITM: Agent explores to get experiences for problem solving. Once they accomplish a task, the experiences are stored in a memory. When agent encounters a similar task, they use a relevant memory.
    • Voyager: Agent has a skill library. Each skill is represented as a executable code. Based on the feedback from an environment, agents learn the way to use skill.
    • MemPrompt: Users provide feedback and this feedback is stored in a memory of agents.

NLSOM: Multiple VLM interacts with each other. Organizer LLM aggregates answers and generates a better prompt (mindstorm). Leader LLM outputs an answer given the better prompt This is self-driven learning among multi agents.

LMA3: Agent sets goals with goal generator and get feedback with reward function. Relabeler adjusts the reward function based on knowledge from LLM agent. Policy executes an action and interacts with external world.

  1. Self-driven Evolution: Set Goals for Themselves, and use self-Motivation
    • LMA3: Agent sets goals for itself and improve themselves by exploring the environment and receiving feedback.
    • SALLM-MS: Use multi-agent systems. With self-driven system among multi agents, they acquire capability.
    • CLMTWA: Teacher-student scheme is used. Strong LLM serves as a teacher and weak LLM serves as a student.
    • NLSOM: Multi agent system with VLM.

4   Application

We see LLM agent’s application to three distinct areas 1) social science 2) natural science 3) engineering.

4.1   Social Science

Social science is devoted to the study of societies and the relationships among individuals within those societies. LLM-based autonomous agents can mimic human-like comprehension, reasoning, and problem-solving skills.

SandBox: Each agent has their own persona and personality designed by prompts. The agent interacts with other agents based on their persona.

4.2   Natural Science

Natural science is concerned with the description, understanding and prediction of natural phenomena, based on empirical evidence from observation and experimentation.

ChatMOF: The agent uses toolkits to search data, predict property, and generate metal-organic frameworks (MOF).

EduChat: Agent provides personalized, equitable, and empathetic educational support to teachers, students, and parents through dialogue.

4.3   Engineering

LLM-based autonomous agents have shown great potential in assisting and enhancing engineering research and applications.

ChatDev: Each agent has its own role specified to solve a specific task such as developing program. Multi agents interact with each others and the simulation get evolved on top of the interaction.

When2Ask: This coordinates the interaction between the agent and the LLM based on the Planner-Actor-Mediator framework. This framwork can be used for robotics & embodied AI.

5   Evaluation

We introduces two evaluation strategies 1) subjective and 2) objective evaluation in order to evaluate the effectiveness of LLM agents.

5.1   Subjective Evaluation

It is suitable for the scenarios where there are no evaluation datasets or it is very hard to design quantitative metrics, for example, evaluating the agent’s intelligence or user-friendliness.


Emotion measures should be accompanied by Human.

5.2   Objective Evaluation

Objective metrics aim to provide concrete, measurable insights into the agent performance. There are three aspects 1) the evaluation metrics, 2) protocols, and 3) benchmarks.
This Benchmark evaluates interaction with other agents or human by investigating request of various help.

6   Challenges

  REFERENCES