Dr. Yanjun Qi

Recent GenAI Papers I Read.

Reviews Indexed

Toggle Menu

Index
Recent Posts By GenAI Category
- FM Basic
- FM Adapt
- FM Risk
- FM Reasoning
- FM Agent
- FM Platform
- FM Efficiency

Agent

Recent Readings for Agent Topics of Foundation Models (since 2022) (Index of Posts):

No.	Read Date	Title and Information	We Read @
1	2025, Mar, 31	Agent - multiagent collaboration	2025-S4
2	2025, Mar, 26	Agent Safety	2025-S4
3	2025, Mar, 24	Agent - Planning / World Model	2025-S4
4	2025, Mar, 19	Platform - long context vs RAG + Hallucination	2025-S4

Here is a detailed list of posts!

[1]: Agent - multiagent collaboration

read on: - 31 Mar 2025
Multiagent

In this session, our readings cover:

Required Readings:

GUI Agents: A Survey

[Submitted on 18 Dec 2024]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

OmniParser v2: Advanced vision-based screen parsing for precisely grounded UI actions

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, Hoang D. Nguyen With recent advances in Large Language Models (LLMs), Agentic AI has become phenomenal in real-world applications, moving toward multiple LLM-based agents to perceive, learn, reason, and act collaboratively. These LLM-based Multi-Agent Systems (MASs) enable groups of intelligent agents to coordinate and solve complex tasks collectively at scale, transitioning from isolated models to collaboration-centric approaches. This work provides an extensive survey of the collaborative aspect of MASs and introduces an extensible framework to guide future research. Our framework characterizes collaboration mechanisms based on key dimensions: actors (agents involved), types (e.g., cooperation, competition, or coopetition), structures (e.g., peer-to-peer, centralized, or distributed), strategies (e.g., role-based or model-based), and coordination protocols. Through a review of existing methodologies, our findings serve as a foundation for demystifying and advancing LLM-based MASs toward more intelligent and collaborative solutions for complex, real-world use cases. In addition, various applications of MASs across diverse domains, including 5G/6G networks, Industry 5.0, question answering, and social and cultural settings, are also investigated, demonstrating their wider adoption and broader impacts. Finally, we identify key lessons learned, open challenges, and potential research directions of MASs towards artificial collective intelligence.

Magentic-One: A generalist multi-agent system built on AutoGen

Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.
Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Built on AutoGen(opens in new tab), our popular open-source multi-agent framework, Magentic-One’s modular, multi-agent design offers numerous advantages over monolithic single-agent systems. By encapsulating distinct skills in separate agents, it simplifies development and reuse, similar to object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without needing to rework the entire system—unlike single-agent systems, which often struggle with inflexible workflows.

Agent-as-a-Judge: Evaluate Agents with Agents

[Submitted on 14 Oct 2024 (v1), last revised 16 Oct 2024 (this version, v2)]
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes – ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems – by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement. Comments: The project can be found at this https URL. The dataset is released at this https URL

[2]: Agent Safety

read on: - 26 Mar 2025
Safety Agent

In this session, our readings cover:

Required Readings:

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
[Submitted on 11 Oct 2024 (v1), last revised 14 Oct 2024 (this version, v2)]
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents – which use external tools and can execute multi-stage tasks – may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

[Submitted on 28 Feb 2025] Jiawei Zhang, Shuang Yang, Bo Li Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for handling complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements also amplify the risks of adversarial attacks, particularly when LLM agents can access sensitive external functionalities. Moreover, because LLM agents engage in extensive reasoning or planning before executing final actions, manipulating them into performing targeted malicious actions or invoking specific tools remains a significant challenge. Consequently, directly embedding adversarial strings in malicious instructions or injecting malicious prompts into tool interactions has become less effective against modern LLM agents. In this work, we present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent’s own reasoning processes to compel it toward malicious behavior. Specifically, UDora first samples the model’s reasoning for the given task, then automatically identifies multiple optimal positions within these reasoning traces to insert targeted perturbations. Subsequently, it uses the modified reasoning as the objective to optimize the adversarial strings. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets.

[3]: Agent - Planning / World Model

read on: - 24 Mar 2025
Planning

In this session, our readings cover:

Required Readings:

NVIDIA World Foundation Models

https://www.nvidia.com/en-us/glossary/world-models/
https://blogs.nvidia.com/blog/openusd-advances-physical-ai/
https://www.nvidia.com/en-us/ai/cosmos/
https://www.nvidia.com/en-us/glossary/synthetic-data-generation/?ncid=no-ncid

AI Planning: A Primer and Survey (Preliminary Report)

Dillon Z. Chen, Pulkit Verma, Siddharth Srivastava, Michael Katz, Sylvie Thiébaux
[Submitted on 7 Dec 2024]
Automated decision-making is a fundamental topic that spans multiple sub-disciplines in AI: reinforcement learning (RL), AI planning (AP), foundation models, and operations research, among others. Despite recent efforts to ``bridge the gaps’’ between these communities, there remain many insights that have not yet transcended the boundaries. Our goal in this paper is to provide a brief and non-exhaustive primer on ideas well-known in AP, but less so in other sub-disciplines. We do so by introducing the classical AP problem and representation, and extensions that handle uncertainty and time through the Markov Decision Process formalism. Next, we survey state-of-the-art techniques and ideas for solving AP problems, focusing on their ability to exploit problem structure. Lastly, we cover subfields within AP for learning structure from unstructured inputs and learning to generalise to unseen scenarios and situations.

Reasoning with Language Model is Planning with World Model

[Submitted on 24 May 2023 (v1), last revised 23 Oct 2023 (this version, v2)]
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu
Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal world model to predict the world state (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration vs. exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

[4]: Platform - long context vs RAG + Hallucination

read on: - 19 Mar 2025
RAG LongContext

In this session, our readings cover:

Required Readings:

Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

[Submitted on 31 Jan 2025]
Arjun Krishna, Erick Galinkin, Leon Derczynski, Jeffrey Martin Large Language Models (LLMs) have become an essential tool in the programmer’s toolkit, but their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain. In this work, we analyze package hallucination behaviour in LLMs across popular programming languages examining both existing package references and fictional dependencies. By analyzing this package hallucination behaviour we find potential attacks and suggest defensive strategies to defend against these attacks. We discover that package hallucination rate is predicated not only on model choice, but also programming language, model size, and specificity of the coding task request. The Pareto optimality boundary between code generation performance and package hallucination is sparsely populated, suggesting that coding models are not being optimized for secure code. Additionally, we find an inverse correlation between package hallucination rate and the HumanEval coding benchmark, offering a heuristic for evaluating the propensity of a model to hallucinate packages. Our metrics, findings and analyses provide a base for future models, securing AI-assisted software development workflows against package supply chain attacks.

YaRN: Efficient Context Window Extension of Large Language Models

[Submitted on 31 Aug 2023 (v1), last revised 1 Nov 2023 (this version, v2)]
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at this https URL

Long Context vs. RAG for LLMs: An Evaluation and Revisits

[Submitted on 27 Dec 2024]
https://arxiv.org/abs/2501.01880
Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun
Extending context windows (i.e., Long Context, LC) and using retrievers to selectively access relevant information (i.e., Retrieval-Augmented Generation, RAG) are the two main strategies to enable LLMs to incorporate extremely long external contexts. This paper revisits recent studies on this topic, highlighting their key insights and discrepancies. We then provide a more comprehensive evaluation by filtering out questions answerable without external context, identifying the most effective retrieval methods, and expanding the datasets. We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind. However, RAG has advantages in dialogue-based and general question queries. These insights underscore the trade-offs between RAG and LC strategies, offering guidance for future optimization of LLMs with external knowledge sources. We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.

Here is a name list of posts!

BackTop

Dr. Yanjun Qi

Agent

Recent Readings for Agent Topics of Foundation Models (since 2022) (Index of Posts):

Here is a detailed list of posts!

[1]: Agent - multiagent collaboration

Required Readings:

GUI Agents: A Survey

OmniParser v2: Advanced vision-based screen parsing for precisely grounded UI actions

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Magentic-One: A generalist multi-agent system built on AutoGen

Agent-as-a-Judge: Evaluate Agents with Agents

More Readings:

A Survey on Large Language Model based Autonomous Agents

Deploying Foundation Model Powered Agent Services: A Survey

[2]: Agent Safety

Required Readings:

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

More Readings:

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

Large Language Model Safety: A Holistic Survey

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Privacy-Preserving Large Language Models: Mechanisms, Applications, and Future Directions

[3]: Agent - Planning / World Model

Required Readings:

NVIDIA World Foundation Models

AI Planning: A Primer and Survey (Preliminary Report)

Reasoning with Language Model is Planning with World Model

More Readings:

Agent Planning with World Knowledge Model

O1 Replication Journey: A Strategic Progress Report – Part 1

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective

Improving Transformer World Models for Data-Efficient RL

[4]: Platform - long context vs RAG + Hallucination

Required Readings:

Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

YaRN: Efficient Context Window Extension of Large Language Models

Long Context vs. RAG for LLMs: An Evaluation and Revisits

More reading:

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

Here is a name list of posts!

Agent - multiagent collaboration

Agent Safety

Agent - Planning / World Model

Platform - long context vs RAG + Hallucination