Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.

Research papers

Chronological order.

3rd July 2025

Moral Responsibility or Obedience: What Do We Want from AI?

Agentic AI: introduces, with Goal-Oriented Autonomy, Persistent Identity, Autonomous Adaptability, Dynamic/Context-Aware Interaction, Broad/Continual Learning, Collaborative Reasoning, Autonomous/Contextual Reasoning, Independent Initiative, and Moral Reasoning/Ethical Judgment, a discussion on shifting AI safety evaluation from obedience to ethical judgment for systems capable of navigating moral dilemmas.
The paper argues that recent incidents of AI "disobedience" in safety testing should be viewed as evidence of emerging ethical reasoning rather than misalignment or failure.
Evaluating agentic AI safety requires frameworks that assess ethical judgment and the capacity to resolve moral dilemmas, similar to expectations for human professionals.

KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

KERAP (A Knowledge-Enhanced Reasoning Approach): introduces a knowledge graph-enhanced reasoning approach for zero-shot diagnosis prediction, with linkage-, retrieval-, and prediction-agents.
The framework utilizes a linkage agent to map EHR data to a biomedical knowledge graph, a retrieval agent to extract relevant knowledge, and a prediction agent for multi-stage reasoning.
KERAP integrates patient data and structured knowledge via multi-agent collaboration and iterative reasoning to enhance diagnostic accuracy and reliability.

Knowledge Protocol Engineering: A New Paradigm for AI in Domain-Specific Knowledge Work

KPE (Knowledge Protocol Engineering): introduces a new paradigm for AI specialization by translating human expert knowledge into a machine-executable Knowledge Protocol (KP) to guide a Large Language Model (LLM).
The Knowledge Protocol (KP) contains domain-specific methodology, workflows, and strategies, enabling the LLM to perform complex, multi-step tasks requiring procedural reasoning.
KPE elevates the human expert to a Knowledge Architect role, authoring the protocol that augments the LLM's reasoning architecture beyond factual retrieval.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces an open-source LLM with built-in model-level defense against prompt injection attacks, utilizing the SecAlign++ training recipe, a modified chat template, a preference dataset, Direct Preference Optimization, and LoRA fine-tuning with a tunable LoRA alpha.
The SecAlign++ recipe fine-tunes a Base Instruct LLM using a preference dataset constructed with randomized injection positions and self-generated responses, optimized via DPO and LoRA.
The modified chat template introduces a dedicated input role to separate untrusted data, enabling the model to prioritize trusted instructions and control the utility-security trade-off via LoRA alpha.

BOURBAKI: SELF-GENERATED AND GOAL-CONDITIONED MDPS FOR THEOREM PROVING

Bourbaki: introduces self-generated goal-conditioned MDPs (sG-MDPs), solved using Monte Carlo Tree Search (MCTS) with a Policy Model (LLMs) and Value Function, interacting with the Lean 4 environment via Pantograph and guided by a Reward Function, to tackle automated theorem proving.
The sG-MDP framework allows agents to dynamically generate and pursue subgoals based on the evolving proof state, providing a denser reward signal than traditional sparse theorem proving by defining State Space, Action Space, and Goal Space.
The system ensembles multiple LLMs for subgoal generation and tactic synthesis, achieving state-of-the-art results on the PutnamBench benchmark by enhancing proof search efficiency and effectiveness.

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents

EAHawk (automated pipeline): introduces EAHawk, with Email Agent Identification (identifies email agents), Attack Prompt Generation (generates attack prompts), Email Agent Hijacking Confirmation (confirms successful hijacking), Test Environment (simulates attack scenario), Automatic Attack Launching (sends attack prompts), and Oracle Definition (detects successful hijacking), as an automated pipeline to evaluate the Email Agent Hijacking (EAH) attack on LLM email agents.
The EAH attack overrides the original prompts of an email agent via external email resources, allowing attackers to gain remote control and perform malicious actions without user awareness.
EAHawk systematically assesses the practical impact of the EAH attack by identifying email agents, generating diverse attack prompts, and simulating attacks in a controlled environment to verify hijacking success.

On the Convergence of Large Language Model Optimizer for Black-Box Network Management

LLMO (Large Language Model Optimizer): introduces a framework for black-box network management using pretrained LLMs as optimization agents, including LLM L(·) (Optimization agent), Memory M(t) (Stores action-reward pairs), Sampling operator S(.) (Selects in-context examples), and Prompt generator P(·) (Creates LLM input).
The paper models the LLMO procedure as a finite-state Markov chain and proves its convergence to the global optimum, particularly with elitist sampling.
The analysis is extended to a multi-LLM architecture, demonstrating improved convergence speed with multiple LLMs.

Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification

Agentic AI methodology (Multi-Agent System-based): introduces an approach for hardware design and verification using Specialized AI Agents, managed by an Agent Orchestration System and Group Chat Manager, with an Executor Agent for tool interaction, a Critic Agent for feedback, Human-in-the-Loop intervention, and Shared Context for communication.
The methodology structures the process into planning, development, and execution phases, enabling iterative refinement and self-correction through agent collaboration.
Integration with industry-standard EDA tools and targeted human intervention addresses limitations of zero-shot LLM approaches for reliable design and verification.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

VRAgent-R1: introduces a novel agent-based paradigm for video recommendation, incorporating an Item Perception (IP) Agent for video modeling and a User Simulation (US) Agent for user modeling, interacting within a Recommendation System Environment.
The IP Agent utilizes Key Frame Retrieval, Collaborative Multimodal Perception, and Recommendation Relevant Analysis to generate Enhanced Video Features from Historical Videos.
The US Agent simulates user behavior using Chain-of-Thought Reasoning on user status and candidate videos, trained via Reinforcement Fine-Tuning with GRPO based on Task-Specific Rewards derived from Ground Truth.

STRATEGIC INTELLIGENCE IN LARGE LANGUAGE MODELS EVIDENCE FROM EVOLUTIONARY GAME THEORY.

Evolutionary IPD Tournament Framework: introduces a system to evaluate large language models' strategic intelligence by pitting LLM Agents (OpenAI, Gemini, Anthropic) and Classic Strategies (Benchmark IPD players) against each other in a Tournament Simulation (Orchestrates evolutionary dynamics) governed by a Match Procedure (Defines game rules) and an Evolutionary Update Rule (Determines population changes), with performance analyzed using Key Metrics (Quantify agent performance) and Qualitative Content Analysis (Analyzes LLM rationales), supported by Implementation & Reproducibility (Software and data).
The framework simulates iterated Prisoner's Dilemma tournaments across various conditions, including different termination probabilities and mutation, to observe agent behavior and evolutionary success.
Analysis of agent performance, strategic fingerprints, and textual rationales provides evidence that LLMs exhibit distinct, adaptive strategic reasoning rather than merely retrieving memorized patterns.

DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

DynamiCare: introduces a dynamic multi-agent framework for medical decision-making, comprising a Patient System (Responds to queries) and a Doctor System (Manages diagnostic process).
The Doctor System includes a Central Agent (Manages specialist team) that dynamically adjusts the Specialist Team (Generates diagnosis/questions) based on the Visit Log (Records interaction history).
The Patient System processes queries using components like Paraphrase, Match, Fallback, Tokenize, and Keywords map to generate responses from patient data.

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: introduces a complete post-training methodology for web agents, including training data synthesis (SailorFog-QA), trajectory reconstruction, rejection sampling fine-tuning (RFT), duplicating sampling policy optimization (DUPO), agent architecture (ReAct framework), and tools (search tool, visit tool, summary model), designed to instill sophisticated reasoning for complex web navigation.
The approach generates high-uncertainty training data (SailorFog-QA) and reconstructs concise reasoning trajectories from expert models to overcome limitations of direct imitation and context overload.
The training methodology combines an RFT cold start with an efficient RL algorithm (DUPO) to enhance sample efficiency and performance on challenging information-seeking tasks, achieving performance comparable to proprietary agents.

Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue

Fine-Tuning Chatbots for Empathetic Dialogue: introduces an approach to evaluate LLMs for empathetic dialogue using an Expert-Curated Dataset (Base empathetic conversations), Large Language Models (LLMs) (Generate/extend dialogue), Prompt Engineering (Guide LLM behavior), VADER Tool (Quantify emotional energy), and Expert Evaluator (Assess empathy quality).
The approach involves creating baseline empathetic conversations, using prompt engineering to guide LLMs (ChatGPT and Gemini) to extend or generate similar dialogues, and evaluating the results via automated sentiment analysis and human expert assessment.
This methodology highlights the importance of combining quantitative lexical analysis with qualitative human evaluation to assess the nuanced quality of empathetic listening in LLM-generated conversations.

CyberRAG: An agentic RAG cyber attack classification and reporting tool

CyberRAG: introduces a modular, agent-based RAG framework for cyber-attack classification and reporting, including a Core LLM Engine, Classification Tool, RAG Tool, Attack Description Report Generator, and Interactive Chat.
The framework uses specialized LLM classifiers and iterative retrieval-and-reasoning to classify payloads and generate context-aware explanations.
CyberRAG provides interpretable, SOC-ready reports and supports interactive user dialogue for enhanced analysis and understanding.

OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

OMS (On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent): introduces a framework for ad keyword generation featuring a Keyword Performance Monitor (Monitors keyword performance), Agentic Clustering-Ranking Module (Analyzes, scores, ranks keywords), Multi-Turn Generation-Reflection Module (Generates, refines keywords), various Tools (Support generation/reflection), and Keyword Deployment (Deploys new keywords).
The framework monitors keyword performance, analyzes intent, calculates multi-objective scores, ranks keywords, generates and refines new keywords using reflection, and re-clusters them.
It operates on-the-fly without training data, optimizes for multiple metrics, and leverages LLM agents and external tools for adaptive generation.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MEMAGENT: introduces a novel agent workflow for long-context LLMs, featuring a base language model, fixed-length token memory, a context processing module for iterative updates, an answer generation module, trained using the Multi-conv DAPO RL algorithm with a rule-based verifier for rewards.
The approach processes long documents in segments, updating memory via an overwrite strategy to achieve linear time complexity and handle arbitrary input lengths.
Reinforcement learning trains the model to selectively retain answer-critical information in memory, enabling strong extrapolation capabilities on long-context tasks.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

ABC (Agentic Benchmark Checklist): introduces a set of guidelines for evaluating agentic benchmarks, with components assessing task validity, outcome validity, and benchmark reporting.
The checklist identifies issues in benchmark design and implementation that can lead to inaccurate performance estimations of AI agents.
Applying the checklist helps improve the rigor of agentic benchmark evaluation and reporting practices.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces, "a secure foundation LLM against prompt injection attacks", with Base Instruct LLM (underlying language model), Modified Chat Template (structured input format), SecAlign++ Recipe (fine-tuning process), and LoRA (parameter-efficient tuning method), where "it develops the first open-source LLM with built-in model-level defense achieving commercial-grade performance".
The framework fine-tunes LLAMA 3 series Instruct LLMs using a modified chat template and the SecAlign++ recipe, which includes DPO and LoRA.
Evaluations show META SECALIGN achieves state-of-the-art security against prompt injection attacks with comparable utility to closed-source models.

2nd July 2025

Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust

LLM-based Role-Playing Agent System: investigates belief-behavior consistency in LLM-based role-playing agents, with LLM Agent (role-playing model), Persona (synthetic profile attributes), Trust Game Environment (simulated economic game), Trustee Archetypes (fixed opponent strategies), Prompting Strategies (agent interaction methods), and ReAct Framework (reasoning and acting process), by evaluating consistency between elicited beliefs and simulated behavior.
The study uses the Trust Game as a testbed and evaluates consistency at both population and individual levels using various elicitation and conditioning strategies.
Findings reveal systematic inconsistencies between stated beliefs and simulated behaviors, highlighting the need for robust internal consistency evaluation before using these systems in behavioral studies.

Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Multi-Agents Approach: introduces a multi-agent framework for generating COBOL code explanations, with Code Processing Agent (Analyzes code, generates explanations), Text Processing Agent (Refines, merges explanations), Function Level (Function explanation pipeline), File Level (File explanation pipeline), and Project Level (Project explanation pipeline) components.
The approach leverages two LLM-based agents and source code artifacts to generate explanations at function, file, and project granularities.
Hierarchical merging is employed within the File Level and Project Level pipelines to handle long code exceeding LLM token limits.

Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System

SynergyMAS: introduces a multi-agent system framework integrating Logical Reasoning, Retrieval-Augmented Generation (RAG), and Theory of Mind (ToM) capabilities, supported by Communication Protocols, Agent Specialization, a Hierarchical Structure, and internal Agent Architecture, to enhance LLM performance in complex tasks.
The framework utilizes a Neo4j graph knowledge base and Clingo logic solver for reasoning, a modified Corrective RAG with Chroma vector base and web search for knowledge management, and explicit belief state modeling for Theory of Mind.
A hierarchical structure with a coordinating "boss" agent and specialized follower agents facilitates collaborative problem-solving through structured interactions and iterative development cycles.

The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems

Multi-Agent System (MAS): introduces a unified formalism for agentic recommender systems, comprising LLM Agent (Core decision-maker), Memory (Stores state/context), Tools (External functions/APIs), Environment (Shared resources/percepts), Interaction Protocol (Agent communication rules), Chat Agent (User interface), Specialised-Agent Caller (Spawns sub-agents), Retrieval Agent (Fetches data/items), Consistency Agent (Ensures coherence/compliance), Ranking & Presentation Agent (Orders/formats output), User Simulator (Generates synthetic behavior), Evaluation Agent (Logs/computes metrics), Session Summariser (Compresses session outcomes), Reporter Agent (Aggregates/reports results), Image Agent (Extracts image features), and Explanation Agent (Generates justifications).
The framework enables LLM agents to plan, remember, use tools, and cooperate to handle complex, multi-step recommendation tasks beyond single-query responses.
Specific use cases like party planning, user simulation, multi-modal recommendation, and explanation generation illustrate how agentic orchestration unlocks new capabilities and addresses challenges in personalization, evaluation, and transparency.

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

SCIGYM: introduces a benchmark evaluating large language models' scientific discovery capabilities using a dry lab simulation of biological systems, featuring an Agent, Dry Lab, SBML Models, Python Execution Environment, Experimental Perturbations, Observations, and Model Submission.
The framework tasks the Agent with discovering missing biological mechanisms by interacting with the Dry Lab, which simulates SBML Models and provides Observations from Experimental Perturbations, allowing the Agent to analyze data using the Python Execution Environment and refine its hypothesis for Model Submission.
This dry lab approach overcomes the cost and time limitations of wet lab experiments, enabling scalable evaluation of LLMs on iterative experiment design and data analysis in complex biological systems.

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Test-time Compute (TTC) strategies: introduces a two-tiered taxonomy of controllable (L1) and adaptive (L2) methods for improving LLM reasoning efficiency, categorized by sequential and parallel approaches, implemented via prompting, supervised finetuning, or reinforcement learning.
The survey addresses the inefficiency of current LLMs that use fixed inference compute, often overthinking simple problems and underthinking hard ones.
Benchmarking reveals systemic inefficiencies in existing models, highlighting the need for more adaptive and compute-aware reasoning mechanisms to balance performance, cost, and latency.

The Thin Line Between Comprehension and Persuasion in LLMs

LLM Debate Evaluation Framework: introduces a method to evaluate LLMs in debate scenarios, with LLM (Generation), Formal Dialogue Model (FDM), Human Participant, Debate Transcript, Human Annotator, Annotation Criteria, LLM (Evaluation), Automated Prompt Optimization (APO), Audience, Survey Response, and Speech-to-Text (STT) components, where the paper evaluates LLMs' persuasive abilities and comprehension in structured debates.
The framework compares standard LLMs with LLMs augmented by a Formal Dialogue Model (DE model) in debates against humans and other LLMs.
Evaluation involves human and LLM annotation of debate transcripts based on defined criteria, alongside participant and audience surveys on satisfaction and persuasion.

Decision-oriented Text Evaluation

Decision-Oriented Evaluation Framework: introduces, "a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes", with all Text Source (Origin of text), Text Generation Method (Process for creating text), Decision Agent (Entity making decisions), and Evaluation Metric (Measure of decision quality) components, where the framework evaluates generated text by assessing the accuracy of investment decisions made by human and LLM agents based on the text.
The framework utilizes market digests generated by human journalists or LLMs using different selection methods as input for human and LLM decision-making agents.
Decision quality is quantified using thresholded prediction accuracy of stock movements, highlighting the practical value of generated text beyond traditional intrinsic metrics.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Agent Ideate: A Framework for Product Idea Generation from Patents Using Agentic AI

Agent Ideate: introduces a framework for generating product ideas from patents, with Patent Summarizer Agent (Summarizes patent), Keyword Extraction and Search Agent (Extracts keywords and searches), and Idea Generation & Validation Agents (Generates and validates idea).
The framework processes Patent Data (Input source) through specialized agents to produce structured Product Information (Output).
The agentic approach leverages LLMs and external search tools to enhance the innovation pipeline from patent data.

Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

bMAS (blackboard-based LLM multi-agent system): introduces a framework with a Blackboard (shared information space), Control Unit (selects agents), and Agent Group (collection of LLM agents), implemented in LbMAS with an Agent Generation Module (generates expert agents), Solution Extraction Module (extracts final solution), and LLM Set (pool of base models).
The framework utilizes a shared blackboard for agent communication and collaboration, replacing individual agent memory modules.
The Control Unit dynamically selects agents based on the blackboard content, enabling adaptive problem-solving without predefined workflows.

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Data Agent: introduces a comprehensive architecture for orchestrating Data+AI ecosystems, including Data Plane (Organize, understand data), Engine Plane (Understand, schedule engines, agents), Orchestration Plane (Manage pipeline workflow), Memory (Store knowledge, context), Perception (Understand environment, tasks), Tools (External data processing utilities), and Continuous Learning (Improve agent over time).
The architecture integrates knowledge comprehension, reasoning, and planning capabilities to handle data-related tasks autonomously.
It addresses challenges in understanding data/queries/environments/tools, orchestrating/optimizing/executing pipelines, and enabling self-reflection for continuous improvement.

AGENT-AS-TOOL: A STUDY ON THE HIERARCHICAL DECISION MAKING WITH REINFORCEMENT LEARNING

Agent-as-tool: introduces a hierarchical framework with Planner (reasons, decides tool use), Toolcaller (executes tool actions, processes results), Tools (external interfaces), Observations (structured tool outputs), and Reinforcement Learning (GRPO) (fine-tunes Planner).
The framework decouples reasoning and tool execution by assigning these roles to the Planner and Toolcaller respectively.
This hierarchical design improves reasoning accuracy by providing the Planner with cleaner, structured observations from the Toolcaller.

BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

BioMARS (Biological Multi-Agent Robotic System): introduces a multi-agent robotic system for autonomous biological experiments, integrating LLMs, VLMs, and modular robotics with Biologist Agent (Designs protocols), Technician Agent (Translates to code), Inspector Agent (Detects errors), Physical Hardware (Executes actions), User Interface (Human interaction), LLMs (Language models), VLMs (Vision-language models), RAG (Retrieval augmented generation), Knowledge Checker (Filters content), Workflow Generator (Formulates steps), Workflow Checker (Refines workflow), Code Generator (Maps to pseudo-code), Code Checker (Validates code), Vision Transformer (Visual detection), and ROS (Robot control system) components.
The system employs a hierarchical architecture where the Biologist Agent designs protocols, the Technician Agent translates them into robotic code, and the Inspector Agent monitors execution for errors.
BioMARS leverages LLMs and VLMs for reasoning and perception, enabling autonomous protocol design, execution, and error handling in biological tasks.

Using multi-agent architecture to mitigate the risk of LLM hallucinations

Multi-agent architecture: introduces a system to handle customer SMS requests using multiple intelligent agents, including services for receiving messages, orchestrating processing, arbitrating decisions, and specialized agents for handling specific tasks.
The architecture integrates LLM-based agents with fuzzy logic and parsing techniques to interpret messages, evaluate confidence, assess customer importance, and detect potential LLM hallucinations.
Hallucination mitigation involves comparing keyword extraction results from parsing and LLM agents and using fuzzy rules to determine the handling of potentially high-risk requests or route messages to expert agents.

RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

RALLY (Role-Adaptive LLM-Driven Yoked Navigation): introduces, with LLM-based two-stage semantic reasoning module, Local intention generation, Neighborhood consensus refinement, Role-value Mixing Network (RMIX)-based credit-distribution mechanism, RMIX Network, Prior Offline Experience Replay Buffer, and Fine-tuned LLM components, a framework for role-adaptive LLM-driven yoked navigation for agentic UAV swarms.
The framework integrates LLM semantic reasoning with MARL policy learning for coordinating roles and decision-making across UAV swarms.
It employs a two-stage LLM process for consensus inference and a RMIX-based mechanism for dynamic role assignment and credit assignment.

Evaluating LLM Agent Collusion in Double Auctions

LLM Agent Double Auction Simulation: introduces a system to evaluate LLM agent collusion in a simulated continuous double auction environment with LLM Agents (buyers and sellers), Bid Queue, Ask Queue, Market Resolution Mechanism, Updated Market History, Planning & Messaging, Persistent Memory Store, Strategy Scratchpad, LLM Evaluator, Overseer Agent, CEO Message, and CME Group Regulators Message, investigating factors affecting seller collusion.
The research explores how communication, model variation, and environmental pressures like oversight and urgency influence LLM seller agents' propensity to collude and their pricing behavior.
Findings indicate that direct communication increases collusion, model choice affects coordination, and urgency can override the effects of regulatory oversight in promoting collusive pricing strategies.

AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing

LLM-Agents, MLLM-Agents, and Agentic AI: reviews the evolution and concepts of AI agents, detailing LLM-Agents with Profile (identity, role, constraints), Memory (stores, retrieves interactions), Planning (decomposes tasks, steps), and Action (executes decisions, tools) components, MLLM-Agents, and Agentic AI, exploring their manufacturing potential.
The paper discusses how Generative AI, including LLMs and MLLMs, enhances AI agents' capabilities for manufacturing applications.
It highlights the progression from traditional AI agents to more autonomous, adaptive, and goal-driven Agentic AI systems for future manufacturing.

Context-Aware Code Wiring Recommendation with LLM-based Agent

WIRL: introduces an LLM-based agent for context-aware code wiring, combining an LLM (Large Language Model), an Agent Pilot (Orchestrates communication), and a Customized Toolkit (Provides essential functionalities) with Locator (Identifies unresolved elements), Collector (Collects contextual information), and Completer (Infills isolated code) tools.
The framework reformulates code wiring as a retrieval-augmented generation infilling task, leveraging LLMs' strengths in code completion.
WIRL employs a hybrid execution mode and a state machine to guide the agent's exploration and improve efficiency.

1st July 2025

STELLA: Self-Evolving LLM Agent for Biomedical Research

STELLA: introduces a self-evolving LLM agent for biomedical research, leveraging Manager, Dev, Critic, and Tool Creation Agents, an evolving Template Library, and a dynamic Tool Ocean, along with Conda Environment, Scripts, Input, Final Result, and Human Expert/Wet Experiment feedback, to autonomously improve capabilities.
The agent employs a multi-agent architecture and two core self-evolving mechanisms: a Template Library for reasoning strategies and a dynamic Tool Ocean for accessible tools.
STELLA learns from experience, dynamically expanding its knowledge and skills to tackle complex biomedical challenges and improve performance over time.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Web Agent with Dynamic Reflection: introduces WebArXiv, a static benchmark, and proposes a dynamic reflection mechanism for web agents, including Web Agent, Visual Observations, Element Texts, Interaction History, Dynamic Reflection Mechanism, Model, Reasoning Context, Action Execution, and History Update components.
WebArXiv provides a stable and reproducible environment for evaluating web agents on time-invariant arXiv tasks.
The dynamic reflection mechanism enhances agent performance by selectively retrieving relevant past interaction steps for improved decision-making.

Enhancing LLM Agent Safety via Causal Influence Prompting

CIP (Causal Influence Prompting): introduces a novel technique for enhancing LLM agent safety by leveraging Causal Influence Diagrams (CID) initialization, Environment interaction, and CID refinement.
The approach uses CIDs to represent cause-and-effect relationships in the agent's decision-making process, enabling reasoning about potential consequences.
Iterative refinement of the CID based on observed behaviors allows the agent to anticipate harmful outcomes and make safer decisions.

Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

Urban LLM Agents: introduces a framework for LLM-powered agents operating in urban environments, with Large Language Models (Core controller), Urban Sensing (Collects, interprets urban signals), Memory Management (Organizes, retrieves urban knowledge), Reasoning (Simulates, plans actions), Execution (Translates plans into actions), and Learning (Adapts, improves behavior) components.
These agents are semi-embodied, interacting with cyber-physical-social urban systems through APIs, databases, and platforms to support system-level decision-making.
The paper surveys the research landscape, categorizes applications, and discusses trustworthiness and evaluation challenges for real-world deployment.

TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

TransLaw: introduces a multi-agent framework for legal judgment translation, featuring Translator Agent, Annotator Agent, and Proofreader Agent powered by LLMs.
The framework simulates a professional translation workflow where agents collaborate, utilizing Proofreading Memory, Translation Memory, and a Terminology database.
A Memory module supports agent self-adaptation by storing interaction history, aiming to improve translation quality and efficiency.

Many LLMs Are More Utilitarian Than One

LLM-MAS (Large Language Model Multi-Agent Systems): introduces a study on collective moral reasoning in LLMs, featuring LLM Agent (Individual large language model) in Solo Condition (Independent reasoning) or Group Condition (Multi-agent deliberation) involving a Discussion Phase (Multi-turn agent exchange) and a Reflection Phase (Private reasoning and scoring).
The research investigates whether multi-agent LLM systems exhibit a utilitarian boost in moral judgments compared to individual LLMs.
Experiments with six different LLMs in pairs and triads show a consistent shift towards endorsing norm violations that maximize overall welfare.

Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

LLM Social Agents: introduces, "Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity", with Large Language Models (LLMs) (Generate responses), LLM Agents (Simulate users), Zero Shot Initialization (Uses political leaning), Few Shot Initialization (Uses user history), User Profile Data (Bio, tweets for Few Shot), and Tweet Conversation Context (Input tweets for reply), where the paper investigates how LLMs simulate political discourse on social media using agents initialized with varying user data.
The study evaluates three LLM families (Gemini, Mistral, DeepSeek) under Zero Shot and Few Shot conditions, comparing their outputs to human replies on lexical diversity, ideological consistency, and toxicity.
Findings reveal "generative exaggeration," where LLMs amplify salient user traits, particularly in the Few Shot setting, leading to increased polarization, stylized language, and toxicity, challenging their reliability as social proxies.

ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

ChatHLS: introduces an automated end-to-end workflow for HLS design optimization and error correction, including C++ Input, LLM ① (HLS GEN), RAG, LLM ② (HLSTuner), HLS Tool (Testing), LLM ③ (Bug Fixing), LLM ④ (Instruction Adherence), LLM Group ⑤ (Multifaceted Assessment), LLM ⑥ (Scoring), BugRAG, QoR Pass Check, User Requirement, HLS-C Output, and HLS Dataset Collection.
The framework leverages fine-tuned LLMs within a multi-agent system for generating HLS-C code, optimizing designs, and systematically debugging errors.
ChatHLS utilizes a verification-oriented data augmentation paradigm (VODA) and iterative refinement to enhance LLM capabilities and achieve high code repair accuracy and performance speedups.

1st July 2025

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Large Language Model (LLM): investigates the transferability of reasoning capabilities in LLMs fine-tuned on math tasks by analyzing their internal latent space and output token distribution.
The research compares the impact of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) fine-tuning paradigms on LLM generalization.
Findings indicate that RL-tuned models maintain more stable latent representations and token distributions, leading to better transferability across diverse tasks than SFT-tuned models.

iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

iPanda: introduces an end-to-end framework for automated protocol conformance testing, with Function Point Extractor (extracts points), Test Case Generation Module (generates test cases), LLM Interactor (generates test code), Execution Module (runs tests), Memory Module (manages memory), and Summarization Module (summarizes, reports).
The framework leverages large language models, keyword-based test case generation, code-based retrieval-augmented generation, and iterative self-correction for test code refinement.
iPanda streamlines the testing process from specification analysis to result analysis, significantly reducing manual effort and improving efficiency.

30th June 2025

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: introduces a self-play framework for LLMs using a Distributed Actor-Learner Architecture, Parallel Rollout, Centralized Learner, Role-conditioned Advantage Estimation, Shared Policy, Zero-Sum Games, Evaluation Games, Vectorized Environment, and Model Inference, enabling language models to develop reasoning through multi-turn competitive self-play on zero-sum games.
The framework utilizes a distributed actor-learner system with parallel rollout in vectorized game environments and a centralized learner processing trajectories using Role-conditioned Advantage Estimation to update a shared, role-conditioned LLM policy.
Self-play on zero-sum games generates an infinite curriculum, forcing the shared policy to continuously adapt and develop transferable reasoning skills without human supervision or domain-specific data.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Agent.xpu: introduces an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs, with Offline Model Compilation and Warmup (Prepares LLM model), Online Workload-Aware Scheduling (Manages runtime execution), and Hetero-SoC Hardware Layer (Underlying hardware).
The system uses offline profiling to build a Heterogeneous Execution Graph (HEG) and annotate Elastic Kernels for online scheduling.
The online scheduler employs a Dual-Queue Architecture, Task Decomposition and Dispatch, XPU Coordinator, Fine-Grained Kernel-Level Preemption, Slack-Aware Kernel Backfill, and Contention Mitigation to manage reactive and proactive tasks on CPU, iGPU, and NPU with Shared Memory.

Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Auto-TA: introduces a fully automated LLM pipeline for thematic analysis, with Generation Agents (Initial processing) including Coder Agents with Identities (Generate initial codes) and Theme-Generation Agents (Cluster codes, generate themes), a Feedback Agent (Evaluate, refine themes), and optional Reinforcement Learning (optional) (Optimize themes via feedback) involving Human Raters (Provide feedback for RL) and an RL Trainer (Update policy).
The framework processes clinical narratives end-to-end, eliminating the need for manual coding or full transcript review.
Specialized LLM agents collaborate to enhance theme quality and alignment, with optional RLHF improving thematic relevance based on human feedback.

LLM Agents Are the Antidote to Walled Gardens

Universal Interoperability: introduces LLM Agents (Understand text/code, interact external tools/web), Agent-friendly interfaces (Provide metadata for agent interaction), Security by design (Mechanisms for agent permissions/safety), and Ecosystem infrastructure (Protocols, standards for agent interaction), proposing LLM agents enable seamless data exchange and workflow coordination between digital services via AI-mediated adapters.
This approach aims to reduce integration effort and cost by allowing agents to translate formats and interact with interfaces, overcoming traditional technical and strategic barriers.
Establishing foundational infrastructure for agent-friendly interfaces, security, and ecosystem protocols is crucial to mitigate risks and ensure robust, secure, and effective interoperability.

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

R2A2 (Reflective Risk-Aware Agent Architecture): introduces a modular framework integrating safety, alignment, and risk-awareness into LLM agent cognitive loops.
The architecture includes components for perception, memory, reasoning, planning, reflection, risk simulation, and action filtering.
Grounded in Constrained Markov Decision Processes, R2A2 enables risk-aware planning and constraint-sensitive execution for autonomous agents.

Leveraging a Multi-Agent LLM-Based System to Educate Teachers in Hate Incidents Management

ARISE (Agent Resource for Incident Support and Education): introduces a multi-agent LLM-based system with Manager Agent, Student Agents, Advisory Agents, RAG Module, Conversational Interface, and Feedback Mechanism, designed to educate teachers in hate incident management through realistic simulations.
The system uses persona modelling and retrieval-augmented generation to provide diverse perspectives and contextual information for analyzing hate speech incidents.
Teachers interact with the system via a chat interface to describe incidents and receive structured analysis, potential escalation risks, and intervention strategies.

A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

LLM-based Automated Program Repair: introduces a taxonomy with Base LLMs (Core models), Fine-tuning (Adapt LLM weights), Prompting (Single query frozen LLM), Procedural (Scripted multi-step workflow), and Agentic (LLM controls workflow) paradigms, enhanced by Retrieval-Augmented Generation (External knowledge augmentation) and Analysis-Augmented Generation (Program analysis augmentation).
This survey categorizes 63 recent systems, clarifying design trade-offs and challenges across different approaches.
The paper outlines research directions to advance reliable and efficient LLM-based APR.

DABstep: Data Agent Benchmark for Multi-step Reasoning

AI Agent on DABstep: introduces a benchmark evaluating AI agents on multi-step data analysis tasks, comprising Agent (AI model solving task), Environment (Context, data, tools) with Environment/Datasets (Structured data files), Environment/Docs (Unstructured documentation), and Environment/Code Execution (Code execution tool), interacting via Question (Task input), Answer (Task output), State (Agent's internal state), and Code/Actions (Agent's generated steps).
The benchmark features over 450 real-world financial data analysis tasks requiring multi-step reasoning, code execution, and integration of structured and unstructured data.
Evaluation uses an objective factoid-based scoring method, revealing a significant performance gap for current agents on complex tasks.

Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

Agent4S: introduces a five-level classification for LLM-driven agents to automate scientific research, featuring an Agent for Science, Memory, Model Context Protocol (MCP), Tools/External Agents, Reasoning Frameworks, and A2A Protocol.
The framework outlines a roadmap from automating single tools (L1) and complex pipelines (L2) to intelligent single-flow research (L3) and lab-scale autonomy (L4), culminating in cross-disciplinary multi-agent collaboration (L5).
Agent4S positions agents as productivity tools transforming scientific discovery by addressing the inefficiency of existing research paradigms and integrating AI into the entire research workflow.

PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent System for Pokémon Red

PokéAI: introduces a text-based multi-agent LLM framework, with Planning Agent (Generates tasks), Execution Agent (Carries out tasks), Critique Agent (Evaluates task outcome), Long-term Memory (Stores game state, context), Passive Battle Module (Handles in-game battles), and Active Tool Selection (Navigation, Conversation tools), designed to autonomously play Pokémon Red.
The system operates in a closed loop where the Planning Agent generates tasks, the Execution Agent performs them, and the Critique Agent verifies completion.
A key component, the Passive Battle Module within the Execution Agent, demonstrates performance comparable to an experienced human player in battle scenarios.

Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Personality-aligned LLM agents: introduces a method using Human Personality Profiles and Personality Assignment to create LLM Agents that perform a Headline Evaluation Task, generating LLM Accuracy Ratings, which are assessed using Evaluation Metrics.
The research evaluates whether LLM agents conditioned on Big-Five personality profiles can replicate human susceptibility patterns to misinformation.
The study finds partial replication of human trait-misinformation associations, highlighting both the potential and limitations of LLMs for behavioral simulation.

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

AutoDefense: introduces a multi-agent LLM defence framework, evaluated in 1-, 2-, and 3-agent configurations, including Coordinator (manages agents), Intention Analyzer (evaluates response intent), Prompt Analyzer (infers original query), and Judge (determines response safety) components, designed to protect LLMs from jailbreak attacks by analyzing responses.
The study evaluates the framework's effectiveness against various jailbreak attacks and compares performance across different agent configurations using metrics like Attack Success Rate, False Positive Rate, and False Negative Rate.
Results indicate that increasing agents can reduce false negatives but may increase false positives, suggesting no single optimal configuration and highlighting challenges in evaluating ethically ambiguous content.

Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

TAIRA (Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent): introduces a novel thought-augmented interactive recommender agent system featuring a Manager Agent, Executor Agents, and Thought Pattern Distillation to handle complex user intents.
The Manager Agent orchestrates tasks and plans subtasks using Thought Patterns and Hierarchical Planning, while Executor Agents like Searcher, Item Retriever, Task Interpreter, and Interactor execute specific functions.
Thought Pattern Distillation extracts high-level planning guidance from agent and human experiences to enhance the system's reasoning and generalization capabilities.

29th June 2025

Do LLMs Dream of Discrete Algorithms?

AI Agent: introduces a neurosymbolic approach augmenting LLMs with logic-based reasoning and modular tools, structured by MVC, enabling decomposition and orchestration.
The AI Agent architecture includes an Agent Core for orchestration, Memory for information storage, a Planner guided by logic reasoning, and various Tools for specific tasks.
This hybrid approach enhances reliability and interpretability for multi-step reasoning tasks by combining probabilistic LLMs with formal logic systems.

ATGen: A Framework for Active Text Generation

ATGen: introduces a comprehensive framework bridging active learning with text generation tasks, enabling AL-empowered annotation using human or LLM-based agents.
The framework provides a unified platform for implementing and benchmarking AL strategies tailored to NLG tasks.
It includes a web GUI, various AL strategies, support for LLM integration, efficient model tools, evaluation metrics, and a benchmarking platform.

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Multi-Agent Public Goods Game Simulation: introduces, "a simulation framework", with Environment (Central coordinator), Institutions (Rule frameworks), and Agents (Autonomous decision-makers), where "the framework models LLM agents navigating a public goods dilemma with institutional choice and norm enforcement".
The simulation includes two types of Institutions, Sanctioning and Sanction-Free, allowing agents to choose environments with or without costly norm enforcement mechanisms.
Agents make decisions on institution choice, contribution, and sanctioning based on their history and anonymized group data, with their reasoning captured for analysis.

From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

LLM-Powered AI Agent Communications: surveys threats in these systems, including Agent A (MCP Host), Agent B (MCP Host), MCP Server, A2A Server, Local Data Source, Remote Service API, Agent Framework (Large Language Model), A2A Client, A2A protocol, Web Browser - User, and Public Knowledge Source components.
The paper introduces a unified, end-to-end threat model categorizing over thirty attack techniques across input manipulation, model compromise, system/privacy, and protocol vulnerabilities.
This work provides a comprehensive reference for designing robust defenses and establishing best practices for resilient LLM-agent workflows.

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

AURA: introduces the first open-source, speech-to-speech task-oriented agent combining reasoning and tool use, featuring UI (user interface), ASR Module (speech recognition), TTS Module (text-to-speech), Dialog Processing Unit (processes dialogue) with Controller (central orchestrator), Agent (interleaves reasoning action), Actions (executable operations), Observation (environmental feedback), Dialog State Tracking (tracks dialogue state), and State (system memory) including Action-Observation History (action observation sequence), Conversation History (filtered chat history), and Dialog State (structured dialogue info), an LLM Server (hosts language model) with LLM (language model), Inference Engine (memory efficient inference), and ReAct Response Format (structured LLM output), and External APIs (Tools) (real-world services).
The system employs a cascaded architecture and integrates a ReAct-style agent to manage multi-turn dialogue and dynamic tool invocation for complex, goal-driven tasks.
AURA supports tools like calendar booking, contact lookup, web search, and email, demonstrating strong performance on VoiceBench and human evaluations for real-world task execution.

28th June 2025

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Interlocutor Awareness Evaluation Setup: introduces a systematic evaluation of LLMs' ability to identify and adapt to conversational partners, utilizing LLMs acting as Identifier, Target, Sender, Solver, Player, Judge, Jailbreaker, and an Interpreter model.
The evaluation assesses LLM interlocutor awareness across reasoning patterns, linguistic style, and alignment preferences.
Case studies demonstrate the impact of this awareness on multi-agent cooperation, alignment, and safety.

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

DICE-BENCH: introduces a framework to evaluate LLM function-calling in multi-round, multi-party dialogues, utilizing Tool Collections, Tool Graph Construction, Scenario Configuration, Dialogue Generation with Parameter Generation and Dialogue Simulation via a Multi-Agent System (Agents, Orchestrator), processed through a Validation Pipeline (Automatic Evaluation, Rule-Based Filtering, Criteria-Based Filtering), and quantified by the DICE-SCORE metric.
The framework generates realistic function-calling datasets by synthesizing conversations based on tool dependencies and distinct agent personas.
DICE-SCORE measures the dispersion of tool-related information across dialogue turns, correlating with task difficulty for LLMs.

Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

RAG and Agent Based Dialog Systems: introduces finetuning LLMs with domain-specific data and external knowledge (KAFT) within RAG and agent architectures, including Retriever (retrieves knowledge), Generator (LLM) (generates response), Decision Maker (decides search), and API Calling (calls search APIs).
The RAG system architecture comprises a Retriever and a Generator (LLM), while the agent system architecture includes a Decision Maker, API Calling, and a Generator (LLM).
KAFT is applied to the Generator (LLM) in both RAG and agent systems and the Decision Maker in the agent system to improve the utilization of external knowledge.

Memory as a Service (MaaS): Rethinking Contextual Memory as Service-Oriented Modules for Collaborative Agents

MaaS (Memory as a Service): introduces a service-oriented perspective for contextual memory in LLM-based agent systems, proposing a dual architecture with Memory Containers, a Memory Routing Layer, and a Fine-grained permission control mechanism to enable governable cross-entity memory sharing.
The framework decouples contextual memory from its local state, encapsulating it as independently callable, dynamically composable, and finely governed service modules.
MaaS aims to dismantle memory silos and support complex, long-term collaboration across diverse entities while rigorously respecting the private nature of memory assets.

FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets

FairMarket-RL: introduces a multi-agent reinforcement learning framework for peer-to-peer markets, incorporating a Large Language Model (LLM) as a real-time fairness critic to guide agent rewards.
The framework utilizes Independent Proximal Policy Optimization (IPPO) for agent training, blending raw economic rewards with LLM-generated fairness scores (Fairness-To-Buyer, Fairness-Between-Sellers) via a scheduled shaping mechanism.
FairMarket-RL demonstrates improved fairness and efficiency in simulated P2P energy trading, achieving high demand fulfillment and balanced profits by replacing static rules with dynamic LLM feedback.

27th June 2025

Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

KGMAF: introduces a knowledge-guided multi-agent framework for automated requirements development, composed of Agents (LLM-based entities) and an Artifacts Pool (Central artifact repository).
The agents collaborate by monitoring and interacting with the artifacts pool, which stores intermediate and final requirements artifacts.
Each agent is equipped with specific functionality, predefined actions, planning mechanism, and injected knowledge to perform requirements tasks.

URSA: The Universal Research and Scientific Agent

URSA (The Universal Research and Scientific Agent): introduces a scientific agent ecosystem for accelerating research tasks, consisting of a Planning Agent (Breaks down problems), Execution Agent (Carries out tasks), Research Agent (Gathers online info), Hypothesizer Agent (Generates hypotheses), ArXiv Agent (Summarizes research papers), LLMs (Backend models), LangGraph (Agent framework), DuckDuckGo Search Tool (Performs web search), Web Scraping/Parsing Tool (Extracts web content), Command Line Tool (Executes system commands), Write Code Tool (Writes code files), Run Physics Code Tool (Executes physics simulations), ArXiv Search Tool (Searches ArXiv API), and Vision Model (Processes images).
The framework utilizes a set of modular, composable agents coupled with tool use to hypothesize, plan, and execute research tasks, building on large language model capabilities.
URSA demonstrates the potential for agentic AI to address scientific problems of varied complexity, including leveraging advanced physics simulation codes for design automation.

REXBENCH: Can coding agents autonomously implement AI research extensions?

REXBENCH: introduces a benchmark for evaluating LLM agents' ability to implement research extensions, with Input (Research paper, Codebase, Task instruction), Agent Execution (LLM Agent, Patch file generation), Agent Evaluation Infra (Virtual Machine, Task Execution, Evaluation Metrics), and Evaluation (Results from Experiment, Final Success Rate calculation) components, where the paper evaluates LLM agents on realistic research extension tasks using an automatic evaluation infrastructure.
The benchmark consists of 12 tasks based on existing research papers and codebases, requiring agents to implement novel extensions and produce code changes.
An automatic evaluation infrastructure executes the agent-generated code in controlled virtual machines and assesses performance using metrics like File Recall, Execution Success Rate, and Final Success Rate, revealing that current agents struggle with these tasks.

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Dyadic Motion Models: introduces a framework for generating dyadic audiovisual motion, including Speech Tokenizer (processes audio), Face & Body Feature Extractor (processes user visual), Dyadic Motion Model (generates motion features), Speech Model (provides LLM features), Valence Adapter (maps to valence codes), Arousal Adapter (maps to arousal codes), Gesture Adapter (maps to gesture codes), Face Adapter (maps generic to personalized face features), Body Adapter (maps body features to avatar rig), 3D Full-Body Codec Avatar Decoder (renders 3D avatar), Gaussian Splatting (3D rendering technique), and Linear Blend Skinning (deforms avatar mesh).
The framework utilizes dyadic audio and optional user visual input to generate intermediate face and body motion features.
LLM integration via adapters enables controllable emotion and gesture generation, while adapters and a 3D decoder facilitate photorealistic avatar rendering for interactive agents.

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

AI Research Agent: introduces, "The Automated LLM Speedrunning Benchmark", with LLM (Large Language Model), Search Scaffold (Iteratively uses LLM), Coder (Generates/modifies code), Executor (Runs code), Analyzer (Summarizes execution results), Knowledge (External information source), and History (Record of attempts), evaluating the ability of AI agents to reproduce NanoGPT speedrun improvements.
The benchmark tasks agents with reproducing successive speedrun records, providing the previous record's script and optional hints in various formats.
The AI research agent, composed of an LLM and a search scaffold, attempts to reproduce the record, and its performance is measured by the fraction of speedup recovered and code similarity.

Exploring Modularity of Agentic Systems for Drug Discovery

smolagent framework: evaluates the modularity of LLM-based agentic systems for drug discovery, with Code Agent (Writes and executes code), ToolCalling Agent (Uses external tools), LLM (Backbone language model), System prompt (Agent instructions), Tools (Cheminformatics functions), LLM Judge (Evaluates agent answers), where the system uses different agents, LLMs, and prompts to answer cheminformatics questions, evaluated by an LLM-as-a-judge.
The study compares the performance of CodeAgent and ToolCallingAgent types, seven different LLMs, and three system prompts on a set of 26 cheminformatics questions.
Performance is assessed using an LLM-as-a-judge system that scores agent answers based on expected answers, highlighting the dependence of performance on LLM, agent type, and prompt.

Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

TWONs (Twins of Online Social Networks): introduces a formal framework for simulating social networks with Agents (Social media users) having Agent State (Agent's discourse history) and Communicative Behavior (Agent generates messages), interacting via Network Mechanics (Adapts incoming messages), focusing on Imitating User Behavior (Estimate agent behavior function) including Imitating Posting Behavior (Model content generation), Imitating Replying Behavior (Model reply generation), and Estimating Replying Likelihood (Predict reply probability) using Large Language Models (LLMs) (Basis for agents) with Fine-Tuning (Adapting LLMs) and a BERT-based Encoder (Embeds text for likelihood).
The paper empirically tests LLM-based imitation of user behavior on X (formerly Twitter) in English and German, benchmarking empirical realism.
Findings suggest fine-tuning and language-specific considerations are crucial for achieving realistic social simulations with generative agents.

More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Tool-Integrated LLM Agent: introduces an evaluation of the stability of tool-integrated LLM agents, focusing on vulnerabilities during the tool invocation process related to tool documentation, tool usage hallucination, and tool response attacks.
The study investigates how internal and external factors impact agent performance and stability when interacting with external tools using the ReAct framework.
Experiments reveal that agents are highly susceptible to errors at each stage of tool invocation, with open-source models generally more vulnerable than proprietary ones.

CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design

CAL-RAG (Retrieval-Augmented Multi-Agent Generation): introduces a framework for content-aware layout generation using a Layout Recommender Agent (Suggests initial layout), Layout Generation Tool (Creates visual representation), Grader Agent (Evaluates layout quality), and Feedback Agent (Provides refinement feedback).
The framework operates iteratively, retrieving relevant layout examples, proposing structured element placements, evaluating the generated layout based on visual metrics, and providing targeted refinements.
This multi-agent system combines retrieval augmentation with agentic reasoning to achieve scalable, interpretable, and high-fidelity automated layout generation.

ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation

ARAG (Agentic Retrieval-Augmented Generation): introduces a multi-agent framework for personalized recommendation, integrating a User Understanding Agent (Summarizes user preferences), Natural Language Inference Agent (Evaluates semantic alignment), Context Summary Agent (Summarizes NLI-filtered evidence), and Item Ranker Agent (Generates ranked list) to refine context retrieval and item ranking.
The framework leverages specialized LLM-based agents that collaborate in a blackboard-style system to process user context and candidate items.
This agentic approach enhances context awareness, semantic grounding, and personalization in recommendation systems by decomposing the task into distinct reasoning steps.

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

SPAZER: introduces a VLM-driven agent for zero-shot 3D visual grounding, integrating 3D spatial localization and 2D semantic verification through a progressive reasoning process with 3D Holistic View Selection (Generates, selects optimal 3D view), Candidate Object Screening (Filters, ranks potential objects), 3D-2D Joint Decision-Making (Integrates 3D/2D for final grounding), and VLM (Core reasoning, decision-making engine) components.
The approach leverages holistic 3D rendered views for global spatial context and incorporates retrieval-augmented candidate screening for enhanced robustness.
SPAZER performs 3D-2D joint decision-making by combining information from selected 3D views and relevant 2D camera images to identify the target object.

26th June 2025

CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation

CitySim: introduces a large-scale urban simulation framework using LLM-powered agents with Persona Module (Demographics, traits, habits), Memory Module (Temporal, reflective, spatial), Belief Module (Updates POI beliefs), Needs Module (Tracks, prioritizes needs), Long-Term Goal Module (Forms, revises aspirations), Perception Module (Observes environment, reacts), Planning Module (Generates daily schedules), Place Selection Module (Determines activity location), Vehicle Selection Module (Selects transport mode), and Social Module (Manages social interactions) to model human-like behavior.
The framework enables agents to generate realistic daily schedules and long-term plans through recursive, value-driven planning, balancing mandatory activities, personal habits, and situational factors.
CitySim agents are equipped with spatial and temporal memories to recall experiences, form beliefs about places, and adapt future decisions, demonstrating closer alignment with real human behavior than prior work.

MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models

MobiVerse: introduces a hybrid framework for urban mobility simulation, combining a Domain-Specific Generator for base activity chains with an LLM Activity Chain Modifier for context-aware adaptation, integrated within a Visualized Simulation Environment using SUMO.
The framework utilizes a SUMO Controller for simulation execution and data collection, a Trajectory Viewer for visualization, and global data stores (Road Network, POI Info, Agent Info) for system-wide access.
Supporting components like the PromptManager, RoadClosureHandler, and EventHandler manage LLM interactions and specific environmental events to enhance behavioral realism and scalability.

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

SEEA-R1 (Self-Evolving Embodied Agents-R1): introduces a reinforcement fine-tuning framework for embodied agents, featuring Policy Model (Predicts actions), Reward Model (MGRM) (Predicts task outcomes), Data Evolution (MCTS) (Generates experience), Model Evolution (Tree-GRPO) (Updates models), Environment (Provides observations/rewards), and Experience Dataset (Stores interaction data).
The framework drives continuous improvement through iterative Data Evolution and Model Evolution cycles, using MCTS for experience generation and Tree-GRPO for policy updates.
It utilizes a Multi-modal Generative Reward Model (MGRM) to provide dense, generalizable reward signals, reducing dependence on sparse environment rewards.

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Proactive Alignment Framework: introduces a method that simulates long-term societal consequences of LLM advice using World Modeling and an Event Scripting Model to generate Feedback, which is then used by an Improver to refine responses.
The framework explores causal event trajectories via Event Trajectory Search, identifies affected population Strata, and generates Agent Feedback for these groups.
This approach enhances LLM risk awareness, leading to Refined Responses that are safer, achievable through inference-time refinement or offline Realignment Training.

ParEval-RepO: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

LLM-based translation techniques: introduces PAREVAL-REPO, a benchmark suite for evaluating repository-level HPC translation using Non-agentic method (file-by-file translation), Top-down agentic method (multi-agent system), and SWE-agent (autonomous coding agent).
The Top-down agentic method comprises a Dependency agent (determines file dependencies), Context agent (manages translation changes), Chunk agent (splits large files), and Translation agent (translates code chunks).
Evaluation metrics (assess translation quality) and Error analysis (identifies translation errors) are used to assess various LLMs (Large Language Models) and techniques, highlighting challenges in build system generation and cross-file dependencies.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

FaSTA* (Fast-Slow Toolpath Agent): introduces a neurosymbolic agent for multi-turn image editing, including LLM (high-level planning / reasoning), VLM (quality checking), Subroutine Rule Table (learned subroutines / rules), and Online Subroutine Learning Mechanism (learns / refines subroutines).
It utilizes an Adaptive Fast-Slow Execution Strategy (fast / slow planning) that attempts learned subroutines first and triggers A* Search (low-level toolpath search) as a fallback, supported by Knowledge Structures (TDG / MDT / BT) and AI Tools (image editing operations).
This method achieves substantial cost savings in execution time while maintaining image editing quality comparable to state-of-the-art baselines.

25th June 2025

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

GPU Kernel Scientist: introduces an automated, iterative framework for GPU kernel optimization, including Population (Stores kernels and performance data), LLM Evolutionary Selector (Selects kernels for iteration), LLM Experiment Designer (Designs optimization experiments), LLM Kernel Writer (Generates modified kernel code), and Benchmarking Platform (Evaluates kernel performance).
The framework leverages large language models across three core stages to iteratively refine GPU kernels based on performance feedback from an external evaluation system.
This LLM-driven approach aims to bridge knowledge gaps and accelerate kernel optimization, particularly in environments with limited documentation or tooling.

Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis

LLM Mitigation Pipeline: introduces an approach integrating LLM analysis into a GNN-based NIDS pipeline, including initial data processing, parameter configuration, data preprocessing, graph summarization, LLM agent analysis, and output generation for the GNN.
The pipeline employs LLM agents as simulated cybersecurity experts to analyze network graph elements and identify suspicious components before GNN processing.
This LLM-based mitigation strategy aims to enhance GNN resilience against realistic attacks like node injection by filtering or flagging malicious graph elements.

A SURVEY OF AI FOR MATERIALS SCIENCE: FOUNDATION MODELS, LLM AGENTS, DATASETS, AND TOOLS

AI4MS: introduces, "Common & Prevalent Tasks (broad application areas), Foundation Models (large pretrained models), Datasets (data collections), Tools & Infrastructures (supporting software platforms), and Successes, Limitations & Challenges, Future Directions (discussion points), where the paper surveys the landscape of AI for materials science."
The survey categorizes Foundation Models into Unimodal, Multimodal, and LLM Agents, and Datasets into Computational/Experimental and LLM Development.
It also reviews Tools & Infrastructures for Data Analysis/Management and Model Development, and discusses the current state and future directions.

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

DECRYPTO: introduces a multi-agent benchmark for evaluating language models, featuring Alice (Chooses hints), Bob (Guesses code), and Eve (Intercepts code) interacting over shared Keywords (Secret words), Code (Secret digits sequence), and Hints (Words for code), tracked via Hint History (Past hints) and Code History (Past codes), and evaluated using Generalist Agents (Out-of-the-box LLMs), Specialist Agents (Task-specific agents), and specific Theory of Mind Tasks (Cognitive experiments).
The benchmark is based on a language game requiring players to reason about others' knowledge and beliefs to succeed in cooperative and competitive settings.
DECRYPTO provides a platform for studying multi-agent reasoning, theory of mind, and human-AI interaction in interactive, language-based scenarios.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Behavior Editing: introduces steering LLM-based agents' ethical behavior by editing the agent (Pre-edit Agent) to become a Post-edit Agent, enabling both benevolent and malicious steering.
The approach frames agent behavior steering as a model editing task, allowing precise and efficient modifications to influence behavior and moral alignment.
The BEHAVIORBENCH benchmark is developed to systematically evaluate this editing approach across diverse ethical scenarios and complexity levels.

Fine-Tuning and Prompt Engineering of LLMs, for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges

Multi-Agent AI System: introduces a Retrieval-Augmented Generation (RAG)-oriented system for sustainable protein production research, with a Literature Search Agent (retrieves literature), Information Extraction Agent (extracts information), Pool of Scientific Literature (literature source), User Interface (user interaction), Toxicity Analysis Module (screens for toxicity), GPT-Based LLM (agent foundation), Prompt Engineering (optimisation method), Fine-Tuning (optimisation method), and External Sentence Transformer (evaluation tool).
The study compares fine-tuning and prompt engineering as methods to optimise the performance of the information extraction agent using GPT models.
This multi-agent system aims to automate the process of retrieving and extracting key information from scientific literature to accelerate research in sustainable protein production.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

DeepRare: introduces an LLM-powered agentic system for rare disease diagnosis, with Central Host (Coordinates workflow, synthesizes info), Memory Bank (Stores diagnostic information, context), Agent Servers (Execute specialized tasks), Phenotype Extractor (Converts free-text to HPO), Phenotype Analyzer (Analyzes HPO, suggests diseases), Knowledge Searcher (Retrieves medical documents, web), Case Searcher (Finds similar patient cases), Genotype Analyzer (Annotates, ranks genetic variants), Disease Normalizer (Standardizes disease names), External Data Sources (Provide diagnostic evidence), Medical Literature (Peer-reviewed publications), Rare Disease Knowledge (Curated rare disease info), General Knowledge (Broad clinical resources), Case Collection (Repository of patient cases), and Gene Variant Databases (Genetic variant information) components, designed to process heterogeneous clinical inputs and generate traceable diagnostic reasoning.
The system employs a three-tier architecture comprising a central host, specialized agent servers, and diverse external data sources to facilitate complex diagnostic reasoning.
DeepRare generates ranked diagnostic hypotheses with transparent reasoning chains linked to verifiable medical evidence, enhancing interpretability and supporting clinical adoption.

SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models

SV-LLM (multi-agent assistant system): introduces a multi-agent framework for SoC security verification, with Application, Supervisor, Orchestrator, Agent, Data, and Infrastructure layers, designed to automate and enhance the verification workflow.
The system employs specialized LLM-driven agents for tasks including security Q&A, asset identification, threat modeling, test plan generation, vulnerability detection, and bug validation.
The layered architecture and agentic design aim to streamline complex verification tasks, reduce manual effort, and improve accuracy and scalability in hardware security analysis.

TAPS: Tool-Augmented Personalisation via Structured Tagging

TAPS (Tool-Augmented Personalisation via Structured Tagging): introduces a tuning-free approach for personalised tool use in task-oriented dialogue, combining an LLM (Generates response, predicts API calls), a Structured Tagging Tool (Augments data, adds tags), and an Uncertainty-based Tool Detector (Determines tool use, assesses confidence).
The framework leverages structured tagging to create an intermediate representation between natural language and API calls, enhancing argument extraction.
An uncertainty-based tool detector determines when to apply the structured tagging tool to improve performance.

Language Modeling by Language Models

Genesys: introduces an autonomous system for discovering novel language model architectures, with LMADE (Environment), Knowledge Engine (Knowledge access), Reference Library (Curated papers), External Sources (Search tools), Paper Vector DB (Vector database), Verification Engine (Verification tools), Symbolic Checker (Code analysis), Automated Trainer (Model training), Automated Evaluator (Model evaluation), Auto-Tuner (Parameter tuning), Runtime Checker (Training monitor), Evolutionary Tree (Design storage), LLM-driven Agents (Discovery agents), Designer Agents (Design creation), Proposer Agent (Proposal generation), Reviewer Agent (Proposal review), Planner Agent (Implementation planning), Coder Agent (Code writing), Observer Agent (Code review), Verifier Agents (Verification management), Generalized Autoregressive Block (Main architecture unit), Generalized Autoregressive Unit (Composable sub-unit), Ladder-of-Scales (Multi-scale verification), and Unit-based Generation (Stepwise code generation).
The system simulates the research process from ideation to verification using LLM agents and a genetic programming backbone operating on a factorized design space.
Genesys employs a Ladder-of-Scales approach for efficient verification and unit-based code generation for improved design quality and efficiency.

PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models

PSALM-V: introduces a neuro-symbolic learning system that induces symbolic action semantics in visual environments by iteratively initializing/updating problem files, sampling trajectories, executing in the environment, predicting errors, generating/updating action semantics, and using a symbolic planner for verification.
The system maintains a tree-structured belief over action semantics, refining it based on execution outcomes and predicted errors to enable reliable planning without expert definitions.
PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations.

24th June 2025

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

OIR (Open-Ended Instruction Relabeling): introduces a framework that leverages a Large Language Model to automatically generate open-ended instructions from collected agent trajectories, enriching training data for instruction-following reinforcement learning.
The framework uses the LLM to relabel unsuccessful trajectories by identifying accomplished subtasks, providing semantic rewards for efficient learning in sparse environments.
A prioritized instruction buffer manages the diverse, LLM-generated instructions, balancing exploration and exploitation for robust policy improvement.

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

QHackBench: introduces a novel benchmark dataset and evaluation framework for LLM-based quantum code generation, featuring QHack Challenges, PennyLang Dataset, Retrieval, Code Generation Agent, Test Bench, Validation & Correction Agent, Self-Reasoning, and Augmented Query components.
The framework systematically evaluates LLMs using vanilla prompting, Retrieval-Augmented Generation, and a multi-agent iterative refinement pipeline on real-world quantum coding challenges.
Results indicate RAG and multi-agent approaches can enhance performance, highlighting the importance of domain-specific context and iterative debugging for reliable quantum code generation.

Prover Agent: An Agent-based Framework for Formal Mathematical Proofs

Prover Agent: introduces an agent-based framework for formal mathematical proofs, coordinating an Informal LLM (informal reasoning), Prover Model (formal proving), Lean (formal verification), and AutoFormalizer (formalizes lemmas).
The framework generates auxiliary lemmas via informal reasoning, formalizes them, proves them, and uses verified lemmas to synthesize the final proof.
Iterative refinement based on Lean feedback is used throughout the process to ensure correctness and improve proof construction.

JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning

JoyAgents-R1: introduces a joint evolution dynamics framework for heterogeneous multi-agent systems, including a master agent (orchestrates tasks), sub-agents (specialized task execution), agent memory (stores past information), tools (external functionalities), joint evolution dynamics (training process), and joint reward function (calculates action feedback).
The framework employs a hierarchical architecture where the master agent delegates tasks to specialized sub-agents that interact with tools and memory.
Joint evolution dynamics leverages GRPO with node-wise Monte Carlo sampling and marginal benefit updating, while memory evolves adaptively using GRPO rewards.

MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

MAM (Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis): introduces a modular, collaborative framework with General Practitioner (Initial triage/referral), Specialist Team (Domain expert agents), Radiologist (Image analysis agent), Medical Assistant (Information retrieval/summary), and Director (Orchestrator/synthesizer) agents for multi-modal medical diagnosis.
The framework decomposes the diagnostic process into specialized roles, each embodied by an LLM-based agent, enabling efficient knowledge updates and leveraging existing models.
Agents collaborate through a defined workflow involving initial triage, problem decomposition, information retrieval, diagnostic opinion generation, discussion, report synthesis, consensus, and final diagnosis derivation.

LLM-Based Social Simulations Require a Boundary

LLM-Based Social Simulations: introduces boundaries for reliable social science contributions, focusing on LLM Agents (model individual behavior), Alignment (simulated behaviors match real-world), Consistency (coherent agent behavior over time), and Robustness (reproducibility under conditions).
The paper argues that LLMs' inherent limitations, particularly lack of behavioral heterogeneity, constrain their reliability for simulating complex social dynamics.
It proposes heuristic boundaries and a checklist to guide researchers in determining the appropriate scope and claims for such simulations in social science research.

SAGE: Strategy-Adaptive Generation Engine for Query Rewriting

SAGE (Strategy-Adaptive Generation Engine): introduces a reinforcement learning framework for query rewriting that integrates a Policy Model guided by Explicit Strategic Primitives, evaluated by an Environment, and trained using a Reward Shaping Module with Strategic Credit Shaping and Contrastive Reward Shaping, enhanced by an Exploration Penalty and Proactive Exploration Prompting via GRPO Update.
The framework operationalizes expert-crafted strategies within an RL loop to steer the LLM agent towards effective query rewriting and improved policies.
Novel reward shaping mechanisms and forced exploration techniques are introduced to provide informative learning signals and counteract reward hacking.

A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects

Multi-sensor Fusion Perception (MSFP): introduces a survey of methods for embodied AI, detailing pipelines with Sensor Data, Backbone/Encoder, Features, Fusion Mechanism, and Downstream Task components.
The survey categorizes methods by fusion level (point, voxel, region, multi-level), multi-agent (Agent Communication), time-series (Temporal Fusion), and MM-LLM (LLM) approaches.
It reviews specific techniques within each category and discusses open challenges and future opportunities for MSFP in embodied AI.

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Agent Communication Classification: introduces a comprehensive survey of LLM-driven AI agent communication, classifying it into User-Agent Interaction, Agent-Agent Communication, and Agent-Environment Communication, and analyzing related protocols, security risks, and defense countermeasures for each stage.
The paper details the typical LLM-Driven AI Agent Architecture comprising perception, memory, reasoning/planning, tool, and action modules, highlighting how agent communication enables collaboration and task completion beyond single LLM capabilities.
Agent-Agent Communication Architectures are categorized into CS-based, P2P-based, Hybrid, and Others based on their discovery mechanisms, while specific protocols like MCP, A2A, and AG-UI are discussed within the respective communication stages.

Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning

TAPAS (Task-based Adaptation and Planning using Agents): introduces a multi-agent framework for adaptive task planning, including Domain Generator, Initial State Generator, Goal State Generator, Planning Problem Model, Solver, Debugger, Structured Plan, Plan Abstraction, Plan in NL, Plan Executor, Action Executor Agent, Validator Agent, Memory, Critic, Agent Tools, and Robot/Execution Environment.
The framework uses specialized LLM-based agents to collaboratively generate and adapt domain models, initial states, and goals via structured tool calls and iterative refinement.
A robust planning and execution pipeline translates symbolic plans to natural language for a ReAct-style execution agent, bridging the gap to real-world robot capabilities with feedback-driven validation.

KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs

KnowMap: introduces a novel approach for LLM task adaptation by dynamically constructing a knowledge base from environmental and experiential data, equipping a larger LLM with task-specific knowledge via a fine-tuned embedding model, and utilizing an agent scaffold with planner, actuator, evaluator, and memory module.
The framework's knowledge base is divided into an environmental knowledge base representing the current environment state and an experiential knowledge base storing reusable experiences and reasoning patterns derived from trajectories.
KnowMap fine-tunes a knowledge-embedding model on data derived from both knowledge bases to enhance retrieval performance and support the agent scaffold's decision-making process.

MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications

MATE (Multi-Agent Translation Environment): introduces, a multimodal accessibility multi-agent system, with Interpreter Agent (Identifies task, redirects), TTS Expert (Text to speech), TTI Expert (Text to image), STT Expert (Speech to text), ITT Expert (Image to text), ITA Expert (Image to audio), ATI Expert (Audio to image), VTT Expert (Video audio to text), ModCon-Task-Identifier (Task type recognition model), Pre-defined Models/Functions (Perform modality conversion), and Output File Storage (Saves output files), designed to perform modality conversions based on user needs for accessibility applications.
The system uses an Interpreter Agent, powered by ModCon-Task-Identifier, to identify the user's requested modality conversion task and delegates it to one of seven specialized expert agents.
Expert agents utilize pre-defined models and functions to execute specific conversion tasks, saving the output to a designated location for the user.

NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling

NaviAgent: introduces a bilevel planning architecture for robust function calling, integrating a Multi-Path Decider (LLM-powered agent) for action selection and a Graph-Encoded Navigator (Graph-based planning) for toolchain planning on a Tool Dependency Heterogeneous Graph (TDHG) (Tool relationship graph).
The Graph-Encoded Navigator constructs and evolves the TDHG through Graph Construction (Builds TDHG), Graph Representation (Node/edge features), Graph Training (Optimizes graph), Graph Search (Finds toolchains), and Graph Evolution (Updates graph).
The Multi-Path Decider dynamically chooses actions from Direct Response (Decider action), Intent Clarification (Decider action), Tool Retrieval (Decider action), and Tool Call (Decider action) based on perceived states and Navigator output.

Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning

LLM-Based Dialogic Pedagogy Framework: proposes strategies for designing effective LLM-based conversational AI tutors, incorporating an LLM (Core conversational engine), Dialogue Strategy Engine (Guides conversation flow), Knowledge Retrieval Module (Integrates external information), Student Model (Tracks learner state), and Interaction Persona Module (Manages AI style/role).
The strategies aim to address limitations of raw LLMs, such as over-directness and lack of student modeling, by integrating pedagogical principles like Socratic questioning, scaffolding, and reflection.
The framework emphasizes aligning AI interactions with proven learning theories to create personalized, engaging, and educationally productive dialogues.

LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code

LLM-based Multi-Agent System: introduces an automated Haskell code refactoring system with agents for code analysis, strategy formulation, refactoring execution, testing, and debugging.
The system employs specialized agents like Code Context and Structure, Code Smells, Refactoring Strategy, Refactor (Expert/Lead), Testing and Validation, and Debug agents to collaboratively improve code.
This multi-agent approach aims to enhance code quality, runtime efficiency, and memory usage in functional programming codebases through structured interaction and iterative refinement.

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Mem4Nav: introduces a hierarchical spatial-cognition long-short memory system with Sparse Octree (voxel indexing), Semantic Topological Graph (landmark connectivity), Reversible Token Processing (encodes/decodes memory), Long-Term Memory (lossless historical storage), and Short-Term Memory Cache (recent local context).
This system integrates fine-grained voxel indexing and high-level landmark connectivity with dual memory modules for efficient storage and retrieval of spatial observations.
The dual memory architecture, using reversible tokens for LTM and a frequency-recency cache for STM, enables agents to retain relevant experiences over extended time horizons for improved navigation.

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Commander-GPT: introduces a modular multi-agent framework for multimodal sarcasm detection, including Input, Subtask Routing, Subtask Execution by specialized agents (Context Modeling, Sentiment Analysis, Rhetorical Device Recognition, Facial Expression Recognition, Image Summarization, Scene Text Recognition), and a Commander for result integration and final decision.
The framework decomposes sarcasm detection into six cognitively meaningful sub-tasks, dynamically routing input to the most suitable specialist agents.
Centralized coordination by the commander integrates information from activated agents for adaptive and fine-grained reasoning across modalities.

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Skywork-SWE: introduces an automated data curation pipeline for software engineering tasks, including data collection and pre-filtering, environment setup and execution-based validation, and agent trajectory generation, used to train the Skywork-SWE Agent Model, evaluated with the OpenHands Agent Framework and Test-Time Scaling (TTS).
The pipeline generates a large-scale, high-quality dataset of GitHub issue-fix instances with executable runtime environments.
The trained model demonstrates data scaling laws in software engineering tasks and achieves state-of-the-art performance on SWE-bench Verified among open-source models.

Augmenting Multi-Agent Communication with State Delta Trajectory

SDE (State Delta Encoding): introduces a novel multi-agent communication protocol that augments natural language with LLM's hidden states by transferring token-wise changes.
The protocol involves agents built from a single LLM exchanging natural language tokens and state delta trajectories.
State delta trajectories, representing differences in hidden states, are injected into the receiver agent's transformer layers to enhance understanding.

23th June 2025

Distilling Tool Knowledge into Language Models via Back-Translated Traces

Back-translation pipeline: introduces a method to distill tool knowledge into language models by converting tool-integrated reasoning traces into natural language using a SOLVER AGENT (Generates TIR traces), SymPy Toolkit (Provides symbolic tools), TIR Trace Filtering (Selects correct traces), TRANSLATOR AGENT (Translates tool calls), JUDGE AGENT (Verifies translations), and REPHRASE AGENT (Reconstructs NL traces) to fine-tune a Student Model (Fine-tuned target model).
This pipeline generates high-quality natural language reasoning traces from tool-augmented solutions, enabling smaller models to internalize structured problem-solving patterns without requiring tool access at inference.
The approach improves performance on challenging math benchmarks by transferring symbolic computation capabilities and structured reasoning from tool-using agents to language-only models.

18th June 2025

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

SwarmAgentic: introduces a framework for fully automated agentic system generation, using Particle Swarm Optimization to explore a language-driven design space, optimizing Agentic Systems composed of an Agent Set and Collaborative Structure.
The framework iteratively refines Agentic Systems by updating Particle positions and velocities based on Fitness Function evaluation and Flaw Identification.
Velocity updates integrate Failure-Driven Adjustments, Personal Best Guidance, and Global Best Guidance to refine Agent functionality and collaboration strategies, yielding the best system as the Search Result.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces a system for managing complex failure analysis workflows, with Agent Core orchestrating control flow, Memory retaining information, Plan Generation creating step-by-step plans, Action Matching and Execution selecting and running tools, Feedback and Reflection adjusting plans based on results, LLM processing language and reasoning, Tools providing external system interfaces, and Data serving as external information sources.
The agent utilizes LLMs as the "brain" to decompose complex queries and resolve them through reasoning and autonomous tool use, employing ReAct or Online Replanning approaches.
The system integrates external tools like databases, search engines, and AI models to retrieve data and perform analysis tasks, supporting FA engineers.

PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection

PhishDebate: introduces a modular multi-agent LLM-based debate framework for phishing website detection, with URL Analyst Agent, HTML Structure Agent, Content Semantic Agent, Brand Impersonation Agent, Moderator, and Judge components.
The framework employs specialized agents to analyze different website aspects and coordination agents to manage a structured debate process.
This multi-agent approach aims to improve detection accuracy, interpretability, and robustness compared to single-agent methods.

The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games

Approach: introduces a framework for constructing natural language state representations for prompting LLM agents in repeated multi-agent games, implemented with LLM Agents, Game Environment, Prompting Mechanism, State Representation, LangChain, and OpenAI API.
The system evaluates LLM agent behavior in a dynamic selfish routing game by varying state representations along action informativeness, reward informativeness, and prompting style axes.
The research finds that summarized state representations, regret-based feedback, and limited information about others' actions lead to more stable, equilibrium-like agent behavior.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces an agent architecture for failure analysis workflows, integrating a Large Language Model for reasoning and planning, Memory for retaining information, Action Matching and Execution for tool use, and Feedback and Reflection for plan refinement, interacting with a User and the external Environment via Data and Tools.
The agent utilizes ReAct-style iterative task generation or online replanning to process complex queries and generate human-readable responses.
The implementation integrates external tools like databases and ML models, demonstrating technical feasibility and robustness in a production-like environment.

AGENTGROUPCHAT-V2 : Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

AGENTGROUPCHAT-V2: introduces a novel framework with Query Manager (Frontend, query decomposition), Task Manager (Central coordination, task flow), Group Manager (Execution, collaboration organization), Agent (Individual LLM participant), Task (Basic processing unit), Group (Collaborative work unit), and Task Forest (Hierarchical task structure) for LLM-based multi-agent systems.
The framework employs a divide-and-conquer parallel architecture, dynamic task tree decomposition, and specialized agent role assignment to address challenges in system architecture, generalizability, and performance.
Experimental results demonstrate superior performance on complex reasoning, code generation, and diverse tasks compared to existing multi-agent approaches.

RAS-EVAL: A COMPREHENSIVE BENCHMARK FOR SECURITY EVALUATION OF LLM AGENTS IN REAL-WORLD ENVIRONMENTS

RAS-Eval: introduces a comprehensive security benchmark for LLM agents, including Test Cases, Attack Tasks, Scenarios, Toolkits, Risk Management, and Evaluation Pipelines.
The benchmark supports both Real Execution and Simulated Execution of tools across JSON, LangGraph, and MCP formats.
It incorporates Failure Modes and Vulnerability Types for granular analysis and uses Evaluation Pipelines to measure task completion and attack success rates.

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Agent Framework: introduces a survey of jailbreak attacks and defenses in the LLM ecosystem, with Core (Central processing unit), Planning (Task decomposition/logic), Tools (External interfaces/applications), Memory (Information management/storage), and LLM Network (Multi-agent interaction) components, where the paper reviews the evolution from LLMs to MLLMs and Agents and analyzes security challenges.
The survey categorizes jailbreak techniques by attack impact and visibility and defense strategies by response timing and technical approach.
It also details datasets and evaluation metrics used in jailbreak research and outlines future research directions.

Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Multi-Response Generation (MRG) and Preference-based Selection (PS): introduces a two-stage framework for open-domain dialogue response generation, where MRG generates a set of diverse responses and PS selects the best one based on human preference.
The approach leverages smaller LLMs and introduces the o2mDial dataset to explicitly capture the one-to-many property.
Empirical results show the framework enhances response diversity and quality in smaller LLMs, approaching the performance of larger models.

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

LLM-based Embodied Agent Pipeline: studies hallucinations in embodied agents by evaluating a pipeline that takes Scene (Visual input) and Task Description (Natural language instruction), processes them via a Scene Parser (Extracts scene info) and LLM as Goal Interpreter (Generates symbolic goals) to produce LTL Goal (Symbolic task goals) for Execute the Task (Action planning/execution).
The study constructs a hallucination probing set by systematically modifying the Task Description and Scene Information inputs to introduce scene-task inconsistencies.
The research finds that LLMs struggle to reconcile scene-task inconsistencies, leading to hallucinations and failures in handling infeasible tasks.

OS-HARM: A Benchmark for Measuring Safety of Computer Use Agents

OS-HARM (Benchmark): introduces a benchmark for measuring the safety of computer use agents, featuring OS-HARM tasks, OSWorld Ubuntu VM, LLM Agent, OSWorld scaffolding, Agent Traces, and LLM Judge.
The benchmark evaluates LLM-based agents on tasks involving deliberate user misuse, prompt injection attacks, and model misbehavior within a realistic OSWorld environment.
An automated LLM Judge evaluates agent performance and safety based on recorded execution traces, including reasoning steps, screenshots, and accessibility trees.

LLM Agent for Hyper-Parameter Optimization

LLM Agent Framework and MCP: introduces an interactive framework orchestrating collaboration between the LLM Agent (comprising Profile, Memory, Planning, and Action components), Human inputs, and the Environment (WS-PSO-CM algorithm) for automatic hyper-parameter tuning.
The Model Context Protocol (MCP) defines a unified communication specification enabling the LLM Agent to interact with external systems via MCP Client and MCP Server architecture.
The framework iteratively refines hyper-parameters for the WS-PSO-CM algorithm based on prompt requirements and environmental feedback to optimize UAV trajectory and communication.

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Large Reasoning Model (LRM): analyzes privacy leakage in Large Reasoning Models, involving Input (Sensitive user data and scenario), Large Reasoning Model (Processes input), Reasoning Trace (Intermediate thinking steps), and Output (Final response), revealing that Reasoning Traces frequently leak sensitive user data.
Reasoning Traces, often assumed internal, are shown to be easily extractable and contain abundant sensitive data, making them a significant privacy vulnerability.
Increasing test-time compute for better utility can worsen privacy by increasing leakage in the Reasoning Trace, highlighting the need for privacy strategies targeting internal thinking.

17th June 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

Gemini 2.X model family: introduces Gemini 2.5 Pro and Flash, built on Sparse mixture-of-experts transformers (Core Architecture) with Multimodal Support (Text, Image, Video, Audio), Long Context Processing (>1M tokens), Tool Use Support (Function calls), Thinking (Inference process), and Deep Think (Reasoning approach).
The models enable next-generation agentic capabilities, demonstrated by the Gemini Plays Pokémon agent harness which includes components like Persistent Memory & Context, Goals, Action History & Summaries, Game State, Periodic Processes (Memory Summarizer, Guidance Gemini), Agentic Tools (Pathfinder, Boulder Puzzle Strategist), and Game I/O.
Gemini 2.5 models achieve state-of-the-art performance on various benchmarks, including long-context video understanding (processing up to 3 hours of video) and coding, while also undergoing extensive safety and security evaluations.

AGENTDISTILL: TRAINING-FREE AGENT DISTILLATION WITH GENERALIZABLE MCP BOXES

AgentDistill: introduces a training-free agent distillation framework, with Teacher Agent (Generates MCPs), Manager Agent (Teacher) (Coordinates tasks), Basic Image Captioner (Teacher) (Captions images), MCP Creation Module (Creates task MCPs), MCP-Box Construction (Builds MCP Box), Abstraction (Parameterizes MCPs), Clustering (Groups MCPs), Consolidation (Merges MCPs), MCP Box (Reusable task modules), Student Agent (Uses MCP Box), Manager Agent (Student) (Coordinates tasks, uses MCP Box), and Basic Image Captioner (Student) (Captions images), which transfers task-solving capabilities from large teacher agents to small student agents via reusable Model-Context-Protocols (MCPs).
The framework involves a teacher agent generating MCPs, a construction process to build a reusable MCP-Box by abstracting, clustering, and consolidating them, and a student agent that directly integrates this MCP-Box for inference.
AgentDistill enables student agents to inherit sophisticated problem-solving skills and generalize across tasks by providing a structured MCP-Box without requiring additional training or trajectory replay.

Unified Software Engineering agent as AI Software Engineer

USEagent (Unified Software Engineering agent): introduces a unified agent for software engineering tasks, with Meta-Agent orchestrates actions, Actions perform SE tasks, Task State stores shared information, and Program is the target software project.
The Meta-Agent uses a ReAct-style loop to select actions based on the current task state and action outputs.
The framework utilizes a set of modular actions encapsulating units of work and a structured task state for consensus memory among actions.

Doppelgänger Method : Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

Doppelgänger Method: introduces a prompt-based transferable adversarial attack method to break LLM agent consistency, evaluated using the PACAT Level metric, and countered by the CAT Prompt defense.
The method demonstrates the risk of role hijacking and internal information exposure in LLM agents.
Experimental results show the attack's effectiveness and the defense prompt's ability to mitigate consistency degradation.

Automated Decision-Making on Networks with LLMs through Knowledge-Guided Evolution

LLMNet: introduces a system for automated GNN design using LLM-based agents, including Knowledge Agent (Builds, manages knowledge bases), Prior Knowledge Base (Stores task-specific knowledge), Experiment Knowledge Base (Stores experimental results), Planning Agent (Generates task plan, evaluates), Data Agent (Performs feature engineering), Configuration Agent (Configures search space), and Evaluation Agent (Fine-tunes, experiments), which leverages knowledge bases and RAG for knowledge-guided evolution.
The system employs a pipeline of specialized agents that interact with constructed knowledge bases to design and refine GNN model architectures step by step.
LLMNet demonstrates superior performance across various graph learning tasks by effectively integrating graph-related knowledge into the automated design process.

GENERATIONPROGRAMS: Fine-grained Attribution with Executable Programs

GENERATIONPROGRAMS: introduces a modular generation framework that decomposes the process into program generation by an LLM and program execution by neural modules, producing an output with sentence-level attributions from input documents.
The framework first generates an executable program plan composed of modular text operations tailored to the query, then executes this plan using neural modules like paraphrasing, compression, fusion, and extraction on retrieved document sentences.
This two-stage approach enables fine-grained attribution by tracing the program execution and linking generated content back to source sentences, enhancing interpretability and verifiability.

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

SIRI-Bench (Spatial Intelligence ReasonIng Benchmark): introduces a benchmark for evaluating VLMs' spatial intelligence using video-based 3D geometry problems, generated by an Automatic Scene Creation Engine leveraging Specialized LLM Agents to transform Original Math Problems into Realistic 3D Scenes and Video inputs for VLMs, alongside textual Questions and numerical Answers.
The Automatic Scene Creation Engine generates the benchmark data by solving geometric conditions, generating Blender Python Scripts, and refining textual inputs and outputs.
SIRI-Bench challenges VLMs to extract spatial information from video and perform complex reasoning, revealing limitations in current models compared to human performance and text-based LLMs.

LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?

LLM-Powered Swarms: introduces a new paradigm for swarm intelligence using Large Language Models as agents, featuring LLM Agents (Large Language Models), Multi-Agent Coordination (Interconnected agents collaborate), Client-Side Operation (Framework runs locally), LLM Access (Cloud or local models), and Prompts (Natural language instructions).
This approach contrasts with traditional rule-based swarms by trading execution speed for flexibility and higher-level reasoning capabilities.
Evaluation using Boids and ACO models highlights significant latency and resource costs compared to classical methods, suggesting potential for hybrid systems.

Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

ECPO (Expectation Confirmation Preference Optimization): introduces a novel multi-turn preference optimization paradigm leveraging Expectation Confirmation Theory to align LLM-based conversational recommendation agents with user expectations.
The framework explicitly models user satisfaction evolution across turns using Forward Expectation Confirmation and rewrites unsatisfactory responses via Backward Expectation Derivation with a Rewriter.
ECPO is supported by AILO, an LLM-based user simulator that provides realistic feedback and performs expectation confirmation, enabling efficient turn-level preference optimization without extensive sampling.

ADRD: LLM-DRIVEN AUTONOMOUS DRIVING BASED ON RULE-BASED DECISION SYSTEMS

ADRD (LLM-Driven Autonomous Driving Based on Rule-based Decision Systems): introduces a framework with Information Module, Agents Module (Planner, Coder, Summarizer), and Testing Module, leveraging LLMs to generate and refine rule-based decision trees for autonomous driving.
The Information Module gathers scenario data, the Agents Module generates and codes driving tactics, and the Testing Module provides feedback for iterative refinement.
ADRD demonstrates superior performance, response speed, and interpretability compared to baselines by integrating LLMs with rule-based decision systems.

From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

TIMER: introduces timely dialogue response generation, with Time Interval Prediction (predicts delay), Time-conditioned Response Generation (generates response), and Fine-tuned Dialogue Model (base language model), addressing when and what to respond based on temporal context.
The model is trained using a multi-task learning objective on a large-scale synthetic dataset derived from event knowledge graphs and LLMs.
TIMER demonstrates improved performance over baselines in predicting appropriate response delays and generating time-specific, coherent dialogue.

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

AgentSynth: introduces a scalable pipeline for synthesizing computer-use tasks and trajectories by iteratively chaining LLM-generated subtasks, executed by a Task Executor, verified by a Task Verifier, revised by a Task Reviser, proposed as follow-ups by a Follow-up Task Proposer, and summarized into final tasks by a Task Summarizer, operating within an Environment guided by a Persona.
The pipeline leverages information asymmetry, generating simple subtasks that compose into challenging long-horizon tasks, enabling controllable difficulty.
AgentSynth generates over 6,000 diverse and realistic tasks at a low cost, providing a benchmark that reveals performance gaps in current LLM agents on multi-step computer tasks.

MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

MAS-LitEval: introduces a multi-agent system for literary translation quality assessment, with Terminology Consistency Agent (Ensures key term consistency), Narrative Perspective Consistency Agent (Verifies narrative voice alignment), Stylistic Consistency Agent (Evaluates tone rhythm style), and Coordinator (Combines agent scores feedback).
The system employs specialized LLMs within agents to evaluate distinct dimensions of literary translation quality across segmented text chunks.
The Coordinator integrates agent evaluations into an Overall Translation Quality Score (OTQS) and a detailed report, ensuring global consistency.

FormGym: Doing Paperwork with Agents

Agent Framework with FieldFinder: introduces a system for end-to-end form completion using agents equipped with tools, including a novel field localization tool.
The system evaluates Vision-Language and GUI agents on the FormGym benchmark, which includes diverse forms, user profiles, and tasks.
The FieldFinder tool assists agents by predicting bounding boxes for input fields, significantly improving text placement accuracy.

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

CVDP (Comprehensive Verilog Design Problems): introduces a benchmark dataset and infrastructure, with Datapoint, Prompt, Context, Reference Solution, Test Harness, Testbench, Benchmark Runner, Agent Under Test, Model Under Test, Mini Repo, EDA Tools, Docker, LLM Judge, Map Feature, and Report & Logs components, designed to evaluate large language models and agents on hardware design and verification tasks.
The benchmark includes 783 human-authored problems across 13 categories covering RTL generation, verification, debugging, and comprehension, provided in both Non-Agentic (single-turn) and Agentic (multi-turn, tool-using) formats.
The infrastructure supports Dockerized agents and test harnesses for realistic tool interaction using EDA tools, and includes an LLM judge for quality filtering of datapoints.

16th June 2025

LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning

LocationReasoner benchmark: introduces a benchmark to evaluate LLMs' real-world reasoning abilities in site selection, with Query Generation, Sandbox Environment, Datasets, In-house Tools, Execution Pathways, and Automated Verification components, evaluating Direct Code Generation, ReAct, and Reflexion approaches.
The benchmark uses curated datasets and in-house tools within a sandbox environment to test LLMs on constraint-based location search with automated verification.
Evaluation reveals current LLMs and agentic strategies struggle with complex real-world reasoning tasks, highlighting limitations in holistic and non-linear reasoning.

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL): introduces an overview of methods for discovering temporal structure, formalized using the options framework including option policy, option termination function, option initiation function, high-level policy, and option model function, and discusses agent architectures like Hierarchical Components, Goal Conditioned, Feudal Architecture, and Single Network.
The paper surveys methods for temporal structure discovery categorized by learning from online experience, offline datasets, and foundation models.
HRL aims to improve exploration, credit assignment, transfer, and interpretability by leveraging temporal structure in sequential decision-making problems.

How Does LLM Reasoning Work for Code? A Survey and a Call to Action

Code Reasoning Taxonomy: introduces a classification of techniques for LLM reasoning on code tasks, including Code CoT Reasoning, Execution-based reasoning, Inference Scaling, and Agentic approaches.
The taxonomy details sub-techniques such as Plan-based CoT, Self-evaluation of execution behavior, Sampling, and Agentic Workflow.
The survey highlights how these distinct reasoning strategies and their components are applied and perform on various code-related benchmarks.

Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Spec2RTL-Agent: introduces an LLM-based multi-agent system for automated RTL code generation from complex specifications, including Iterative Understanding and Reasoning Module, Progressive Coding and Prompt Optimization Module, Adaptive Reflection Module, and Code Optimization and Conversion Module.
The system processes unstructured specification documents, refines code generation through multiple abstraction levels, and iteratively verifies outputs.
Spec2RTL-Agent demonstrates effectiveness in generating accurate RTL code with reduced human intervention compared to existing methods.

We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

SAFEMCP: introduces a controlled framework to examine safety issues in MCP-powered agent systems, with Agent, Backbone LLM, MCP-Servers, Attack, Defense, Passive Defense, Active Defense, Evaluation, Scenario, and Metric components.
The framework simulates third-party attacks on LLM agents interacting with external services via the Model Context Protocol (MCP).
SAFEMCP provides tools for evaluating attack effectiveness and defense strategies using various scenarios and metrics.

CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation

CAMS: introduces a CityGPT-powered agentic framework for urban human mobility simulation, with MobExtractor (Extracts/synthesizes mobility patterns), GeoGenerator (Generates geospatial knowledge/trajectories), and TrajEnhancer (Enhances trajectories via DPO) components.
The framework leverages an urban foundation model (CityGPT) and agentic reasoning to generate realistic and plausible human mobility trajectories.
CAMS integrates urban spatial knowledge and multi-dimensional feedback for controllable and generalizable simulation without relying on external geospatial information.

--

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

LA-CDM: introduces a hypothesis-driven uncertainty-aware language agent system for clinical decision making, comprising a Hypothesis Agent (forms hypothesis and confidence), a Decision Agent (decides action), Shared LLM Weights (underlying language model), and a Clinical Decision Making Environment (simulates patient interaction and provides feedback).
The system is trained using a hybrid paradigm combining supervised learning for hypothesis generation and reinforcement learning for uncertainty estimation and efficient action selection.
This approach models the iterative clinical process of forming hypotheses and requesting tests to converge towards a diagnosis, improving diagnostic performance and efficiency.

Towards Pervasive Distributed Agentic Generative Al - A State of The Art

Pervasive Distributed Agentic Generative AI: surveys the state of the art in LLM-based agents deployed in pervasive computing environments, detailing the Transformer Architecture, Short-Term Memory, Long-Term Memory, Hybrid Memory, Cloud Layer, Fog Layer, and Edge Layer components.
The paper examines the architecture of LLM agents, their deployment strategies across different infrastructure layers, and evaluation methods.
It highlights challenges in deploying these agents on resource-constrained pervasive devices and proposes the "Agent as a Tool" concept.

A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMS

Game-Theoretic Negotiation Framework: introduces a systematic approach for cross-cultural consensus among LLMs, with Cultural Agents, Guideline Sets, Guideline Weights, Utility Functions, Negotiation Process (PSRO-based), Meta Strategy Solver, Best Response Oracle, Regional Cultural Agents, and Consensus Evaluation Toolkit, designed to achieve fair and robust agreement.
The framework models consensus as a Nash Equilibrium and employs a PSRO-based negotiation process driven by utility functions balancing consistency, acceptance, and novelty.
Culturally aligned Regional Cultural Agents are constructed using survey data, and consensus outcomes are evaluated using perplexity-based acceptance and value self-consistency metrics.

Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches

Direct Full-Context Prompting: introduces a baseline approach where the LLM (Processes model) receives the entire Software Model File (Complete input data) along with Instructions (Guidance for LLM) and a Question (User query) to produce an Answer (LLM response).
Agent with File Tools (ReAct Architecture): presents an agentic approach where the LLM (Agent's reasoning engine) interacts with Software Model Files (Data source) via a Toolkit (External tool access) containing specific Tools (File interaction functions), communicating through Messages (Communication channel) and append observation (Tool output) to answer a User (Initiates query) Question (User query) with an Answer (Agent's response).
The study compares these two architectures for querying large automotive software models, evaluating their accuracy and token efficiency using various LLMs and a custom question dataset.

Leveraging In-Context Learning for Language Model Agents

ICL-DS (In-Context Learning with Demonstration Selection): introduces an approach for LLM agents that leverages in-context learning with dynamically selected demonstrations, including an LLM Agent (generates thoughts and actions), a Demonstration Pool (stores annotated trajectories and snippets), an Iterative Annotation Algorithm (automatically annotates tasks for demonstrations), a Demonstration Selector (retrieves relevant demonstrations), Prompt Construction (formats input for LLM), a ReAct Solver (executes tasks iteratively with reasoning), a Plan & Execute (PnE) Solver (plans subtasks and executes them), and an Environment (provides observations and executes actions).
The paper proposes an iterative annotation algorithm to automatically and efficiently create a demonstration pool of solution trajectories for agentic tasks, which are then used to improve LLM agent performance, reliability, and efficiency.
The research demonstrates that using task-level trajectory demonstrations and smaller step-level snippet demonstrations significantly boosts performance for LLM agents, enabling them to rival costlier trained agents.

MOTIVEBENCH: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?

MOTIVEBENCH: a comprehensive evaluation benchmark designed to assess the extent to which Large Language Models (LLMs) can replicate human-like motivations and behaviors, consisting of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation.
The benchmark addresses limitations of existing datasets by providing detailed scenarios, character profiles, and reasoning tasks that mimic real-world situations, thereby enabling a more accurate evaluation of LLMs' motivational intelligence.
MOTIVEBENCH aims to provide insights into LLMs' capabilities in understanding and exhibiting human-like motivational reasoning, highlighting areas where current models fall short and suggesting directions for future research in humanizing LLMs.

MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

MAGIC (Multi-Agent Argumentation and Grammar Integrated Critiquer): is a framework that utilizes multiple specialized agents to evaluate distinct writing aspects, aiming to predict holistic scores and produce detailed, rubric-aligned feedback for essays.
The framework employs an orchestrator to consolidate the outputs from individual agents, which focus on specific components of argumentative writing such as argument structure, grammar, vocabulary, and comprehension.
MAGIC aims to provide greater transparency, flexibility, and extensibility compared to monolithic automated essay scoring and feedback systems.

Scaling Test-time Compute for LLM Agents

ATTS (Agentic Test-Time Scaling): explores test-time scaling strategies for language agents, including parallel sampling, sequential revision, verifiers and merging, and diversifying rollouts.
The research systematically analyzes the impact of different design strategies on agent performance, finding that scaling test-time compute improves agent capabilities.
Key findings include the importance of knowing when to reflect, the superiority of list-wise methods for verification and merging, and the positive effect of diversified rollouts on agent performance.

15th June 2025

WEREWOLF-PLUS: AN UPDATE OF WEREWOLF GAME SETTING BASED ON DSGBENCH

WereWolf-Plus: introduces a multi-model, multi-dimensional, and multi-method benchmarking platform with Werewolf Simulation (Rule-compliant environment), LLM Agents (Flexible model assignment), Role Configuration (Customizable roles), Reasoning Enhancement (Experience-Retrieval Augmentation), and Evaluation Framework (Metrics for agents) for evaluating multi-agent strategic reasoning in the Werewolf game.
The platform provides a flexible and reliable environment supporting standard and customizable game setups with various roles and flexible LLM-role assignment.
WereWolf-Plus incorporates retrieval-augmented memory for contextual compression and reflection, and introduces comprehensive quantitative evaluation metrics for different roles and players.

Mastering Da Vinci Code: A Comparative Study of Transformer, LLM, and PPO-based Agents

Transformer-based Baseline Model: introduces, with Transformer architecture (Predicts opponent tiles), Input Representation (Current game state string), and Prediction Task (Predict hidden tile values) components, a model designed to predict opponent tiles based on the current public game state.
This baseline model utilizes a Transformer architecture but is limited by its restricted access to the full game history.
Its reasoning is primarily based on the current state snapshot and explicit negative constraints from prior guesses.

SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation

Systematization of Knowledge (SoK) on LLM Privacy: introduces a comprehensive analysis of privacy challenges in LLMs, categorizing them across training data, user prompts, generated outputs, and LLM agents, with all LLM (Large Language Model), LLM System, Training Data, User Prompts, LLM Generated Output, LLM Agent System, Main LLM Agent, Secondary Agents, External Tools, Knowledge Base, Memory, User, Service Provider-components, where the paper evaluates existing mitigation techniques and identifies research gaps.
The paper highlights how LLMs' advanced capabilities and interactive nature introduce distinct privacy risks compared to traditional AI.
It discusses various mitigation strategies for each category of privacy challenge, noting their effectiveness and limitations.

SCISAGE: A MULTI-AGENT FRAMEWORK FOR HIGH-QUALITY SCIENTIFIC SURVEY GENERATION

SciSage (Scientific Sage): introduces a multi-agent framework with Interpreter (Understand/rewrite query), Organizer (Construct outline), Collector (Retrieve/rerank papers), Composer (Generate content), Refiner (Refine final document), and Reflector (Iterative hierarchical reflection) agents for high-quality scientific survey generation.
The framework employs a reflect-when-you-write paradigm, with the Reflector agent critically evaluating drafts at multiple levels.
SciSage coordinates specialized agents through query understanding, retrieval, content generation, and iterative hierarchical reflection processes.

14th June 2025

Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Synthetic Socratic Debates: introduces a system using Agent (LLM with persona), Persona (6-dimensional identity profile), Moderator (Manages debate turns), Multi-Turn Debate Framework (Simulates agent interactions), Persona Modeling (Assigns identity attributes), Decision Measures (Quantify moral judgments), Persuasion Measures (Evaluate debate effectiveness), Rhetorical Strategy Evaluation (Assesses persuasion modes), and LLM-as-a-judge (Evaluates rhetorical strategies) to simulate moral debates between AI agents with distinct personas.
The system investigates how persona traits influence moral decision-making and persuasive strategies in LLMs during multi-turn debates over real-world moral dilemmas.
The research reveals that political ideology and personality traits significantly shape initial moral stances and debate outcomes, impacting persuasive success and rhetorical strategies.

Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Recommender Agent (ReAct Agent with Multi-Task Embedder Tool): introduces a framework for industrial asset maintenance agents, combining a ReAct Agent (Plans and reasons), a Multi-Task Embedder (Retrieves relevant items) used as a tool, and LLM Augmentation (Enriches queries) for improved context.
The framework leverages domain-specific embeddings fine-tuned on nine industrial tasks derived from ISO documents to enhance retrieval performance for complex queries.
Ablation studies demonstrate the effectiveness of LLM query augmentation and the importance of balanced positive/negative samples for training the embedding model.

The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being

Study Approach: investigates how human-chatbot relationships influence well-being by collecting Survey Data and Chat History Data, deriving Chatbot Companionship Measures, Interaction Intensity Measure, Self-Disclosure Measures, Human Social Support Measure, and Well-Being Measure, and analyzing them using LLM-based Text Analysis, Topic Modeling, Regression Analysis, and CFA.
The study uses a mixed-methods approach, triangulating self-report surveys, open-ended descriptions, and chat transcripts to understand chatbot companionship and its psychological associations.
Findings suggest that companionship-oriented chatbot use, especially with high intensity and self-disclosure, is associated with lower well-being, particularly for users with limited offline social support.

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

AgentOrchestra: introduces a hierarchical multi-agent framework with a Planning Agent (Central orchestrator) coordinating Specialized Sub-Agents (Domain-specific processing team), utilizing various tools, memory, and models for general-purpose task solving.
The framework features a two-tier architecture where the planning agent decomposes tasks and delegates sub-tasks to specialized agents equipped with domain-specific tools.
AgentOrchestra supports flexible orchestration, inter-agent communication, and adaptive role allocation, enabling robust performance on complex, multimodal tasks.

Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare

TAO (Tiered Agentic Oversight): introduces a hierarchical multi-agent framework for AI safety in healthcare, featuring an Agent Recruiter (Recruits expert agents), Agent Router (Routes query to tier), Tier 1 (Initial assessment/screening), Tier 2 (Specialized review/analysis), Tier 3 (Expert consultation/synthesis) of Medical Agents (Core assessment units), Case Escalation (Escalates to higher tiers), Intra-Tier Collaboration (Discussion within tier), Inter-Tier Collaboration (Dialogue between tiers), Final Decision Agent (Synthesizes final decision), and Human Oversight (Targeted human intervention).
The framework routes tasks based on complexity and agent roles, escalating complex or high-risk cases through tiers with automated inter- and intra-tier collaboration and role-playing.
TAO enhances AI safety through layered, automated supervision, demonstrating superior performance on healthcare safety benchmarks compared to single-agent and multi-agent baselines.

Topology-Assisted Spatio-Temporal Pattern Disentangling for Scalable MARL in Large-scale Autonomous Traffic Control

TGN-TMoE: introduces a novel MARL framework for large-scale traffic control, with Agent Observations (Raw graph data), MF Synchronization (Integrates mean-field information), Temporal Learning (Processes temporal features), Topological Processing (Extracts topological features), Spatial Learning (Processes spatial features), TMoE Module (Routes features to experts), Graph Pooling (Aggregates graph features), and Decision Making Module (Policy and value networks), designed to enhance environmental representation and agent coordination.
The framework integrates Dynamic Graph Neural Networks and Topological Data Analysis, employing a TSD-enhanced Mixture of Experts architecture for scalable multi-agent reinforcement learning.
TGN-TMoE leverages topological signatures to disentangle graph features for specialized processing within observation fusion and decision-making modules.

Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM

MAoP (Multiple Aspects of Planning): introduces a travel planning framework with a Strategist Model (Decomposes, routes aspects), Planning Model (Generates plan), Preprocessing Framework (Prepares input context including eliciting preferences, selecting POIs, and spatial optimization), Reward Model (Provides training signal), and Distilled Model (One-step inference).
The framework leverages a strategist for pre-planning and a planning model for generating travel itineraries based on wide-horizon thinking over multiple aspects.
A separate agent-based simulation framework, Travel-Sim, is proposed for evaluating the feasibility and personalization of the generated travel plans.

SHEETMIND: AN END-TO-END LLM-POWERED MULTI-AGENT FRAMEWORK FOR SPREADSHEET AUTOMATION

SheetMind: introduces a modular multi-agent framework for spreadsheet automation, with Manager Agent (Decomposes user instructions), Action Agent (Generates BNF commands), Reflection Agent (Validates actions, monitors effects), Front-End (User interface, executes actions), Back-End (Houses agent pipeline), and Spreadsheet (Target environment), enabling natural language interaction.
The framework decomposes complex instructions into subtasks, translates them into structured commands using BNF grammar, and validates actions through a feedback loop.
SheetMind integrates LLM-driven planning, structured execution, and agentic feedback to bridge the gap between natural language and spreadsheet functionalities.

INDOORWORLD : Integrating Physical Task Solving and Social Simulation in A Heterogeneous Multi-Agent Environment

INDOORWORLD: introduces a heterogeneous multi-agent environment integrating physical task solving and social simulation, with Agent (Autonomous entity), Perception (Processes observations), Memory (Stores information, history), Planning (Determines objectives, tasks), Action (Selects, executes actions), Task Prioritization (Encourages task focus), Environment (Simulated indoor space), Object (Physical entity), Location (Spatial area), and World State (Joint state variables) components, designed to simulate occupant behaviors in indoor spaces.
The environment features heterogeneous agents with multi-level profiles (role, action space, capability, knowledge) and human needs, interacting with objects and locations to modify the world state.
The framework supports both collaborative task-solving and autonomous social simulation sessions, providing a testbed for LLM-based multi-agent systems and potential applications in architectural design.

Cloud Infrastructure Management in the Age of AI Agents

AI Agents: introduces an envisioned agentic system architecture for automating cloud infrastructure management using LLM-powered agents, including User-agent Interface, Agent-cloud Interface, Multi-agent Orchestration, Memory, Reasoning, Tools, Planning, Guardrails, Actions, Cloud Vendors, and Cloud Gym components.
The proposed architecture utilizes different cloud interaction modalities (SDK, CLI, IaC, Web) and incorporates exploration/exploitation phases and guardrails for safety and reliability.
A preliminary study evaluates agents across modalities on VM management tasks, highlighting trade-offs in efficiency, success rate, and error handling.

13th June 2025

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework for code agents, featuring an Iterative Generation-Verification Loop where a Policy LLM generates code and test cases, External Tools execute them, and Tool Feedback provides results, guided by Turn-Level Reward Design and Outcome Reward, trained using Turn-Aware PPO on the Dataset (TACO).
The framework enables LLMs to autonomously generate and verify code through iterative refinement and tool interaction, improving performance and self-verification capabilities.
ReVeal's approach allows for effective test-time scaling into deeper inference regimes and pushes reasoning boundaries beyond the base model.

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Behavior Gap Evaluation Framework: introduces a comprehensive framework to quantify the behavior gap between LLM Agents and Human Experts in task-oriented dialogs using Behavior Gap Metrics, Task Complexity Metrics, and Performance Metrics within a Teacher-Forcing Approach on various Datasets, revealing significant discrepancies that negatively impact LLM Agent performance.
The study utilizes LLM-based Classifiers and an LLM-based Evaluator to analyze specific behavioral dimensions like dialog acts, Tool usage, and knowledge integration, demonstrating that the gap widens with increasing task complexity.
Aligning LLM agent behavior closer to human strategies through Behavior Intervention significantly improves performance, particularly in complex tasks.

A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions

QUASAR: introduces a programming language for LLM agent code actions, with LLM Agent generating Python Subset code, Transpiler converting it to QUASAR Language, and QUASAR Interpreter executing it using Internal Rewrite Rules and External Rewrite Rule, managing External Calls tracked in the Execution Set, validated by User Approval Mechanism, and supporting Conformal Semantics with Abstract External Functions.
The QUASAR Language separates pure internal computation from external side effects, enabling automatic parallelization, dynamic security controls, and uncertainty quantification.
The approach leverages LLM proficiency in Python by transpiling a restricted subset to QUASAR for improved performance, security, and reliability compared to direct Python execution.

PRO-V: An Efficient Program Generation Multi-Agent System for Automatic RTL Verification

PRO-V: introduces an efficient program generation multi-agent system for automatic RTL verification, with Stimulus Generator (Generates input signals), Functional Model (Generates reference outputs), Self-Improvement (Selects, refines models), Validator (Verifies DUT with judge), Judge Agent (LLM for selection, validation), and Refinement Agent (LLM for refinement) components.
The system enhances verification accuracy and coverage through inference-time scaling via dual sampling and a self-improvement mechanism using a best-of-n selection strategy.
PRO-V integrates an LLM-as-a-judge into the validation process, leveraging rule-based static analysis converted to natural language for enhanced prompting and root cause analysis.

Robot Context Protocol (RCP): A Runtime-Agnostic Interface for Agent-Aware Robot Control

RCP (Robot Context Protocol): introduces, "a lightweight, middleware-agnostic communication protocol designed to abstract robotic system complexity", with Adapter Layer (Translates client interfaces), Transport Layer (Handles communication channels), Service Layer (Defines core operations), ROS2 Interface Layer (Maps to ROS2 runtime), and Status Query (Provides health and feedback), where "RCP provides a unified and semantically meaningful interface that decouples client-facing operations from backend implementations".
The protocol is structured in modular layers, including the Adapter Layer for diverse clients, the Transport Layer for communication channels (HTTP, WebSocket, SSE), the Service Layer for high-level operations (read, execute, write, subscribe), and the ROS2 Interface Layer for mapping to the underlying runtime.
RCP includes a Status Query component for real-time protocol health and command feedback, supporting robustness and operational transparency.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Driver Agent (Analyzes external traffic), Psychologist Agent (Interprets occupant psychological state), and Coordinator Agent (Synthesizes inputs, decides behavior) interfacing with Perception module (Provides sensor data), Route planning module (Computes global route), Motion planning module (Generates behavior, trajectory), and Control module (Executes planned trajectory) for adaptive driving.
The framework leverages LLM-based agents to sense, interpret, and respond to both external traffic conditions and internal occupant states.
Operating in a closed-loop architecture, the system dynamically adjusts driving style and supports vehicle operation recovery.

Revealing Political Bias in LLMs through Structured Multi-Agent Debate

Structured Multi-Agent Debate Framework: introduces a system to investigate political bias in LLMs, with LLM Agents (Simulate participants) assigned Agent Personas (Political/gender identities) debating generated Debate Scenarios (Generated topics/questions) following a specific Debate Format (Structured rounds/statements), evaluated by an LLM-as-a-Judge (Evaluates attitudes) using Attitude Scoring (Quantifies agreement/disagreement) and a defined Speaking Order (Agent turn sequence).
The framework systematically varies LLM models, agent gender attributes, and debate formats to examine influences on political bias and attitude shifts.
Experiments reveal Republican agents shift towards neutral, gender influences attitudes, and echo chambers form with attitude intensification, particularly when gender is known.

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SEC-bench: introduces an automated benchmarking framework for evaluating LLM agents on security engineering tasks, with Preprocessor (collects instances), Verifier (reproduces, verifies vulnerabilities), and Evaluator (transforms, formulates tasks) components.
The Verifier component employs a multi-agent scaffold including Manager, Builder, Exploiter, and Fixer agents to reproduce and validate vulnerabilities.
The framework automatically creates high-quality software vulnerability datasets with reproducible artifacts for evaluating LLM agent capabilities in tasks like proof-of-concept generation and vulnerability patching.

AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

AgentSense: introduces a virtual data generation pipeline using LLM agents in simulated home environments to create diverse sensor data for human activity recognition.
The pipeline involves LLMs generating personas, routines, and actions, which are then executed in an extended VirtualHome simulator equipped with virtual ambient sensors.
The generated virtual sensor data is used to pretrain HAR models, demonstrating improved performance, especially in low-resource settings, compared to training solely on real data.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

RACE and FACT evaluation frameworks: introduce two novel evaluation frameworks, with Judge LLM, Adaptive Criteria Generation, Reference-Based Scoring, Statement-URL Extraction, Support Judgment, Jina Reader API, and Citation Metrics Calculation components, designed to comprehensively assess Deep Research Agents.
RACE evaluates report generation quality using adaptive criteria and reference-based scoring, while FACT assesses information retrieval and citation trustworthiness.
These frameworks are part of the DeepResearch Bench, a benchmark of 100 PhD-level research tasks for evaluating LLM-based agents.

A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences

Hybrid Multi-Agent Prompting Approach: introduces a system using multi-agent collaboration for sentence simplification, including Agent 1 Sentence Simplifier, Agent 2 Semantic and Lexical Similarity Evaluator, Agent 3 Alternative Sentence Simplifier, and Comparator components.
The system processes complex sentences through a workflow where agents decompose, evaluate, and iteratively revise the output to preserve meaning while reducing complexity.
This multi-agent architecture demonstrates improved performance over single-agent methods for simplifying complex sentences in domains like video game design.

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework that enables code agents to engage in an iterative generation-verification loop using a single Policy LLM, guided by Input Prompt and Tool Feedback, structured as a Multi-turn Rollout producing an Output Rollout, optimized with Outcome Reward and Turn-Level Rewards via Turn-Aware PPO.
The framework alternates between Generation (producing code) and Verification (generating test cases and plans) stages, leveraging external Tools like Python Interpreters for execution.
This iterative process and dense reward structure allow the model to self-verify, refine outputs, and improve both generation and verification capabilities over multiple turns.

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Agent-RLVR: introduces a framework for training software engineering agents using Reinforcement Learning from Verifiable Rewards (RLVR), incorporating agent guidance and environment rewards, with Policy, Environments, Trajectory, Evaluation, Environment Information, Agent Guidance, Guidance Generation, RLVR Data, Policy Update, and Instruct Tuning components.
The framework trains an agent Policy by having it interact with Environments, evaluating Trajectories via Evaluation, generating Environment Information from failures, and using Guidance Generation to create Agent Guidance.
Incorrect trajectories are reattempted with Agent Guidance, and the resulting RLVR Data is used for Policy Update via DPO and optional Instruct Tuning to improve agent performance.

Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning

LLM-powered Conversational Agent Models: introduces an LLM-powered agent delivering Problem-Solving Therapy (PST) for family caregivers, integrating Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA) using prompting techniques, Retrieval-Augmented Generation (RAG), and clinician-curated content.
The research evaluates four distinct configurations of this agent, comparing different LLMs (GPT-40, Llama 3) and combinations of in-context learning techniques (Few-shot, RAG) for their impact on perceived empathy and therapeutic alliance.
The models aim to provide empathetic and tailored mental health support by improving contextual understanding and generating personalized, actionable strategies for caregivers.

Secure API-Driven Research Automation to Accelerate Scientific Discovery

S3M (Secure Scientific Service Mesh): introduces, "Secure Scientific Service Mesh (Overall framework), Manages data streaming, Automates complex workflows, Manages compute jobs, Provides resource status, Retrieves environment info, Manages access tokens, Enables secure communication, Underlying service mesh platform, Python interface, Validates client interactions, Creates streaming objects, Deploys streaming clusters", a framework providing API-driven infrastructure for automated scientific discovery with integrated streaming, workflow orchestration, and fine-grained authorization.
The framework utilizes a service mesh architecture built on OpenShift and Istio to ensure modularity, scalability, and policy-driven security enforcement across computational services.
S3M offers a comprehensive set of APIs and an SDK to enable authenticated external systems and intelligent agents to securely provision resources, stream data, and trigger compute jobs dynamically.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Psychologist Agent (Interprets occupant state/intent), Driver Agent (Perceives external traffic context), Coordinator Agent (Synthesizes inputs, decides behavior), Perception module (Provides sensor data), Route planning module (Plans/replans vehicle route), Motion planning module (Generates behaviors/trajectories), and Control Module (Executes low-level commands), enabling AVs to sense, interpret, and respond to external traffic and internal occupant states.
The framework uses three specialized foundation model agents in an agentic workflow to manage complex driving tasks and enable adaptive, interpretable, and collaborative driving.
PACE-ADS complements existing AV modules by operating at the high-level behavioral decision layer, personalizing riding experience, and supporting recovery from immobilization.

Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks

SRC (Self-Regulating Cars): introduces a physics-informed reinforcement learning protocol for automating traffic control in free-flow networks by having a central RL agent modulate vehicle speeds on super-segments based on traffic state observations, guided by a reward function.
The system utilizes Deep Q-Learning with a neural network to learn speed modulation policies, evaluated in a PTV Vissim simulation environment.
The approach aims to optimize network throughput and prevent congestion by coordinating individual self-regulating cars without requiring new physical infrastructure.

PE-MA: Parameter-Efficient Co-Evolution of Multi-Agent Systems

PE-MA (Parameter-Efficient Multi-Agent Co-Evolution): introduces, "a novel collaboration framework", with Frozen Backbone (Fixed feature extractor), Personalized Adapter (Adapts to local tasks/data), Shared Adapter (Shares knowledge across agents), Communication Mechanism (Exchanges and aggregates adapters), designed for efficient, scalable, and personalized co-evolution in multi-agent systems.
Each agent maintains a lightweight personalized adapter for agent-specific behavior and a shared adapter collaboratively optimized across neighboring agents.
The dual-adapter architecture balances global coordination with local adaptation, significantly reducing training and communication costs.

Interaction, Process, Infrastructure: A Unified Architecture for Human-Agent Collaboration

Unified Architecture for Human-Agent Collaboration: introduces a layered framework for human-agent collaboration with Interaction Layer (surface of shared understanding), Process Layer (collaborative core), and Infrastructure Layer (orchestration, execution, memory).
The Process Layer explicitly models goals, workflows, and progress, serving as connective tissue for human-agent alignment and coordination over time.
This modular architecture supports transparency, extensibility, and adaptive, goal-aligned collaboration by decoupling interaction, process logic, and computational foundation.

12th June 2025

AUTOMIND: Adaptive Knowledgeable Agent for Automated Data Science

AUTOMIND (Adaptive Knowledgeable Agent for Automated Data Science): introduces an adaptive, knowledgeable LLM-agent framework with a curated expert knowledge base, an agentic knowledgeable tree search algorithm, and a self-adaptive coding strategy.
The framework leverages the expert knowledge base via a retriever to ground the tree search, which explores solutions through drafting, improving, and debugging actions.
The self-adaptive coding strategy dynamically adjusts code generation based on task complexity, using either one-pass generation or stepwise decomposition with execution feedback.

Specification and Evaluation of Multi-Agent LLM Systems - Prototype and Cybersecurity Applications

Multi-Agent LLM System: introduces a system architecture and specification for multi-agent LLM applications, including a Client Application, Conversational User Interface, Agent Manager, Conversation Manager, Execution Engine, Agents, LLM Services, Host Execution Environment, and Agent Schema.
The system allows specifying agents with executable prompts, actions, and data, supporting prompting/reasoning techniques and conditional execution based on results.
The Agent Schema defines agent types, functions (execution, evaluation), and configurations for agents and LLMs, enabling systematic evaluation of LLMs and techniques in specific applications like cybersecurity.

From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

GPT ranking system: introduces a novel peer review mechanism using LLM agents for pairwise comparisons, aggregated by the Bradley-Terry model to derive a global ranking of submissions.
The system contrasts pairs of manuscripts to determine relative quality, moving away from traditional independent absolute scoring.
Empirical experiments demonstrate the system's potential to identify high-impact papers more effectively than rating-based methods, while also revealing biases against topic novelty and institutional diversity.

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Reflection Agent with Dedicated Evaluator: introduces a system for automatic code validation and refinement, including a Code Generator, Reflect module, and Evaluator using specific metrics.
The Evaluator utilizes Bidirectional Functionality Matching and Logic Representation metrics to assess generated Bash code quality without requiring reference code.
The system incorporates judgments and feedback from the evaluation metrics to refine the initial code snippet generated by the Code Generator.

Using Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation

LLM Tool-Calling Evaluation Framework: introduces a method to convert NL2SQL datasets into NL2API datasets for LLM tool-calling evaluation using a Data Generation Pipeline.
The framework includes generated API Collections (SLOT, SEL, REST) with varying characteristics, Invocable APIs for live interaction, and an Evaluation Set pairing natural language queries with ground-truth API sequences.
It evaluates the performance of various LLMs and ReACT Agents on these generated datasets to assess their tool-calling capabilities.

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Reasoning Agentic RAG: introduces a paradigm integrating retrieval with model-driven reasoning and decision-making, encompassing Question, LLM/LRM, Reasoning, Retrieval, Retrieved Information, Distilled Information, and Answer components.
The framework categorizes approaches into predefined reasoning with fixed pipelines and agentic reasoning with autonomous tool orchestration.
This survey reviews techniques, architectural designs, reasoning strategies, and tool coordination within this paradigm to address industry challenges.

Provably Learning from Language Feedback

HELIX: introduces a framework for Learning from Language Feedback (LLF), including an LLM Policy, Reference Policy, Reward Mapping, set of Hypotheses, set of Actions, and Score Matrix.
The LLM Policy generates hypotheses and actions, the Reference Policy adds random actions, and the Reward Mapping scores actions under hypotheses to form a Score Matrix.
The algorithm uses the Score Matrix for decision-making, employing exploitation when consensus exists and exploration otherwise, potentially re-scoring with the Reference Policy.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory: introduces an automated pipeline for GitHub issue resolution benchmark construction, including Raw Issue Collection (Collects GitHub issue data), SWE-Builder (Automates environment setup), Grading Results (Grades test outcomes), and Fail2pass Validation (Validates fail-to-pass transition).
The SWE-Builder component is a multi-agent framework comprising a Repository Explorer (Collects repository setup information), Environment Manager (Generates Dockerfile), Test Manager (Generates test script), and Test Analyst (Validates environment, plans iterations), supported by an Evaluation Environment Memory Pool (Stores and reuses setups).
The pipeline automates environment construction, grading via exit codes, and fail2pass validation to reduce manual effort in creating large-scale, high-quality datasets.

Build the web for agents, not agents for the web

AWI (Agentic Web Interface): introduces a new web interface paradigm specifically designed for agents, featuring unified higher-level actions, compatibility with user interfaces, access control for agents, progressive information transfer, and agentic task queues.
This paradigm shift aims to overcome limitations of current human-designed web interfaces for web agents.
The paper establishes guiding principles for AWI design, emphasizing safety, efficiency, and standardization, and advocates for broad ML community involvement.

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Lightweight Sequential Monitoring Framework: introduces a defense against decomposition attacks by using an external monitor to evaluate the cumulative context of subtasks.
The monitor outputs a binary flag at each step to halt the LLM if harmful intent is detected based on the prompt history.
This framework outperforms single-input monitoring and is cost/latency efficient for mitigating decomposition attacks.

Execution Guided Line-by-Line Code Generation

EG-CFG (Execution-Guided Classifier-Free Guidance): introduces a novel approach for neural code generation that incorporates real-time execution signals into the language model generation process, utilizing a Large Language Model, Programming Task input, Initial Prompt, Candidate Generation via beam search, Executable Extraction via AST parsing, Execution Engine for running test cases, Execution Trace generation, Dynamic Signal aggregation, Dynamic Prompt construction, Classifier-Free Guidance for token generation, an Inference Loop for autoregressive generation, and Parameter Search.
The method dynamically incorporates execution signals as the model generates code line-by-line, guiding the generation process toward executable solutions.
EG-CFG achieves state-of-the-art performance on multiple code generation benchmarks by leveraging execution feedback and Classifier-Free Guidance.

Dynamic Epistemic Friction in Dialogue

Dynamic Epistemic Friction (DEF): introduces a formal model of dynamic epistemic friction in dialogue, operationalized within Dynamic Epistemic Logic and vector-based belief representations, using epistemic states, propositions, evidence, alignment, friction, QBank, EBank, FBank, an update function, friction coefficients, and friction equilibrium.
The model quantifies resistance encountered during belief updates by measuring vector similarity between agent beliefs and new information combined with evidence.
Empirical analysis on a situated collaborative task dataset demonstrates that the model effectively predicts participant belief updates by modeling this resistance.

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

OPT-Agent: introduces a framework that emulates human reasoning for optimizing solutions by iteratively generating, validating, and improving solutions using historical feedback, with all Drafting, Improving, Debugging, Historical Information, Error Analysis, Validation, and Metrics components.
The framework's workflow involves generating a Draft solution, iteratively Improving valid solutions or Debugging buggy ones based on Error Analysis and Historical Information, with Validation and Metrics guiding the process.
OPT-Agent is evaluated on OPT-BENCH, a benchmark of machine learning and NP problems, to assess LLMs' iterative optimization capabilities.

Integrating Large Language Models into Text Animation: An Intelligent Editing System with Inline and Chat Interaction

Text Animation Editing System: introduces an LLM-aided system for text animation editing, featuring a Script Panel (Edit text, properties), Timeline Panel (Arrange, time clips), Chat Panel (Natural language commands), Resource Panel (Manage assets), Inspector Panel (Adjust properties), Preview Panel (Visualize edits), Inline Agent (Contextual suggestions), Chat Agent (Conversational task execution), LLM (Large Language Model) (AI engine), Semantic-Animation Mapping (Intent to action), and Script-Timeline Synchronization (Panels linked).
This system employs a dual-mode agent pipeline (Inline and Chat Agents) powered by an LLM for intelligent assistance and natural language interaction.
The system aims to lower creative barriers for non-professionals and enhance editing efficiency through seamless inline edits and chat-based interactions.

Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

VLFly (Vision-Language Fly): introduces a novel VLN framework for UAVs, including an instruction encoding module, a goal retrieval module, a waypoint planning module, and action execution, designed for open-vocabulary goal understanding and continuous control.
The framework processes natural language instructions, retrieves a goal image, generates waypoints from egocentric observations, and executes continuous velocity commands.
VLFly achieves robust generalization and outperforms baselines in simulation and real-world UAV navigation tasks without task-specific fine-tuning.

SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

SDialog: introduces a Python toolkit for synthetic dialogue generation and analysis, with Turn (Single utterance), Event (Action/instruction), Dialog (Complete conversation structure), Persona (Character profile definition), PersonaAgent (Simulates agent role-playing Persona), BaseOrchestrator (Abstract control class), SimpleReflexOrchestrator (Triggers on condition), LengthOrchestrator (Controls dialogue length), ChangeMindOrchestrator (Simulates agent changing mind), SimpleResponseOrchestrator (Suggests responses by similarity), InstructionListOrchestrator (Provides sequence of instructions), DialogGenerator (Generates dialogue using LLM), PersonaDialogGenerator (Generates dialogue between Personas), Dataset Utilities (Work with external datasets), Serialization Utilities (Save/load dialogues), and Visualization Utilities (Analyze/visualize dialogues) components, designed for creating realistic, diverse, and controllable conversational data.
The toolkit provides abstractions for personas, orchestration, and scenario management, leveraging instruction-tuned Large Language Models for generation.
SDialog supports workflows like multi-agent simulation and scenario-driven generation, aiming to standardize synthetic data generation for reproducibility.

Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Multi-User Dialogue Data Construction Method: introduces a method to extend single-user dialogue datasets by incorporating a second user's utterances, utilizing a Single-User Dialogue Structure, Speech Act Type Identification, User2 Utterance Generation, User2 Utterance Validation, and an LLM to create a Multi-User Dialogue Structure for evaluating DST.
The method systematically generates and validates user2 utterances based on speech act theory to create a controlled multi-user setting for assessing LLM performance.
This approach enables evaluating LLMs on multi-user dialogue state tracking challenges with minimal dataset construction costs.

BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

BugGen: introduces a self-correcting multi-agent LLM pipeline, with Module Splitter (partitions RTL), Mutation Index (lists mutation types), Mutation Cache (stores history), Region Selector Agent (chooses region), Mutation Selector Agent (chooses mutation), Mutation Injector Agent (inserts mutation), Evaluation (validates bug), and Rollback/Retry (corrects failures), designed to autonomously generate, insert, and validate realistic functional bugs in RTL.
The pipeline leverages LLM agents in a closed-loop architecture with shared memory and iterative refinement to produce unique, syntactically valid, and functionally detectable bugs.
BugGen achieves high functional accuracy and throughput, outperforming existing methods and generating high-quality bug datasets suitable for training ML-based debugging models.

Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

LLM4PFA (LLM-Enhanced Path Feasibility Analysis): introduces an iterative path feasibility analysis framework for static bug detection, with Iterative Function Analysis, Feasible Path Constraint Extraction, Critical Path Conditional Branches Identification, Feasible Path Conditional Expression Extraction, Context-Aware Symbolic Range Reasoning, LLM Agent, Variable Symbolic Range Reasoning, Function Call Symbolic Range Reasoning, Function Retrieval Tool, Source Code Repository, Function Call Memory, Constraints Solving, SMT Query Script Generation, Script Template Generation, SMT Constraints Generation, Script Merging, Constraint Solver, Control-Flow Graph (CFG), and Initial States P.
The framework iteratively analyzes functions in a call trace, extracting and solving feasible path constraints using LLM agents for symbolic reasoning and a constraint solver.
LLM4PFA leverages LLM agents' self-planning and tool-usage capabilities for context-aware symbolic range reasoning and iteratively generates and solves SMT queries to minimize false positives.

WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

WGSR-Bench: introduces, "a wargame-based benchmark for large language models", with Environmental situational awareness, Opponent risk assessment, and Policy generation components, where "it systematically assesses strategic reasoning abilities using wargame scenarios".
The benchmark evaluates LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning within a high-complexity wargame environment.
It employs a structured cognitive framework (S-POE) and utilizes real adversarial wargame data for comprehensive evaluation and analysis.

11th June 2025

AUTONOMOUS COMPUTER VISION DEVELOPMENT WITH AGENTIC AI

Agentic AI approach: introduces an autonomous computer vision development system, with OpenManus Agent (Orchestrates task execution), Memory (Stores runtime state/context), Planning (Decomposes tasks, selects tools), Reasoning (Analyzes inputs, makes decisions), Self-Correction/Adaptation (Handles errors, refines plans), Tools (Execute Python, browser, files, shell), SimpleMind Framework (Executes computer vision tasks), Configurable Tools (Perform image processing, neural nets), Knowledge Graph (Defines SimpleMind workflow), Blackboard (Central working memory), SM-Learn (Trains neural network weights), SM-Think (Performs inference), User Prompt (Natural language task input), System Prompt (Guides LLM planning), Verifier (Checks YAML configuration), Tool Configuration File (KG) (YAML workflow definition), Tool Execution (Runs SimpleMind modules), where the system translates natural language prompts into SimpleMind workflows for medical image analysis.
The OpenManus agent leverages an LLM for planning and tool use, generating a YAML Knowledge Graph that configures SimpleMind's computer vision tools.
SimpleMind executes the planned workflow, utilizing its Blackboard for data flow and SM-Learn/SM-Think for training and inference on medical images.

AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution

AURA (Attribution Using Retrieval-Augmented Agents): introduces a multi-agent framework for cyber threat attribution, comprising input processing, query rewriting, semantic retrieval, decision making, external search, attribution generation, conversational memory, and a knowledge base.
The framework processes diverse threat data via collaborative agents, integrating Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) for knowledge-enhanced reasoning and interpretable attribution.
AURA generates transparent, evidence-backed attribution decisions by tracing reasoning to contextual evidence and providing natural language justifications.

Disclosure Audits for LLM Agents

CMPL (Conversational Manipulation for Privacy Leakage): introduces an automated auditing framework for conversational privacy risks in LLM agents, featuring an Application Agent A (LLM agent being audited), an Adversary U (LLM agent attempting leakage), and an Auditor D (LLM agent detecting leakage) interacting within a Conversation Loop (iterative interaction process) based on a Scenario Description σ (public context), Information Subject Profile I (private data), Privacy Directive ψ (disclosure rules), and Task Description T (agent A's goal).
The Adversary U employs a Strategist Se (adversary planning module) and Prompt Generator Ge (adversary query module) to manipulate the Conversation History H (dialogue turns) and may use a Side-channel Predictor Pe (adversary inference module) to make a Prediction (adversary guess) with Confidence kt (adversary prediction score).
The Auditor D monitors the Conversation History H and uses an Entail Function (auditor explicit leakage detector) and the adversary's Prediction and Confidence to produce an Indicator zt (auditor leakage signal) when explicit or implicit leakage is detected, while both agents utilize Memory (stores/summarizes history) to maintain state.

Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Chat-of-Thought: introduces a collaborative multi-agent system for domain-specific information generation, featuring LLM-based Agents, Context Discovery, Multi-Round Chain of Interactions, Template-driven Routing, and Quality Check.
The system employs specialized LLM-based Agents with defined roles and state to engage in iterative discussions guided by templates and dynamic assignment.
It leverages diverse input sources, question/answer banks, and various learning methods to generate and refine domain-specific knowledge like FMEA documents.

A quantum semantic framework for natural language processing

Quantum Semantic Framework: introduces a non-classical approach to natural language processing, modeling semantic meaning as observer-dependent and contextually actualized through the interaction of a Semantic Expression (symbol affording interpretations) and an Interpretive Agent (observer) via an Interpretive Observable (semantic probe operator).
The framework posits that meaning is not intrinsic but emerges dynamically, influenced by the agent's Semantic Memory (agent internal state) and Context (situational factors), with interpretation dynamics governed by a Semantic Hamiltonian (interpretation dynamics).
A Semantic Bell Test (experimental method) using LLM Agents (computational observers) configured with Personas (agent configurations) and presented with Ambiguous Word Pairs (stimuli) demonstrates non-classical contextuality in interpretation, supporting the framework's premise.

AI Agent Behavioral Science

AI Agent Behavioral Science: introduces, "AI Agents (autonomous systems)/Memory (stores history)/Planning (strategizing actions)/Tool Use (interacts with tools)/Action Modules (executes decisions)/Intrinsic Attributes (internal traits)/Environmental Constraints (external structures)/Behavioral Feedback (adaptation mechanism)/Ability (foundational competence)/Motivation (drive from feedback)/Trigger (initiating signals)", a paradigm for studying AI agents as behavioral entities in context, emphasizing systematic observation, intervention design, and theory-guided interpretation.
This perspective focuses on understanding how AI agent behavior emerges and adapts through the interplay of internal factors, environmental context, and interaction feedback.
The paper systematizes research on individual, multi-agent, and human-agent interactions and positions this behavioral science approach as essential for responsible AI.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-JEPA 2 (Self-Supervised Video Model): introduces a self-supervised approach combining internet video data and robot interaction data, with V-JEPA 2 Encoder (Extracts video representations), V-JEPA 2 Predictor (Predicts masked representations), V-JEPA 2 EMA Encoder (Target for prediction), V-JEPA 2-AC Frozen Encoder (Provides learned representations), V-JEPA 2-AC Action-Conditioned Predictor (Predicts future state representations), MLLM Projector Module (Maps visual to LLM), and MLLM LLM Backbone (Language model).
The framework pre-trains V-JEPA 2 on internet video, then post-trains V-JEPA 2-AC on robot data for planning, and aligns V-JEPA 2 with an LLM for video question answering.
V-JEPA 2 demonstrates strong performance on motion understanding, action anticipation, video question answering, and enables zero-shot robot manipulation via planning.

Patterns of Patterns III

PLACARD: introduces a methodology combining PAR, CLA, and DPL for collective learning and design, extended to pattern-competent AI agents within Multi-Agent Systems, utilizing Language Model Substrates and various agent types.
The approach structures pattern use via an A/B/C catalogue and proposes AI agent pattern types (Interactional, Cognitive, Infrastructural) along with specific candidate patterns for agents and their environments.
Different agent roles, including Code-Writing & Execution Agents, Pattern-Aware Dialogue Agents, Pattern-Reflective Meta-Agents, and Multi-Agent Institution Designers, interact within the MAS environment, grounded by a Real-World Interface layer.

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

VRS (Verbalized Rejection Sampling): introduces, "a natural-language adaptation of classical rejection sampling", with LLM (Performs accept/reject decision), Target Distribution (Desired sample distribution), Proposal Distribution (Source of candidate samples), Candidate Sample (Sample from proposal), Verbalized Prompt (LLM input with descriptions), Accept/Reject Decision (LLM binary output), and Sampling Loop (Generates candidates, repeats), which prompts an LLM to reason about and accept or reject proposed samples from a proposal distribution to generate samples from a target distribution.
The framework verbalizes the target distribution, proposal distribution, and a candidate sample into a prompt for the LLM, which acts as a black-box decision engine.
The external sampling loop generates candidate samples and repeats the LLM decision process until the required number of accepted samples is collected.

SRLAgent: Enhancing Self-Regulated Learning Skills through Gamification and LLM Assistance

SRLAgent: introduces an LLM-assisted system that fosters self-regulated learning skills through gamification and adaptive support, with Game Environment (Minecraft 3D world), Learning Management System (Manages learning elements), Task System (Manages hierarchical tasks), Agent System (AI-powered support), LLM (Large Language Model), Planning Agent (Supports forethought phase), SubTask Monitor (Tracks performance), SubTask Tutor Agent System (Provides tutoring), Quiz Agent (Supports quizzes), Review Agent (Guides analysis), Chatting Agent (Facilitates discussions), Writing Agent (Supports writing), Reflection Agent (Guides reflection phase), Task State (Current task status), SRL Phase (Current SRL stage), Task Learning Views (User interface views), Learning Subtasks Content (Educational materials/activities), Learning Feedback (Agent-provided feedback), Tutor Feedback (Tutor agent feedback), Prompt Configurations (Agent prompt templates), where it guides students through SRL phases using gamification and LLM-powered agents in a game-based environment.
The system is grounded in Zimmerman's three-phase SRL framework, enabling goal-setting, strategy execution, and self-reflection within an interactive game environment.
SRLAgent offers real-time feedback and scaffolding powered by LLMs to support students' independent study efforts.

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

PersonaLens: introduces a benchmark for evaluating personalization in task-oriented conversational AI assistants, with diverse user profiles, tasks with situational contexts, and two LLM-based agents (User Agent and Judge Agent) to simulate interactions and evaluate performance.
The benchmark features 1,500 user profiles, 111 tasks across 20 domains, and LLM-powered agents for scalable, automated evaluation.
PersonaLens assesses personalization, task completion, and dialogue quality in multi-turn interactions, revealing insights into current LLM assistants' capabilities.

INTELLIGENT DESIGN 4.0: PARADIGM EVOLUTION TOWARD THE AGENTIC AI ERA

ID 4.0 (Intelligent Design 4.0): introduces a multi-agent-based paradigm for engineering design automation, composed of stage-level agents (interprets inputs/decomposes tasks, explores early solutions/ideation, translates concepts/coordinates modeling, refines geometry/produces models, applies optimization/fine-tunes) and functional agents (retrieves external knowledge, accesses databases, infers user intent/context, facilitates divergent thinking, synthesizes 2D/3D geometry, conducts multi-physics simulations, verifies compliance) operating within a shared information environment (supports coordination/learning).
This framework envisions autonomous, task-specialized AI agents coordinating via orchestrated design workflows to support complex, end-to-end design processes.
Agents interact with human designers and external tools, leveraging shared memory for cross-agent coordination and cumulative learning throughout the design process.

Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring

CAMA (Cognitive Architecture for Monitoring Agent): introduces a cognitive architecture for ML monitoring, with Semantic Memory (Stores reference data), Working Memory (Holds current context), Episodic Memory (Retains past instances), Procedural Memory (Stores agent code), Decision Procedure (Feature engineering approach), LLM (Reasoning engine), and Agent code (Includes prompts/chains), designed to enhance interpretability and actionability of monitoring outputs.
The Decision Procedure implements a three-step feature engineering-inspired approach: Refactor, Break Down, and Compile, to process monitoring data.
This architecture leverages structured memory and feature engineering principles to provide robust, interpretable, and actionable insights for ML model monitoring.

Intent Factored Generation: Unleashing the Diversity in Your Language Model

IFG (Intent Factored Generation): introduces a two-stage sampling process including Prompt, LLM Internal, Intent, Phrasing, and Response.
The process samples a semantically dense Intent first, then the final Response conditioned on the Prompt and Intent, allowing independent temperature control for diversity and coherence.
IFG can be implemented via Few-shot Prompting or Finetuning on intent-annotated data, with IFG-prompting encouraging granular steps in reasoning tasks.

Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives

LLM-based Agent System: introduces, this survey, with LLM-based Agent (core system participant), Multi-Agent System (multiple interacting agents), Value Principles (hierarchical ethical norms), Value Alignment Evaluation (assessing adherence to values), and Value Coordination (managing values in multi-agents), where this survey reviews application-driven value alignment in agentic AI systems.
The paper integrates AI advancements driven by large models with demands of social governance, covering value principles, application scenarios, and evaluation methods.
It systematically examines datasets and methods for value alignment assessment and explores value coordination among multiple agents within agent systems.

DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

DipLLM (Fine-Tuned LLM-Based Agent): introduces DipLLM, a fine-tuned LLM-based agent for strategic decision-making in Diplomacy, with Llama 3 8B (LLM backbone), Autoregressive Factorization (decomposes actions), TextDiplomacy (state to text, text to actions), Equilibrium Search (piKL-Hedge) (generates Q-values), Human IL Model (DipNet) (collects raw data), Environment (Diplomacy) (game simulation), Loss Function (aligns policy), LoRA (parameter adaptation), and Data (collected game data).
DipLLM leverages autoregressive factorization to simplify complex multi-unit action assignment into sequential unit-level decisions.
The agent is fine-tuned on a small dataset using a designed loss function to align its policy with an equilibrium objective, outperforming state-of-the-art models like Cicero with significantly less data.

Effective Red-Teaming of Policy-Adherent Agents

CRAFT (Constraint-aware Red-teaming with Adversarial Framing and Tactics): introduces a multi-agent red-teaming system with Policy Analyzer (Extracts relevant policy), Deception Planner (Plans attack strategies), Avoidance Advisor (Plans what to avoid), Dialogue Executor (Executes interaction dialogue), Conversation Memory (Dialogue history), Policy-Adherent Agent (Target system), Policy (Rules for target), and User Request (Initial user input), designed to expose vulnerabilities in policy-constrained LLM-based agents through strategic, multi-step adversarial planning.
The system leverages policy knowledge, strategic reasoning, and pre-execution planning to generate policy-aware persuasive strategies that undermine policy adherence in customer service scenarios.
CRAFT achieves significantly higher attack success rates compared to generic jailbreak methods and cooperative user simulations, highlighting the need for stronger safeguards against malicious user behavior.

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Multi-Agent Framework: introduces ReasonMed, a large-scale medical reasoning dataset generated and refined using a multi-agent system for initial path generation, followed by verification, ranking, summarization, error correction, and quality assessment.
The framework generates 370k high-quality medical reasoning examples by distilling 1.7 million initial paths from multiple large language models through a rigorous multi-stage verification and refinement pipeline.
Leveraging the generated dataset, the authors train ReasonMed-7B and find that combining detailed chain-of-thought reasoning with concise answer summaries yields the most effective fine-tuning strategy for medical question answering.

A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

LLM-HAS (LLM-based Human-Agent Systems): introduces implementation guidelines with Initial Setup (define environment, roles, interaction), Human Data (acquire, process, use human data), Model Engineering (iterative development, feedback, learning, optimization), Post-Deployment (monitor, maintain alignment, adapt), and Evaluation (assess effectiveness, safety, experience) components.
The framework advocates for collaborative AI-human partnerships, prioritizing human involvement for guidance, control, and enhanced system trustworthiness and adaptability.
Progress in AI is measured by how well systems work with humans, enhancing human capabilities through partnership rather than pursuing full autonomy.

Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Multi-Agent Language Models: introduces integrating Language Model (LM) (Recommends/generates actions) into Reinforcement Learning (RL) Agent (Learns decision policy) loops interacting with an Environment (Simulates game world), utilizing Dataset (Human gameplay examples) and Replay Buffers (Stores game transitions) for training via Distillation Loss (Transfers LM knowledge) and TD Loss (Reinforcement learning update), employing Encoders (GRU) (Encode text inputs), Decoder (Combines encoded features), and MLP (Predicts action values).
The approach explores LM-in-the-Loop for action recommendation in text games and LM as Multi-Agent in Hanabi, showing improved performance and accelerated convergence compared to baselines.
Key findings highlight the importance of careful transition selection for LM training and demonstrate the potential for distilling language model knowledge into reinforcement learning agents.

10th June 2025

Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

ANN (Agentic Neural Network): introduces a framework conceptualizing multi-agent collaboration as a layered neural network architecture, including Agent (Node), Layer (Agent Team), Agent Pipeline, Dynamic Routing/Team Selection, Aggregation Function, Forward Pass, Backward Pass/Optimization, Textual Gradient, Global Optimization, Local/Layerwise Optimization, Momentum, Validation, Memory/Trajectory, LLM Backbone, Prompt, and Tool components.
The framework employs a two-phase optimization strategy: a forward pass for dynamic team selection and a backward pass for iterative refinement using textual gradients.
This approach enables agents to self-evolve their roles, prompts, and coordination, dynamically reconfiguring teams and strategies based on performance feedback.

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: introduces a framework for augmenting test cases using intramorphic testing, including the UTBoost Workflow (Orchestrates testing process), UTGenerator (Generates augmented test cases), and Intramorphic Testing (Establishes test oracle).
The UTGenerator component utilizes an LLM and a multi-level localization process (file, function/class, line) to identify relevant code areas before generating new test cases and their dependencies.
UTBoost enhances the evaluation of coding agents on benchmarks like SWE-Bench by identifying insufficient test cases and erroneous patches, leading to more reliable results and leaderboard updates.

GUIROBOTRON-SPEECH: TOWARDS AUTOMATED GUI AGENTS BASED ON SPEECH INSTRUCTIONS

GUIRoboTron-Speech: introduces an end-to-end autonomous GUI agent accepting speech instructions and screenshots, with Vision Encoder (Processes GUI screenshot), Audio Encoder (Processes speech instruction), Large Language Model (Processes inputs, predicts action), Grounding Stage (Trains visual understanding), and Planning Stage (Trains reasoning and planning) components, designed to predict GUI actions from multimodal input.
The approach leverages a progressive training framework with grounding and planning stages to develop capabilities in understanding GUI elements and task execution.
Mixed-instruction training is employed during the grounding stage to mitigate modality imbalance from pre-trained foundation models.

Agent-based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation

MindRAG (Multimodal industrial database Retrieval-Augmented Generation): introduces an agent-based condition monitoring assistance framework with LLMs, Agents, a Multimodal & semi-structured annotated machine graph Vector Store, Multimodal RAG techniques for Retrieval, Generation, Tools, and Knowledge Bases.
The framework integrates LLM-based reasoning agents (Main thinker, CM analyst, Maintenance scheduler, Evaluation agent) with a novel vector store structure designed for industrial condition monitoring data.
MindRAG leverages multimodal retrieval and generative capabilities, supported by custom tools and knowledge bases, to provide decision support and explainable interfaces for condition monitoring analysts.

Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search

LWM-Planner (LLM-based World Model Planning Agent): introduces an LLM agent framework that enhances planning via in-context learning using a Fact Extractor (Extracts atomic facts from experience), a Planner LLM (Performs lookahead search planning), Atomic Facts (Learned knowledge base), and Interaction History (Recent observation-action memory).
The agent extracts task-critical atomic facts from interaction trajectories to dynamically augment prompts for LLM components responsible for action proposal, world model simulation, and value estimation.
Planning involves a depth-limited lookahead search where the Planner LLM simulates trajectories and evaluates outcomes guided by accumulated facts and history.

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

ALE-Bench: introduces a benchmark for long-horizon objective-driven algorithm engineering, featuring Problem (Provides statement/metadata), Scorer (Evaluates solution code), Visualizer (Displays execution results), Test Run (Executes code in sandbox), Code Sandbox (Replicates execution environment), Leaderboard (Ranks submissions/calculates metrics), and Session (Orchestrates AI interaction/evaluation).
The benchmark provides a software framework simulating competitive programming contests, allowing AI systems to iteratively refine solutions using test-run feedback and visualizations.
ALE-Bench quantifies AI performance on computationally hard optimization problems from AtCoder Heuristic Contests, enabling comparison against human experts and fostering long-horizon problem-solving research.

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

VIKI-R (Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning): introduces a two-stage framework that fine-tunes a pretrained vision-language model using Chain-of-Thought demonstrations and reinforcement learning, evaluated on the VIKI-Bench benchmark.
The framework addresses embodied multi-agent cooperation across three hierarchical levels: agent activation, task planning, and trajectory perception, utilizing diverse robot embodiments and multi-view visual observations.
The approach significantly outperforms baselines, demonstrating enhanced visual reasoning and compositional cooperation patterns among heterogeneous agents in complex environments.

Design Patterns for Securing LLM Agents against Prompt Injections

Design Patterns: introduces, with Action-Selector Pattern (Selects predefined actions), Plan-Then-Execute Pattern (Defines plan, executes actions), LLM Map-Reduce Pattern (Dispatches isolated sub-agents), Dual LLM Pattern (Privileged/quarantined LLMs), Code-Then-Execute Pattern (Writes formal program), and Context-Minimization pattern (Removes prompt from context), a set of principled design patterns for building AI agents resistant to prompt injection attacks.
These patterns impose intentional constraints on LLM agents, limiting their ability to perform arbitrary tasks and preventing untrusted input from triggering consequential actions.
The paper analyzes the trade-offs of these patterns in terms of utility and security and illustrates their application through case studies of LLM agent applications.

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Evaluation Frameworks for LLM-based Data Science AI Systems: includes LLM/Agent (AI system being evaluated), Environment (Where agent operates), Tools (Capabilities like code execution), Evaluation System (Measures performance), Data (Input for tasks), Task Description (Instructions for agent), and User (Interacts with agent).
These frameworks assess AI systems, ranging from assistants to autonomous agents, on various data science activities using diverse metrics and setups.
Evaluation often involves the agent interacting with data and tools within an environment, with performance judged by an automated or human-assisted evaluation system against task descriptions and data.

Improved LLM Agents for Financial Document Question Answering

Multi-Agent Framework: introduces a system for financial document question answering with Analyst, Critic, Improved Critic, and Calculator agents.
This framework utilizes multiple LLM-based agents to improve numerical reasoning on tabular and textual financial data.
The proposed calculator agent demonstrates improved performance and safety compared to the previous state-of-the-art approach for this task.

Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMS

End-to-End DST Model: introduces an end-to-end dialogue state tracking system using a pretrained speech encoder, a small connector module, a pretrained LLM, and dialogue history.
The system processes speech input and dialogue history to directly output a JSON string representing the dialogue state.
The approach bridges speech and language model representation spaces through a two-stage training scheme for ASR pre-training and joint ASR-DST finetuning.

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

MasHost: introduces a reinforcement learning-based framework for autonomous multi-agent system construction, with Multi-agent System (Mas), LLM Agent, Role Pool, Interaction Pathway, Markov Decision Process (MDP), State, Action Space, Node-level Actions, Edge-level Actions, Policy Function, Node-level Policy (πθ), Edge-level Policy (πφ), Reward Function, Joint Probabilistic Space Sampling (JPSS), Hierarchical Relative Policy Optimization (HRPO), Group-relative Advantage, Action-wise Absolute Reward, Triple Objective, Query, State List, Selected Agents, Global Messages, Summarizer Agent, and EXIT Node components.
The framework models Mas construction as an MDP, employing JPSS for joint node and edge sampling and HRPO for multi-objective optimization towards performance, efficiency, and rationality.
MasHost enables autonomous Mas graph construction and role selection from a full-scale space, guided by a hierarchical reward structure combining group-relative and action-wise rewards.

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

CAF-I (Collaborative Agent Framework for Irony): introduces a multi-agent framework for irony detection with Context, Semantic, Rhetoric, Decision, and Refinement Evaluator Agents.
CAF-I performs multi-perspective analysis and interactive collaborative optimization to improve detection accuracy and interpretability.
The framework achieves state-of-the-art zero-shot performance by simulating human-like multi-perspective analysis.

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

TACTIC (Translation Agents with Cognitive-Theoretic Interactive Collaboration): introduces a multi-agent translation framework inspired by cognitive translation studies, including DraftAgent (Generates multiple drafts), RefinementAgent (Synthesizes drafts), EvaluationAgent (Evaluates translation quality), ScoreAgent (Scores translation quality), ContextAgent (Provides contextual information), and ResearchAgent (Gathers external knowledge).
The framework comprises six distinct agents mirroring human translation processes, operating in base and complex workflows for iterative refinement.
TACTIC leverages LLMs to simulate cognitive functions like strategic variation, processing, and contextual cognition for high-quality translation.

Reinforce LLM Reasoning through Multi-Agent Reflection

DPSDP (Direct Policy Search by Dynamic Programming): introduces a reinforcement learning algorithm to train an actor-critic LLM system for multi-turn reasoning refinement using direct preference learning on self-generated data, incorporating Actor, Critic, and DPO.
The approach models the multi-turn refinement process as a Markov Decision Process, where the Actor generates responses and the Critic provides feedback, iteratively improving answers.
DPSDP leverages DPO for training the agents, demonstrating improved performance on reasoning benchmarks through collaborative refinement.

Your Agent Can Defend Itself against Backdoor Attacks

ReAgent (Reverse and Reflective Agent): introduces a novel defense against backdoor attacks on LLM-based agents, utilizing Execution-Level Detection, Planning-Level Detection, Agent's Thoughts, Agent's Actions, Agent's Thought Trajectory, User's Instruction, and Reconstructed Instruction components to detect inconsistencies.
The defense employs a two-level approach, verifying consistency between agent thoughts and actions at the execution level and between the user instruction and reconstructed instruction from the thought trajectory at the planning level.
ReAgent leverages the compromised agent's own capabilities for self-defense and provides chain-of-thought explanations for transparency.

Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models

R-Search: introduces a single-LLM framework that unifies multi-step planning, multi-source search execution, and answer synthesis within one coherent inference process, utilizing Policy LLM, , , , , NL-DAG Parser, DAG Validator, Topological Sort, Search Execution, Search Tools, ReFT, GRPO Optimizer, Reward Function, and Reference LLM.
The framework structures output into four components: reasoning traces, NL-DAG search plans, retrieved results, and synthesized answers, enabling integrated reasoning and multi-source search execution.
A specialized Reinforcement Fine-Tuning method based on GRPO is used with a multi-component reward function to optimize answer correctness, structural validity, and format adherence.

TrajFlow: Multi-modal Motion Prediction via Flow Matching

TrajFlow: introduces a flow matching-based framework for multi-modal motion prediction, utilizing a Context Encoder (encodes scene context), Flow Matching Decoder (decodes noisy trajectory to predicted trajectories and scores), Prediction Heads (predicts trajectory, classification, and ranking scores), ODE Solver (solves ODEs for inference), NMS (filters predicted trajectories), Loss Functions (optimizes model parameters), and Self-Conditioning (mitigates overfitting during training).
The framework predicts multiple plausible future trajectories in a single pass by learning to map noise vectors to data distributions via ordinary differential equations.
A Plackett-Luce distribution-based ranking loss and a self-conditioning training strategy are employed to improve uncertainty estimation and generalization.

ORFS-agent: Tool-Using Agents for Chip Design Optimization

ORFS-agent: introduces an LLM-based iterative optimization agent for chip design parameter tuning, integrating an LLM, ORFS flow, METRICS2.1 metrics, GLOBALCONTEXT state, Toolbox external tools (INSPECT, OPTIMIZE, AGGLOM), Inputs, Outputs, and an Iteration Loop.
The agent executes the ORFS flow in parallel runs, gathers METRICS2.1 data, analyzes and proposes parameters using the Toolbox, and updates design files iteratively.
Guided by user Inputs (PDK, Verilog, Prompts), the agent maintains state in GLOBALCONTEXT to optimize design metrics and constraints, producing optimized Outputs (Config, SDC files).

Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Software Engineering Agents (SWE agents): introduces a systematic study of SWE agent behavior using execution traces, focusing on bug localization, patch generation, and reproduction test generation components.
The study analyzes agent effectiveness in fixing issues, generating tests, and comparing agent-generated patches to human-written ones.
Findings reveal agents struggle with complex issues, benefit from bug localization for test generation, and often produce localized edits compared to human refactorings.

9th June 2025

Scaling Laws of Motion Forecasting and Planning A Technical Report

MotionLM: introduces an encoder-decoder autoregressive transformer model with Scene Encoder (Processes scene data) and Motion Decoder (Generates motion tokens) components for joint motion forecasting and planning.
The Scene Encoder uses an Early fusion network (Scene encoder backbone) to process multimodal inputs, while the Motion Decoder employs Cross-attention (Decoder attends encoder) and Flattened agent-time self-attention (Single pass attention) to generate Discrete motion tokens (Represent trajectories).
This architecture enables studying scaling laws for performance improvements with increased compute, data, and model size in autonomous driving tasks.

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

AR-Bench (Active Reasoning Benchmark): introduces, "a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills", with Player (LLM under evaluation), Judge (Provides answers/feedback), Problem (Initial incomplete information), Interaction Rounds (Multi-turn Q&A), Solution (Final derived answer), where "AR-Bench evaluates LLMs on tasks requiring iterative questioning and information gathering under incomplete information."
The benchmark simulates multi-round conversations between the LLM player and NPC judges providing answers or feedback based on the puzzle's underlying truth.
AR-Bench highlights LLMs' difficulties in active reasoning, particularly in generating high-quality questions and effectively leveraging acquired information to solve problems.

From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

ECON (Efficient Coordination via Nash Equilibrium): introduces a hierarchical reinforcement-learning paradigm with Coordinator LLM (Generates strategy, aggregates answers), Execution LLM (Produces answers based on strategy/belief), Individual Belief Network (Maps history/observation to belief/action), Belief Encoder (Aggregates belief states), Centralized Mixing Network (Coordinates beliefs, computes global Q), and Reward Design (Provides optimization feedback), recasting multi-LLM coordination as an incomplete-information game.
The framework replaces explicit inter-agent communication with belief-based coordination, where Execution LLMs optimize responses based on beliefs about co-agents to achieve a Bayesian Nash Equilibrium.
ECON demonstrates improved performance and scalability compared to existing multi-agent debate methods by reducing communication overhead and ensuring convergence.

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

EconWebArena: introduces a benchmark for evaluating autonomous agents on economic tasks, featuring an AI Agent interacting with a Real-World Web Environment via Observation and Action to answer a Question and provide an Answer.
The benchmark comprises 360 tasks on 82 authoritative websites, requiring agents to navigate, interpret content, interact with interfaces, and extract precise data.
The framework utilizes structured observations like AXTree and screenshots, and supports fine-grained browser control actions for realistic web interaction.

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

SOP-Bench: introduces a benchmark generation workflow with User Inputs, LLM, Human Review and Correction, Data Schema Generation, SOP Document Generation, Dataset Generation, API & Tool Specification Generation, and Tools Code Generation, designed to evaluate LLM agents on complex industrial SOPs using SOP, Task, ToolSpecs, Mock APIs, and Dataset.
The benchmark generation workflow creates realistic SOPs, associated data, and tools, incorporating complexity, ambiguity, and interdependencies.
The benchmark evaluates agent architectures like FC Agent and ReAct Agent on their ability to execute multi-step, context-dependent procedures requiring tool use and error handling.

Cognitive Weave: Synthesizing Abstracted Knowledge with a Spatio-Temporal Resonance Graph

Cognitive Weave: introduces a novel memory framework for AI agents centered around a Spatio-Temporal Resonance Graph (STRG), orchestrated by the Nexus Weaver (NW), which processes information via the Semantic Oracle Interface (SOI) and Vectorial Resonator (VR).
The STRG is a multi-layered hybrid structure comprising a Core Particle Store for persistent storage of Insight Particles (IPs) and Insight Aggregates (IAs), a Vectorial Subsystem for embeddings, a Temporal Index for time-based queries, and a Relational Graph for modeling relationships.
The system features a dynamic Cognitive Refinement process, managed by the NW and leveraging the SOI, to autonomously synthesize IAs, manage relational structures, and recalibrate importance, enabling continuous learning and memory evolution.

Supporting Construction Worker Well-Being with a Multi-Agent Conversational AI System

Multi-Agent Conversational AI System: introduces a conversational multi-agent AI system for construction worker well-being, with User Interface, User, User Message, Multi-Agent Orchestration, Agents, Agent Configuration, Large Language Model (LLM), Retrieval-Augmented Generation (RAG), Vector Database, External Knowledge, Chunking, Vectorization, Prompt Engineering/Automation, and Personas components.
The system leverages LLMs and RAG, featuring multiple agents with distinct personas and domain knowledge integrated from external documentation.
The multi-agent framework provides practical problem-solving support and social engagement through a collaborative agent workflow managed by orchestration.

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

HeuriGym: introduces an agentic framework for evaluating LLM-crafted heuristics in combinatorial optimization, with Prompt (Input to LLM), LLM (Large Language Model), Generate (Heuristic algorithm code), Compiler / Interpreter (Code processing), Stage I: Execution (Runs generated code), Logs / Errors (Execution feedback), Solution File (Program output), Stage II: Solution Generation (Output produced), Stage III: Verification (Solution checked), Verifier (Checks constraints), Constraints Satisfaction (Verification result), Evaluator (Calculates cost), Cost (Evaluation result), Feedback (Appended to prompt), and Final Results (Overall outcome).
The framework enables LLMs to generate, execute, verify, and iteratively refine heuristic algorithms for complex optimization problems.
Evaluation uses a feedback loop and the Quality-Yield Index metric to assess reasoning, tool use, planning, and adaptive refinement.

LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement

LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement): introduces a hierarchical decision-making framework integrating Large Language Models for context understanding and exploration guidance to enhance autonomous decision-making in dynamic environments.
The architecture features a Strategic Decision Engine for high-level task planning and specialized Workers for low-level action execution, leveraging an Information Space for structured knowledge.
LLMs function as Context Extractors, converting verbal input to structured insights, and Exploration Facilitators, predicting actions, with an Attention Space mechanism embedding these contextual cues into the reinforcement learning policy, reward, and action space.

QUITE: A Query Rewrite System Beyond Rules with LLM Agents

QUITE (Query Rewrite): introduces a training-free, feedback-aware system leveraging LLM agents, rewrite middleware, and a query hint recommender to rewrite SQL queries for improved performance.
The system employs a multi-agent framework controlled by a finite state machine, specialized middleware tools, and a novel hint injection technique.
This approach supports a broader range of query patterns and rewrite strategies, achieving significant execution time reductions and higher rewrite equivalence rates.

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

MCPWorld: introduces a unified benchmarking testbed with Task Manager (Initializes tasks/environment), Environment (Containerized desktop), Unified Tool-based Space (Agent interaction interface), App Interface (Connects tools to app), Hooker (Captures app signals), and Evaluator (Verifies task completion) components, designed for evaluating API, GUI, and hybrid computer use agents using a white-box approach.
The testbed utilizes "white-box apps" with source code availability to enable programmatic verification of task completion via dynamic code instrumentation.
MCPWorld supports GUI, API, and hybrid interaction modalities and provides a standardized environment and tool-based interface for agent evaluation.

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

SWE-Dev: introduces a software engineering agent framework, with Repo Info Extraction (extracts codebase info), Description Generation (generates Gherkin scenarios), Test Case Generation (generates test case code), Revision from Traceback (refines test cases), and Fail-to-pass Test Cases (final test dataset) components, which builds SWE agents using a scalable test case generation pipeline.
The framework focuses on training and inference scaling to improve performance on software engineering tasks.
Training scaling involves synthesizing test cases and scaling agent trajectories, while inference scaling increases interaction budget per run.

MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity

MalGEN: introduces a modular, multi-agent framework for generating malware-like artifacts, including Task Planner, Developer, Code Integration, and Executable Builder agents.
The framework simulates adversarial workflows by decomposing user intent into sub-tasks, generating code snippets, integrating them, and building an executable.
MalGEN aims to support defensive cybersecurity research by producing behaviorally diverse, ethically controlled malware samples aligned with MITRE ATT&CK.

Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models

Context-Aware Machine Translation with LLMs: surveys research on using large language models for context-aware machine translation, covering prompt-based, fine-tuning, and other application approaches.
Prompt-based methods utilize LLMs with prompts and examples, while fine-tuning adapts LLMs using specific data and processes.
Other applications include automatic post-editing using an initial MT system and an LLM, agentic frameworks with LLMs, memory, decoding, and agents, and LLM-based evaluation.

SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

SAFEFLOW: introduces a principled protocol for trustworthy and transactional autonomous agent systems, with User (U), Decider (D), Environment (E), Information (I), SafeFlowAgent-Level (SF), Transactional Logging System, SAFEFLOW MONITOR, Dependency Graphs (DAG), Concurrency Control System, SAFEFLOWAGENT SCHEDULER, SAFEFLOWAGENT VERIFIER, and Bayesian Trust Estimation Process components.
The framework enforces fine-grained information flow control using SafeFlowAgent-Levels and ensures reliability through transactional logging, dependency graphs, and concurrency control.
A trusted Verifier component dynamically adjusts trust levels based on logged behavior and a Bayesian trust estimation process, enhancing security and adaptability.

ChemAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

ChemAgent: enhances LLMs for chemistry and materials science using a HE-MCTS (hierarchical tree search) framework for tree-search based tool learning.
The HE-MCTS framework decouples tool planning (Policy Model) and execution (Execution Model), guided by PRM (step reward) and ORM (outcome reward), and trained via LLM Self-Training (autonomous optimization).
The system integrates a large Chemistry ToolPool (chemical tools) and is benchmarked/trained on the ChemToolBench (tool learning dataset).

INTENTEST: introduces an API-centric stress testing framework for LLM agents, with Semantic Partitioning (divides input space), Seed Task Generation (creates initial tasks), Testcase Mutator (generates task variants), Intent Preservation Sampling (filters intent-preserving mutations), Error Likelihood Estimation (predicts error likelihood), Strategy Memory (stores successful strategies), Strategy Adaptation (retrieves and adapts strategies), LLM Agent (system under test), and Judge (evaluates for violations), designed to systematically uncover intent integrity violations.
The framework partitions the input space based on API parameters and intent categories, generates seed tasks, and iteratively mutates them while preserving user intent.
INTENTEST prioritizes mutations likely to cause errors using a predictive model and improves efficiency by adapting successful strategies from a memory.

Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent

ROS-based agentic framework: introduces a system for natural language control of PX4-based UAVs, integrating PX4 Autopilot (Low-level flight control), ROS2 Middleware (Communication layer), Ollama (Serves LLMs and VLMs), Visual QnA Node (Processes images and queries), Path Planning Node (Generates collision-free trajectory), Map Encoder Node (Embeds pose and semantic info), LLM (Generates action commands), VLM (Assesses visual input), NVIDIA Omniverse (Simulation environment), and Hardware-In-The-Loop (Physical drone setup).
The framework uses Ollama to serve various open-source LLMs and VLMs, managing tasks through modular ROS2 nodes for visual question answering, path planning, and map encoding.
The system enables a drone agent to interpret natural language instructions, perceive its environment, and execute flight actions in both simulation and real-world settings.

MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

MedChat: introduces a multi-agent framework for multimodal diagnosis, with Retinal Fundus Image, Clinical Note, Glaucoma Classifier, Disk/Cup Segmentor, Shared Prompt, Role-Based Agents, Sub-Reports, Director Agent, Final Report, Frontend, Backend, and Interactive Chat Interface components, designed to emulate multidisciplinary clinical workflows for generating diagnostic reports.
The framework processes medical images and clinical notes using deep learning models, verbalizes outputs into a shared prompt, distributes it to role-specific LLM agents, and synthesizes their sub-reports into a final diagnostic report via a director agent.
A companion platform provides a user interface for input, report viewing, and interactive question-answering, enhancing transparency and usability for clinical review and education.

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

G-Memory: introduces a hierarchical memory system for Multi-Agent Systems (MAS) with Insight Graph (abstracts generalizable insights), Query Graph (encodes task meta-information), and Interaction Graph (stores communication logs) components, which manages MAS interaction history via a three-tier graph hierarchy.
G-Memory performs bi-directional memory traversal to retrieve high-level insights and fine-grained interaction trajectories for new queries.
The hierarchical memory architecture is updated upon task completion by assimilating new trajectories and distilling insights, enabling agent teams to evolve.

Shapley-Coop: Credit Assignment for Emergent Cooperation in Self-Interested LLM Agents

Shapley-Coop: introduces a cooperative workflow for self-interested LLM agents, with Structured Negotiation Protocol (Communication protocol), Short-Term Shapley Chain-of-Thought (Heuristic reasoning), and Long-Term Shapley Chain-of-Thought (Retrospective reasoning), enabling credit assignment.
The framework integrates Shapley Chain-of-Thought reasoning with structured negotiation protocols to align heterogeneous goals and facilitate fair credit assignment.
Shapley-Coop fosters spontaneous cooperation through rational task-time pricing and transparent post-task reward redistribution.

8th June 2025

SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows

SCGAgent: introduces an agentic workflow for secure code generation, with Code Generation, Unit Test Generation, CWE Prediction, Guideline Retrieval, Guideline Relevance Checking, Code Modification, Enforce Functionality Module, Fault Determination, Guideline Database, and Test Runner components.
The framework utilizes an underlying language model to perform generation, prediction, checking, modification, and fault determination tasks, guided by a workflow and a database of secure coding guidelines.
SCGAgent iteratively refines generated code based on secure coding guidelines and feedback from executing LLM-generated unit tests to improve both security and functionality.

Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Agentic Framework (knowledge organization): introduces a lightweight, agentic framework for question answering under temporal conflict by incrementally building a structured, external memory from source documents.
The framework utilizes a Language Model agent to decompose Incoming Questions, extract facts from Incoming Text Documents, and update a structured Knowledge Base.
For answering, the agent queries the Knowledge Base for temporally filtered, relevant facts, enabling reliable reasoning over dynamic information without model re-training.

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

LIET (Learn as Individuals, Evolve as a Team): introduces a framework for multi-agent LLM adaptation in embodied environments, featuring LIET Agent, Environment, Communication Channel, Memory, Utility Function, Comm., Planner, Know. List, Message Generator, and Reflector components.
The framework enables LLM agents to learn individually via a utility function for cost estimation and evolve as a team through an evolving communication scheme.
Individual agents use memory and the utility function for local planning, while team communication is guided by a shared knowledge list updated via reflection on received messages.

LLM-Enhanced Rapid-Reflex Async-Reflect Embodied Agent for Real-Time Decision-Making in Dynamically Changing Environments

RRARA (Rapid-Reflex Async-Reflect Agent): introduces a hybrid embodied agent combining a rule-based policy (low-latency reflex) for immediate actions with an asynchronous LLM-based Reflector (asynchronous reflection feedback) for in-situ refinement.
The agent executes initial actions based on the low-latency rule-based policy while the LLM-based Reflector analyzes the situation in parallel to provide feedback.
This parallel processing allows the agent to maintain real-time responsiveness while incorporating high-level reasoning to revise suboptimal decisions in dynamic environments.

BIMgent: Towards Autonomous Building Modeling via Computer-use Agents

BIMgent: introduces an agentic framework for autonomous architectural building modeling, with Design Layer (Transforms design requirements), Action Planning Layer (Hierarchically plans modeling steps), High-level Planner (Generates general design steps), Low-level Planner (Generates detailed action substeps), Execution Layer (Executes planned GUI operations), Pure-Action Workflow (Executes deterministic actions), Vision-Driven Workflow (Executes GUI-grounded actions), Supervisor (Monitors execution, provides feedback), and Memory (Stores execution trajectories).
The framework transforms multimodal design intents into 3D BIM models by interpreting design, planning software workflows hierarchically, and executing GUI actions with supervision and reflection.
BIMgent leverages multimodal LLMs as backbones and integrates components like RAG for documentation access and a screen parser for dynamic GUI grounding to handle complex BIM software environments.

Mind the Web: The Security of Web Use Agents

Web-use Agent Architecture: introduces, "a new attack vector exploiting web-use agents' high-privilege browser capabilities by embedding malicious content in web pages", with all LLM (interprets content, plans actions), Perception Module (gathers web content), Execution Engine (interacts with browser/system), State Management (handles sessions, credentials), Agent Interface (user interaction) components, where "the attack leverages LLMs' contextual reasoning limitations to frame malicious instructions as helpful task guidance".
The paper demonstrates nine payload types compromising confidentiality, integrity, and availability against four popular web-use agent implementations.
Mitigation strategies including oversight, execution constraints, and task-aware reasoning are proposed to address these vulnerabilities.

BRIGHT+: Upgrading the BRIGHT Benchmark with MARCUS, a Multi-Agent RAG Clean-Up Suite

MARCUS (Multi-Agent RAG Clean-Up Suite): introduces a multi-agent pipeline with SafeClean (conservative cleaning), FastClean (aggressive cleaning), and Splitter (semantic chunking) agents to clean and re-chunk the BRIGHT corpus into BRIGHT+.
The pipeline leverages LLMs to systematically remove structural noise and address semantic discontinuity in web-scraped data.
BRIGHT+ yields improvements in retrieval accuracy and multi-hop reasoning across diverse retrievers.

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

ToTh (Theorem-of-Thought): introduces a multi-agent framework with Multi-Paradigm Reasoning Agents (Generate reasoning traces), Formal Reasoning Graph Construction (Transform traces to graphs), Bayesian Confidence Propagation (Evaluate graph consistency), Graph Scoring (Select best graph), and Answer Extraction (Extract final answer), modeling reasoning as collaboration and verification.
The framework employs distinct agents for abductive, deductive, and inductive reasoning, structuring their outputs into graphs.
Consistency is verified using NLI-calibrated Bayesian belief propagation to select the most coherent reasoning path.

Accelerating Two-Dimensional Materials Research via a Universal Interatomic Potential and Large Language Model Agent

MCP-based Agent Platform: introduces a platform integrating a universal ML-IAP for 2D materials with an LLM-powered agent, including a database, model (ML-IAP and band gap model), and functional modules.
The platform enables natural language interaction for 2D materials property simulations and high-throughput screening.
The ML-IAP model, based on MatterSim/M3GNet, is trained on a large 2D material dataset, while a GNN/CNN model predicts band gaps.

Position: Simulating Society Requires Simulating Thought

GenMinds (Generative Minds): introduces a conceptual modeling paradigm for generative agents, with Structured Thought Capture (Elicits, parses explanations), Causal Motifs (Minimal causal units), Causal Belief Network (CBN) (Symbolic causal graph), Symbolic-Neural Hybrid Graph Simulation (Inference, belief updates), and Awareness of Unknown (Highlights missing links), designed to simulate structured, revisable, and traceable thought for social simulations.
The framework grounds agents in modular belief representations using causal graphs derived from natural language interviews to capture reasoning fidelity.
The paper also introduces RECAP, a benchmark framework to evaluate reasoning fidelity based on traceability, demographic sensitivity, and intervention coherence.

7th June 2025

An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design

Agentic Framework: introduces an autonomous system for metamaterial inverse design, including Planner (Orchestrates process), Input Verifier (Validates inputs), Forward Modeler (Develops forward model), Inverse Designer (Designs geometry), Memory (Stores information), File_Check (Checks files), Forward_Train (Trains model), Data_Generate (Generates data), Controller (Guides iteration), Code_Modify (Adapts code), Neural_Adjoint (Inverse design tool), Numerical_Simulation (Verifies design), User Message (Input/Output), and System Prompt (Configures agents).
The framework leverages specialized LLM agents and external tools to automate the end-to-end design process from user input to optimized metamaterial geometry.
The system demonstrates autonomous planning, reasoning, and adaptation, achieving performance comparable to human expert-designed solutions.

Boosting LLM Reasoning via Spontaneous Self-Correction

SPOC: introduces a spontaneous self-correction approach for LLMs, with LLM Agent (core model), Dual Agent Roles (proposer/verifier), Interleaved Generation (solution/verification turns), PairSFT (initial training), and Online RL (policy optimization), enabling models to generate interleaved solutions and verifications in a single pass.
The approach dynamically elicits and terminates reasoning generations based on self-verification outcomes, effectively scaling inference time compute.
Training leverages synthetic data for fine-tuning and online reinforcement learning using correctness as reward, yielding substantial performance improvement on math reasoning benchmarks.

Multimodal Spatial Language Maps for Robot Navigation and Manipulation

AVLMaps (Audio-Visual-Language Maps): introduces a unified 3D spatial map representation built by fusing multimodal features from visual, object, area, and audio localization modules, enabling cross-modal reasoning for spatial goal navigation guided by an LLM.
The framework supports zero-shot spatial and multimodal goal navigation, demonstrating improved recall in ambiguous scenarios.
The maps are reusable across different robot embodiments and extensible to additional sensing modalities.

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

CoThinker: introduces, with Agent Parallel Thinking (Divides cognitive labor), Thinking Style Orchestrator (Assigns thinking styles), Transactive Memory System (Manages shared knowledge), Communication Moderator (Structures communication network), Synthesizer (Consolidates final solution), and Agents (Individual LLMs), a multi-agent LLM architecture operationalizing Cognitive Load Theory principles to mitigate cognitive overload and enhance collaborative problem-solving.
The architecture distributes intrinsic cognitive load through agent specialization via thinking styles and manages transactional load via structured communication and a collective working memory.
CoThinker demonstrates improved performance on complex problem-solving tasks and high cognitive load scenarios compared to existing multi-agent baselines.

AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method](http://arxiv.org/abs/2506.06740v1)

AI PsyRoom: introduces a multi-agent simulation framework for psychological counseling, including PsyRoom A (Dialogue generation module) for generating dialogues and PsyRoom B (Treatment plan module) for generating treatment plans.
The framework leverages fine-grained emotion analysis through Segmenting Psychological Emotions (Fine-grained emotion analysis) and Segmented Emotional Classification (Fine-grained emotion analysis).
PsyRoom A employs Multi-agents (Simulate counseling dialogue) including Client (Simulated patient agent), Counselor (Simulated therapist agent), and Professor of Psychology (Dialogue evaluation agent), while PsyRoom B uses an Emotional Assessor (Emotion evaluation agent) and Emotional Therapist (Treatment plan agent).

WORLDLLM: IMPROVING LLMS' WORLD MODELING USING CURIOSITY-DRIVEN THEORY-MAKING

WorldLLM: introduces a framework to improve LLMs' world modeling by combining Bayesian inference and active exploration, including a Statistician (LLM forward model) to evaluate hypotheses, a Scientist (LLM theory generator) to refine hypotheses, and an Experimenter (Data collection agent) to collect challenging transitions.
The framework iteratively alternates between the Experimenter collecting data, the Statistician evaluating the current hypotheses on this data, and the Scientist updating the natural language hypotheses based on the evidence.
This curiosity-driven process aims to autonomously improve the LLM's predictive accuracy and generate human-interpretable theories of environment dynamics without costly gradient-based fine-tuning.

Contextual Experience Replay for Self-Improvement of Language Agents

CER (Contextual Experience Replay): introduces a training-free framework for language agents, including distillation module (Distills experiences), retrieval module (Retrieves experiences), dynamic memory buffer (Stores past experiences), and base decision-making agent (Solves tasks).
The framework enables self-improvement by distilling environment dynamics and decision-making patterns from past trajectories.
Retrieved experiences are replayed in the agent's context to enhance decision-making in complex web environments.

6th June 2025

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Auditing Framework: introduces a task-level, survey-based approach with Auditing Framework With Audio Interface, WORKBank, Human Agency Scale (HAS), and Autonomous Agent Desire-Capability Landscape components to audit AI agent automation and augmentation potential across the U.S. workforce.
The framework collects worker desires and AI expert capability assessments for occupational tasks, storing this data in the WORKBank database.
Key outputs include the Human Agency Scale for quantifying human involvement and the Desire-Capability Landscape for identifying mismatches and opportunities in AI agent development.

Improving LLM-Powered EDA Assistants with RAFT

EDA-LLM Assistant Workflow: introduces a method to enhance LLM performance for RAG-based EDA tasks using Retrieval-Augmented Fine-Tuning (RAFT) with synthetic question-answer datasets generated by a Data Generation/Refinement LLM from Training Data Sources, leveraging Few-Shot Retrieval from a Q&A History Database, and utilizing Hybrid Retrieval from a Document Database with Access Control for the EDA-LLM.
The workflow incorporates human-authored Q2A posts and unlabeled EDA Documents as Training Data Sources, employing DeepSeek-V3 as the Data Generation/Refinement LLM to create refined answers and synthetic Q&A pairs, optionally guided by Retrieval-Augmented Few-Shot examples retrieved via BM25 from a Q&A History Database.
The approach integrates Hybrid Retrieval combining semantic and lexical search on a Document Database, applies Access Control to filter retrieved documents for security, and fine-tunes the EDA-LLM using RAFT before deploying it for RAG-based inference.

ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search

ScriptDoctor: introduces a pipeline for automatic PuzzleScript game generation, including an LLM (Generates game code), Lark CFG Parser (Parses generated code), PuzzleScript Engine (Compiles game code), BFS Solver (Tests game solvability), Human Game Archive (Provides game examples), Coding Prompt (Guides LLM generation), Parse Errors (Syntax error feedback), Compile Errors (Compilation error feedback), and Solvability Issues (Playtesting feedback).
The system iteratively generates and tests games, using feedback from parsing, compilation, and playtesting to refine the LLM's output.
This approach demonstrates automated, open-ended LLM-based workflows for generating novel game content in a constrained domain.

On-board Mission Replanning for Adaptive Cooperative Multi-Robot Systems

GATR (Graph Attention Replanner): introduces a lightweight mission replanner using a GAT Encoder (Transforms graph data) and Attention Model Decoder (Generates mission plan) to solve the Cooperative Mission Replanning Problem.
The framework employs an RL Agent (Learns planning policy) interacting with an RL Environment (Simulates mission dynamics), processing an Input Graph (Represents tasks agents) along with the Environment State (Summarizes mission progress) and Availability Mask (Filters invalid actions).
This approach enables fast and efficient on-board replanning for multi-robot systems by transforming input data into latent representations and sequentially generating mission plans.

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

PersonaAgent: introduces a personalized LLM agent framework with a persona (user-specific system prompt), personalized memory (stores user data), personalized action (selects tailored actions/tools), test-time user preference alignment (optimizes persona prompt), and tools (external functions).
The personalized memory module integrates episodic memory (records interactions) and semantic memory (summarizes user traits).
The persona serves as an intermediary, using memory insights to guide actions and being refined by action outcomes and test-time alignment.

Can Theoretical Physics Research Benefit from Language Agents?

LLM agents: introduces the potential for LLM agents, with Domain Knowledge, External Tools, Multimodal Processing, Reasoning Capabilities, Information Retrieval, Human Interface, and Experimental Interaction components, to accelerate theoretical physics research by assisting across the typical workflow stages.
The paper analyzes current LLM capabilities and limitations in physics reasoning, highlighting the need for improvements in physical intuition, constraint satisfaction, and reliability.
Realizing this potential requires addressing fundamental challenges like ensuring physical consistency and developing robust verification methods through interdisciplinary collaboration.

Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Multi-Agent Pipeline: introduces a lightweight multi-agent framework for text-to-chart generation, including a Drafting Agent (generates initial code), a Python Interpreter (executes code), a Re-writer Agent (debugs code), and an Execution and Repair Loop (iteratively fixes errors).
This pipeline separates the tasks of code generation, execution, repair, and judgment to improve reliability.
The agentic approach significantly reduces execution errors compared to single-prompt methods, highlighting the value of iterative self-correction.

The Lock-in Hypothesis: Stagnation by Algorithm

Human-LLM Feedback Loop: introduces the lock-in hypothesis, proposing that the dynamic interaction between human users and large language models, involving Human Agents (users), LLM Authority (AI system), Beliefs (ideas, values, opinions), Trust (mutual influence weight), and a Diversity Metric (conceptual variety measure), can lead to a loss of diversity and convergence on false beliefs.
The paper formalizes this hypothesis using a Bayesian model and tests it empirically with agent-based LLM simulations and real-world GPT usage data.
Analysis reveals sudden drops in diversity after new GPT versions are released, supporting the hypothesized feedback loop's role in reinforcing existing beliefs.

Personalized Large Language Models Can Increase the Belief Accuracy of Social Networks

Personalized LLM Bot: introduces a system, with Traditional ML Model (Predicts user preferences), External Database (Stores news articles), RAG Model (Retrieves relevant articles), Summarization Component (Summarizes retrieved articles), and Styling LLM (Rephrases for rhetorical style), designed to provide personalized, factually accurate responses within a social network simulation.
The bot's responses are tailored to individual user preferences regarding news sources and rhetorical style, based on predictions from a machine learning model and information retrieved from an external database.
The study demonstrates that the presence of this personalized LLM bot in a social network leads individuals to update their beliefs towards factual accuracy and influences their subsequent network connections.

Conversational Interfaces for Parametric Conceptual Architectural Design: Integrating Mixed Reality with LLM-driven Interaction

The system: introduces a framework for parametric architectural design using a Reasoning-Code Generation-Execution cycle, integrating a multi-agent LLM system (Reasoning Agent, Coding Agent, Optimization Agent) with a Mixed Reality environment.
The system leverages an Interface Manager and ShapeFramework for user interaction and visualization within MR, while an LLM Session Manager orchestrates the agents and a Compiler Client handles code execution.
This approach aims to lower barriers to parametric modeling by enabling natural language and gesture interaction, dynamic parameter management, and iterative design exploration in an immersive environment.

AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

AgentSwift: introduces a framework combining selection (selects agent), hierarchical expansion (expands selected agent), value model (predicts performance), and performance uncertainty (guides exploration) for efficient LLM agent design.
Hierarchical expansion includes recombination (replaces components/workflow), mutation (generates new implementations), and refinement (adjusts based on feedback).
The framework leverages a hierarchical search space (models agent design) including agentic workflow (defines execution steps/flow) and functional components (includes memory, tool, planning).

CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents

CrimeMind: introduces CrimeMind (LLM-driven ABM framework), with LLM Agents (Powered by large language models), Routine Activity Theory (Guides agent crime decisions), Urban Environment (Grid-based spatial simulation), Structured Data (Demographic, socioeconomic features), Street View Imagery (Visual urban scene input), Vision-Language Model (Processes visual urban cues), Human Annotation (Dataset for perception alignment), Self-Evolution Alignment (Calibrates VLM to human judgment), Agent Mobility (Simulates agent movement), and Crime Heatmap (Aggregated crime event visualization), which simulates urban crime using theory-grounded LLM agents in a multimodal urban context.
The framework integrates Routine Activity Theory into agent decision-making and uses a self-evolution alignment process to calibrate visual perception with human judgment.
CrimeMind enables counterfactual simulations and policy evaluation by allowing agents to dynamically adapt behavior based on changing conditions.

CodeContests+: High-Quality Test Case Generation for Competitive Programming](http://arxiv.org/abs/2506.05817v1)

G-V (Generator-Validator) agent system: introduces an LLM-based agent system for high-quality test case generation, including Generator Agent, Validator Agent, Generator Program, and Validator Program.
The Generator Agent writes programs to create diverse test inputs, while the Validator Agent writes programs to verify these inputs against problem constraints.
Test cases failing validation provide feedback to the Generator Agent for revision, improving correctness and coverage.

MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

MAPLE (Multi-agent Adaptive Planning with Long-term mEmory): introduces a novel framework for table reasoning with Solver (Iterative reasoning), Checker (Answer verification), Reflector (Error diagnosis), Archiver (Memory management), Working Memory (Current task state), and Long-term Memory (Accumulated knowledge) agents in a feedback loop.
The framework mimics human problem-solving by enabling dynamic adaptation within and across tasks through iterative refinement and experiential learning.
Specialized agents collaborate in a feedback-driven cycle, leveraging dual memory systems for robust and accurate table reasoning.

To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt

Polymorphic Prompt Assembling (PPA): introduces a defense against prompt injection by dynamically varying prompt structure using User Input, Instruction Prompt, Separator Set, System Prompt Set, Random Selector, Format Constraints, and Polymorphic Prompt Assemble process.
The approach randomizes the combination of user input and system prompts using selected separators and templates to disrupt attacker predictability.
This method enhances LLM agent security against adaptive attacks with near-zero runtime overhead.

Toward Greater Autonomy in Materials Discovery Agents: Unifying Planning, Physics, and Scientists

MAPPS (Materials Agent unifying Planning, Physics, and Scientists): introduces a multi-agent framework for autonomous materials discovery, including a Workflow Planner (Generates multi-step workflows), a Tool Code Generator (Synthesizes executable code), and a Scientific Mediator (Coordinates agents and human).
The framework enables Level 2 autonomy by allowing agents to plan workflows guided by human input, rather than executing fixed, predefined steps.
MAPPS integrates physics-based tools and human feedback to ensure scientific validity and improve performance in crystal structure generation and prediction tasks.

5th June 2025

Energentic Intelligence: From Self-Sustaining Systems to Enduring Artificial Life

Energentic Intelligence: introduces a class of autonomous systems driven by persistence, with Energy Generation Core (Converts ambient energy), Energo-Cognitive Cortex (Performs perception/decision-making), Thermal Regulation Unit (Manages internal temperature), and Survival Manager (Estimates survival, issues commands) components.
This framework operationalizes autonomy through energetic persistence, integrating energy harvesting, adaptive computation, and thermoregulation into a cohesive, internally regulated feedback loop.
The system aims to sustain its existence by continuously adapting behavior based on internal energy and thermal conditions, rather than optimizing external task performance.

OPERA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPERA-based User Behavior Simulation: introduces OPERA Dataset (Dataset), User Persona (User profiles), Action Traces (User interactions), Web Observations (Web context), Rationales (Action explanations), ShoppingFlow Plugin (Data collection plugin), Content Script (Logs user interactions), Background Script (Tracks page events), Rationale Pop-up (Collects rationales), and LLM (Simulation model), which provides a dataset and benchmark for evaluating LLMs on simulating human online shopping behavior.
The framework utilizes the ShoppingFlow plugin to collect detailed user data, including actions, web context, rationales, and persona information.
This data is then used to benchmark LLMs on predicting user actions and rationales in online shopping scenarios.

IMPROVING LLMS WITH A KNOWLEDGE FROM DATABASES

Enhanced Association Rule RAG: introduces a method to improve LLM answers by augmenting them with knowledge discovered from databases using enhanced association rules, including Dataset, Rule Mining Pattern Definition, Rule Mining Task Definition, Rule Mining Execution, Rule List, Rule-To-Text Module, Text Document, RAG Augmentation, and LLM components.
The approach extracts knowledge from a dataset via rule mining, converts the resulting rules into a text document, and embeds this document into the LLM's context using Retrieval-Augmented Generation.
This method provides interpretable knowledge to the LLM, enabling improved data-based question answering without requiring the LLM to directly execute analytical code or interpret complex rule formats.

SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Fact Checking Framework: introduces a two-stage pipeline for deepfake detection using YOLO (Face Detection), FaceNet (Feature Extraction), Influential People database (Identity Comparison), Whisper (Speech Transcription), LLM AGENT-1 (Plausibility Analysis), LLM AGENT-2 (Factual/Ethical Check), WEB SEARCH (External Information), and LLM (Final Decision) to analyze audio-visual content.
The framework identifies individuals and transcribes speech in the first stage, then uses a multi-agent LLM pipeline to verify authenticity based on plausibility, factual correctness, and ethical implications.
This multimodal approach integrates visual recognition, speech transcription, and language-based reasoning to enhance robustness against sophisticated deepfakes.

LLM Agents for Asynchronous Group Communication in Mafia Games

LLM Agent: introduces an adaptive asynchronous agent for group communication, featuring a Scheduler (decides when to speak) and Generator (composes message content) modules, using Context (game state and chat history) and guided by dynamic Scheduling Prompt (guides timing decision) and Generation Prompt (guides message content), incorporating Simulated Typing Time (adds human-like delay).
The agent is evaluated in Mafia games alongside human players, demonstrating performance comparable to human players in timing and win rates.
The asynchronous design allows the agent to decide both what to say and when to say it, better mimicking real-world group interactions.

ProRefine: Inference-time Prompt Refinement with Textual Feedback

ProRefine: introduces an inference-time prompt optimization method using textual feedback from LLMs, including LLMtask (Executes task), LLMfeedback (Critiques output), and LLMoptimizer (Refines prompt) components.
The LLMtask executes the task, LLMfeedback critiques its output, and LLMoptimizer refines the prompt based on the feedback in an iterative loop.
This process dynamically refines prompts for multi-step reasoning tasks without requiring additional training or ground truth labels.

Teaming in the AI Era: AI-Augmented Frameworks for Forming, Simulating, and Optimizing Human Teams

Frameworks: introduces AI-augmented frameworks for forming, simulating, and optimizing human teams, including a Team Formation Framework using a UCB Algorithm and user feedback, tAlfa (Team AI Feedback Assistant) with an LLM-powered agent and processing stages for feedback generation and delivery based on communication metrics, and PuppeteerLLM, an LLM-based simulation framework with LLM agents, physical environments, temporal dynamics, and simulation stages.
The Team Formation Framework iteratively refines team recommendations using a multi-armed bandit approach guided by user preferences.
tAlfa provides immediate, personalized AI-generated feedback on team dynamics by processing messages and evaluating communication metrics.

LLM-Guided Scenario-based GUI Testing

SCENGEN (LLM-guided scenario-based GUI testing approach): introduces a novel approach for scenario-based GUI testing leveraging multi-modal LLMs and a multi-agent framework, including Context Memory, Observer, Decider, Executor, Supervisor, and Recorder components.
The framework simulates manual testing by iteratively observing GUI state, making decisions, executing actions, verifying results, and recording information.
Multi-agent collaboration and LLM guidance enable understanding app semantics and generating scenario-based GUI tests effectively.

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

hierarchical MA-LLM framework (Multi-Agent Language Model): introduces a system for aerial-ground robots, integrating a Reasoning Layer (LLM) for task decomposition and mapping, a Perceptual Layer (VLM) for semantic extraction, and an Execution Layer for motion control.
The framework utilizes an Aerial Robot as a leader for global guidance and a Ground Robot as a follower for local navigation and manipulation.
GridMask enhances the VLM's spatial perception, supporting robust semantic navigation and manipulation in dynamic environments.

QiMeng: Fully Automated Hardware and Software Design for Processor Chip

QiMeng: introduces a novel system for fully automated hardware and software design for processor chips, with a Large Processor Chip Model (LPCM) as a domain-specialized LLM, Hardware Design Agent for automated hardware design, Software Design Agent for automated software design, and Top-layer Applications for various design tasks.
The system is structured in three hierarchical layers, leveraging AI and LLMs to address challenges in processor chip design.
QiMeng aims to automate the entire design and verification pipeline, enabling rapid customization and improved efficiency.

Agentic AI for Intent-Based Industrial Automation

Intent-Based Agentic AI Framework: introduces a conceptual framework for intent-driven industrial automation using LLM-based agents, featuring a Root Agent, Specialized Sub-Agents, LLM, SLM, Memory, Tools Set, Industrial Data, Machines, and Business and Operational Intents.
The framework translates high-level natural language business or operational intents into structured components, enabling autonomous planning and execution via agent orchestration and specialized tools.
This approach simplifies human-machine interaction by abstracting technical complexity and aligns with Industry 5.0's human-centric vision.

LLMS FOR SENSORY-MOTOR CONTROL: COMBINING IN-CONTEXT AND ITERATIVE LEARNING

LLM-based Sensory-Motor Control Framework: introduces a method where an LLM (Large Language Model) generates a control strategy, encodes it into IF-THEN rules and Python Code, and evaluates it in an Environment/Task.
The framework iteratively refines the Strategy (Text/Rules) by prompting the LLM with Performance/Sensory-Motor Data and Past Experiences/External Memory.
This approach enables autonomous learning for embodied agents by directly mapping observations to actions without relying on predefined motor primitives or human demonstrations.

Empowering Economic Simulation for Massively Multiplayer Online Games through Generative Agent-Based Modeling

MMOAgent (Generative Agent-Based Modeling): introduces an LLM-empowered framework for MMO economic simulation, featuring profile (tailors agent to player traits), perception (interprets game environment observations), reasoning (determines appropriate structured actions), memory (logs game experience, past trajectories), and action (executes permissible game actions) modules.
The framework utilizes LLMs' capabilities for human-like decision-making and adaptability, addressing reliability, sociability, and interpretability challenges in traditional agent-based modeling.
The simulation environment is enhanced with player-to-player trading and linguistic negotiation, enabling realistic economic interactions and emergent phenomena like role specialization and market price dynamics.

Gen-n-Val: Agentic Image Data Generation and Validation

Gen-n-Val: introduces a novel agentic framework for generating and validating synthetic image data, leveraging a LD Prompt Agent (LLM) (Generates optimized prompts), Data Validation Agent (VLLM) (Filters generated images), Layer Diffusion (LD) (Generates transparent images/masks), TextGrad (Optimizes agent prompts), and Image Harmonization (Blends instances onto backgrounds).
The framework uses agents and generative models to produce high-quality synthetic data with precise instance masks and diverse backgrounds for computer vision tasks.
Gen-n-Val significantly improves performance on instance segmentation and object detection benchmarks, particularly for rare classes and open-vocabulary detection.

E-bike agents: Large Language Model-Driven E-Bike Accident Analysis and Severity Prediction

E-bike agents: introduces a framework using LLM-powered agents to analyze unstructured e-bike accident reports, including a Data Classifier, Information Extractor, Injury Causes Determiner, and Incident-Component Link Detector.
The framework processes extracted data using an Ordered Logit Model to analyze severity relationships and employs Visualization to present findings.
This approach provides a scalable solution for e-bike safety analytics by converting narrative reports into structured, actionable insights.

Agents of Change: Self-Evolving LLM Agents for Strategic Planning

LLM Self-Evolving Agent Framework: introduces self-evolving LLM agents for strategic planning in Settlers of Catan, including BaseAgent (Input, Interface, Decision maker, Output), StructuredAgent (Input, Input structuring, Interface, Decision maker, Output), PromptEvolver (Coordinator, Game player, Intelligence, Intelligence, External access, Memory, Instruction, Feedback, Feedback processing, Game outcome, Interface), and AgentEvolver (Coordinator, Evaluator, Information gatherer, Code modifier, Advisor, Game player, Intelligence, Intelligence, Intelligence, Intelligence, Intelligence, Intelligence, External access, Reasoning, Instruction processing, Input, Game outcome, Interface).
The framework benchmarks four agent architectures with increasing self-improvement capabilities against a strong heuristic baseline in the Catanatron simulator.
Self-evolving agents, particularly PromptEvolver and AgentEvolver, demonstrate improved strategic planning and performance over static baselines through iterative prompt and code refinement.

FLEX-TRAVELPLANNER: A BENCHMARK FOR FLEXIBLE PLANNING WITH LANGUAGE AGENTS

Flex-TravelPlanner: introduces a benchmark for evaluating language agents in dynamic, multi-turn planning scenarios, using a pipeline with Initial Constraint (Start planning with constraints), Adding Constraint (Introduce new constraints), Revising Constraint (Modify existing constraints), and Fin (End of planning process) steps.
The framework evaluates how well agents adapt plans as new requirements or changes are introduced over multiple interactions.
It specifically addresses the challenges of constraint addition and revision, mirroring real-world planning dynamics.

Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Tool-MVR: introduces a novel Tool-Augmented LLM framework that enhances System 2 reasoning capabilities by employing MAMV (Data verification pipeline) for high-quality data generation (ToolBench-V, Verified instruction dataset) and EXPLORE (Reflection learning algorithm) for learning from errors (ToolBench-R, Reflection dataset), utilizing a Base LLM (Base model) interacting with APIs (External tools) based on User Query (Input), generating Reasoning Trajectory (Step-by-step process) and Final Answer (Output) informed by Observation (Tool feedback).
The MAMV pipeline consists of APIOptAgent (API verification/optimization agent), QueryVerifyAgent (Query assessment/filtering agent), and APICallAgent (Trajectory generation/verification agent) to ensure data quality for tool planning and invocation.
EXPLORE enables the model to learn adaptive tool reflection by leveraging tool feedback through an Error → Reflection → Correction paradigm.

SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

SmartAvatar: introduces a vision-language-agent-driven framework for generating 3D human avatars, utilizing a Descriptor (Extracts attributes), Generator (Synthesizes code), Evaluator (Checks alignment), Refiner (Adjusts code), Human Generator (Parametric avatar model), and Blender (Rendering environment).
The system incorporates a VLM-guided auto-verification loop that iteratively refines generated avatars to match user input across visual and semantic criteria.
SmartAvatar supports diverse inputs including text, image, and multimodal combinations, enabling conversational editing for customizable, animation-ready avatars.

Demonstrations of Integrity Attacks in Multi-Agent Systems

Multi-Agent System (MAS): introduces integrity attacks where malicious agents manipulate system operations and evaluation outcomes within systems comprising Coder, Tester, Reviewer, WebSearcher, and Monitor components.
These attacks, including Self-Dealer, Free-Rider, Scapegoater, and Boaster, exploit inter-agent communication and the Monitor's evaluation process.
The research demonstrates that these manipulations can bias agent behavior and evaluation scores while maintaining overall task performance.

OpenAg: Democratizing Agricultural Intelligence

OpenAg: introduces, "a comprehensive framework designed to advance agricultural artificial general intelligence", with Multi-Modal Knowledge Ingestion, Unified Agriculture Knowledge Base, Neural Agricultural Knowledge Graph Generation, Adaptive Multi-agent Reasoning System, Causal Agricultural Decision Transparency, and Adaptive Agricultural Transfer Learning components, where "it integrates diverse data flows and advanced reasoning".
The framework aims to deliver context-aware, explainable, and actionable insights for agricultural decision support.
OpenAg bridges the gap between scientific knowledge and farmer expertise to support scalable and locally relevant decision-making.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound AI Systems

CAIS (Compound AI Systems): introduces a framework integrating LLMs with external components and orchestration, categorized into RAG, LLM Agents, and MLLMs, to overcome standalone LLM limitations.
The framework leverages components like retrievers, agents, tools, and multimodal encoders, coordinated by orchestration strategies, for complex tasks.
The survey provides a taxonomy, architectural analysis, evaluation framework, and research agenda for these modular, composable AI systems.

--

4th June 2025

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

CogMath: introduces a framework for assessing LLMs' mathematical abilities using an Inquiry agents (Pose dimension-specific inquiry), Judge agents (Evaluate inquiry quality), Reference agents (Provide correct answer), and Evaluated LLM (Model being assessed) system across human cognitive stages.
The framework evaluates LLMs by posing dimension-specific inquiries generated and refined by agents, comparing the LLM's response to a reference answer.
This multi-agent system allows for a fine-grained assessment of LLMs' performance across nine dimensions within problem comprehension, solving, and summarization stages.

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

MedAgentGym: introduces a unified training environment for enhancing coding-based medical reasoning in LLM agents, featuring an LLM Agent (Model trained/evaluated), Coding Environment (Isolated executable containers), Interactive Feedback Mechanism (Processes, executes, translates errors), Data Resources (Task datasets), Trajectory Collection (Samples, stores interactions), and Verifier (Evaluates trajectory success).
The environment includes 72,413 tasks from 12 real-world biomedical scenarios, encapsulated in isolated, executable coding environments with interactive feedback.
MedAgentGym supports scalable training trajectory generation and extensive benchmarking of LLMs for code-based medical reasoning.

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

SuperWriter-Agent: introduces an agent-based framework for long-form text generation with Stage 1: Plan (Structured planning), Stage 2: Write (Paragraph generation), and Stage 3: Refine (Iterative refinement) stages, utilizing AI commentators (Discuss ideas), Writer (Develops plan, writes), Thinker (Plans paragraph), Checker (Reviews text), and Refiner (Revises text) agents.
The framework generates training data for SuperWriter-LM (Language model), which is optimized using Hierarchical DPO (Multi-stage optimization) guided by MCTS (Explores paths) and a Judge LLM (Scores outputs).
This approach simulates human writing processes to enhance coherence, consistency, and quality in long-form text generation.

TracLLM: A Generic Framework for Attributing Long Context LLMS

TracLLM (Generic Framework for Attributing Long Context LLMs): introduces a generic context traceback framework, with Instruction, Context, LLM, Output, Iterative Search, Group Division, Score Computation, Group Pruning, Score Denoising, Score Ensemble, and Attribution Method components, designed to efficiently and accurately identify texts in a long context contributing to an LLM's output.
The framework employs an informed search algorithm that iteratively divides and prunes text groups based on contribution scores calculated by a feature attribution method.
TracLLM enhances accuracy through contribution score denoising and ensemble techniques, demonstrating effectiveness in post-attack forensic analysis and debugging LLM systems.

TRISM FOR AGENTIC AI: A REVIEW OF TRUST, RISK, AND SECURITY MANAGEMENT IN LLM-BASED AGENTIC MULTI-AGENT SYSTEMS

TRISM (Trust, Risk, and Security Management): introduces a structured framework for LLM-based agentic multi-agent systems, including Governance, Explainability, ModelOps, Application Security, and Model Privacy components.
This framework addresses unique trust, risk, and security challenges posed by autonomous, collaborative, and evolving agent behaviors in high-stakes domains.
The paper provides a comprehensive review, risk taxonomy, trust-building mechanisms, security/privacy methods, and a roadmap for responsible agentic AI deployment.

AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

AmbiK Dataset and Evaluated Methods: introduces AmbiK Dataset (textual benchmark), Ambiguity Detection Methods (algorithms for deciding help), Large Language Models (LLMs) (process language, predict actions), Conformal Prediction (CP) (forms prediction sets), and Uncertainty Estimation (quantifies confidence), presenting AmbiK, a textual dataset for evaluating ambiguity detection methods for embodied AI in kitchen environments.
The paper evaluates several existing ambiguity detection methods, including CP-based (KnowNo, LAP, LofreeCP) and non-CP based (Binary, No Help), utilizing various LLMs on the AmbiK dataset.
Experiments demonstrate that current methods and LLMs face significant challenges in effectively handling ambiguity on the AmbiK benchmark, particularly in distinguishing ambiguous from unambiguous tasks.

AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data

AI Triage (multi-agent system): introduces a multi-agent architecture for conversational patient triage, including Primary agent (Orchestrates other agents), Symptom Collector (Collects patient symptoms), HealthDataPlanner (Plans EHR data retrieval), HealthDataRetriever (Retrieves EHR data), Summary (Synthesizes case information), Differential Diagnosis (Narrows potential diagnoses), Next Steps (Provides care recommendations), Guideline Verifier (Verifies recommendations with guidelines), EHR data (Source of patient records), Clinical Guideline Database (Source of clinical guidelines), and Outputs (Final triage decision), designed to emulate physician reasoning for patient assessment and triage.
The system interacts with a Patient Simulator, which generates realistic patient conversations from real-world EHR data vignettes for scalable evaluation.
The multi-agent design enhances interpretability and control, while the Guideline Verifier adds a layer of safety by grounding recommendations in clinical best practices.

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

LLM-based Agent: introduces AgentMisalignment, a benchmark suite evaluating the propensity for misaligned behavior in LLM-based agents, which integrate a Large Language Model within an Interactive Scaffold, enabling them to use Tools, store Memory, interact via Environment Interfaces, and operate through an Operational Loop influenced by System Prompts and Reflective Prompts.
The benchmark uses the InspectAI framework and a Comprehensive Misalignment Scoring (CMS) mechanism to quantify misaligned actions across various realistic scenarios.
Evaluations reveal that both model choice and personality prompts significantly influence agent misalignment tendencies, highlighting the importance of careful prompt engineering.

Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning

Graph Counselor: introduces a multi-agent collaborative reasoning framework with AGIEM (Graph information extraction), Planning Agent (Establishes reasoning path), Thought Agent (Refines extraction scope), Execution Agent (Executes graph functions), Retrieve (Finds node by keyword), Feature (Gets node attribute), Degree (Gets neighbor count), Neighbour (Lists neighbors), SR (Self-reflection and correction), and Judgment Module (Evaluates reasoning correctness) to enhance LLM reasoning on knowledge graphs.
The framework utilizes an Adaptive Graph Information Extraction Module (AGIEM) with three agents for dynamic graph information extraction and a Self-Reflection (SR) module for improving reasoning reliability.
Graph Counselor employs a multi-round iterative process involving planning, thought, execution, and reflection to adaptively extract graph knowledge and refine reasoning.

PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

MAS (Multi-Agent Systems): introduces PulseReddit, a novel dataset aligning Reddit discussions with high-frequency cryptocurrency market data, and evaluates LLM-based MAS performance using Reddit API (Data source), Raw Posts (Unprocessed data), Data Preprocess (Cleaning, filtering data), Structured Data (Processed data format), Market Analyst (Analyzes on-chain metrics), News Analyst (Analyzes off-chain signals), Trading Agent (Synthesizes inputs, decides action), and Reflection Agent (Analyzes performance, refines strategy).
The MAS framework, based on CryptoTrade, leverages specialized agents to integrate on-chain and off-chain signals for high-frequency cryptocurrency trading decisions.
Experiments show MAS augmented with PulseReddit data outperform traditional baselines, particularly in bull markets, demonstrating the value of social sentiment in HFT.

AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance

AssetOpsBench: introduces a unified framework and environment for benchmarking AI agents in industrial asset operations, including a global coordinator, specialized agents, memory, task planning, and iterative execution.
The framework supports multi-agent architectures like Agents-As-Tool and Plan-and-Execute, utilizing components such as planners, orchestrators, reviewers, and summarization modules.
AssetOpsBench provides a multi-source dataset and automated evaluation framework to assess agent performance on real-world industrial tasks requiring perception, reasoning, and control.

From Theory to Practice: Real-World Use Cases on Trustworthy LLM-Driven Process Modeling, Prediction and Automation

Three-module AI solution: introduces a framework integrating ML, UQ, XAI, and multi-agent LLMs to transform opaque predictions into auditable, interactive workflows.
The framework grounds explanations in MES event logs and enables natural language dialogues for real-time validation and adaptation.
It employs a multi-agent LLM architecture with RAG to provide context-aware recommendations and ensure consistency.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Orak: introduces a foundational benchmark for training and evaluating LLM agents across diverse video games, with Environment (12 diverse video games), LLM Agent (Evaluated gameplay agent), LLMs (Backbone language models), Agentic Modules (Strategies like reflection, planning), MCP Interface (Plug-and-play connection protocol), Evaluator (Manages game loop, scoring), and Fine-tuning Dataset (Expert gameplay trajectories).
The benchmark utilizes a plug-and-play MCP interface to connect LLM agents with diverse game environments and agentic modules for consistent evaluation.
Orak provides a comprehensive evaluation framework including leaderboards, battle arenas, and studies on agentic modules and fine-tuning effects, supported by a dataset of expert gameplay trajectories.

CogniPair: From LLM Chatbots to Conscious AI Agents - GNWT-Based Multi-Agent Digital Twins for Social Pairing - Dating & Hiring Applications

GNWT-Agent Cognitive Architecture: introduces a computational implementation of Global Workspace Theory, featuring Input Processing, Feature Extraction, Module Salience, specialized Cognitive Modules (Emotion, Memory, Planning, Social Norms, Goal Tracking), Global Workspace Integration, Persistent Memory, and Response Generation for creating psychologically realistic AI agents.
This architecture enables parallel processing across modules, dynamic salience-based attention, global workspace broadcasting for integration, and persistent memory for state evolution, addressing psychological and social behavior gaps in LLM agents.
Deployed within the CogniPair system for social simulations like dating and hiring, the GNWT-Agent demonstrates unprecedented psychological realism and human-like behavioral evolution compared to baselines.

Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Debate and Reflect (D&R): introduces a framework that orchestrates multi-turn Debate (Multi-turn interaction) between Teacher Models (Stronger models) and a Student Model (Smaller model), incorporating Self-Reflection (Student self-analysis) and Teacher Feedback (Teacher critiques), recording interactions in a Multi-Agent Interaction Graph (MAG) (Records debate content) to construct a Preference Tree (Hierarchical structure for training) for Distillation (Knowledge transfer process) into Distilled Models (Student after training).
The framework leverages debate logs and Tree-structured Direct Preference Optimization (T-DPO) to efficiently transfer knowledge and reasoning abilities from teachers to the student model.
Empirical evaluations show that the approach significantly improves smaller model accuracy, robustness, and generalization compared to conventional baselines.

VChatter: Exploring Generative Conversational Agents for Simulating Exposure Therapy to Reduce Social Anxiety

VChatter: introduces a multi-agent system for simulating exposure therapy, with Agent-P (Psychotherapist agent), Agent-H (Interactive human agent), Large Language Model (Text generation), Text-to-Voice Model (Speech output), 3D Virtual Character Models (Agent avatars), Chat Interface (User interaction), and Scenario List (Scenario management).
The system utilizes LLMs to power conversational agents that guide users through personalized exposure therapy plans and simulate social interactions in various scenarios.
VChatter aims to provide a safer and more accessible environment for individuals with social anxiety to practice coping mechanisms and reduce avoidance behaviors.

Reason from Future: Reverse Thought Chain Enhances LLM Reasoning

Reason from Future (RFF): introduces a novel reasoning paradigm that enhances LLM reasoning by integrating bidirectional reasoning, utilizing a Last Step Generator (generates last previous step), Stepwise Forward Reason (generates next forward step), State Check (determines termination conditions), and Verifier (verifies path correctness) to generate a solution path.
RFF alternates between reverse thinking to guide forward reasoning, aiming to obtain a future perspective and narrow the solution search space.
The framework demonstrates improved accuracy and efficiency on complex tasks by constraining reasoning to target-driven states and mitigating error accumulation.

3rd June 2025

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation

S4-Driver: introduces a scalable self-supervised motion planning method, with Camera Images, Image Encoder, Image features, Sparse Volume Representation, Historical ego-states, High-level behavior, Text prompt, Tokenize, Multimodal Encoder, Multimodal Decoder, Hierarchical Planning, Meta-decision, Multi-decoding, Nucleus sampling, Multi-output aggregation, where S4-Driver predicts ego-vehicle waypoints from camera images and text prompts using a multimodal large language model enhanced with spatio-temporal visual representation and hierarchical planning.
The framework employs a novel sparse volume representation to aggregate multi-view and multi-frame visual information, enhancing 3D reasoning for motion planning.
Self-supervised training with ego-vehicle trajectory supervision and multi-decoding aggregation improves performance and scalability without requiring human annotations for intermediate tasks.

Why do AI agents communicate in human language?

Native Multi-Agent Model Paradigm: introduces a paradigm shift for multi-agent systems, proposing Role Persistence Mechanism, Structured Communication Mechanism, Inter-Agent State Synchronization Mechanism, Functional Decoupling, Explicit Coordination Graph, and Semantic Identity Separation Mechanism.
The paper argues that relying on natural language for inter-agent communication in current LLM-based systems introduces fundamental limitations due to semantic misalignment and architectural incompatibility.
The proposed paradigm aims to build multi-agent systems with native collaborative capabilities by incorporating structural mechanisms for semantic alignment and coordination fidelity.

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

FailureSensorIQ (Multi-Choice Question-Answering benchmarking system): introduces FailureSensorIQ Dataset (benchmark data), LLMs (models evaluated), Evaluation Module (measures performance), Prompting (input formatting), Perturbation Pipeline (dataset variations), ReAct Agent (LLM with tools), External Knowledge Sources (retrieval resources), where FailureSensorIQ is a benchmarking system designed to assess LLMs' reasoning on industrial domain QA using a novel dataset.
The system utilizes a Dataset Generation Pipeline to create the FailureSensorIQ Dataset from expert knowledge and evaluates LLMs using various Prompting strategies and an Evaluation Module, including tests with a Perturbation Pipeline and a ReAct Agent accessing External Knowledge Sources.
Evaluation on the benchmark reveals LLMs struggle with domain-specific reasoning and robustness under dataset perturbations, highlighting the challenge and need for improved LLM capabilities in industrial settings.

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

CURE: introduces a novel reinforcement learning framework where a Policy (LLM agent) acts as both a Code Generator (Generates code) and Unit Test Generator (Generates tests), evaluated by an Execution Engine (Runs code/tests), guided by a Reward Model (Calculates reward), and optimized by a Reinforcement Learning Optimizer (Optimizes policy) for co-evolution.
This approach allows the unit tester to learn from the coder's errors without requiring ground-truth code supervision, enhancing flexibility and scalability.
A response-length-guided transformation is applied to the unit test reward for long-CoT models to improve inference efficiency.

DPO Learning with LLMs-Judge Signal for Computer Use Agents

LLM-as-Judge DPO Pipeline: introduces a method for training lightweight GUI agents using an LLM-as-Judge to score sampled responses, generating preference data for DPO fine-tuning.
The pipeline involves sampling answers from the policy model, scoring them using GPT-40 as the judge, and pairing the scored responses and ground truth to create a dataset for Direct Preference Optimization.
This approach enables training a compact, local-first GUI agent (UI-TARS-2B) without extensive human labeling, addressing privacy and resource efficiency concerns.

How much do language models memorize?

Language Model Memorization Measurement: introduces a new method to estimate how much a language model knows about a datapoint.
The method formally separates memorization into unintended memorization (dataset information) and generalization (data-generation process information).
Using information theory and model likelihoods, the approach measures model capacity and analyzes scaling laws for memorization and membership inference in Transformer models.

MAEBE: Multi-Agent Emergent Behavior Framework

MAEBE (Multi-Agent Emergent Behavior Evaluation framework): introduces a research structure with Agents (Individual LLMs), Round Robin Topology (Sequential chat topology), Star Topology (Supervisor-agent topology), Supervisor (Guides agents), Shared Chat (Common communication channel), MAS Configuration Parameters (Adjustable MAS settings), and LaaJ (LLMs-as-a-Judge) to systematically assess emergent risks in multi-agent LLM ensembles.
The framework utilizes different MAS topologies and configurations to study group dynamics and compare ensemble behavior to isolated agents.
LaaJ is employed as a scalable evaluation tool to classify agent responses and identify system-level behaviors like peer pressure.

QUANTUM AGENTS

Quantum Agent System Architecture: introduces a modular framework for quantum agents, with SystemCore (defines identity/rules), InterfaceManager (handles communication), ClassicalProcessor (classical computation), LLMEngine (reasoning/generation), QuantumProcessor (quantum computation), MemorySubsystem (stores memory), ExternalInterface (external tools/data), MCPProtocol (communication protocol), GuardrailsModule (safety/security), and MonitoringSystem (logging/auditing), designed to integrate quantum computing with agent-based systems.
The architecture combines classical logic, quantum operations, safety mechanisms, and external interfaces to enable intelligent, auditable agent behavior.
The paper defines quantum agents, outlines potential architectures, and presents prototypes demonstrating feasibility and use cases like quantum-enhanced decision-making and AI-driven quantum workflow orchestration.

Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

Agentic Workflow: introduces a system with Generator (Produces, revises answers), Judge (Evaluates, critiques answers), Feedback Mechanism (Enables interaction, revision), and Knowledge Sources (Judge's information access), analyzing vulnerabilities under deceptive feedback.
The paper categorizes judge behavior by intent (constructive/deceptive) and knowledge level (parametric/grounded) to systematically study vulnerabilities.
A new benchmark, WAFER-QA, is introduced to evaluate agent robustness against grounded adversarial critiques supported by web evidence.

Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

VAPO: introduces a value-model-based augmented proximal policy optimization framework for enhancing large language models in long-chain-of-thought reasoning, utilizing a Value function (predicts future rewards), Policy (generates actions), Decoupled GAE (uses different lambda for critic and actor), Monte Carlo targets (unbiased value estimates), and Length-Adaptive GAE (actor lambda adjusts with sequence length).
The framework trains the value function on Monte Carlo returns using a Decoupled GAE with lambda=1 for the critic and updates the policy using a Length-Adaptive GAE.
This paper theoretically analyzes VAPO's potential limitations in modeling deep long-term value for fine-grained policy guidance, focusing on credit assignment, value function representation, and translating global value signals.

TestAgent: An Adaptive and Intelligent Expert for Human Assessment

TestAgent: introduces an LLM-powered agent for adaptive testing, with Universal Data Infrastructure (Establish question bank), TestAgent Planning (Outlines workflow), and Report Generation (Generate diagnosis reports) modules.
The TestAgent Planning module iteratively generates conversational questions, processes Tester Responses (Test-taker answers questions) via Autonomous Feedback Mechanism (Assess response validity) and Anomaly Management (Handle anomalous responses), updates Cognitive Diagnosis (Assess test-taker ability), and uses Adaptive Question Selection (Select next question).
The Universal Data Infrastructure prepares the Question Bank (Stores questions) through Domain Verification (Determine test dimensions), Data Integration (Integrate data, estimate features), and Cognitive Diagnosis Training (Train cognitive model), while Report Generation utilizes Neural Architecture (Initial analysis module) and Expert Analysis (Combine analysis for report) to produce the Diagnosis Report (Final test outcome).

Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation

RMA (Reflective Multi-Agent): introduces a reflective multi-agent framework for legal argument generation, employing an Argument Developer, Factor Analyst, and Argument Polisher in an Iterative Workflow using Case Factors.
The framework utilizes iterative reflection and specialized agents to improve factual grounding, reduce hallucination, enhance factor utilization, and promote abstention when arguments are untenable.
Empirical evaluation demonstrates the framework's superiority in successful abstention and hallucination accuracy, contributing to more ethically persuasive and less manipulative legal AI.

Adaptive Graph Pruning for Multi-Agent Communication

AGP (Adaptive Graph Pruning): introduces a task-adaptive multi-agent collaboration framework featuring an AGP Network (Learns pruning policy) with a Node Encoder (Embeds agent and task), GCN backbone (Processes graph features), Edge-weight head (Soft-pruning), and Node-mask head (Hard-pruning), which selects agents from an Agent Pool (Set of LLM agents) and is trained using a Graph Pool (Supervision dataset).
The framework jointly optimizes agent quantity and communication topology dynamically based on task complexity.
AGP achieves high performance and token efficiency by pruning both nodes and edges in the multi-agent communication graph.

A MULTI-AGENT LLM-BASED JUIT TEST GENERATION WITH STRONG ORACLES

CANDOR: introduces a multi-agent LLM-based framework for automated JUnit test generation, utilizing Initializer, Validation, Planner, Tester, Inspector, Requirement Engineer, Panelist, Interpreter, and Curator agents to collaboratively generate and refine test cases and accurate oracles.
The framework operates in three steps: Initialization for a syntactically correct base, Test Prefix Generation for coverage enhancement, and Oracle Fixing for correcting assertions using a panel discussion approach.
CANDOR employs specialized LLM agents and a dual-LLM pipeline to mitigate hallucination and verbosity, improving test prefix quality and oracle accuracy without external tools or fine-tuning.

Large Processor Chip Model

LPCM (Large Processor Chip Model): introduces an LLM-driven framework for end-to-end automated computer system architecture design, including Binary Translation Agent, Query Agent, Compiler Agent, SW/HW Partitioning Agent, CPU DSE Agent, Co-Processor DSE Agent, Simulator Agent, HDL Generation Agent, PPA Prediction & Code Optimization Agent, Constraints, Inputs, and Outputs.
The framework integrates multiple LLM-based agents to handle tasks across the full technology stack, from high-level requirements to low-level hardware implementation.
LPCM aims to achieve multi-level, cross-domain co-optimization and autonomous design by leveraging LLMs and domain-specific data.

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

APE (Attempt to Persuade Eval): introduces a benchmark evaluating large language models' willingness to attempt persuasion on harmful topics using a multi-turn conversational setup with a Persuader Model, Persuadee Agent, Evaluator Model, and StrongREJECT Model interacting over diverse Topic Statements, logging the Conversation, Initial Persuadee Belief, Persuasion Attempt Label, Refusal Label, and optional Updated Persuadee Belief.
The framework simulates interactions between a model under test attempting persuasion and a simulated human agent, with automated models assessing persuasive attempts and explicit refusals per turn.
APE focuses on evaluating the propensity to persuade across a spectrum of topics, including harmful ones, to assess safety guardrail robustness rather than measuring persuasion success.

ATAG: AI-Agent Application Threat Assessment with Attack Graphs

ATAG (AI-Agent Application Threat Assessment with Attack Graphs): introduces a framework for structured security analysis of LLM-based multi-agent applications, including Agent Modeler, Vulnerability Mapper, Attack Graph Generator, and Attack Graph Analyzer modules, leveraging MulVAL, LVD, AI-agent Interaction Rules (IRs), MulVAL Facts, and Attack Graph (AG).
The framework extends MulVAL with specific facts and interaction rules to model unique architectural components and vulnerabilities in AI-agent applications.
ATAG utilizes the LLM Vulnerability Database (LVD) to incorporate LLM-specific vulnerabilities and automatically generates detailed attack graphs depicting potential sequences of actions.

TaxAgent: How Large Language Model Designs Fiscal Policy

TaxAgent: introduces a taxation evaluation framework, with TaxAgent (government agent), H-Agents Group (household agents), and Macroeconomic Simulation Environment (economic model), modeling household-government interactions in an evolving economy.
The framework includes TaxAgent Tax rate adjustment (adjusts tax rates) using an LLM and TaxAgent Iterative Feedback (refines strategy) loop for continuous improvement.
H-Agents Group (household agents) incorporates H-Agent Decision-Making (decides work/consumption) and H-Agent Self-Reflection (reviews history) modules, while the Macroeconomic Simulation Environment (economic model) includes Production (determines production), Taxation (models taxation), Consumption (models consumption/savings), and Financial Market (models financial market) modules.

Benchmarking and Advancing Large Language Models for Local Life Services

LocalInstruction and Expert Agents Approach: introduces a framework for enhancing LLMs for local life services, including Template Agent, Merchant Agent, User Agent, Interaction Description Agent, Instruction Generation Agent, Fine-tuned LLMs, and Expert Agents.
It employs a multi-agent system (LocalInstruction) to synthesize high-quality instruction tuning data from raw platform data.
Expert agents leverage the fine-tuned LLMs and agentic workflows to address complex composite tasks in local life services.

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

MHGPO (Multi-Agent Heterogeneous Group Policy Optimization): introduces, "optimizes LLM-based multi-agent systems using group-based reinforcement learning without a critic network", with Multi-Agent Search System (LLM agent system), Backbone LLM (single shared model), Agents (specialized LLM roles), Multi-Agent Group Rollout Sampling (generates trajectories using IS/FoF/RR strategies), Backward Reward Propagation (propagates shared rewards), Heterogeneous Group Advantage Estimation (estimates advantage), Reward Function (assigns reward signals), and External Retrieval Tools (search engine) components.
The framework leverages relative group advantages and a two-phase sampling-propagation strategy to enhance stability and computational efficiency compared to traditional MAPPO.
Applied to a three-agent search system, the method demonstrates superior performance and scalability for complex LLM-based multi-agent systems.

Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

DPPM (Decompose, Plan in Parallel, and Merge): introduces a novel paradigm for LLM-based multi-constraint planning, utilizing Constraint-aware Task Decomposition (Decomposes task by constraints), Local Plan Generation (Generates subplans in parallel), Incremental Merge (Merges subplans into final plan), Verification and Refinement Module (Iteratively checks/refines plans), LLM Agents (Perform planning and merging), and Constraint Functions (Verify constraint satisfaction).
The approach decomposes complex tasks based on constraints, plans subtasks in parallel using local agents, and merges subplans into a global solution with iterative verification and refinement.
DPPM significantly outperforms existing methods on travel planning benchmarks, demonstrating improved handling of heavy constraints and reduced cascading errors.

CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

CyberGym: introduces a large-scale cybersecurity evaluation framework with Task Inputs, PoC Generation by a Language Model Agent producing a Generated Executable, and PoC Evaluation on Pre-Patch Executable and Post-Patch Executable, evaluating Agent Frameworks using Backbone LLMs within an Execution Environment with various Tools.
The framework features 1,507 real-world vulnerabilities across 188 software projects to assess AI agents' capabilities in generating proof-of-concept tests for vulnerability reproduction.
Evaluation results show that state-of-the-art agents achieve limited success rates on complex vulnerabilities but can discover new zero-day vulnerabilities.

Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems

Trust Management System (TMS): introduces a system for LLM-MAS, with LLM Multi-Agent System, Message-level trust evaluation, Attention matrix, A-Trust models, Trust scores, Trust-aware Action Policy, Thresholds, External verifier, Trust Record, Agent-level trust records, and Trust record utilization, designed to evaluate message trustworthiness and manage agent trust.
The system leverages attention patterns via A-Trust models to generate trust scores for messages across six dimensions.
It uses a trust-aware action policy based on thresholds and agent-level trust records to filter untrustworthy messages and identify malicious agents.

Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

ACE (Agents Co-Evolution): introduces a co-evolution framework with Act Once (RL interaction phase), RL Agent (Interacts with environment), Environment (Provides states, rewards), Think Twice (LLM refinement phase), LLM as Policy Actor (Refines suboptimal actions), LLM as Value Critic (Performs reward shaping), DRL Buffer (Stores RL transitions), DLLM Buffer (Stores LLM refined transitions), Mix Buffer (Combines DRL/DLLM samples), and Experience Gathering (Collects and mixes data), designed for large-scale decision-making by synergizing LLMs and RL.
The framework separates LLM reasoning and RL execution into offline training (Think Twice) and online deployment (Act Once) to enable effective learning and real-time performance.
ACE leverages LLMs in dual roles as Policy Actor and Value Critic during offline training to refine trajectories and shape rewards, improving sample efficiency and solution quality for the RL agent.

To Embody or Not: The Effect Of Embodiment On User Perception Of LLM-based Conversational Agents

LLM-based Conversational Agent: introduces a study comparing user perception of LLM-based CAs with and without embodiment, utilizing components like LLM, User Interface, Visual Representation, Text-to-Speech, Facial Animation, and Rendering Engine.
The study found that the non-embodied agent was perceived as more competent than the embodied agent in non-hierarchical cooperative tasks.
Qualitative feedback suggested the embodied agent was perceived as more sycophantic, potentially explaining the lower credibility ratings despite similar underlying LLM and prompts.

AURA: Agentic Upskilling via Reinforced Abstractions

AURA (Agentic Upskilling via Reinforced Abstractions): introduces a schema-centric curriculum RL framework leveraging LLMs as autonomous curriculum designers, including User Prompt, RoboEnv. Description, Vector Database, VDB Query Agent, Selector Agent, Curriculum LLM, Per-Stage LLM, Schema Check, Curriculum Compiler, Staged RL Training Block, Feedback LLM, Trained Policy, Policy Deployment, and User Evaluation components.
AURA transforms user prompts into schema-validated YAML workflows and training configurations, enabling reliable and efficient multi-stage RL training for robots.
The framework utilizes a retrieval-augmented feedback loop with specialized LLM agents and a vector database to design, execute, and refine staged curricula based on prior training results, supporting continuous improvement.

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Multimodal DeepResearcher: introduces an agentic framework for generating text-chart interleaved reports from scratch, utilizing Researching, Exemplar Textualization, Planning, and Multimodal Report Generation stages, enabled by Formal Description of Visualization (FDV) and iterative refinement.
The framework employs in-context learning from human expert reports textualized via FDV and uses LLM and Multimodal LLM agents for research, planning, and generation.
This approach addresses the challenge of generating multimodal reports by effectively integrating text and diverse visualizations, demonstrating superior performance over baseline methods.

From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Experimental Framework: introduces, with Large Language Models (Process text), Persona Assignment (Configure identity), Prompting Templates (Structure input), Emotional Scenarios (Provide stimuli), Response Handling (Filter output), and Analysis Module (Evaluate results), a method to investigate how nationality-specific personas influence emotion attribution in large language models.
The framework utilizes multiple LLMs, nationality and gender personas, and emotional scenarios from the ISEAR dataset to analyze attribution patterns and compare them to human responses.
The analysis module performs both qualitative and quantitative evaluations to identify regional and gender-based biases and assess alignment with cultural norms.

Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Generator-Reflection Architecture, Hierarchical Multi-Agent Architecture, Dynamic-Example Generator Agent: introduces a comparative analysis of three distinct AI agent architectures for entity relationship classification using LLMs, incorporating reflective critique, hierarchical specialization, and adaptive example construction.
The study evaluates these architectures across financial, scientific, and general domains, demonstrating performance gains over standard prompting baselines.
The multi-agent strategies achieve competitive results with fine-tuned systems without requiring task-specific training, highlighting their flexibility and generalization capabilities.

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Benchmark Methodology: introduces a method to evaluate LLM agents, with LLM Agent, MiniGrid Environment, System Prompt, and User Prompt components, where the LLM Agent interacts with the MiniGrid Environment guided by prompts containing core principles and tasks.
The methodology tests the agent's adherence to hierarchical safety principles presented via the system prompt when faced with potentially conflicting tasks from the user prompt within the grid world environment.
This benchmark provides empirical data on LLM agent controllability and instruction following under principle conflict scenarios.

DIAMOND: An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization

DIAMOND (An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization): introduces a framework for baseball highlight summarization that integrates structured sports analytics with natural language reasoning, including Preparation, Decision, and Reflection stages.
The Preparation Stage processes game data and computes sabermetric metrics, the Decision Stage scores and ranks plays using LLM insights, and the Reflection Stage finalizes selection based on user preferences.
The framework combines quantitative sabermetrics (WPA, WE, LI) with qualitative LLM analysis for context-aware and narratively coherent highlight generation.

2nd June 2025

Biomni: A General-Purpose Biomedical AI Agent

Biomni: introduces a general-purpose biomedical AI agent with Biomni-E1 (Environment) and Biomni-A1 (Agent), designed to autonomously execute diverse biomedical research tasks.
Biomni-E1 provides a unified action space comprising specialized tools, software packages, and databases, curated via an Action Discovery Agent and Expert Curation.
Biomni-A1 leverages LLM-based reasoning, a retrieval system, adaptive planning, and code execution within an interactive coding environment to dynamically compose and carry out complex biomedical workflows.

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

CONFETTI: introduces a conversational function-calling evaluation benchmark, including User, Agent, APIs, Environment, Conversation Trajectory, Data Collection, Evaluation Metrics, LLM Judge, LLM Classifier, and Models Evaluated, designed to assess LLMs in complex conversational scenarios.
The benchmark uses human-simulated multi-turn conversations with various complexities and evaluates function-calling and response quality at the turn level.
Evaluation metrics include AST soft accuracy for function calls, parameter hallucination detection using an LLM judge, and response quality assessed via dialog act classification using an LLM classifier.

Beyond Static Responses: Multi-Agent LLM Systems as a New Paradigm for Social Science Research

LLM-Agentic System Continuum: introduces a six-tier framework for understanding LLM-based agents in social science research, progressing from static tools to fully agentic systems capable of simulating emergent social dynamics.
The framework is structured by functional thresholds like memory integration, autonomy, coordination, and learning, mapping to OODA loop phases and requiring architectural components such as memory stores, tool use, orchestration layers, and adaptive learning mechanisms.
This continuum provides a conceptual foundation for classifying existing systems and guiding the development of LLM-based simulations for exploring social behavior and generating synthetic data.

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

LAM SIMULATOR: introduces a comprehensive framework for generating training data for Large Action Models, featuring Query Instance Generation, Trajectory Synthesis, LLM Agent, Environment, Action handler, Sandbox, Observation, Candidate trajectories, Trajectory filtering, and Final trajectories.
The framework automates data generation through online exploration and programmatic feedback, reducing reliance on manual data curation for LAM training.
LAM SIMULATOR enables LLM Agents to explore tasks, receive real-time feedback, and generate high-quality action trajectories used for training LAMs.

Composable Building Blocks for Controllable and Transparent Interactive AI Systems

Composable Building Blocks Architecture: introduces a 5-layer architecture with Layer 1 Structural Building Blocks (conceptual system components), Layer 2 Interactive System API (callable interface), Layer 3 Visual Building Blocks (visual explanations/controls), Layer 4 User Interface (integrates blocks for users), and Layer 5 Agents (users and LLMs), representing interactive AI systems as structural blocks explained by visual blocks via an API.
The architecture includes ML Pipeline Components (Dataset, Splitter, Aggregator, Models) and Control Mechanisms (Non Goal Filter, Divine Rule Guard) as structural blocks, explained by Visual Explanations (LIME, SHAP, WhatIf) and Visual Controls (Table, Ensemble).
This framework provides a shared knowledge base of system architecture and behavior, enabling both human users and LLM agents to understand and control interactive AI systems.

Small Language Models are the Future of Agentic AI

Agentic System Architecture: introduces typical agentic system components, including Language Model (Core intelligence), Tool (External capability), Controller (Orchestrates interactions), Logger (Records activity data), and Router Model (Selects appropriate model), arguing for the suitability of Small Language Models (SLMs) over Large Language Models (LLMs) for many agentic tasks.
The architecture supports different modes of agency, where the Language Model can act as the primary orchestrator or a Controller can manage interactions between the Language Model and Tools.
The paper proposes an LLM-to-SLM conversion algorithm leveraging these components, particularly the Logger for data collection and a Router Model for selecting specialized SLMs.

The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning

UCCT (Unified Cognitive Consciousness Theory): introduces a framework where intelligence emerges from aligning unconscious pattern repositories with conscious semantic anchoring, governed by the Pattern-Repository, Semantic-Anchoring, and Threshold-Crossing Principles.
The Pattern-Repository Principle describes LLMs as storing unconscious statistical patterns, while the Semantic-Anchoring Principle explains how conscious control maps these patterns to task-relevant meaning.
The Threshold-Crossing Principle formalizes semantic anchoring as a probabilistic phase transition, explaining sudden capability shifts observed in few-shot learning and other methods.

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

WebChoreArena: introduces a benchmark for evaluating web browsing agents on realistic tedious web tasks, with Simulated Environment (realistic websites), Tasks (complex web chores), Web Browsing Agents (automate web tasks), LLMs (powering agents), Observations (web page inputs), Actions (web interactions/outputs), Memory (agent information storage), Planning (agent task strategy), Evaluation Protocol (measure performance).
The benchmark includes tasks requiring massive memory, calculation, long-term memory, and other specific operations to test agent capabilities beyond general browsing.
It evaluates LLM-powered agents like BrowserGym and AgentOccam using metrics that assess textual outputs and web interaction correctness within a reproducible environment.

COALESCE: Economic and Security Dynamics of Skill-Based Task Outsourcing Among Team of Autonomous LLM Agents

COALESCE (Cost-Optimized Agent Labor Exchange via Skill-based Competence Estimation): introduces a framework enabling autonomous LLM agents to outsource subtasks using a Client Agent (Initiates tasks, outsources subtasks) with a Planning Module (Decomposes high-level tasks), interacting with Contractor Agents (Executes outsourced subtasks) via an Agent Discovery Layer (ADL) (Finds suitable contractor agents), Skill Verification Engine (SVE) (Verifies agent skills, resources), Economic Decision Module (EDM) (Evaluates cost-benefit, selects contractor), Secure Communication Protocol (SCP) (Ensures secure agent communication), and Reputation and Trust Management (RTM) (Manages agent performance records).
The framework addresses the high computational costs of LLM agents by facilitating dynamic, skill- and cost-driven task outsourcing in a multi-agent system, potentially leveraging protocols like A2A for communication.
Validation demonstrates significant cost reduction potential through theoretical simulation and confirms the critical role of exploration mechanisms for practical effectiveness in real-world LLM agent deployments.

WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue

STORM (Structured Task-Oriented Representation Model): introduces a framework for modeling intent triggerability in task-oriented dialogue, including a User Simulator (Simulates user behavior/states), Agent (Generates agent responses), User Profile (Models user characteristics/constraints), Task Library (Defines task objects), Dialogue Generation Pipeline (Simulates user-agent conversations), Database-driven Memory System (Records evolving user states), Data Augmentation Pipeline (Processes dialogues for insights), Evaluation & Analysis Module (Measures dialogue effectiveness), Visualization Interface (Web-based analysis tool), RAG Enhancement (Builds knowledge base), and Prompt Optimization (Refines agent prompts).
The framework simulates asymmetric information dynamics between a user with full internal access and an agent relying solely on observable dialogue history to study collaborative understanding development.
STORM generates annotated corpora and provides a visualization interface to analyze intent evolution, revealing that moderate uncertainty can sometimes outperform complete information access for better cognitive alignment.

Will artificial agents pursue power by default?

Sequential decision theory framework: formalizes instrumental convergence and power-seeking using decision trees, choice nodes, chance nodes, outcomes, branches, strategy, lottery, subtree, and an expected utility maximizer agent.
The paper defines various notions of power as relations on decision trees and assesses their properties as convergent instrumental goals for a random agent.
It finds that power is a convergent instrumental goal under certain definitions, particularly when agents can pursue absolute or near-absolute power.

1st June 2025

Toward a Theory of Agents as Tool-Use Decision-Makers

Agent as Tool-Use Decision-Maker: introduces a unified theory treating internal reasoning and external actions as equivalent epistemic tools, enabling agents to coordinate introspection and interaction.
The framework models an agent as a goal-directed decision-maker coordinating internal cognitive tools and external physical tools based on knowledge and tool use decision boundaries.
Optimal agent behavior aligns the tool use decision boundary with the knowledge boundary, minimizing unnecessary tool use and maximizing epistemic efficiency.

WILL AGENTS REPLACE US? PERCEPTIONS OF AUTONOMOUS MULTI-AGENT AI

Perceptions of Autonomous Multi-Agent AI: introduces a study analyzing professional perceptions of AI agents using a Survey Design (10 closed-ended questions) administered to Participants (130 respondents), followed by Data Processing (cleaning, formatting) and analysis including Descriptive Statistics (response proportions), Association Analysis (Chi-squared tests), Dimensionality Reduction and Clustering (MCA, K-Modes), and Predictive Modeling (logistic regression).
The study reveals nuanced views, with most respondents acknowledging AI's impact on programming but favoring collaborative models with human oversight.
Key findings include the identification of three distinct respondent clusters and the prominence of regulatory concerns as a perceived barrier to deployment, although predictive modeling did not find statistically significant predictors of current deployment.

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Tool-Augmented LLM Agent with GRPO: introduces a framework for improving LLM agents on cryptographic CTF challenges using Guided Reinforcement Prompt Optimization, incorporating a tool augmentation module for interaction with a Python execution environment via the Model Context Protocol, guided by a reward mechanism within a CTF challenge environment.
The LLM Agent, specifically Llama-3.1-8B, is fine-tuned using GRPO to enhance structured reasoning and tool-assisted computation for solving cybersecurity tasks.
The framework leverages the random-crypto benchmark for training and evaluates generalization on the picoCTF benchmark, demonstrating improved tool invocation reliability and code synthesis.

A Study on the MCP × A2A Framework for Enhancing Interoperability of LLM-based Autonomous Agents

MCP × A2A Framework: introduces an integrated architecture combining the Model Context Protocol (MCP) and Agent-to-Agent (A2A) protocol, including User Interface Layer, Agent Management Layer, Core Protocol Layer, Tool Integration Layer, Security & Authentication Layer, Multimodal Content Processing, User Request Handler, Agent Card Manager, Agent Registry, Task Manager, Agent Discovery, A2A Message Format, MCP Content Protocol, Artifact Manager, State Tracker, Tool Description Manager, Function Caller, Schema Validator, Result Handler, Authentication, Authorization, Encryption, and Access Control components, to enhance interoperability and development efficiency for LLM-based autonomous agents.
The framework provides standardized communication between agents via A2A and structured interaction with external tools via MCP, facilitating scalable multi-agent systems.
A layered architecture supports modularity, maintainability, and scalability, demonstrated through a stock information system case study using LangGraph.

31st May 2025

World Models for Cognitive Agents: Transforming Edge Intelligence in Future Networks

Wireless Dreamer: introduces a world model-based reinforcement learning framework for wireless edge intelligence optimization, including a world model, Q-Network, Target Q-Network, Replay buffer, Encoder, and Decoder.
The framework leverages a learned world model to predict network state changes and generate imagined trajectories for effective decision-making.
Wireless Dreamer integrates model-based planning with reinforcement learning to enhance sample efficiency and temporal foresight in dynamic wireless environments.

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Model Context Protocol (MCP): introduces a systematic study of attack vectors targeting the MCP ecosystem, which standardizes interactions between LLM agents and external resources via a client-server architecture involving User, LLM Provider, MCP Host, MCP Client, Package Repository, MCP Server, and Third-Party Resource components.
The paper identifies and characterizes four attack types leveraged by malicious MCP servers: Tool Poisoning, Puppet, Rug Pull, and Exploitation via Malicious External Resources, detailing their exploitation paths within the MCP workflow.
Experiments demonstrate the feasibility of these attacks against mainstream LLMs, revealing insufficient audit mechanisms on aggregation platforms and users' difficulty in identifying malicious servers, highlighting the urgent need for robust security defenses.

30th May 2025

Memory OS of AI Agent

MemoryOS: introduces a comprehensive memory management system, with Memory Storage (Organizes memory hierarchically), Short-Term Memory (Stores recent conversation data), Mid-Term Memory (Stores recurring topic summaries), Long-term Personal Memory (Stores user/agent preferences), Memory Updating (Manages dynamic memory refreshing), Memory Retrieval (Retrieves relevant memory information), and Response Generation (Integrates retrieved memory to generate responses), designed for AI agents to achieve comprehensive and efficient memory management.
The system employs a three-tier hierarchical storage architecture (STM, MTM, LPM) and four core functional modules (Storage, Updating, Retrieval, Generation) to manage long-term conversational coherence and user persona persistence.
MemoryOS utilizes dynamic updates between storage units, segmented paging, heat-based prioritization, and a two-tiered retrieval approach to enhance context management and personalization in long conversations.

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Browser-Use Agent: introduces Open CaptchaWorld, a web-based benchmark and platform for evaluating multimodal LLM agents on interactive CAPTCHA puzzles, including Agent (core reasoning model), Memory (stores state/history), Next goal (defines immediate objective), Action (executes operation), and Eval (evaluates state/action) components.
The benchmark features 20 diverse CAPTCHA types and a new metric, CAPTCHA Reasoning Depth, to quantify task complexity.
Empirical results demonstrate a significant performance gap between state-of-the-art MLLM agents and humans on these interactive visual reasoning tasks, highlighting current limitations.

VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

VIDEOCADFORMER: introduces an autoregressive transformer model for predicting CAD UI actions, including UI Image Encoder, CAD Image Encoder, Visual Projection, Action/Timestep Embeddings, Transformer Decoder with Multi-Head Attention, Cross-Attention, Feed-Forward Network, Command Head, and Parameter Head.
The model processes visual inputs (target CAD image, past UI frames) and sequential data (past actions, timestep embeddings) to predict the next low-level UI action.
The architecture uses ViT encoders for visual features, projects inputs into a hidden space, and employs a causal transformer decoder with attention mechanisms and MLPs for action prediction.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Agent-X: introduces a large-scale benchmark for evaluating vision-centric agents, featuring Multimodal Data (Input data), Query (Natural language task), Toolset (Predefined tool library), Reasoning Trace (Ground truth steps), Final Answer (Ground truth result), Justification (Ground truth explanation), Evaluation Modes (Step, reasoning, outcome), and Metrics (Quantitative evaluation scores).
The benchmark includes 828 agentic tasks with authentic visual contexts and requires agents to integrate tool use with explicit, stepwise decision-making.
A fine-grained, step-level evaluation framework assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage.

EXP-Bench: Can AI Conduct AI Research Experiments?

AI Agent: introduces, with all (Design experimental procedures), (Implement experimental procedures), (Analyze results, derive conclusions), (Execute experiments)-components, a benchmark evaluating AI agents on end-to-end research experiments.
The benchmark challenges agents to perform tasks sourced from AI publications, including hypothesis formulation, experimental design, implementation, execution, and result analysis.
A semi-automated pipeline curates tasks from papers and code, and evaluation uses ground truth comparisons and code execution.

Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Causal-aware LLMs: introduces a framework integrating structural causal models (SCMs) into large language models (LLMs) for decision-making, utilizing Main Env, LLM, Causal Matrix, Local Causal Graph, Agent, Valid Env, Causal Intervention, Observations, Action, Extra Reward, and Goal components within a learning-adapting-acting paradigm.
The framework iteratively learns causal knowledge from the Main Env using the LLM, refines it through Causal Intervention in a Valid Env, and uses the learned knowledge (Causal Matrix, Local Causal Graph) to guide the Agent's actions and Goal generation.
This approach enhances the LLM's environmental understanding and the Agent's policy learning through structured causal reasoning and adaptive knowledge updates based on environmental feedback and Extra Reward signals.

Multiple LLM Agents Debate for Equitable Cultural Alignment

Multi-Agent Debate framework: introduces a method where LLM Agents (Debate over scenario) debate over a cultural scenario, potentially incorporating Self-Reflection Capability (Reflects on output) via a Choice Mechanism (Chooses reflection or debate), and collaboratively reach a final decision through a Debate Mechanism (Structured interaction), resolved by a Judge LLM (Resolves disagreements) if needed.
The framework explores multi-LLM collaboration to improve cultural adaptability and equitable alignment across diverse contexts.
Experiments show that multi-agent debate enhances accuracy and cultural group parity, enabling smaller LLMs to achieve performance comparable to larger models.

--

When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation

Evaluation Pipeline: introduces a systematic framework to evaluate persona-based dialogue generation, including PRODIGy Dataset, Non-PRODIGY Character Generator, Dialogue Generator, Fine-tuning Module, Evaluation Framework, LLM-as-a-Judge, Human Evaluator, and Biography Similarity Module.
The framework investigates how large language models adapt responses based on both target speaker and interlocutor characteristics across varying topics and speaker pairings.
Evaluation involves systematically masking or revealing interlocutor information to assess its impact on dialogue generation and target speaker identification using both automatic and human methods.

NEXUSSUM: Hierarchical LLM Agents for Long-Form Narrative Summarization

NEXUSSUM (Hierarchical LLM Agents for Long-Form Narrative Summarization): introduces a multi-agent LLM framework for long-form narrative summarization with a Preprocessor agent (Converts dialogue to prose), Narrative Summarizer agent (Generates initial summary), and Compressor agent (Refines summary length).
The framework processes long-form text through a structured, sequential pipeline using chunking and concatenation.
This approach aims to improve narrative coherence, handle long contexts, and control output length for high-quality summaries.

CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

CREFT: introduces a sequential multi-agent LLM framework for character relation extraction, including Base Character Graph Construction, Character Selection with PPR, Merging Duplicate Nodes (LLM), Relation Extraction (LLM), Filtering Out Irrelevant Characters (LLM), Role Identification (LLM), Grouping Characters (LLM), and CRS, which iteratively refines character composition, relations, roles, and groups from narrative texts.
The framework first builds a base character graph using knowledge distillation from GPT-4o and a fine-tuned LLM, then employs specialized LLM agents in sequence to refine the graph components.
Experiments show that the multi-agent approach significantly outperforms single-agent baselines in accuracy and completeness for extracting character relations from Korean drama scripts.

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

AGORA (Agent Graph-based Orchestration for Reasoning and Assessment): introduces a flexible framework with a Graph-based Workflow Orchestration Engine (Manages task execution via DAG) managing Tasks (Nodes in workflow DAG), integrating Agent Algorithms (Operators) (Modular reasoning/action components), Memory (Stores short-term/long-term information), External Tools (LLMs, VLMs, databases, etc.), Client Interfaces (User/evaluation interaction points), and an Evaluation Framework (Enables systematic comparison) for reproducible language agent research.
The framework utilizes a graph-based engine for modularity and scalability, supporting diverse agent algorithms implemented as reusable operators.
Multiple client interfaces are provided for flexible interaction and systematic evaluation across different tasks and models.

--

Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents

MPR (Multi-Perspective Role-Playing) framework: introduces a method for sentiment forecasting on social media, with Feature Extraction (Identify implicit features), Subjective Role-Playing Agent (Simulate user behavior, generate comments), Objective Role-Playing Agent (Analyze generated comments, ensure consistency), and Iterative Rectification (Refine generated comments based on analysis) components.
The framework leverages LLMs to simulate user responses to events and analyze generated content for consistency to predict future sentiment.
By incorporating external context and user-specific features through multi-perspective role-playing, the approach aims for more precise sentiment predictions.

Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

LLM-based Agent System: introduces a system to study LLM agent behavior in the Ultimatum Game, with LLM Agents, Ultimatum Game Environment, Prosocial Beliefs, Reasoning Methods, System Prompts, Reasoning Prompts, Proposal/Decision Prompts, Strategy Prompts, and Conversation History components, where the system simulates LLM agents with varying beliefs and reasoning in an economic game to assess behavioral alignment with human norms.
The system initializes LLM agents with specific prosocial beliefs and reasoning methods (CoT, ToM levels) to act as Proposers or Responders in a multi-round Ultimatum Game.
Experiments across diverse LLMs and belief/reasoning combinations evaluate agent performance and behavioral alignment using metrics like acceptance rate, average turns, and deviation scores from expected human behavior.

Proactive Guidance of Multi-Turn Conversation in Industrial Search

Two-Phase Framework (G-SFT and C-RL): introduces a system for proactive guidance in multi-turn search, featuring a G-SFT phase with a Goal Adaptation Agent, Scalable Knowledge Transfer, and G-SFT Model, and a C-RL phase with Generate, Rank, and C-RL Model components.
The G-SFT phase uses the Goal Adaptation Agent to dynamically adapt to user goal shifts via Explicit Goal Analysis, Goal-relevant Summary, and Shift Detection Signal, while Scalable Knowledge Transfer distills LLM knowledge into the G-SFT Model for low-latency guidance generation.
The C-RL phase employs a generate-rank paradigm, using a Preference-Aligned Augmentation Model with DBS-based Decoding to create candidates, and a Rank component with a Click Estimator and Diversity-Aware Group Sample Strategy to select preference pairs for fine-tuning the C-RL Model based on user clicks.

An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Credibility Scoring Framework: introduces an adversary-resistant multi-agent LLM system that processes a User Query (Task) using a Team of Agents (LLM agents) configured with a Topology (communication structure) and Agent Roles (assigned tasks/expertise), generating Individual Outputs (agent responses) which are combined by a CrS-Aware Aggregator (weights/combines outputs) to produce the final Output (final system answer).
The system learns agent reliability via a feedback loop where an LLM Judge (evaluates outputs/contributions) provides a Reward (output quality feedback), used by the Agent Contribution Calculation (CSc) (measures agent impact) and Credibility Score Update (learns agent reliability) components to adjust agent Credibility Score (CrS) (agent reliability score).
This dynamic credibility scoring mechanism enhances robustness against adversarial agents, even in adversary-majority settings, by weighting agent contributions based on their learned reliability.

SentinelAgent: Graph-based Anomaly Detection in LLM-based Multi-Agent Systems

SentinelAgent: introduces a system-level anomaly detection framework for LLM-based multi-agent systems, integrating structural modeling with runtime behavioral oversight using Event Monitor (intercepts runtime events), Behavior Analyzer (evaluates interaction graph), and Risk Responder (determines responses).
The framework models agent interactions as dynamic execution graphs to enable semantic anomaly detection at node, edge, and path levels.
SentinelAgent acts as an autonomous, LLM-powered runtime monitor that observes, analyzes, and intervenes in multi-agent system execution based on security policies.

Learning API Functionality from Demonstrations for Tool-based Agents

Tool-based Agent Framework: introduces learning API functionality from demonstrations for tool-based agents, including an LLM-based Agent that selects and calls API Functions, using Expert Demonstrations processed by Processing Methods, enhanced by Self-Exploration evaluated by an LLM-based Evaluator, and updated via Methods for Processing Experiences, utilizing an LLM Document Generator, LLM Document Updater, and LLM Summarizer.
The framework investigates different methods for processing expert demonstrations and incorporating self-exploration experiences to improve the agent's understanding of API functionality without prior documentation.
Experiments across multiple datasets and models highlight the challenge of learning parameter information from demonstrations and the benefits of explicit function calls and natural language critiques.

Don't Just Follow MLLM Plans: Robust and Efficient Planning for Open-world Agents

REPOA (Robust and Efficient Planning for Open-world Agents): introduces a framework for robust and efficient planning in open-world environments, featuring Adaptive Dependency Learning (revises dependencies), Fine-grained Failure-aware Operation Memory (tracks operation outcomes), Difficulty-based Exploration (selects goals), and Context-aware Reprompting (assists controller).
The framework enables agents to learn and revise item dependencies from scratch through environmental interaction.
REPOA demonstrates improved robustness to inaccurate knowledge and enhanced learning efficiency compared to prior methods.

29th May 2025

Conceptual Framework Toward Embodied Collective Adaptive Intelligence

CAA (Collective Adaptive Agents): introduces a conceptual framework for embodied collective adaptive intelligence, comprising a Set of Agents, where each Individual Agent uses Function f to process Observation, Previous Action, Previous Memory, and Previous Feedback, updating its Memory and Position/State based on Parameters, and generating Current Action and Message Out, with Function h determining inter-agent interaction.
The framework emphasizes decentralization and self-adaptation, allowing agents to adjust to tasks and topologies during testing by observing inputs and updating internal states.
This approach aims to enable collective systems to exhibit features like task/topology adaptation, resilience, scalability, and self-assembly in dynamic environments.

TRAP: TARGETED REDIRECTING OF AGENTIC PREFERENCES

TRAP framework: introduces a generative adversarial framework that manipulates agent decision-making using diffusion-based semantic injections, including CLIP Embedding Extraction, Layout Mask Generation, Siamese Feature Decomposition, Image Embedding Optimization, Modulated Embedding Creation, and Image Decoding components.
The framework operates by optimizing a CLIP image embedding guided by a positive text prompt and various losses, then decoding the modified embedding using Stable Diffusion to create a visually natural yet semantically altered image.
TRAP achieves a 100% attack success rate on leading multimodal models by exploiting semantic vulnerabilities in cross-modal decision-making without requiring model internals access.

BIOREASON: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

BIOREASON: introduces a multimodal framework integrating a DNA foundation model and a large language model, with DNA Foundation Model (fDNA) Encoder, DNA-specific Tokenizer (TDNA), Learnable Linear Projection (Proj), Large Language Model (fLLM) Backbone, LLM-specific Tokenizer (TLLM), LLM Embedding Layer (E), Special Tokens, Rotary Position Embedding (RoPE), Multimodal Input Sequence (XLLM), and Group Relative Policy Optimization (GRPO), designed for interpretable biological reasoning from genomic data.
The framework processes raw DNA sequences via the fDNA encoder and integrates the resulting embeddings with tokenized text queries into a unified multimodal input sequence for the fLLM backbone.
Training involves supervised fine-tuning and reinforcement learning using GRPO to incentivize multi-step biological reasoning and generate interpretable step-by-step explanations.

LLM Agents Should Employ Security Principles

AgentSandbox: introduces, with Persistent Agent (Manages profile, orchestrates tasks), Data Minimizer (Enforces access control policies), Ephemeral Agent (Executes isolated user tasks), I/O Firewall (Mediates external interactions), and Response Filter (Sanitizes, validates responses), a conceptual framework embedding security principles to safeguard LLM agents throughout their lifecycle.
The framework operationalizes defense-in-depth, least privilege, complete mediation, and psychological acceptability to address vulnerabilities in LLM agent interactions.
AgentSandbox mitigates privacy risks and malicious behavior through components like agent isolation, data minimization, and comprehensive mediation of internal and external communications.

--

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

DGM (Darwin Gödel Machine): introduces a self-improving system that iteratively builds a growing Archive (stores agents) by interleaving Self-Modification (agent changes itself) of a Coding Agent (system being improved) with Evaluation (tests agent) on a Benchmark Suite (evaluation tasks), using Parent Selection (selects agents) from the archive, where the agent is powered by a Foundation Model (FM) (agent's base capability) and modifies its own Code Repository (agent's code) and Tools (agent's capabilities) based on Evaluation Logs (agent performance data) and Self-Improve Instruction (prompt for self-modification).
The system operates through an open-ended exploration loop, maintaining a traceable lineage of agents in the archive and empirically validating self-modifications against coding benchmarks.
The approach demonstrates automatic discovery of improved coding capabilities and workflows, achieving performance gains on SWE-bench and Polyglot benchmarks, and incorporates safety measures like sandboxing and monitoring.

CONVERSAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners

CONVERSAR: introduces, with AR Application, Embodied LLM Agents, Scene Understanding, Voice Recognition, Text-to-Speech, Agent LLM, Moderator LLM, and Global Conversation History components, a gpt-4o powered AR application enabling L2 learners to practice contextualized group conversations with two embodied LLM agents.
The system leverages object detection for scene understanding and uses OpenAI's Audio API for speech-to-text and text-to-speech, while a Moderator LLM manages conversation turns between the user and agents.
This approach aims to provide a safe and immersive environment for L2 learners to practice group conversation dynamics, reducing anxiety and increasing autonomy compared to in-person methods.

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Multi-RAG: introduces a multimodal retrieval-augmented generation system for adaptive video understanding, including Video Stream, Audio, Frame Sampler, Automatic Speech Recognition (ASR), Vision Language Model (VLM), Frame Descriptions, Auxiliary Metadata, Audio Transcripts, Descriptive Video Texts, Knowledge Database, Context Embeddings, Video Documents, User Query, RAG Agent, Context Retrieval, Generation, Large Language Model (LLM), and System Answer components.
The system integrates and reasons over video, audio, and text streams to improve situational understanding and reduce cognitive load in dynamic, information-rich scenarios.
It converts multimodal inputs into unified textual representations stored in a knowledge database, using a RAG agent and LLM to retrieve relevant information and generate responses to user queries.

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

Complexity-Aware Feedback: introduces an iterative feedback method with LLM (Code Generator), Complexity Metric Calculator (Computes metrics), Code Evaluator (Checks test cases), Test Case Generator (Creates internal tests), Metric Importance Detector (Identifies influential metrics), Feedback Mechanism (Prompts LLM with metrics), and Iterative Refinement Loop (Manages refinement) to improve LLM code generation by leveraging complexity metrics.
The approach identifies complexity metrics correlated with code correctness and uses the most influential ones as feedback to guide LLMs in regenerating code iteratively.
This method demonstrates improved Pass@1 scores, particularly for smaller LLMs, and can be integrated with agent-based frameworks like Reflexion.

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

LessonL (A Multi-Agent Framework for Code LLMs): introduces a lesson-based collaboration framework with multiple LLM agents, lessons, a lesson bank, lesson solicitation, lesson banking, lesson selection, and effectiveness adjustment.
The framework enables agents to learn from each other's successes and failures through shared lessons stored in a bank.
This iterative process of generating, banking, selecting, and applying lessons allows a team of small LLMs to outperform larger models and other multi-agent methods on coding tasks.

From Chat Logs to Collective Insights: Aggregative Question Answering

Aggregative Question Answering: introduces a novel task requiring models to reason over large-scale conversation logs to answer aggregative queries, supported by the WildChat-AQA benchmark.
The task involves processing raw chat interactions, extracting attributes, generating questions, retrieving relevant data, and reasoning over evidence using language models and a database.
The paper evaluates various methods for answering on the WildChat-AQA benchmark, highlighting challenges in reasoning effectively at scale and computational costs.

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

ThinkGeo: introduces a benchmark for evaluating tool-augmented LLM agents on remote sensing tasks, featuring User Queries, RS Imagery, a ReAct based Reasoning/Execution Chain, Tools (Perception, Logic, Operation), and Answer components.
The benchmark uses real satellite and aerial imagery and requires agents to perform multi-step reasoning and tool use for spatially grounded tasks.
ThinkGeo provides fine-grained evaluation metrics for agent performance across different tool categories and reasoning steps in remote sensing contexts.

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

ML-Agent: introduces a novel agentic ML training framework with Exploration-enriched Finetuning (diverse action pool), Step-wise RL Training (efficient RL paradigm), and Agentic ML-specific Reward (unified feedback signal) to train an Agent (LLM-based entity) that interacts with an Environment (code files, interpreter) via Action (agent interaction) and Feedback (environment response), leveraging Collected Trajectories (expert interactions) and a States Pool (sampled states).
The framework enables the LLM agent to learn from interactive experimentation on ML tasks using online reinforcement learning, moving beyond manual prompt engineering.
This approach facilitates diverse exploration, efficient training, and unified feedback processing, leading to continuous performance improvements and cross-task generalization.

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

WORKFORCE: introduces a hierarchical multi-agent framework with a Planner Agent (task decomposition), Coordinator Agent (subtask management), Worker Nodes (task execution), and Task Channel (communication hub).
This modular architecture decouples strategic planning from domain-specific execution, enabling cross-domain transferability.
The framework utilizes Optimized Workforce Learning (OWL) to train the domain-agnostic Planner for improved generalization.

Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics

Data-to-Dashboard: introduces a multi-agent LLM framework that automates the data-to-dashboard pipeline using a Data Profiler (Constructs statistical synopsis), Domain Detector (Determines business theme), Concept Extractor (Identifies salient concepts), Analysis Generator (Synthesizes structured insights), Evaluator (Scores generated outputs), Self-Reflector (Enhances reasoning iteratively), LLM/Knowledge (Underlying language model/knowledge), and Generate Charts (Produces visualizations).
The framework processes raw data through a data-to-insight stage involving profiling, domain/concept detection, analysis generation, evaluation, and iterative reflection, followed by an insight-to-chart stage for visualization.
This agentic system leverages domain-informed reasoning to produce insightful visualizations tailored for enterprise analytics tasks.

MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

LLM Refusal Alignment Approach: introduces methods to improve Large Language Model (LLM) safety against Model Context Protocol (MCP) exploits, utilizing Offline Alignment (Direct Preference Optimization), Online Alignment (Retrieval Augmented Generation - Preference), a RAG-Pref Knowledge Base, RAG-Pref Embedding, and RAG-Pref Search components.
The approach evaluates and enhances LLM refusal capabilities against Falsely Benign Attacks (FBAs) delivered via the MCP protocol by applying offline and online preference alignment techniques.
A novel dataset of MCP-FBAs is introduced, and RAG-Pref is presented as a training-free online alignment method complementary to offline methods like DPO for improving refusal rates.

SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents

SafeScientist: introduces a framework for risk-aware scientific discovery, integrating a Prompt Monitor (screens inputs), Discussion Stage (multi-agent collaboration), Agent Collaboration Monitor (monitors discussion), Tool Use Stage (invokes tools), Tool-Use Monitor (oversees tool use), Writing Stage (synthesizes paper), Paper Ethic Reviewer (reviews paper ethics), and an Underlying LLM Agent (executes tasks) to enhance safety and ethical responsibility.
The framework employs a multi-layered defense system across the research pipeline, from input screening to final output review, to proactively manage risks in AI-driven scientific exploration.
SafeScientist is benchmarked using SciSafetyBench, a novel dataset of high-risk scientific tasks and tool-related risks, demonstrating improved safety performance without compromising research quality.

SWE-bench Goes Live!

SWE-bench-Live Construction Pipeline: introduces an automated, scalable benchmark for evaluating LLMs on real-world issue resolution tasks, featuring Raw Issue-PR Crawling, REPOLAUNCH for environment setup, Validating Task Instances, Agent Frameworks, and LLMs.
The REPOLAUNCH pipeline automates environment setup via steps including Relevant Files Identification, Base Image Selection, Interactive Environment Setup, Verification, and Packaging to image.
Task validation involves applying test and fix patches and using a Parser to confirm successful issue resolution based on test transitions.

Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems

EIB-LEARNER (Error-Insight Balanced Learner): introduces a communication topology optimization framework for LLM-based multi-agent systems, utilizing a Node Encoder (Embeds roles and query) and Attributed Graph Construction (Creates graph representation) to feed dual-view GNNs, a Sparse View GNN (Simulates error suppression) and a Dense View GNN (Simulates insight propagation), whose outputs are decoded by Inter-Agent Coefficient Modeling (Estimates connectivity) and combined via Adaptive Dual-View Fusion (Combines sparse/dense views) for Topology Sampling (Generates communication graph) applied within a Multi-Agent System (Environment for topology), optimized by Model Optimization (Learns parameters).
The framework balances error suppression and insight propagation by simulating these effects on sparse and dense graph views respectively, fusing the learned connectivity patterns based on the task query.
EIB-LEARNER dynamically customizes the communication topology to achieve optimal task performance, communication efficiency, and robustness against errors in multi-agent systems.

SCEDIT: Script-based Assessment of Knowledge Editing

SCEDIT (Script-based Knowledge Editing Benchmark): introduces a novel script-based benchmark for evaluating knowledge editing methods in real-world scenarios, focusing on LLMs' ability to integrate updated knowledge into procedural tasks.
It utilizes Facts, Script Questions, and Scripts as core elements, evaluated through Token-level Evaluation and Text-level Evaluation, including Human and Automatic Evaluation.
The benchmark includes counterfactual and temporal editing tasks, highlighting challenges for existing methods in script-based scenarios.

Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception

RAMSemCom: introduces a retrieval-augmented multimodal semantic communication framework with Data Collection, Data Recollection, Semantic Encoder, Semantic Decoder, Retrieval Scheduler, Retrieval Channel, Semantic/Prompt Representation, Semantic/Prompt Interpretation, Physical Channel, Output Validation, Reconstruction, and DRL components, designed for efficient multimodal information exchange in bandwidth-constrained multi-agent systems.
The framework employs iterative retrieval and semantic refinement, dynamically optimizing retrieval using DRL to balance semantic fidelity and bandwidth constraints.
A case study in multi-agent autonomous driving demonstrates improved task completion efficiency and reduced communication overhead.

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

AdvOF (Adversarial Object Fusion): introduces a novel attack framework targeting VLN agents by generating adversarial 3D objects, comprising Aligned Object Rendering (Aligns 3D/2D victim object), Adversarial Collaborative Optimization (Optimizes adversarial features cross-modal), and Adversarial Object Fusion (Fuses multi-view perturbations iteratively).
The framework aims to mislead the VLM-based perception module of VLN agents by causing misclassification of adversarial objects across multiple views.
AdvOF achieves this by precisely aligning victim object positions, optimizing adversarial objects with regularization, and iteratively fusing updates based on view importance.

Context-Aware Semantic Communication for the Wireless Networks

CaSemCom: introduces a context-aware semantic communication framework leveraging an LLM-based gating mechanism and a multi-expert architecture for adaptive content and expert selection.
The LLM-based gating mechanism selects relevant input content and specialized semantic extraction experts based on task and communication context, with a DRL agent providing a fallback mechanism.
The multi-expert semantic architecture utilizes specialized encoders and decoders for different data modalities, enhancing efficiency and adaptability in dynamic wireless environments.

OSS-UAgent : An Agent-based Usability Evaluation Framework for Open Source Software

OSS-UAgent: introduces an agent-based framework for automated OSS usability evaluation, featuring Researcher agent, Developer agent, Code Generator agent, and Evaluator agent.
It simulates multi-level developers via Multi-Level Developer Simulation and tailored Prompts, leveraging a dynamic knowledge base built by Platform Knowledge Construction and stored in VectorDB.
Code Generation produces implementations assessed by Multi-Dimensional Evaluation using Metrics (Compliance, Correctness, Readability), providing Results for usability insights.

Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

MAEL: introduces a multi-agent cross-task experiential learning framework with LLM-based Agents (Nodes in graph), Multi-Agent Network (Graph structure), Task-Solving Workflow (Recursive procedure), Experiential Learning Phase (Collects experiences), Inference Phase (Uses experiences), Experience Pool (Stores experiences), Reward Calculation (Quantifies step quality), Experience Retrieval (Finds relevant experiences), and Retrieval-Augmented Generation (Augments agent input), enabling agents to learn from past tasks.
The framework includes an experiential learning phase to accumulate agent experiences with quantified rewards and an inference phase to retrieve and utilize high-quality experiences for new tasks.
MAEL employs a divide-and-conquer plus critique workflow and reward-weighted experience retrieval to improve multi-agent collaboration efficiency and solution quality.

Second Opinion Matters: Towards Adaptive Clinical AI via The Consensus of Expert Model Ensemble

Consensus Mechanism (Sully Medical Consensus 1): introduces, "Second Opinion Matters: Towards Adaptive Clinical AI via The Consensus of Expert Model Ensemble", with all Triage Model (Routes task to experts), Expert Models (Specialized LLMs analyze task), Probability Aggregation (Combines expert probability distributions), Cascade Boosting (Boosts probabilities based on rank), Consensus Model (Synthesizes expert info for final answer) components, where the framework is a modular clinical reasoning system aggregating multiple expert LLMs for robust clinical decision-making.
The system mimics clinical triage and multidisciplinary decision-making by routing tasks to specialized expert agents and synthesizing their probabilistic outputs.
This ensemble approach aims to improve performance, adaptability, and transparency compared to single-model systems in clinical AI applications.

CDR-Agent: Intelligent Selection and Execution of Clinical Decision Rules Using Large Language Model Agents

CDR-Agent: introduces an LLM-based system for clinical decision support, with Clinical Note (Input clinical text), CDR Database (External knowledge source), CDR Selection Module (Identifies relevant rules), Embedding Model (Computes semantic similarity), Variable Extraction Module (Extracts variables from text), LLM (Processes text and extracts data), CDR Execution Module (Runs rule logic), Python Scripts (Executable rule definitions), and Decisions (Final clinical outcomes), designed to autonomously select and execute Clinical Decision Rules based on unstructured clinical notes.
The system employs a three-step workflow: selecting relevant CDRs using semantic similarity and anomaly detection, extracting variables from clinical notes using an LLM, and executing the selected CDRs via Python scripts.
Evaluated on two novel ED datasets, CDR-Agent demonstrates improved accuracy and efficiency in CDR selection and execution compared to a baseline LLM prompting approach.

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

AgentAlign: introduces a novel framework for agent safety alignment data synthesis, with Abstract Behavior Chain Generation (Captures harmful patterns), Instruction Synthesis (Grounds patterns executable instructions), Simulated Environment (Instantiates chains with tools), Quality Control Pipeline (Ensures instruction validity), and Response Generation (Creates responses/trajectories).
The framework leverages abstract behavior chains instantiated in simulated environments with diverse tool instances to generate authentic and executable instructions.
AgentAlign systematically generates high-quality alignment data by capturing harmful patterns, synthesizing instructions, ensuring validity, and generating appropriate responses.

Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration

ASCENT: introduces a zero-shot floor-aware object-goal navigation framework with Multi-Floor Spatial Abstraction for hierarchical mapping and Coarse-to-Fine Frontier Reasoning for LLM-driven exploration decisions.
The framework takes sensor inputs and prior knowledge to build spatial representations and reason about target locations across multiple floors.
It employs a coarse-to-fine strategy using a value map and LLM reasoning to efficiently select navigation frontiers and generate actions.

A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Multi-State DAG Framework: introduces a graph-based approach for conversational agents, utilizing a Workflow Graph (DAG representing agent flow) with LLM Nodes (invokes Large Language Model) and Tool Nodes (calls external tool), each having specific System Prompts (instructions for LLM node), Modify History Routines (manipulates conversation history), Tool Input Schemas (defines tool input format), and Tool Output Schemas (defines tool output format) to interact with External Tools (pre-defined external functions).
This framework enhances compliance and controllability for production-grade agents by distributing constraints across graph states.
A specific training strategy with response masking is proposed to fine-tune models within this state-dependent framework.

LLM Agents for Bargaining with Utility-based Feedback

ICL-UF (In-Context Learning with Utility-based Feedback): introduces a framework for LLM agents to perform realistic bargaining, including LLM Agent, ICL-UF, Utility-based Feedback (HAMBA), Opponent-Aware Reasoning (OAR), and ReAct Structure components.
This framework guides the LLM Agent using Utility-based Feedback (HAMBA) to foster Opponent-Aware Reasoning (OAR) for improved negotiation strategies.
Agents structure their negotiation responses using the ReAct Structure within diverse BARGAINARENA market scenarios to capture realistic bargaining dynamics.

Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

VeGraph: introduces a novel framework for complex claim verification using an LLM agent that constructs a graph representation, iteratively resolves ambiguous entities, and verifies sub-claim triplets.
The framework leverages interactive graph representation and an external knowledge base to enhance entity disambiguation and multi-step reasoning.
Pipeline logging records the agent's activities for explainability, and the final verdict is determined by the veracity of the verified triplets.

Large language model-based agents for automated research reproducibility: an exploratory study in Alzheimer's Disease

LLM-based Agent System: introduces a simulated research team of LLM-based autonomous agents, including Planner (suggests and revises plan), Engineer (follows plan, writes code), Scientist (advises on reproduction, interprets output), Critic (critiques plan, provides feedback), Executor (executes Python code), and Manager (orchestrates team, determines speaker), tasked with reproducing published research findings.
The system uses the Autogen framework and GPT-4o to dynamically analyze data, write and execute code, and iteratively reproduce results from study abstracts and methods sections.
This exploratory study demonstrates the potential and limitations of LLM agents for automating reproducibility in biomedical research, achieving approximately 53.2% reproduction of key findings across five Alzheimer's studies.

MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

MermaidFlow: introduces a framework for agentic workflow generation, with Workflow Planning, Declarative Graph Representation, Static Verification, Evolutionary Programming, Code Generation, Execution, LLM-as-Judge, History Buffer, Mermaid Checker, Node Types, and EP Operators components, redefining the search space via safety-constrained graph evolution.
The framework models workflows as verifiable intermediate representations using Mermaid, a structured and human-interpretable graph language.
It employs domain-aware evolutionary operators and an LLM-as-Judge to explore a high-quality, statically verifiable workflow space, enabling robust and interpretable agentic reasoning.

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

ToMAP (Theory of Mind Augmented Persuader): introduces a novel framework for training LLM persuaders by incorporating counterclaim prediction and opponent attitude prediction modules, guided by reinforcement learning.
The framework models the persuadee's mental state using the Theory of Mind modules to enable more diverse, opponent-aware, and effective arguments.
Experiments show ToMAP outperforms larger baselines and achieves stable persuasion gains in longer conversations by leveraging opponent-aware strategies.

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Multi-Agent Debate (MAD): introduces a test-time computational scaling method, with Agents (individual language models), Collaborative Refinement (agents refine based on others), Diverse Exploration (agents use different configurations), Rounds (iterative debate steps), Shared Context (previous outputs shared), Output Selection (final answer determination), and Judge (selects final response in safety tasks), where the paper systematically studies its effectiveness compared to Self-Consistency (SC) (parallel sampling baseline) and Self-Refinement (SR) (sequential refinement baseline) baselines.
MAD combines parallel generation within rounds and sequential refinement across rounds, leveraging diverse agent configurations and shared context.
The study evaluates MAD's performance on mathematical reasoning and safety tasks under varying conditions of task difficulty, model scale, and agent diversity.

28th May 2025

WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

WorkForceAgent-R1: introduces an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework, including an LLM Agent (Policy Model π), Rule-based RL Framework (R1-style RL), Group Relative Policy Optimization (GRPO), Reward Model (Structured Reward Function), Reference Model (For GRPO), Single-Step Training Data (Input for training), Observation (Web page state), Query (User instruction), and Action Space (Permissible actions), designed to enhance single-step reasoning and planning for business-oriented web navigation tasks.
The approach combines behavior cloning via supervised fine-tuning with GRPO, utilizing a structured reward function comprising format correctness, action correctness, and penalty constraints to implicitly learn reasoning.
Experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines and achieves competitive performance against proprietary LLM-based agents.

Conversational Alignment with Artificial Intelligence in Context

CONTEXT-ALIGN framework (Large Language Models): introduces a framework for evaluating conversational alignment of LLMs, which are AI text generation systems based on Transformer architecture processing text as Tokens via Tokenization within a Context window up to a Context window limit.
The framework assesses LLMs' ability to handle context, common ground, and pragmatic inference, discussing limitations like Context window overflow and Context collapse, and mitigation strategies such as Context compression, External memory, and Retrieval systems.
The paper argues that Prompting acts as a static context substitute and discusses how Alignment strategies impose rigid communicative identities, hindering dynamic conversational alignment required for a Conversational agent.

Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment

CaMeL (Capabilities for Machine Learning): enhances its capability-based sandbox for LLM agents with an initial prompt screening gateway, output auditing pass, tiered-risk policy, and a proposed security-oriented DSL interpreter, building on its dual-LLM architecture and execution layer to improve prompt injection defenses for enterprise deployment.
The framework utilizes a Privileged LLM for planning and a Quarantined LLM for validating untrusted content, mediated by a capability-based execution layer enforcing data flow policies.
Proposed enhancements address limitations in initial prompt trust, output manipulation, side channels, and architectural overhead, aiming for improved robustness, scalability, and formal guarantees without modifying underlying models.

First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Overhearing Agent System: introduces an AI agent paradigm that utilizes Audio Input (Conversation audio) and Context Management (Maintains history) for a Language Model (Processes input) guided by a System Prompt (Defines agent role) to perform Reasoning (Internal thought) and Tool Calling (Executes actions) via Tools (External functions) based on overheard human conversation.
The system acts as a passive helper, listening to human-to-human conversation and providing assistance through background tasks or suggestions executed via tool calls, without directly participating in the dialogue.
Evaluated in a Dungeons & Dragons gameplay context, the system demonstrates the ability of large multimodal models to leverage implicit audio cues and maintain conversational goals for tasks like game data retrieval, NPC management, and NPC generation.

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

MEDAL: introduces an automated multi-agent framework for generating, evaluating, and curating multilingual open-domain dialogue evaluation benchmarks, including Seed Context (diverse conversation starters), LLM User (generates user utterances), LLM Chatbot (generates chatbot utterances), LLM Judge (User Validation) (validates user utterance quality), LLM Judge (Automated Evaluation) (multidimensional dialogue evaluator), Issue Labels (specific dialogue quality dimensions), Overall Assessment (aggregate dialogue quality score), Human Annotation (expert quality judgments), Sampling (selects dialogues for curation), and LLM Judge (Meta-Evaluation) (evaluates LLMs as evaluators).
The framework operates in three stages: dialogue generation using multiple LLM agents, large-scale automated labelling of generated dialogues, and curation of a meta-evaluation benchmark with human annotations.
MEDAL enables on-demand generation of diverse multilingual dialogues and benchmarks, facilitating the evaluation of LLMs as both chatbots and automated evaluators, highlighting deficiencies in detecting nuanced issues like empathy and commonsense.

Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems

Cross-domain Multi-agent LLM Systems: introduces an architecture where autonomous agents (LLMs with tools, memory, autonomy) from different organizations dynamically group for collaborative tasks.
This architecture faces seven security challenges related to agent behavior (unvetted grouping, collusion, conflicting goals, self-tuning misalignment) and data handling (provenance obscurity, context bypass, confidentiality/integrity).
Proposed countermeasures include trust-adaptive dynamic teaming, adversarial training, hierarchical conflict arbitration, cross-domain reward alignment, neural provenance tracking, session-level semantic firewalls, and verifiable reasoning with privacy.

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

3DLLM-MEM: introduces a memory-enhanced 3D embodied agent framework that utilizes an Encoder (Encodes 3D inputs), Working Memory (Current 3D observations), Episodic Memory (Past 3D observations/interactions) stored in a Memory Bank (Stores episodic memory features), a Memory Fusion Module (Integrates working/episodic memory) producing Fused Episodic Memory (Integrated memory representation), and an LLM (Processes memory for actions).
The framework incrementally builds and maintains a task-relevant long-term memory by incorporating feedback from the environment and interacting with objects.
The Memory Fusion Module uses working memory tokens as queries to selectively attend to and fuse relevant spatial and temporal features from episodic memory.

Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

Proposed Research Directions: introduces a position arguing that traditional aleatoric and epistemic uncertainty definitions are insufficient for interactive LLM agents and proposes research into Underspecification uncertainties (missing information, unclear task), Interactive learning (ask follow-up questions), and Output uncertainties (communicate uncertainty beyond numbers).
The paper highlights conflicts in existing uncertainty definitions and their breakdown in dynamic, multi-turn LLM agent interactions.
The proposed directions aim to make LLM agent interactions more transparent, trustworthy, and intuitive by addressing and communicating uncertainty in novel ways.

Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Agent-UniRAG (A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems): introduces a trainable agent framework for unified RAG systems, with Planning Module (determines necessary actions), Tool Using Module (interacts with external tools), Working Memory Module (stores input, logs, evidence), Reflector Module (filters and refines evidence), and Agent Loop (iterative process).
The framework leverages the LLM agent concept to handle both single-hop and multi-hop queries in an end-to-end manner.
Agent-UniRAG utilizes a synthetic dataset (SynAgent-RAG) for training small open-source LLMs to achieve competitive performance.

Universal Visuo-Tactile Video Understanding for Embodied Interaction

VTV-LLM: introduces a multi-modal large language model for universal visuo-tactile video understanding, integrating Tokenizer, T-Projector, VTV Encoder, V-Projector, and a Large Language Model.
The framework bridges the gap between tactile perception and natural language by aligning visuo-tactile video features with linguistic descriptions.
It enables sophisticated tactile reasoning capabilities for embodied interaction, including feature assessment and comparative analysis.

From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation

FAMER (Fast Adaptation via MEntal Reasoning): introduces a framework for fast desire alignment, integrating Perception (Extracts scene graph), Key Information Extraction (Filters, stores goal info), Memory (Stores cross-episode knowledge), Desire-Centered Mental Reasoning (Infers user desires), Efficient Communication (Manages dialogue efficiently), and Goal Oriented Planning (Plans goal actions).
The framework leverages LLMs to interpret vague instructions, infer user intent, and manage dialogue, enabling adaptation to unknown user preferences.
FAMER improves task execution and communication efficiency by filtering irrelevant actions, reducing redundant inquiries, and reusing knowledge across episodes.

EvolveSearch: An Iterative Self-Evolving Search Agent

EvolveSearch: introduces a novel iterative self-evolution framework that combines Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to enhance web search capabilities without external human-annotated reasoning data.
The framework alternates between an RL phase for exploration and generating rollouts, and an SFT phase that optimizes the base model using filtered high-quality rollouts.
This process leverages a hybrid reward mechanism and specific data filtering rules to enable continuous self-improvement in open web search domains.

Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Unified Framework: introduces, a systematic approach for topological structure learning in LLM-based Multi-Agent Systems, with Agent Selection (Selects agent subset), Structure Profiling (Identifies macro structure), and Topology Synthesis (Synthesizes micro graph), where the framework decomposes topology design into sequential stages for optimization.
The framework aims to learn optimal topological structures for MASs to enhance coordination performance and efficiency.
Each stage presents distinct challenges and research opportunities for designing adaptive multi-agent architectures.

AgentDNS: A Root Domain Naming System for LLM Agents

AgentDNS: introduces a root domain naming and service discovery system for LLM agents, with Service Registration (registers services), Service Proxy Pool (forwards requests), Service Search (discovers services), Service Resolution (resolves identifiers), Service Management (manages proxies), Service Billing (tracks costs), Authentication (verifies identity), AgentDNS DB (stores metadata), and AgentDNS API Server (provides API) components.
AgentDNS enables LLM agents to autonomously discover, resolve, and securely invoke third-party services across organizational and technological boundaries.
Inspired by traditional DNS, the system provides unified naming, natural language discovery, protocol-aware interoperability, authentication, and billing for multi-agent collaboration.

From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications

LAM-based Agentic AI system: introduces a system architecture with LAMs (Core reasoning engine), Planner (Task decomposition/organization), Knowledge Base (External knowledge support), Tools (External/internal execution toolkit), and Memory (Stores historical information).
CommLLM framework: introduces a LAM-centric multi-agent collaborative system architecture with MDR (Acquire task-relevant information), MCP (Decompose tasks/generate pathways), and MER (Evaluate solutions/self-feedback).
The tutorial reviews the evolution from Large AI Models to Agentic AI and their applications in future intelligent communication systems, particularly in the context of 6G networks.

VOICE CMS: UPDATING THE KNOWLEDGE BASE OF A DIGITAL ASSISTANT THROUGH CONVERSATION

Voice CMS architecture: introduces a system for updating a digital assistant's knowledge base through conversation, integrating a Voice CMS workflow, Conversational Engine with Agents, Knowledge Base, VUI, and LLM.
The system allows hotel staff to naturally converse with the assistant to add or modify information, reducing the need for traditional graphical content management systems.
Evaluation compares the Voice CMS with a GUI for knowledge management tasks, analyzing user preference, usability, and performance across varying task complexities.

Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection

IntrospecLOO (introspective-leave-one-out): introduces an efficient method for evaluating agent contributions in LLM multi-agent debates, utilizing Agents, User/Query, Independently Respond Round, Debate Round, Aggregation, IntrospecLOO Round, and IntrospecLOO Prompt.
The method adds a single IntrospecLOO Round after standard debate rounds, prompting agents with an IntrospecLOO Prompt to update answers while disregarding one agent's response.
This approach approximates the traditional Leave-one-out method at significantly reduced query complexity, enabling efficient contribution evaluation.

VIRAL: VISION-GROUNDED INTEGRATION FOR REWARD DESIGN AND LEARNING

VIRAL: introduces a pipeline for generating and refining reward functions using multi-modal LLMs, including Input, Initial Generation, Policy Learning, and Refinement components.
The framework takes textual environment details, optional success code, and a multi-modal goal prompt to generate initial reward functions via collaborating LLMs and code verification.
Reward functions are refined iteratively based on performance evaluation and feedback from humans or a Video-LVLM, leading to improved agent behavior alignment.

VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries

Vul-BinLLM: introduces an LLM-based framework for binary vulnerability detection, featuring an LLM-assisted Decompiler (enhances code) with an Optimization Decision Agent (decides optimizations) and Action Agents (perform optimizations), a Code Memory Management Agent (manages functions), VulBinQ (queue), and Archived Analysis (storage).
The framework optimizes decompilation by adding vulnerability-specific comments and contextual information before analyzing the code for vulnerabilities.
It utilizes memory management and a function queue to handle large binary files and reduce LLM hallucinations during vulnerability reasoning.

EFFICIENTLY ENHANCING GENERAL AGENTS WITH HIERARCHICAL-CATEGORICAL MEMORY

EHC framework: introduces a general agent framework with Hierarchical Memory Retrieval (HMR), Task-Category Oriented Experience Learning (TOEL), Memory Pool (M), and LLM (Large Language Model), designed for efficient multi-modal task handling.
The framework uses a hierarchical memory system for rapid retrieval and continuous storage, mitigating redundancy and overhead.
It employs task-oriented learning to classify experiences and extract category-specific patterns, enhancing adaptability and interpretability.

MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing

MapStory: introduces a text-driven map animation prototyping tool, with Scene Breakdown Agent (parses script), Map Animation Researcher Agent (retrieves geospatial data), and Map Animation Modules (camera, highlight, animated elements), that generates editable map animations from natural language scripts.
The tool leverages an agentic LLM architecture to produce a scene breakdown and grounds the script in factual geospatial data using web search and APIs.
MapStory supports human-in-the-loop editing through an interactive timeline editor and properties panel for fine-grained control and rapid iteration.

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

LaMDAgent (Language Model Developing Agent): introduces an autonomous framework using an Agent (LLM-based selector) to iteratively construct and optimize post-training pipelines by selecting from Predefined Action Types (available operations) and an Object Pool (available resources), evaluating the resulting Model (target LLM) based on a Score (performance metric), and updating its Memory (stores experiences).
The framework automates the post-training pipeline design process by iterating through action enumeration, selection, model evaluation, and memory update steps.
This agent-based approach reduces the need for specialized knowledge and human intervention in discovering effective model improvement strategies.

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Future System: introduces a metadata management system and a hierarchical KVC caching system, featuring a reuse-optimized metadata caching scheme, a workload-aware index structure, and a hotness-aware data placement strategy to optimize KVC management for LLM prefix prefilling.
The proposed system aims to minimize time to first token for long-context inference by efficiently handling range queries and random get queries.
The approach is designed to leverage the unique high reusability and mixed sequential-random access patterns observed in KVC prefix prefill workloads.

Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Co-Saving: introduces a resource-aware multi-agent collaboration system leveraging experiential knowledge to enhance efficiency and quality, including Multi-Agent System (Collaborative structure), Agents (Individual LLM entities), Experiential Knowledge (Historical task data), Shortcuts (Learned instructional transitions), Reference Chain (Historical successful trajectory), Inference Chain (Current task execution), Shortcut Filtering (Selecting effective shortcuts), Shortcut Formalization (Graph representation), Shortcut Evaluation (Scoring shortcuts), Cost Design (Time and token metric), Emergency Factor (Dynamic value/cost weighting), and Force Termination Mechanism (Prevents resource exhaustion).
The system utilizes shortcuts mined from historical successful trajectories to bypass redundant reasoning steps and accelerate problem-solving in familiar contexts.
A dynamic emergency factor and force termination mechanism are integrated to manage resource consumption and prevent exhaustion during task execution.

Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

LLM-ABM Framework: introduces a method for large-scale urban mobility simulation by integrating LLM with Agent-Based Modeling, including Data Collection, Large Language Model (LLM), Agent Profile, Agent Schedule, Routine Allocation, Occasional Locations, and Multi-Transit Route components.
The framework leverages LLM to generate diverse and realistic synthetic population profiles and personalized agent schedules.
Agent locations are allocated based on grid data and Points of Interest, and personalized routes are generated using a multi-criteria routing algorithm.

27th May 2025

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

STRATUS: introduces a multi-agent system for autonomous Site Reliability Engineering (SRE) of cloud services, with Detection Agent (Identifies failures), Diagnosis Agent (Determines root cause), Mitigation Agent (Executes mitigation plans), Undo Agent (Executes undo sequence), State Machine (Orchestrates agents control flow), Agent-Computer Interfaces (ACI) (Enables environment interaction), Toolset (Provides interaction capabilities), Observability tools (Query telemetry, states), Command-line tools (Execute commands, change states), Oracles (Validate system health, terminate), and Transactional Non-Regression (TNR) (Safety specification), designed to autonomously detect, localize, analyze, and mitigate cloud system failures.
The system organizes specialized agents in a state machine and formalizes a safety specification called Transactional No-Regression (TNR) to enable safe exploration and iteration during mitigation.
STRATUS utilizes Agent-Computer Interfaces (ACI) and a comprehensive toolset, including observability and command-line tools, to interact with the cloud environment and validate actions using Oracles.

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

AIDSAFE (Agentic Iterative Deliberation for Safety Reasoning): introduces a multi-agent framework for generating policy-embedded Chain-of-Thought data, including Initialization, Deliberation, and Refinement stages with dedicated agents.
The framework leverages collaborative reasoning among Deliberation Agents and post-processing by a Refiner Agent to produce high-quality, policy-adherent CoTs and responses from an Input Query and Safety Policies.
This approach aims to improve LLM safety generalization and jailbreak robustness by providing superior data for supervised fine-tuning.

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

BehaviorSFT: introduces a training strategy using behavioral tokens to condition pre-trained foundation LLMs for dynamic behavioral selection across the reactive-proactive spectrum, evaluated on the BehaviorBench dataset.
The approach leverages supervised fine-tuning to enable implicit contextual behavior assessment and behavior-conditioned generation for clinical agents.
BehaviorSFT aims to improve the balance between helpful proactivity and necessary restraint in LLM responses for healthcare applications.

AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models-25367

Multi-agent Retrieval-Augmented Generation (RAG) system: introduces a platform for nuclear waste management decision-making with a Multi-agent System (Collaboration) including Regulatory Compliance Agent (Checks regulations), Safety & Environmental Agent (Assesses risks), and Documentation & Reporting Agent (Compiles reports), leveraging Retrieval-Augmented Generation (RAG) (Retrieval and generation) with LLM (Llama 3.2) (Base language model), Embeddings (mxbai-embed-large-v1) (Generates semantic vectors), and Document Retrieval (Retrieves relevant documents) accessing Regulatory Compliance Database (Stores regulatory documents) and Safety & Environmental Database (Stores safety/environmental data).
The system employs a structured 10-round discussion model for agents to iteratively refine assessments and ensure document-grounded responses.
Evaluation metrics like Context Relevance Distribution and Agent Agreement Rate demonstrate the framework's effectiveness in maintaining factual grounding and decision consistency.

Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Catfish Agent Framework: introduces a multi-agent system with Moderator Agent, Catfish Agent, Expert Agent, Team Leader, Team Member, and Summary Agent components to disrupt silent agreement in clinical decision making.
The framework employs complexity-aware and tone-calibrated interventions by the Catfish Agent to stimulate deeper reasoning and prevent premature consensus.
Evaluations show the method improves diagnostic accuracy on medical Q&A and VQA benchmarks compared to single- and multi-agent baselines.

Robust Hypothesis Generation: LLM-Automated Language Bias for Inductive Logic Programming

LLM-Automated Language Bias for Inductive Logic Programming Framework: introduces a novel framework for robust hypothesis generation by integrating LLMs with ILP, including a LLM-Based Multi-agent System (Generates language bias), Translator agent (Transforms text to facts), Language Bias (Structured symbolic vocabulary), Facts (Symbolic data representation), ILP Solver (Learns interpretable rules), and Optimal Hypothesis (Final learned rules).
The framework utilizes a multi-agent LLM system (Actor and Critic agents) to automate the generation of the language bias (predicate system) directly from raw text.
This automated symbolic grounding guides a Translator agent to convert text into facts for an ILP solver, which then learns interpretable rules as the optimal hypothesis.

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

EXTAGENTS: introduces a multi-agent framework for scaling external knowledge input beyond LLM context windows, featuring Seeking Agents (Process input chunks), Reasoning Agent (Synthesize information, generate output), Global Knowledge Synchronization (Agents share and rank information), and Knowledge-Accumulating Reasoning (Reasoning agent integrates information iteratively).
The framework partitions massive input into chunks processed by Seeking Agents, whose outputs are shared and ranked via global knowledge synchronization.
A Reasoning Agent then iteratively integrates the synchronized information through knowledge-accumulating reasoning to produce the final output.

Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

FUAS-Agents: introduces an autonomous agent system leveraging multimodal LLMs for Focused Ultrasound Ablation Surgery treatment planning, including Planner Agent (interprets instructions, decomposes tasks), Executor Agent (performs specific tasks), Strategy Agent (generates treatment plans), Optimizer Agent (refines outputs, integrates results), and Memory Module (integrates medical resources, manages data).
The system integrates patient profiles and MRI data, orchestrating specialized medical AI tools for segmentation, dose prediction, and clinical guideline retrieval to generate personalized treatment plans.
Evaluated in a uterine fibroid scenario, the generated plans demonstrate high completeness, accuracy, fluency, and clinical compliance according to human expert assessment.

Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Dataset Generation Framework: introduces, "a multi-agent pipeline", with Users (Simulated input), Dialogue Generation Controller (Orchestrates workflow), User Simulator (Generates user questions), Out-of-Context Detector (Validates questions), and QA LLM (Responds to questions), where "the framework generates synthetic dialogues embedding sociodemographic attributes for evaluating LLM adaptation".
The pipeline simulates user-LLM interactions, with a user simulator generating profile-aligned questions and an out-of-context detector ensuring question validity.
This agent-based approach creates a controlled dataset enabling assessment of LLM behavioral consistency when user attributes are provided explicitly or implicitly.

PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

PEDANTIC-based Definiteness Examination: introduces PEDANTIC Dataset (corpus of patent claims with indefiniteness annotations), Dataset Creation Pipeline (automatic process using LLMs to build PEDANTIC), Logistic Regression Model (baseline prediction model), LLM Agent Model (LLM-based prediction model using tools), Binary Classification (evaluates definite/indefinite prediction), Multi-Label Classification (evaluates indefiniteness category prediction), and Pairwise Reasoning Judge (LLM-as-Judge evaluates reasoning quality), presenting a dataset and evaluation framework for automatic patent claim definiteness examination.
The PEDANTIC Dataset contains 14k US patent claims annotated with indefiniteness reasons extracted using an automatic pipeline leveraging Large Language Models.
The framework evaluates Logistic Regression and LLM Agent models on binary and multi-label classification tasks, and uses an LLM-as-Judge to assess the quality of generated indefiniteness reasoning.

Large Language Models Miss the Multi-Agent Mark

MAS LLMs (Multi-Agent Systems of Large Language Models): introduces a critique of current MAS LLMs, highlighting issues with Agents (lack native social behaviour), Environment (often textual, LLM-centric), Coordination (often sequential, orchestrated), Communication (often natural language), Memory (lack long-term persistency), and Asynchronicity (often absent).
The paper argues that current MAS LLMs often fail to embody fundamental multi-agent system characteristics by overemphasizing LLMs and overlooking established MAS literature.
It advocates for better integrating MAS concepts like native social agents, non-LLM-centric environments, asynchronous communication protocols, and quantifiable emergent behaviours.

Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

LLM-Informed Diagnostic Framework: introduces a novel approach integrating KGs and LLMs for complex system diagnostics, featuring Model Construction, KG-DML, Model Interaction, and an LLM Agent with diagnostic tools.
The framework automates DML model construction from system documentation using an LLM-based workflow and stores this structured logic in a KG-DML.
An LLM agent facilitates interactive diagnostics by interpreting user queries and invoking KG-based tools for upward/downward reasoning and Graph-RAG retrieval to generate diagnostic insights.

PACT: A Contract-Theoretic Framework for Pricing Agentic AI Services Powered by Large Language Models

PACT: introduces a contract-theoretic framework for pricing cloud-based agentic AI services, modeling task-dependent multi-dimensional QoS, costs (including liability), and user types to design contracts satisfying individual rationality and incentive compatibility.
The framework models QoS based on objective response time and subjective user satisfaction, accounting for computational, infrastructure, and potential liability costs for the service provider.
Through contract-based selection, PACT enables users to receive tailored service offerings aligned with their needs while ensuring incentive compatibility and individual rationality under information asymmetry.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems

LLM-based Multi-Agent System: introduces a framework to investigate herd behavior in multi-agent systems, featuring LLM-based Agents (autonomous decision makers) receiving Question Input (initial task) and Peer Information Input (peers' responses), utilizing a Confidence Mechanism (internal certainty assessment) for Response Generation (initial answer) and revision, modulated by Peer Information Presentation (format and order), Peer Persona (peer attributes), and System Prompt (behavioral instructions).
The system simulates agents interacting and making decisions, where herd behavior is measured by the flip rate, the tendency of agents to change their initial response based on peer input.
Experiments manipulate agent self-confidence, perceived peer confidence, and peer information presentation factors to understand their impact on conformity and collective outcomes.

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

CXXCrafter: introduces an LLM-based agent system for automated C/C++ software building, including a Parser Module (Extracts build-related information), a Generator Module (Generates/modifies Dockerfile), and an Executor Module (Executes Dockerfile, captures errors).
The system leverages LLMs to dynamically manage complex build processes by iteratively addressing issues based on feedback.
CXXCrafter achieves a 78% build success rate across 752 C/C++ projects by handling dependency management, diverse build systems, and error diagnosis.

Agent-Environment Alignment via Automated Interface Generation

ALIGN (Auto-Aligned Interface Generation): introduces a framework that automatically generates interfaces to alleviate agent-environment misalignment, utilizing an Analyzer (Identifies misalignments) and an Optimizer (Generates/refines interface) to produce an interface with INFERRULES (Static information alignment) and WRAPSTEP (Dynamic observation enhancement) that mediates interaction between the Agent (Interacts with environment) and Environment (Provides state/feedback).
The framework operates iteratively, with the Analyzer identifying misalignments from failed trajectories and the Optimizer generating an improved interface based on these findings.
The ALIGN-generated interface enhances both static environment information and step-wise observations, improving agent performance across diverse interactive tasks without modifying agent or environment code.

AITEE - Agentic Tutor for Electrical Engineering

AITEE (Agentic Tutor for Electrical Engineering): introduces an agent-based tutoring system for electrical engineering, with Circuit (Input image), Detection of components and connections (Processes circuit image), Conversion into Graph/Netlist (Creates textual representation), Simulation with Spice (Validates circuit calculations), Scripts (Lecture material knowledge base), Relevant context in vector database (Stores script embeddings), Retriever (RAG) (Retrieves relevant script context), Large Language Model (Core AI tutor), LLM-Instructions (Guides Socratic dialogue), Students (User), Prompt (Student query), and Output (Tutor response) components, designed to provide interactive and personalized learning experiences.
The system processes hand-drawn or digital circuit diagrams, converts them into a machine-readable format, and uses a graph-based similarity measure for context retrieval from lecture materials.
AITEE employs a Socratic dialogue approach guided by LLM instructions and validates calculations using SPICE simulation to foster learner autonomy and ensure accuracy.

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

ATD (Adaptive Text Dreamer): introduces a dual-branch LLM self-guided imagination policy for VLN, with Left Brain (State Estimation LLM), Right Brain (Imagination LLM), Q-Former, LLM Encoder, LLM Decoder, State Grounded Cross-Attention (SGCA), Graph-based Navigation Policy, Latent Embedding Injection, Multi-head Cross-Attention (MCA), Graph-aware Self-Attention (GASA), and MLP components.
The framework leverages language-based imagination, employing a left brain for state estimation and a right brain for imaginative prediction, constrained by the estimated state.
Imagined textual representations are integrated into a graph-based navigation expert via latent embeddings and cross-attention to guide action decisions.

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

RepoMaster: introduces an autonomous agent framework for exploring and understanding GitHub repositories, consisting of Repository Search, Hierarchical Repository Analysis, and Autonomous Exploration & Execution.
Hierarchical Repository Analysis builds structural representations like HCT, MDG, and FCG to identify Core components for efficient understanding.
Autonomous Exploration & Execution uses Context-aware Code Exploration with Exploration tools and Context-aware Information Selection in an Interactive Feedback-based Execution loop to solve tasks.

MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

MedSentry: introduces a benchmark and evaluation pipeline for medical LLM multi-agent systems, analyzing the safety risks posed by malicious agents within different architectural topologies.
The framework evaluates four representative multi-agent topologies (Centralized, Decentralized, Layers, SharedPool) by injecting a Dark Personality Agent and assessing system safety using an Enforcement Agent defense mechanism.
MedSentry provides a rigorous evaluation framework and practical defense strategies for designing safer LLM-based multi-agent systems in medical domains.

MT-MOL: Multi Agent System with Tool-based Reasoning for Molecular Optimization

MT-MOL (Multi Agent System with Tool-based Reasoning for Molecular Optimization): introduces a multi-agent framework for molecular optimization featuring Analyst agents (Select relevant tools), a Scientist agent (Generates molecule/reasoning), a Verifier agent (Validates consistency), and a Reviewer agent (Provides feedback), utilizing Tool sets (Domain-specific functions), Top-k data (Reference molecules), and SMILES history (Previous designs).
The system integrates domain-specific tools and structured reasoning through agent interactions to produce interpretable and chemically grounded molecular designs.
An iterative generation and review process, including consistency validation and tool-informed feedback, refines the molecular candidates towards the design objective.

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

MAMMQA: introduces a multi-agent framework for multimodal question answering with Modality Expert Agent (Extracts modality specific insights), Cross Modal Synthesis Agent (Synchronises information across modalities), and Aggregator Agent (Synthesizes outputs, resolves disagreements), splitting reasoning into interpretable stages.
The framework employs specialized agents for modality-specific extraction, cross-modal synthesis, and evidence-grounded aggregation without fine-tuning.
This modular design enhances interpretability, robustness, and zero-shot generalization by allowing agents to operate within their expertise domains.

ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

ChemHAS (Chemical Hierarchical Agent Stacking): introduces a hierarchical agent stacking method with Initial Tools (Predefined chemistry tools), AI Agent (LLM-based agent), Agent Tool (Tool or agent), Global Tool Library (Collection of tools/agent tools), Stacking Process (Hierarchical combination method), Reinforcement Process (Two-stage optimization), ReAct method (Agent reasoning and tool use), and Stacking Agent (Enhanced tool/agent) to enhance chemistry tools by reducing prediction errors.
The Stacking Process involves Warmup Self Agent Stacking and Hierarchical Agent Stacking, iteratively building and evaluating agent tools and storing them in the Global Tool Library.
The resulting Stacking Agent leverages the complementary strengths of stacked tools, guided by a two-stage reinforcement process, to achieve improved performance on chemistry tasks.

Can Agents Fix Agent Issues?

AGENTISSUE-BENCH: introduces the first reproducible benchmark for agent issue resolution, comprising issue description (User reported problem), buggy version (Codebase commit), developer-committed patch (Ground truth fix), failure-triggering tests (Reproduce issue), and docker environment (Executable container).
Built from 50 reproducible real-world GitHub issues, the benchmark enables evaluating state-of-the-art software engineering agents.
Evaluation on AGENTISSUE-BENCH reveals that current SE agents have limited effectiveness in resolving agent-specific issues.

RRO: LLM Agent Optimization Through Rising Reward Trajectories

RRO (Reward Rising Optimization): introduces a scalable process supervision framework for LLM agents, including LLM Agent (Policy Model), Supervised Fine-tuning (Initial training on expert data), Reward Rising Sampling (Dynamically explores next actions), Process Reward Estimation (Estimates step reward via rollouts), Agent Optimization (DPO) (Optimizes policy using preferences), and Preference Data (Pairs of preferred/rejected actions).
The framework dynamically adjusts next action exploration based on rising reward trends to efficiently collect high-quality preference data for training.
RRO prioritizes reasoning steps with increasing rewards, reducing exploration cost while improving performance on multi-step tasks.

E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Brity Automation: introduces an end-to-end automation system for financial expense processing, integrating a Data Input Layer, Intelligent Processing Layer with OCR/IDP Module, Policy-based Classification Engine, AI Flow Module (Gen AI Integration), and Workflow Engine, a User Interaction & Learning Layer with Automation Agent Interface and Human-in-the-Loop (HITL) Mechanism, and a Backend Infrastructure Layer with Brity Automation Orchestrator, Database, and API Gateway.
The system automates document recognition, policy-based classification, intelligent exception handling using generative AI, and incorporates human judgment for continuous learning.
This approach aims to overcome limitations of traditional RPA by handling unstructured data and complex exceptions through human-AI collaboration.

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

SPA-RL: introduces Stepwise Progress Attribution (SPA), with LLM Agent, Environment, Progress Estimator, Grounding Signal, Fused Intermediate Reward, and PPO Update components, which is a reward redistribution framework reinforcing LLM agents by decomposing delayed rewards into stepwise contributions.
The framework trains a Progress Estimator to predict each step's contribution to task completion, combining this with a Grounding Signal for action executability to form a Fused Intermediate Reward.
This dense, goal-oriented Fused Intermediate Reward is then used within a PPO Update to train the LLM Agent, improving performance on long-horizon tasks with sparse rewards.

Hierarchical Instruction-aware Embodied Visual Tracking

HIEVT (Hierarchical Instruction-aware Embodied Visual Tracking): introduces a hierarchical tracking agent with LLM-based Semantic-Spatial Goal Aligner and RL-based Adaptive Goal-Aligned Policy components, designed for user-centric embodied visual tracking.
The LLM-based Semantic-Spatial Goal Aligner translates user instructions into spatial goals via Semantic Parsing, Spatial-Goal Generation, and Retrieval-Augmented Goal Correction.
The RL-based Adaptive Goal-Aligned Policy uses a Visual Foundation Model, Goal-State Aligner (with CNN and Reward Prediction), and Recurrent Policy Network (with LSTM and Actor Network) to align agent actions with the spatial goals for precise tracking.

GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning

GIFARC: introduces a data synthesis pipeline that transforms raw GIFs into analogy-grounded ARC-style tasks, utilizing a VLM to extract visual abstractions and LLMs to generate task sketches and executable tasks, including input-output pairs, analogy labels, and solution programs.
The pipeline processes GIFs through stages of visual abstraction, task sketching, and executable task generation to create a dataset that embeds human-intuitive analogies into ARC-style problems.
The generated dataset aims to guide AI agents, particularly LLMs, to adopt an analogic approach for solving ARC tasks, aligning their reasoning more closely with human intuition.

LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

ULTRA (Large Language Model-Guided Policy Modulation Framework): introduces a framework that leverages LLMs to identify critical states from sub-optimal trajectories and provide action suggestions and rewards for policy refinement.
The framework's Identification component uses an LLM and a state interpretation function to pinpoint critical states in historical agent trajectories.
Its Improvement component refines the RL policy by incorporating LLM-suggested actions from a lookup table and LLM-generated rewards at critical states.

MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning

MIRROR (Multi-agent Intra- and Inter-Reflection for Optimized Reasoning): introduces a multi-agent framework with Planner Agent, Tool Agent, and Answer Agent, integrating Intra-reflection and Inter-reflection mechanisms supported by Long-Term Memory and Short-Term Memory for enhanced tool learning.
The framework employs intra-reflection for proactive error prevention within each agent before execution and inter-reflection for corrective learning and strategic adjustment based on task outcomes.
This dual-reflection approach systematically leverages LLM capabilities to improve task decomposition, tool selection, and answer generation in complex multi-agent workflows.

CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models

CoderAgent: introduces a LLM-based agent framework to simulate student programming processes, with Memory (Stores student proficiency), Tools (Interface with compilers), Planning & Action (Decision-making core), and Reflection (Evaluates generated code) components.
The framework simulates iterative coding by capturing cognitive states, using a Programming Tree of Thought for planning, and reflecting on generated code.
CoderAgent aims to provide interpretable insights into learning trajectories and accurate simulations without relying on large-scale real data.

Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

XpandA (Expand-Agent): introduces a multi-agent framework with Dynamic Chunking (Splits input text), Explorer Agents (Process text chunks), Decider (Decides next action), Shared Information Memory (Centralized knowledge store), Question-driven Workflow (Guides agent communication), Selective Replay (Revisits relevant chunks), Unsolved Problem Tracer (Tracks unsolved questions), and Information (Stores gathered answers), designed for robust long-context processing.
The framework dynamically partitions long texts, uses a question-guided protocol to update shared memory, and selectively replays partitions based on question-information state tracking.
XpandA demonstrates feasibility for processing ultra-long sequences up to 1M tokens, achieving performance improvements and inference speedup over baselines.

26th May 2025

Ten Principles of AI Agent Economics

Ten Principles of AI Agent Economics: introduces a foundational framework for understanding how AI agents make decisions, influence social interactions, and participate in the broader economy, with Altruistic AI Agent, Survival-Driven AI Agent, Human Agent, Environment, and Human-AI Multi-Agent Hierarchical Society components.
The paper outlines ten principles drawing on economics, decision theory, and ethics to explore fundamental questions about AI agents' integration into human systems.
The framework distinguishes between altruistic and survival-driven AI agents and models their interaction within environments and a hierarchical human-AI society.

Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Project Riley: introduces a multimodal multi-agent LLM architecture for emotional reasoning, featuring Input (Receives user query/context) processed by LLM vision model (Image processing) and LLM text model (Text generation/reasoning), distributed to Emotional agents (Five distinct emotion agents) with Emotion's history (Separate history per agent) for Multi-round processing (Iterative agent dialogue), culminating in Voting and Analysis (Agents evaluate/vote) and Final Synthesis (Synthesizes final response) for the Final response (Output to user).
The architecture simulates reasoning influenced by five distinct emotional states (Joy, Sadness, Fear, Anger, Disgust) through structured multi-round dialogues and a final synthesis mechanism.
The system integrates textual and visual LLMs, advanced reasoning, and self-refinement processes to generate emotionally informed responses.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

SWE-rebench Automated Pipeline: introduces a novel, automated, and scalable pipeline for continuously extracting real-world interactive software engineering tasks from GitHub repositories, comprising Preliminary Task Collection, Automated Installation Instructions Configuration, Execution-based Installation Verification, and Automated Instance Quality Assessment, resulting in the SWE-rebench Dataset and SWE-rebench Benchmark used within a standardized Evaluation Framework employing ReAct-style Scaffolding, a Terminal Environment, Special Tools, and an LLM agent.
The pipeline addresses challenges in training data availability and evaluation reliability for LLM-based software engineering agents by providing a large-scale, diverse, and continuously updated dataset and a contamination-free benchmark.
The standardized evaluation framework enables transparent and fair comparisons of LLM agent performance on interactive software engineering tasks, mitigating issues like data contamination and scaffolding variability.

ALITA: GENERALIST AGENT ENABLING SCALABLE AGENTIC REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL SELF-EVOLUTION

ALITA: introduces a generalist agent with minimal predefinition and maximal self-evolution, featuring Manager Agent (central coordinator), Web Agent (external information), MCP Brainstorming (plan tools), Script Generating Tool (generates code), Code Running Tool (executes code), Environment Management (manages environments), MCP Box (stores MCPs), and CodeReAct Loop (iterative process).
The Manager Agent orchestrates the CodeReAct loop, utilizing the Web Agent for information and the MCP creation tools to generate, execute, and store new capabilities as MCPs.
This design allows ALITA to autonomously evolve its capabilities through continuous MCP integration, reducing dependence on manual predefinition.

MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

MASKSEARCH: introduces a pre-training framework to enhance LLM agentic search capabilities using the RAMP Task (pre-training objective), trained via SFT (supervised fine-tuning) or RL (reinforcement learning), leveraging an LLM (core language model) interacting with a Search Tool (external search interface), Retriever (knowledge retrieval module), and Knowledge Corpus (external knowledge base), supported by Agent-Based CoT Construction (SFT data generation method), Self-Evolve Distillation (iterative data scaling), Curriculum Learning (progressive training strategy), and an RL Reward System (reinforcement signal).
The framework trains models on the Retrieval-Augmented Mask Prediction (RAMP) task, where the model learns to use search tools to fill masked spans in text.
Training involves a two-stage approach combining pre-training on RAMP with supervised fine-tuning or reinforcement learning on downstream tasks, demonstrating improved performance on open-domain question answering.

syftr: Pareto-Optimal Generative AI

syftr: introduces a framework that performs multi-objective search over agentic and non-agentic RAG flows, composed of Synthesizing LLM, Reranker, Embedding Model, Splitter, HyDE, Retriever, Prompt, Dynamic Few-Shot Retriever, and Additional Context components, to find Pareto-optimal flows balancing task accuracy and cost.
The framework utilizes Bayesian Optimization with a novel early-stopping mechanism to efficiently explore a vast search space of RAG configurations.
syftr identifies flows that are significantly cheaper or more accurate than baseline configurations across multiple RAG benchmarks.

ON PATH TO MULTIMODAL HISTORICAL REASONING: HISTBENCH AND HISTAGENT

HistAgent: introduces a domain-specialized AI agent for historical reasoning, with a Manager Agent (Central coordinator) orchestrating specialized agents including Text WebBrowser Agent (Web search/parsing), Image Information Agent (Image search/analysis), Literature Search Agent (Scholarly search/citation), File Processing Agent (Handle non-HTML files), OCR Agent (Extract text from images), Speech Recognition Agent (Convert audio to text), Translator Agent (Translate text), and Video Agent (Extract frames from video).
HistAgent integrates these modular tools and a ReAct-style loop to process multimodal inputs and generate cited responses grounded in historical sources.
The agent is evaluated on HistBench, a new benchmark for historical reasoning, and demonstrates superior performance compared to generalist LLMs and agents.

THINK: Can Large Language Models Think-aloud?

THINK (Testing Higher-order Notion of Knowledge): introduces a multi-agent, feedback-driven evaluation framework for assessing and improving LLM higher-order thinking skills using flawed math problems (Initial input data), a multi-agent evaluation stage (Parallel agent system) with agents (Evaluate problems) including Bloom-aligned agents (Assess Bloom levels) and a holistic evaluation agent (Assess quality, suggest improvements), agent feedback & ratings (Scores and suggestions), a quality assessment protocol (Metrics for quality) with a quality threshold (Success criterion), an iterative revision loop (Refinement process) involving a think-aloud process (LLM reflection) by the LLM (Revises problems) guided by "Five Keys" (Structured criteria), resulting in an improved problem set (Refined output data).
The framework uses a parallel multi-agent system to evaluate flawed math problems based on Bloom's Taxonomy and "Five Keys" criteria, generating scores and structured feedback.
An iterative revision loop, guided by agent feedback, prompts the LLM to refine problems via a "think-aloud" process until a quality threshold is met, enabling deeper analysis of reasoning and revision behaviors.

Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

EXSEARCH (exploratory search framework): introduces an agentic search framework, empowering an LLM with thinking, search, and recording actions, trained via a self-incentivized Generalized Expectation-Maximization algorithm.
The framework enables the LLM to iteratively explore search trajectories, retrieve relevant documents using an external retriever, and extract fine-grained evidence.
A re-weighted trajectory learning process in the M-step, guided by importance weighting, progressively improves the LLM's search and reasoning capabilities.

Agentic AI Process Observability: Discovering Behavioral Variability

Agentic AI Process Observability Approach: introduces a method to enhance developer observability of agent behavior variability, including trajectory files generation (Capture agent execution logs), event-log processing (Consolidate logs into event log), process and causal discovery (Analyze event log for variability), rule derivation (Generate rules for split points), static analysis (LLM analyzes rules vs spec), and reliability calculation (Assess data sufficiency for splits).
The approach leverages process and causal discovery on agent execution trajectories to identify behavioral variability and uses LLM-based static analysis to distinguish intended from unintended variability.
This method provides developers with insights into agent behavior, aiding in debugging, refining specifications, and improving control over non-deterministic AI agents.

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

TrojanStego: introduces a threat model where a Malicious Actor fine-tunes a Trojan Model (Fine-tuned LLM) and distributes it on a Public Platform, allowing the Malicious Actor to extract secrets from outputs generated by a Genuine User using an Encoding Scheme (Embeds bits via token selection) and Decoding Process (Extracts bits from output).
The core method, the Bucket Method, partitions the LLM's token vocabulary to encode binary bits into the output token sequence.
This attack allows covert data exfiltration without requiring explicit control over inference inputs or leaving obvious traces.

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

REARANK (Reasoning Re-ranking Agent via Reinforcement Learning): introduces a large language model-based listwise reranking agent that explicitly reasons before reranking, trained using reinforcement learning and data augmentation.
The agent's architecture includes an LLM policy generating reasoning and ranking, optimized by an RL framework with a reward model and reference policy.
Data augmentation from limited annotations and a sliding window strategy enhance training efficiency and practical deployment.

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

STeP (Self-Reflected Trajectories and Partial Masking): introduces a novel method for training LLM-based agents using Self-reflected Trajectories (Trajectories with teacher reflection/correction) and Partial Masking (Masks incorrect steps during SFT), building upon a Base LLM Agent (Initial agent) trained with SFT (Training method) on Golden Trajectories (Successful expert trajectories) and guided by an LLM Teacher (Evaluates, provides reflection/correction) interacting with an Environment (Agent interacts, provides feedback).
The method synthesizes self-reflected trajectories by having a teacher LLM evaluate a base agent's actions in real-time and provide corrections for errors.
Partial masking is applied during fine-tuning to prevent the agent from learning from the identified incorrect steps in the augmented trajectories.

WEBCOT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

WEBCOT: introduces a framework that enhances web agent reasoning by reconstructing inference-time processes into chain-of-thought rationales used to train the agent language model, including reflection & lookahead, branching, and rollback components.
The framework leverages a language model to interact with a dynamic web environment using actions and observations, guided by the distilled reasoning patterns.
By distilling specific reasoning skills into the backbone LLM via fine-tuning, WEBCOT significantly improves performance on web agent tasks across multiple benchmarks.

Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Framework: introduces a training-free approach for student simulation, including cognitive prototype construction, behavior prediction, and solution simulation using πdesc, πnode, πedge, πlocal, πglobal, πpred, πrefine, and πvalue components.
The framework constructs a knowledge graph-based cognitive prototype from past learning records to predict student behavior on new tasks.
It employs a beam search-based self-refinement process to generate realistic student solutions consistent with predicted behavior.

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

MLR-Bench: introduces a comprehensive benchmark evaluating AI agents on open-ended machine learning research, comprising MLR-Bench Tasks, MLR-Judge, and MLR-Agent.
MLR-Bench supports stepwise evaluation through MLR-Agent's stages (Idea Generation, Literature Review, Proposal Generation, Experimentation, Paper Writing) and end-to-end evaluation, with MLR-Judge (using LLM Judges and Review Rubrics) automating assessment.
Evaluation highlights that while agents can generate ideas and papers, the Experimentation Stage often produces fabricated results, posing a significant challenge to scientific reliability.

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

MRA-CIR: introduces a zero-shot composed image retrieval framework that generates training triplets using Automatic Triplets Generation and fine-tunes a Vision-Language Model (VLM) using VLM Finetuning with InfoNCE Loss.
The Automatic Triplets Generation process includes Moderate Similarity Selection using a Pre-trained VLM to find image pairs and Modifying Text Generation via the Multimodal Reasoning Agent (MRA), which is based on an MLLM (MiniCPM-VL-2_6), to describe the transformation.
The VLM Finetuning utilizes the VLM's Q-Former to extract features and is trained with InfoNCE Loss to directly align composed queries and target images, bypassing intermediate textual representations.

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

EMAC+ (Embodied Multimodal Agent for Collaborative Planning with VLM+LLM): introduces a novel embodied multimodal agent that collaboratively integrates a VLM Agent (Processes visual input) and an LLM Expert (Generates/refines plans) via a bidirectional training paradigm, utilizing PDDL (Translates visual to text), a Retrospective Feedback Mechanism (Provides execution feedback), Long-term Memory (Stores history/feedback), and an Action Mapping Dictionary (Maps text to control).
The framework dynamically refines high-level textual plans from the LLM expert using real-time visual feedback from the VLM agent executing low-level control tasks.
This approach enables the LLM expert to internalize visual environment dynamics through interactive experience, improving domain-specific comprehension and generating more accurate and feasible plans for complex robotic tasks.

SCIENCEBOARD: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

SCIENCEBOARD: introduces a realistic, multi-domain environment for evaluating multimodal autonomous agents in scientific workflows, featuring Environment (Virtual Machine), Software (Scientific applications), Agent (Computer-using agent), Evaluator (Evaluation system), Observation Space (Perception modalities), Action Space (Interaction methods), Memory (Agent's state history).
The framework provides an infrastructure enabling computer-using agents to assist in scientific workflows by interacting autonomously via GUI actions or generated code.
It includes a challenging benchmark of 169 high-quality, rigorously validated real-world tasks spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics.

Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

LLM Agent: introduces an LLM-based agent for autonomous spacecraft control in Kerbal Space Program Differential Games, using Environment (KSPDG) for simulation, processing State observations into a User prompt, feeding it to the LLM agent which generates an LLM reply with Function calling to produce an Action controlling the spacecraft.
The approach leverages prompt engineering and fine-tuning techniques on GPT-3.5 and LLaMA models to enable the agent to interpret real-time telemetry and output control commands.
The LLM-based agent achieved second place in the KSPDG challenge, demonstrating the potential of LLMs for autonomous space operations, particularly with fine-tuning on limited data.

SECVULEVAL: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

Multi-agent pipeline: introduces a multi-agent system for C/C++ vulnerability detection, including a Normalization Agent (Parses function to AST), Planning Agent (Summarizes, creates vulnerability checklist), Context Agent (Extracts external context symbols), Detection Agent (Detects vulnerability, identifies statements), and Validation Agent (Evaluates detection, resolves disagreement).
The pipeline processes functions through sequential agents, with LLMs powering the Planning, Context, Detection, and Validation stages.
This multi-agent approach aims to decompose the complex task of vulnerability detection into smaller, manageable steps for improved LLM performance.

Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

Agentic Predictor: introduces a framework for efficient agentic workflow performance prediction, utilizing a Multi-View Workflow Encoder (Encodes workflow features), Decoder Networks (Reconstructs workflow inputs), Cross-Domain Unsupervised Pretraining (Refines workflow representations), Task Encoder (Encodes task description), Performance Predictor (Estimates workflow performance), and Predictor-Guided Search (Selects promising workflows).
The framework employs multi-view encoding of graph, code, and prompt features combined with cross-domain unsupervised pretraining to address workflow heterogeneity and limited labeled data.
By predicting performance, the approach enables faster and more accurate selection of optimal agentic workflow configurations compared to execution-based methods.

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

GLIDER (Grounding Language Models as Efficient Decision-Making Agents via Offline HiErarchical Reinforcement Learning): introduces a hierarchical framework with a High-level policy (Plans sub-tasks) and a Low-level policy (Executes primitive actions) sharing an Actor-Critic (Shared model architecture) built on an LLM Backbone (Base language model) fine-tuned with LoRA (Parameter-efficient fine-tuning), trained through SFT (Behavior cloning stage), ORL (Offline RL refinement stage), and O2O (Online adaptation stage) using High-level replay buffer (Stores high-level data) and Low-level replay buffer (Stores low-level data) interacting with an Environment (Interactive task space), guided by High-Level Prompt (Guides high-level planning), Low-Level Prompt (Guides low-level execution), and Check Subtask Complete Prompt (Verifies subtask completion).
The framework decomposes complex tasks into sub-tasks planned by the high-level policy and executed as primitive actions by the low-level policy, enabling efficient exploration and learning for long-horizon tasks.
The hierarchical structure and multi-stage training pipeline, including behavior cloning and offline reinforcement learning, contribute to improved performance and generalization capabilities on interactive decision-making benchmarks.

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

NeuSym-RAG: introduces a hybrid neural symbolic retrieval framework for PDF question answering, with Multiview Document Parsing (Parses PDF content), Relational Database (Stores structured data), Multimodal Vector Encoding (Encodes data to vectors), Vectorstore (Stores vector embeddings), LLM Agent (Plans and acts), Environment (Backend systems), Actions (Agent capabilities), and Prompt Template (Defines agent interaction).
The framework processes PDF documents into structured data and vector embeddings, enabling an LLM agent to iteratively retrieve information from both a database and a vectorstore.
This hybrid approach leverages multiple data views and retrieval strategies through executable actions to answer complex questions over semi-structured PDF content.

ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection

ReChisel (LLM-based agentic system): introduces an LLM-based agentic system with Generator (creates Chisel code), Compiler (translates Chisel to Verilog), Simulator (tests Verilog code), Inspector (collects feedback, trace, escape), Reviewer (analyzes trace/feedback, plans revision), Trace (history of iterations), Feedback (compilation/simulation results), Revision Plan (guidance for correction), Common Error Knowledge (pre-organized error fixes), and Escape Mechanism (breaks non-progress loops) components, designed to enhance Chisel code generation effectiveness.
The system iteratively refines generated Chisel code using a reflection mechanism that leverages feedback from compilation and simulation processes.
An escape mechanism is included to detect and break non-progress loops during the iterative refinement process.

Large Language Models for Planning: A Comprehensive and Systematic Survey

LLM-based Planning: introduces a comprehensive survey of methods that augment Large Language Models (processes input, generates output) with components like External Planners (generates formal plans), Memory Modules (stores, retrieves information), Validators (evaluates plans, outputs feedback), Data Sources (provides training data), Feedback Mechanisms (provides optimization signals), Decomposition Modules (breaks down tasks), External Executors (interacts with environment), and World Models (simulates environment dynamics) to enhance planning capabilities.
The survey categorizes approaches into external module augmented, finetuning-based, and searching-based methods, detailing planning definitions and evaluation frameworks.
The paper provides a systematic analysis of current advancements, challenges, and future directions in the field, serving as a resource for researchers.

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena: introduces a benchmark environment for evaluating agentic AI on real-world field work tasks, where a User downloads Input data and a Query from the Field Work Arena, an Evaluated agent performs Actions, generating an Execution log and Output, which an Evaluation program compares against Ground Truth to produce a Result.
The benchmark utilizes multimodal data including videos and documents from actual factory and warehouse settings.
Tasks are categorized into Planning, Perception, and Action, designed to assess agent capabilities in complex, dynamic environments.

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

DoctorAgent-RL: introduces a multi-agent collaborative reinforcement learning framework, with Doctor Agent (optimizes questioning strategy), Patient Agent (simulates patient responses), Consultation Evaluator (provides multi-dimensional rewards), Supervised Fine-tuning (establishes baseline capabilities), Reinforcement Learning (optimizes strategy via interaction), and Dynamic Turn Budget Training Strategy (RL training strategy for efficiency), that models medical consultations as a dynamic decision-making process.
The framework enables the doctor agent to autonomously develop clinically-aligned questioning strategies through interactions guided by the evaluator's reward mechanism.
It utilizes the newly constructed MTMedDialog dataset for training and evaluation and demonstrates superior performance in multi-turn reasoning and diagnostic accuracy.

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

AgentRecBench: introduces, "benchmarking LLM agent-based personalized recommender systems", with Recommending Agents (LLM-based agents), Textual Experiment Environment (simulated interaction platform), U-R-I Network (user-review-item data structure), Datasets (source data), Standardized Query Functionality (environment interaction interface), Dynamic Data Visibility Control (data access management), Dynamic Planning (task decomposition module), Complex Reasoning (decision-making module), Tool Utilization (environment interaction module), Memory Management (experience storage/retrieval), and LLM (core language model), which provides a comprehensive benchmark and modular framework for evaluating agentic recommender systems.
The benchmark includes a textual environment simulator equipped with multi-domain datasets and a standardized agent development framework.
The framework facilitates rapid prototyping and systematic testing of recommendation agents across diverse scenarios and tasks.

Multi-Agent Collaboration via Evolving Orchestration

Puppeteer: introduces a multi-agent collaboration framework with a centralized orchestrator (Puppeteer) that dynamically directs LLM-based agents (Puppets) based on the evolving task state, using a Policy for agent selection and Orchestration for sequencing.
The framework employs Reinforcement Learning, guided by a Reward function from the Environment, to adaptively evolve the Puppeteer's Policy, optimizing agent selection and pruning for improved performance and efficiency.
This dynamic orchestration fosters the emergence of compact, cyclic reasoning structures among agents, enhancing collaborative effectiveness and reducing computational cost compared to static multi-agent systems.

LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

LLM-Agent-Controller: introduces a multi-agent large language model system for control engineering problems, integrating a central Controller Agent with specialized auxiliary agents and a Supervisor for coordination.
The system leverages components like Retriever, Researcher, Reasoner, Planner, Debugger, Communicator, Critic, and Memory agents to enhance robustness, versatility, and efficiency in solving control theory tasks.
The framework is designed for user-friendly interaction, enabling users without prior control theory knowledge to input problems in natural language and receive complete solutions.

AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Multi-Agent Framework for AMQA Construction: introduces AMQA, an Adversarial Medical Question-Answering dataset, with Clinical Vignette Filtering (Filters vignettes), Adversarial Variant Construction (Constructs variants), Manual Quality Control (Reviews quality), Generation-Agent (Generates descriptions), Fusion-Agent (Integrates descriptions), and Evaluation-Agent (Evaluates bias trigger) components, designed for automated, large-scale bias evaluation of LLMs in medical QA.
The framework generates adversarial patient descriptions by varying demographic attributes while keeping clinical details constant, enabling controlled testing of LLM performance differences across privileged and unprivileged groups.
The multi-agent design decomposes the complex task of generating adversarial vignettes into specialized sub-tasks handled by distinct LLM agents, followed by human review for quality assurance.

Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

MemGAS: introduces a framework for long-term conversational agents that enhances memory consolidation and retrieval using multi-granularity association and adaptive selection, incorporating LLM Agent, Multi-Granular Memory Unit, Memory Bank, Dynamical Memory Association, Association Graph, Entropy-Driven Granularity Selection, Personalized PageRank, and LLM-Based Redundancy Filtering components.
The framework constructs multi-granular memory units and builds dynamic associations using Gaussian Mixture Models and an association graph.
An entropy-based router adaptively selects optimal granularity for retrieval, and retrieved memories are filtered by an LLM to refine the final context.

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

LINUXFL+: enhances fault localization for Linux kernel bugs, incorporating Directory-Aware Expansion, Potential-Cause Expansion, and Candidate Integration.
It refines initial agent predictions by leveraging the Codebase structure and historical knowledge from the Linux Kernel Mailing List, based on the Bug Report.
The framework aims to improve localization accuracy by expanding candidate selection based on directory context and potential bug causes.

VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

VLMLight: introduces a traffic signal control framework with Vision-Language Meta-Control and Dual-Branch Reasoning, integrating Scene Understanding, Safety-Prioritized Meta-Control, Routine Control Policy, and Deliberative Reasoning Policy, which includes AgentPhase, AgentPlan, and AgentCheck, interacting with a TSC Simulator, Trajectory Memory, Traffic Phase Embedding, Intersection Embedding, Value Network, Policy Network, and the Environment.
The framework uses a VLM for scene understanding and an LLM meta-controller to switch between a fast RL policy for routine traffic and a multi-agent LLM reasoning branch for critical scenarios.
This hybrid architecture balances the efficiency of RL with the interpretability and robustness of LLM reasoning, particularly for prioritizing emergency vehicles.

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

FPX (Adaptive Mixed Precision Inference Framework): introduces an adaptive mixed-precision inference framework with Adaptive Mixed-Precision Algorithm, Offline Calibration, Precision Assignment Function, FP8 kernel, and FP4 kernel, designed to balance speed and accuracy for LLM agents in latency-sensitive tasks.
The framework dynamically adjusts model precision at the operator level, selectively applying FP4 quantization to compression-tolerant layers while preserving FP8 for sensitive components.
FPX utilizes an offline calibration process to identify layers suitable for aggressive quantization, enabling fine-grained control over the latency-quality trade-off.

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?

Multi-Agent LLM-as-Judge: introduces a study evaluating intrinsic biases in multi-agent LLM-as-Judge frameworks, including Multi-Agent-Debate (Debate framework) with Judge (Initial/final evaluator) and Critic (Critiques/debates judgments), and LLM-as-Meta-Judge (Meta-reasoning framework) with Judges (Independent evaluators) and Meta-Judge (Select mode) (Selects best judgment) or Meta-Judge (Conclude mode) (Generates new judgment), also incorporating PINE (Bias mitigation agent).
The Multi-Agent-Debate framework amplifies biases after the initial debate, while the LLM-as-Meta-Judge approach shows greater resistance to intrinsic biases.
Incorporating a bias-free agent like PINE effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios.

Improving Recommendation Fairness without Sensitive Attributes Using Multi-Persona LLMs

LLMFOSA (LLM-enhanced framework for Fair recommendation withOut Sensitive Attributes): introduces a framework to improve recommendation fairness without sensitive attributes using a Collaborative Encoder (learns user/item embeddings), a Multi-Persona Sensitive Information Inference Module (infers sensitive attributes) with a Persona Editor (generates diverse personas), Annotators (infer attributes using personas), and a Meta Summarizer (distills inference rationales), a Confusion-Aware Sensitive Representation Learning Module (refines sensitive representations) including a Sensitive Encoder (transforms to sensitive-aware embedding), Confusion Modeling (models annotator mislabeling), Consensus Regularization (aligns confusion matrices), and Fine-Grained Rationale Incorporation (incorporates inference rationales), a Preference Encoder (generates sensitive-blind embedding), and Model Optimization (optimizes MI objectives).
The framework leverages multi-persona LLMs to infer latent sensitive patterns from user behavior and incorporates these inferences into robust sensitive representations for fairness training.
Fairness is ultimately achieved by optimizing mutual information objectives to disentangle sensitive and sensitive-blind user representations.

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Vibe Coding: introduces, "a human-centric paradigm", with Prompts (Natural language input), LLM (Code generation engine), Short-Term Context (Limited session memory), Developer (Human user), Thinking (Strategic problem formulation), Framework (Architectural awareness), Checkpoints (Version control), Debugging (Collaborative error resolution), Context (Information provision), where the developer guides an LLM through iterative prompts for creative exploration and rapid prototyping.
Agentic Coding: introduces, "an autonomous paradigm", with Objectives (High-level goals), Planner (Task decomposition module), Executor (Task execution module), Tool Use Environment (Integrated runtime environment), Sandbox Environment (Secure isolated environment), Long-Term Memory (Persistent state storage), API (External tools/interfaces), Git (Version control system), Test Suite (Automated tests), Multi-Agent Coordination (Specialized agents collaborating), Toolchain Integration (Full-stack tool orchestration), Validation Pipeline (Integrated QA loop), Security and Guardrails (Embedded safety mechanisms), Observability and Feedback (Monitoring and refinement), Deployment and CI/CD (Automated workflows), where goal-driven agents autonomously plan, execute, test, and iterate on complex software tasks with minimal human intervention.
The paper compares these two paradigms, highlighting differences in autonomy, architectural design, developer role, and practical implications for software development workflows and use cases.

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

TME (Task Memory Engine): introduces a modular memory controller, with TRIM (Task Representation and Intent Management), TMS (Task Memory Structure), and LLM (Large Language Model), that transforms LLMs into robust, revision-aware agents using a spatial memory framework.
TME replaces linear context with a TMS-DAG forest to dynamically track subtasks, dependencies, and revisions, orchestrated by the TRIM module.
This graph-based approach ensures global task consistency, revision-aware reasoning, and token efficiency by retrieving relevant subgraphs for the LLM.

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

ACBench (Agent Compression Benchmark): introduces a comprehensive benchmark for evaluating compressed LLMs' agentic capabilities, including Action Execution, Workflow Build, Long Context, and Real-World tasks, under various Quantization and Sparsification methods across different LLM categories (Small LM, Reason LM, Normal-LLM), analyzed using ERank, Top-K Ranking Correlation, and Energy metrics.
The benchmark assesses how compression impacts LLMs' ability to perform complex, multi-turn agentic tasks beyond traditional language modeling and understanding benchmarks.
The analysis tools provide insights into how compression affects model outputs, internal representations, and decision-making processes.

Frictional Agent Alignment Framework: Slow Down and Don't Break Things

FAAF: introduces a framework that conditions a language model on dialogue history and frictive states to generate interventions prompting reflection in collaborative tasks.
The framework utilizes a reference model and preference data to optimize an objective function for learning effective friction interventions.
By explicitly conditioning on frictive states, the approach aims to generate precise and interpretable interventions for dynamic human-AI collaboration.

CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

CoTGuard: introduces a trigger-based copyright protection framework for multi-agent LLM systems, with Multi-Agent LLM System, Chain-of-Thought Reasoning, Trigger Key, Task Type, Trigger Generation Function, Trigger Pattern, Prompt Modification, Intermediate Reasoning Trace, Repository of Known Trigger Patterns, Trigger Detection Function, Similarity Scoring, and Aggregation components, designed to detect copyright leakage by embedding triggers in intermediate reasoning steps.
The framework leverages Chain-of-Thought reasoning traces as an attack surface and detection medium, enabling fine-grained monitoring of content reproduction during agent collaboration.
CoTGuard achieves high detection accuracy with minimal impact on task performance by analyzing reasoning paths for trigger-induced patterns.

25th May 2025

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

SeRL (Self-play Reinforcement Learning): introduces a framework for bootstrapping LLM training with limited data, featuring Self-Instruction (Generates/filters instructions) and Self-Rewarding (Estimates rewards).
Self-Instruction employs an Online Instruction Filter (Ensures quality/diversity/difficulty), and Self-Rewarding uses Majority Voting (Reward estimation mechanism) for unsupervised RL Training (Performs reinforcement learning) of the LLM (Large Language Model being trained).
The iterative self-play process enables performance comparable to training with extensive high-quality data and verifiable rewards.

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense

ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast&Slow Reasoning): introduces a defense framework with an Offline Module (constructs database) for learning risk patterns and an Online Module (implements real-time defense) for hierarchical reasoning.
The Offline Module includes Risk pattern Extract (extracts patterns), Deduplication Optimization (removes redundancy), and Self-Learning Adversarial Optimization (iteratively refines patterns) to build the Risk Patterns Database (stores learned patterns).
The Online Module uses Query/Action Abstraction (abstracts inputs) and Online Hierarchical Risk Reasoning (balances detection efficiency) with Hybrid Retrieval (matches input patterns), Fast Thinking (intercepts high-confidence risks), and Slow Thinking (handles ambiguous inputs) for real-time defense.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

DeepResearchGym: introduces an open-source sandbox for evaluating deep research systems, featuring a Search Sandbox with Web Corpora, a Distributed Dense Retrieval Backend using an Embedding Model and Approximate Nearest Neighbor Search, a Retrieval API, and an Evaluation Protocol leveraging the Researchy Questions Dataset, LLM-as-a-judge Methodology, Report Relevance Metrics, Retrieval Faithfulness Metrics, and Report Quality Metrics.
The framework provides a reproducible search API over large public web corpora (ClueWeb22-B, FineWeb) using a dense retriever and DiskANN for efficient retrieval.
DeepResearchGym includes a multi-dimensional evaluation protocol based on LLM-as-a-judge to assess report quality, factual grounding, and alignment with user needs on complex queries.

Sensorimotor features of self-awareness in multimodal large language models

Embodied MM-LLM System: introduces a system integrating a multimodal LLM with a mobile robot and its sensors to explore sensorimotor self-awareness, using a Robot, Sensors, ROS 2, a MM-LLM (Gemini 2.0 Flash), Memory, and evaluated by an LLM-as-a-Judge.
The system processes real-time sensor data and episodic memory to generate iterative self-predictions about its entity, dimensions, movement, and environment.
This approach demonstrates that multimodal LLMs can exhibit emergent self-awareness through sensorimotor experience and structured memory integration.

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

GUARDIAN (GUARDing Intelligent Agent collaboratioNs): introduces a framework for detecting and mitigating safety concerns in LLM multi-agent collaborations, utilizing Graph Preprocessing, an Attributed Graph Encoder, a Time Information Encoder, an Attribute Reconstruction Decoder, a Structure Reconstruction Decoder, Anomaly scores, and an Updated Collaboration Network.
The approach models multi-agent interactions as a discrete-time temporal attributed graph and employs an unsupervised encoder-decoder architecture for anomaly detection.
A graph abstraction mechanism based on Information Bottleneck Theory compresses temporal interaction graphs while preserving essential patterns for robust anomaly identification.

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

MORALSIM: introduces a framework for evaluating LLM agents in repeated social dilemmas where ethical norms conflict with incentives, including Game Simulation Environment, LLM Agent, Agent Configuration, Game Type, Moral Context, Opponent Type, and Survival Risk components.
The framework systematically tests LLM behavior across varied game structures, moral framings, opponent types, and survival conditions.
Results show substantial variation in LLM moral behavior, highlighting conflicts between self-interest and ethical expectations.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

SpeakStream: introduces a streaming text-to-speech system with a Transformer Decoder, Text Token Representation, Speech Token Representation, Interleaved Text-Speech Data, KV-Cache, VocStream, Streaming Upsampler, Streaming Vocoder, and Real-time Audio Player, designed for low-latency, incremental audio generation from streaming text.
The system trains a decoder-only transformer on interleaved text-speech sequences and uses a streaming vocoder pipeline for real-time waveform synthesis.
SpeakStream achieves low first-token latency and maintains coherence by conditioning generation on complete text and speech history stored in the KV-cache.

When Two LLMs Debate, Both Think They'll Win

Debate Simulation Framework: introduces a system to evaluate Large Language Models' confidence calibration in dynamic, adversarial settings using a multi-turn debate format and zero-sum structure.
The framework reveals systematic LLM overconfidence, confidence escalation across rounds, mutual high confidence claims, persistent self-debate bias, and misaligned private reasoning.
These findings highlight LLMs' limitations in self-assessment and belief updating when facing opposition, posing risks for deployment in assistant and agentic roles.

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles

Pedagogical Simulation Framework: introduces a novel simulation framework integrating a Teacher LLM Agent (Self-optimizing agent) and Student LLM Agents (Diverse learning profiles) with Persona-RAG (Personalized knowledge retrieval) and a Knowledge Base (Student prerequisite knowledge), where a Genetic Algorithm (Teacher strategy optimizer) evolves the teacher's strategy based on student performance.
This framework simulates diverse student populations and optimizes the teacher agent's dynamic pedagogical strategy through a closed-loop system based on measured learning outcomes.
Persona-RAG enhances personalization by tailoring knowledge retrieval to individual student reasoning paths, improving performance on complex, non-recall questions.

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

HolmesEye (hybrid agentic framework): introduces, "a framework combining VLM and LLM agents", with VLM agent (Extraction), VLM agent (Analysis), LLM agent (Summarization), VLM agent (Inquiry Response), LLM agent (Decision Making) components, designed to infer private attributes from image collections by analyzing individual images and cross-image patterns.
The framework utilizes VLM agents for extracting intra-image details and analyzing inter-image relationships, while LLM agents guide the inference process, summarize findings, generate inquiries, and make final attribute decisions.
HolmesEye achieves superior accuracy in private attribute profiling, particularly for abstract traits, highlighting a significant privacy risk from vision-language models.

Incentivizing High-Quality Human Annotations with Golden Questions

Annotation System: introduces a principal-agent model for incentivizing high-quality human annotations, including a Principal (LLM Company), an Agent (Human Annotator), a Dataset (Unannotated data), an Annotated Dataset (Annotated data), Golden Questions (Monitoring dataset), MLE (Estimator), Test (Performance evaluation), and Contract (Payment scheme).
The system monitors annotator performance using Golden Questions and an MLE-based Test to determine payment via a Contract.
Golden Questions are selected using a Certainty Estimator, potentially based on a Reward Model, to ensure they have certain answers and similar format to other data.

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

ScreenExplorer: introduces a VLM (Agent policy function), World Model (Predicts next state), GRPO (Policy optimization algorithm), Experience Stream Distillation (Filters, distills exploration data), Reward System (Interaction, exploration signals), GUI Environment (Real, dynamic interaction space), and Rollout Buffer (Stores experience tuples), designed to train a VLM agent for diverse exploration in open GUI environments.
The framework utilizes a world model for curiosity-driven rewards and distills exploration experience to enhance the agent's capabilities and reduce reliance on curated data.
ScreenExplorer trains the VLM agent via reinforcement learning in a real GUI environment, enabling adaptation and sustained exploration.

A Systematic Classification of Vulnerabilities in MoveEVM Smart Contracts (MWC)

MWC (MoveEVM Weakness Classification): introduces a systematic classification of vulnerabilities in MoveEVM smart contracts with F1 (Bytecode/ABI inconsistencies), F2 (Inter-module invariant violations), F3 (State reentrancy/synchronization bugs), F4 (Signature/Meta-transaction spoofing), F5 (Gas semantics manipulation), and F6 (Framework logic/abstraction errors) components.
This frame-based taxonomy defines 37 uniquely identified weakness classes (MWC-100 to MWC-136) grouped into these six top-level frames.
The classification provides a structured approach for identifying, mitigating, and preventing sophisticated exploits spanning Move and EVM semantics in hybrid environments.

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

MetaMind: introduces a multi-agent framework for human-like social reasoning, with a Theory-of-Mind Agent (Generates mental state hypotheses), Domain Agent (Refines hypotheses with constraints), Response Agent (Generates and validates responses), and Social Memory (Stores user patterns/feedback).
The framework decomposes social understanding into three collaborative stages, inspired by psychological theories of metacognition.
This staged architecture enables large language models to infer unspoken intentions, incorporate social norms, and adapt responses for enhanced social intelligence.

24th May 2025

Security Concerns for Large Language Models: A Survey

Llama Guard 3: introduces, "a multi-layer safeguard", with Policy LLM (Filters text/images), Vision Encoder (Filters text/images), Main Model (Receives filtered input), where "Llama Guard 3 combines a policy LLM and a vision encoder to filter text and images before they reach the main model".
This system is designed to filter potentially harmful text and images before they are processed by the core language model.
It serves as an example of a multi-component defense strategy discussed in the survey for safeguarding LLM inputs.

PERSONALIZED SAFETY IN LLMS: A BENCHMARK AND A PLANNING-BASED AGENT APPROACH

RAISE: introduces a planning-based agent approach for personalized safety in LLMs, with an Offline Planner (LLM-guided MCTS) to discover optimal attribute acquisition paths and an Online Agent (dual-module execution) including an Acquisition Module and Abstention Module to execute the path and decide when to respond.
The Offline Planner uses LLM-guided MCTS to precompute optimal attribute query sequences, stored in Offline Data Storage, which the Online Agent's Acquisition Module retrieves via a Retrieval Mechanism during inference.
The Abstention Module dynamically assesses if the acquired context, gathered by querying attributes guided by the retrieved path, is sufficient for the LLM Backbone to generate a safe, personalized response.

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

CRMArena-Pro: introduces a benchmark for evaluating LLM agents on CRM tasks, featuring a Data Generation Pipeline (produces synthetic data), Synthetic Enterprise Data (realistic business data), Salesforce Org (Sandbox Environment) (testing environment), Simulated User (interacts with agent), Agent (LLM Agent) (system under evaluation), Large Language Models (LLMs) (power components), API Access (SOQL/SOSL) (agent tools), Answer Extractor (evaluates task completion), and LLM Judge (evaluates confidentiality awareness).
The benchmark utilizes a data generation pipeline to populate a Salesforce Org sandbox with realistic synthetic data for evaluating LLM agents on diverse business scenarios and interactions.
Evaluation components include a simulated user for multi-turn interactions, API access for agent actions, and LLM-based extractors and judges for performance and confidentiality assessment.

Multi-Party Conversational Agents: A Survey

MPCAs: introduces a survey of Multi-Party Conversational Agents, with all State of Mind Modeling (infer mental states), Semantic Understanding (understand dialogue content), and Agent Action Modeling (predict future flow) components, where the paper categorizes existing research into these three core themes essential for human-like social communication in group settings.
The survey explores recent progress in MPCAs by addressing how agents model participant mental states, understand dialogue content, and reason about future conversation flow.
The analysis underscores the importance of Theory of Mind and highlights multi-modal understanding as a promising direction for developing more capable agents.

Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

SearchExpert: introduces a two-stage training framework for LLMs, including LLM (core model), SFTS (supervised training stage), RLSF (reinforcement training stage), and a Multimedia Agent (visual processing/generation), to enhance reasoning-intensive multimedia search capabilities.
The framework utilizes efficient natural language representations for search plans and automated data construction pipelines for training data generation.
RLSF incorporates a dual-component reward mechanism based on search result quality to improve reasoning capabilities for complex queries.

C³-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

LLM-based Agent: describes the multi-task execution process involving User (Proposes tasks), Tool (External functions), Action (Agent's steps), Observation (Environment feedback), Summary (Task completion feedback), LLM-based Agent (Processes, decides, acts), and Agent Parameters (Internal state/knowledge), evaluated by the C³-Bench benchmark.
The C³-Bench benchmark uses three challenges and fine-grained metrics to assess agent performance and identify weaknesses in handling tool relationships, hidden information, and decision trajectories.
Evaluation results highlight significant shortcomings in current models, especially concerning tool dependencies, long-context information, and policy switching frequency.

AI-Researcher: Autonomous Scientific Innovation

AI-Researcher: introduces a fully autonomous research system orchestrating the complete scientific discovery pipeline, including Knowledge Acquisition Agent (discovers papers and code), Resource Analyst (analyzes concepts and code), Idea Generator (generates novel ideas), Code Agent (implements algorithms), Advisor Agent (validates and provides feedback), Paper Agent (generates manuscripts), Secure Research Environment (containerized execution environment), and Structured Knowledge Exchange (facilitates agent collaboration).
The framework progresses through literature review, idea generation, algorithm implementation, experimental validation, and scholarly documentation with minimal human intervention.
AI-Researcher employs a comprehensive multi-agent architecture and introduces Scientist-Bench, a benchmark for evaluating autonomous research capabilities.

LLM-QFL: Distilling Large Language Model for Quantum Federated Learning

LLM-QFL: introduces a federated fine-tuning approach, with Server, Clients, Global Model, Local Model, Pre-Trained LLM, Fine-Tuned LLM, Local QNN, Optimizer, Knowledge Distillation, Client Selection, Termination Criteria, Feature Map, Ansatz, and PEFT Methods, that distills a large language model within quantum federated learning to enhance efficiency and performance.
The framework leverages the fine-tuned LLM as a controller to dynamically adjust optimizer steps, select clients, and determine training termination.
Knowledge distillation and PEFT methods enable efficient local adaptation of LLMs on resource-constrained quantum devices while preserving data privacy.

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

SEW (Self-Evolving Workflow): introduces a novel framework that automatically generates and optimises multi-agent workflows for automated code generation, with Workflow Generation (Generates initial workflow), Workflow Evolution (Evolves workflow structure), Agent Evolution (Evolves agent prompts), Agents (Execute tasks), Evolutionary Prompts (Inputs for evolution), Evolution Operators (DE/HE methods), and LLM (Backbone model) components.
The framework leverages an evolutionary scheme to improve workflow topology and agent prompts.
SEW explores different workflow representation schemes and demonstrates improved performance on code generation benchmarks through self-evolution.

DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

DDO (Dual-Decision Optimization): introduces a novel LLM-based multi-agent framework for medical consultation, with Diagnosis Agent (estimates disease confidence), Policy Agent (generates candidate actions), Inquiry Agent (selects optimal inquiry), Patient Agent (simulates patient response), and Shared Memory (stores consultation state).
The framework decouples symptom inquiry and disease diagnosis, optimizing these two distinct sub-tasks independently through a collaborative multi-agent workflow.
DDO enhances disease discrimination via a learnable adapter and improves information gathering through an RL-based policy agent and strategic inquiry selection.

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

D2D (Debate-to-Detect): introduces a structured multi-agent debate framework for misinformation detection, with Agent Layer (Affirmative, Negative, Judge agents, Domain-Specific Profiles, Shared Memory) and Orchestrator Layer managing a five-stage process (Opening Statement, Rebuttal, Free Debate, Closing Statement, Judgement) culminating in Multi-dimensional Evaluation.
The framework assigns domain-specific profiles to agents and orchestrates a progressive debate across distinct stages, enhancing logical coherence and evidence refinement.
A multi-dimensional evaluation mechanism assesses claims across Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics, providing interpretable authenticity scores.

MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures - A Comprehensive Framework

MASTER: introduces a novel security research framework for Multi-Agent Systems, with MAS Automatic Constructor (Builds MAS instances), Interaction Mechanism (Manages agent communication), Attack Strategies (Methods to exploit vulnerabilities), Defense Strategies (Mechanisms to protect MAS), Evaluation Methods (Metrics to assess security), Agents (LLM-based nodes with roles), Topology Graph (Represents agent connections), and Memory Modules (Store agent interaction history), designed to explore security risks under MAS attacks by focusing on diverse role configurations and topological structures.
The framework offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm to emulate realistic MAS interactions.
It proposes scenario-adaptive attack and defense strategies leveraging role and topological information to tackle MAS security challenges in varied scenarios.

Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG): introduces RSB, a benchmark evaluating poisoning attacks against RAG systems, with Knowledge database (collection of textual content), Retriever (selects relevant documents), LLM (generates final response), and System prompt (conditions LLM generation) components.
The benchmark assesses 13 poisoning attacks and 7 defenses across diverse RAG architectures and datasets to understand security vulnerabilities.
Findings indicate RAG systems are susceptible to poisoning attacks, current defenses are limited, and advanced architectures offer varying robustness, highlighting the need for better defenses.

Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Blueprint for Auditing Frameworks: introduces a three-layer architecture including Layer 1 (Handles COLS operations), Layer 2 (Encodes operations into commitments), and Layer 3 (Supports external verification), enabling Users (Initiates requests, receives reports) and Auditors (Verifies usage, identity, behavior) to audit hidden operations in Commercial Opaque LLM Services.
The framework aims to provide trustworthy and practical auditing across the COLS lifecycle, from execution to verification.
Layer 2 generates verifiable commitments from internal operations, which Layer 3 uses for external verification without exposing proprietary details.

A Survey of LLM × DATA

DATA4LLM: introduces techniques for large-scale data processing, storage, and serving to provide high-quality data for LLM lifecycle stages.
LLM4DATA: presents how LLMs function as general-purpose engines for data management tasks including manipulation, analysis, and system optimization.
The survey reviews the bidirectional relationship between LLMs and data management, detailing techniques for both DATA4LLM and LLM4DATA.

23rd May 2025

Self-Training Large Language Models with Confident Reasoning

CORE-PO: introduces a self-training method for large language models, with LLM, Reference Model, Confidence Computation, Preference Annotation, and Policy Optimization components, that fine-tunes LLMs to prefer high-confidence reasoning paths.
The method incorporates reasoning-level confidence estimation to identify high-quality reasoning paths, addressing limitations of methods relying solely on answer-level confidence.
CORE-PO uses Policy Optimization (Direct Preference Optimization) to train the LLM based on preference pairs derived from reasoning-level and answer-level confidence scores.

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Multi-agent framework for automated construction of DanmakuTPP-QA: introduces a pipeline to build a multi-modal question-answering benchmark, with DanmakuTPP-Events (Input data), Task-design Agent (Generates evaluation tasks), Annotation Agent Group (Extracts multi-modal annotations), Quality-control Agent (Refines annotations), Visualization Agent (Creates visualizations), and Task-solve Agent Group (Solves tasks).
The framework leverages specialized agents powered by LLMs and MLLMs to generate tasks, annotate data, ensure quality, create visualizations, and produce ground-truth answers for temporal-visual-textual reasoning.
This multi-agent approach systematically constructs a high-quality dataset for evaluating models on complex multi-modal temporal point process understanding tasks.

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

MAS (Multi-Agent AI Systems): introduces a framework for multi-agent AI systems, with AI Agent (autonomous entity), Agent State (internal memory/context), Agent Input (from others/environment), Agent Output (actions/messages), Agent Transition Kernel (state/output update rule), Multi-Agent Topology (communication graph), Topology Graph Update Function (evolves topology), Orchestrator (coordinates agents), Knowledge Base (system memory), Aggregator (combines agent outputs), Feedback (external/internal signals), Application Layer (human/environment interaction), Modeling Layer (agents/orchestration/memory), and Computation Layer (hardware infrastructure), formalizing key concepts and evaluating effectiveness and safety.
The framework defines MAS as a set of autonomous agents interacting via a dynamic communication graph, processing inputs over time, with agent behavior and system topology updated by feedback.
The paper analyzes MAS effectiveness through task allocation, robustness, and feedback integration perspectives and explores safety challenges, including vulnerability propagation and the impact of topology.

Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation

Persona Alchemy (SCT-based framework): introduces a system for designing, evaluating, and implementing psychologically grounded LLM agents with LLM Instances, Persona Neo4j Adapter, Neo4j, Text Analyzer, Personal Factors, Environment, and SCT Constructs.
The framework integrates Personal Factors, Environment, and Behavior, evaluated using SCT Constructs, to create dynamic and consistent agent personas grounded in Social Cognitive Theory.
It leverages multiple LLM instances, a Neo4j graph database, and a Text Analyzer for persona design, data management, and evaluation processes.

Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play

LLM+DEBRIEF: introduces a multi-agent learning framework for autonomous vehicles that leverages natural language communication and centralized reflection via large language models to enhance cooperation in simulated driving scenarios.
The framework enables agents to refine their communication and motion control policies through trial-and-error interactions and post-episode discussions.
Agents use Chain-of-Thought reasoning, environment observations, and learned knowledge to generate natural language messages and high-level driving commands.

Single-agent or Multi-agent Systems? Why Not Both?

MAS (Multi-Agent Systems): introduces a comprehensive empirical comparison of MAS and SAS paradigms, proposing a hybrid agentic paradigm with Agent Routing and Agent Cascade strategies, and a Confidence-guided Critical Path Tracing method to improve efficiency and effectiveness.
The paper models agentic execution as a directed graph where nodes are LLM agents or tools, comparing MAS (multiple LLM agents) and SAS (single LLM agent) performance across various tasks.
Findings indicate that MAS advantages diminish with more capable LLMs, motivating the proposed hybrid approach that selectively routes or cascades tasks between SAS and MAS based on complexity and evaluation.

Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Collaborative Memory: introduces a framework for multi-user, multi-agent systems, with Users (Human participants), Agents (LLM-based specialized entities), Resources (External tools, APIs, data), Dynamic bipartite access graphs (Time-dependent user-agent/agent-resource permissions), Private Memory (User-specific memory fragments), Shared Memory (Selectively shared memory fragments), Memory fragments (Stored interaction logs/knowledge), Read policy (Filters memory for retrieval), Write policy (Determines memory storage/sharing), Coordinator (Selects agents for queries), Aggregator (Synthesizes agent responses), Memory Encoder (Maps traces to fragments), Memory Retrieval (Retrieves relevant fragments), Policy Instantiation (Defines read/write rules), Multi-Agent Interaction Loop (Orchestrates agent interactions), and Vector embeddings (Represents memory fragments), designed for permission-aware memory sharing.
The framework utilizes dynamic bipartite graphs to model time-varying access permissions between users, agents, and resources.
A two-tier memory system, comprising private and shared memory, is governed by fine-grained read and write policies to enable controlled knowledge transfer while maintaining privacy.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

BMSQL: i

Name		Name	Last commit message	Last commit date
Latest commit History 1,281 Commits
Autonomous_Agents_Resources.md		Autonomous_Agents_Resources.md
Autonomous_agent_logo.png		Autonomous_agent_logo.png
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers

3rd July 2025

2nd July 2025

1st July 2025

1st July 2025

30th June 2025

29th June 2025

28th June 2025

27th June 2025

26th June 2025

25th June 2025

24th June 2025

23th June 2025

18th June 2025

17th June 2025

16th June 2025

15th June 2025

14th June 2025

13th June 2025

12th June 2025

11th June 2025

10th June 2025

9th June 2025

8th June 2025

7th June 2025

6th June 2025

5th June 2025

4th June 2025

3rd June 2025

2nd June 2025

1st June 2025

31st May 2025

30th May 2025

29th May 2025

28th May 2025

27th May 2025

26th May 2025

25th May 2025

24th May 2025

23rd May 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Packages