Your very own "Iron Man" assistant OS Agents are here! Zhejiang University teamed up with 10 institutions such as OPPO and Zero One Everything to launch a new review-News-Artificial Intelligence Global Cooperation Alliance

Your very own "Iron Man" assistant OS Agents are here! Zhejiang University teamed up with 10 institutions such as OPPO and Zero One Everything to launch a new review

#News ·2025-01-03

In the movie "Iron Man," Tony Stark's assistant Jarvis (J.A.R.V.I.S.) can help him control various systems and complete tasks automatically, which has made countless viewers envy.

Now, such a super intelligent assistant has finally become a reality!

With the explosive evolution of multimodal large language models, OS Agents have emerged, and they can seamlessly control computers and phones, and automatically do tedious tasks for you.

From Anthropic's Computer Use, to Apple's Apple Intelligence, to Jepu AI's AutoGLM, and Google DeepMind's Project Mariner, the ambitions of tech giants all point to the same goal: Create a real sense of the operating system intelligent assistant.

OS Agents are no longer just "assistants," they are rewriting the rules of the game for "human-computer interaction."

Recently, Zhejiang University, together with OPPO and ten other institutions, jointly sorted out a review article "OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use not only explains the core technical structure of OS Agents in detail, but also takes stock of their evaluation methods and future challenges.

Could the next big thing in tech be OS Agents?

OS Agents business academics burst out at the same time

Super AI assistants like Jarvis, commonly referred to as OS Agents, are capable of automating various tasks on computing devices such as computers or mobile phones through environments and interfaces (such as graphical user interfaces, GUIs) provided by operating systems (OS).

OS Agents have enormous potential to improve the lives of billions of users around the world, and imagine a world where everyday activities such as online shopping and booking travel can be done seamlessly by these agents, which will dramatically increase people's efficiency and productivity.

In the past, AI assistants such as Siri[1], Cortana[2], and Google Assistant[3] have demonstrated this potential. However, due to the limited modeling capabilities in the past, these products can only accomplish a limited task.

Fortunately, with the development of multimodal large language models, With the Gemini[4], GPT[5], Grok[6], Yi[7] and Claude[8] series models (ranked according to the LLM Leaderboard Chatbot Arena updated on December 22, 2024 [9]), new possibilities have opened up in this field.

(M)LLMs demonstrate an impressive ability to enable OS Agents to better understand complex tasks and execute them on computing devices.

Basic model companies and handset makers have been making a lot of moves in this area recently, Examples include Computer Use[10] by Anthropic, Apple Intelligence[11] by Apple, AutoGLM[12] by Chipu AI, and Project Mariner by Google DeepMind [13].

Among them, Computer Use uses Claude[14] to interact directly with the user's computer to achieve seamless task automation.

At the same time, academia has proposed various ways to build OS Agents based on (M)LLM.

For example, OS-Atlas[15] proposed a basic GUI model. By synthesizing GUI operation data across multiple platforms, it greatly improved the model's ability to operate GUI and improved the performance of OOD tasks.

Os-copilot [16] is an OS Agents framework that enables agents to automate a wide range of computer tasks with little supervision, and has demonstrated its generalization and self-improvement capabilities in a variety of applications.

Some representative commercial products and academic research of OS Agents

This paper is a comprehensive review of OS Agents.

Firstly, the foundation of OS Agents is clarified, its key elements including environment, observation space and action space are discussed, and core capabilities such as understanding, planning and performing operations are outlined.

Then, it examines the methods for building OS Agents, focusing on the development of the domain-specific underlying models and agent frameworks for OS Agents.

It then reviews evaluation protocols and benchmarks in detail, showing how OS Agents can be evaluated across a variety of tasks.

Finally, the paper discusses current challenges and points out potential directions for future research, including security and privacy, personalization and self-evolution.

This paper aims to sort out the status quo of OS Agents research and provide help for academic research and industrial development.

To further drive innovation in the field, the team also maintains an open source GitHub repository of 250+ papers on OS Agents and other related resources, which is still being updated. (Link at end of article ~)

Fundamentals of OS Agents Key elements and core capabilities

OS Agents Basics

Key Component

To achieve the general control of computing devices by OS Agents, it is necessary to interact with the environment, input and output interfaces provided by the operating system.

To meet this interaction requirement, existing OS Agents rely on three key elements:

Environment: An intelligent system or platform, such as a computer, mobile phone, and browser. The environment is the arena where the agent completes the task, supporting everything from simple information retrieval to complex multi-step operations.
Observation Space: The range of all information available to the agent. This information, such as screenshots, text descriptions, or GUI interface structures, is the basis for the agent to understand the environment and the task. For example, the HTML code of a web page or a screen shot of a mobile phone.
Action Space: The set of actions that an agent interacts with the environment. It defines actions that can be performed, such as clicking, entering text, navigating operations, and even calling external tools. This enables agents to automate tasks and optimize workflows.

Core competence (Capability)

After these key elements of OS Agents, how to correctly and effectively interact with the operating system needs to test the abilities of OS Agents themselves in all aspects.

The core capabilities that OS Agents must master can be summarized as follows:

Understanding: OS Agents first need to understand the complex operating environment. Whether it's HTML code, screenshots, or dense ICONS and text messages in an on-screen interface, agents need the ability to extract key content and build a comprehensive understanding of the task and environment. This ability to understand is a prerequisite for tasks such as information retrieval.
Planning: In task execution, the planning ability of OS Agents is crucial. Planning capabilities require OS Agents to break down complex tasks into multiple subtasks and formulate sequences of actions to achieve goals. At the same time, they should be able to dynamically adjust the plan to adapt to the complex operating system environment as the environment changes, such as dynamic web pages and real-time updated user screen interfaces.
Operations (Grounding) : OS Agents eventually need to translate the plan into specific, executable actions, such as clicking a button, entering text, or calling an API. This ability to "land" a plan allows them to accomplish tasks efficiently in a real environment, and to achieve a precise transition from literal description to operational execution.

Construction of OS Agents

Foundation Model

To build OS Agents that can perform tasks efficiently, the core lies in developing the basic model of adaptation.

These models not only need to understand complex screen interfaces, but also perform tasks in multimodal scenarios.

The following is a detailed summary and summary of the structure and training strategy of the basic model:

OS Agents Basic model: Architecture, pre-training, supervised fine-tuning, and reinforcement learning

Architecture: The main model architecture is divided into four categories: 1. Existing LLMs: Existing LLMs uses the open source large language model architecture directly, inputs structured screen interface information to LLMS in the form of text, so that the model can perceive the environment; 2. Existing MLLMs: Existing MLLMS directly adopts the open source multi-modal large language model architecture, integrates text and visual processing capabilities, improves the understanding of GUI, and reduces the feature loss caused by textual visual information; 3, Concatenated MLLMs: LLM and visual encoder bridge, higher flexibility, can choose different language models and visual models according to task requirements for combination; 4. Modified MLLMs: The existing MLLM architecture is optimized to solve scenarios-specific challenges, such as adding additional modules (high-resolution visual encoders or image segmentation modules, etc.) to perceive and understand screen interface details in greater detail.

Pre-training: Pre-training lays the foundation for model construction and improves the ability to understand the screen interface through massive data. Data sources include public data sets and synthetic data sets. The pre-training tasks cover Screen Grounding, Screen Understanding, and optical character recognition (OCR).

Supervised Fine-tuning: Supervised fine-tuning makes the model better fit the GUI scene, which is an important means to improve the planning and execution ability of OS Agents. For example, by recording the track of task execution to generate training data, or using HTML to render screen interface details, improve the model's ability to generalize to different GUIs.

Reinforcement Learning: Reinforcement learning at this stage realizes the paradigm shift from (M)LLMs as feature extraction to (M)LLM-as-Agent, which helps OS Agents interact in dynamic environment and constantly optimize decisions based on reward feedback. This approach not only improves the alignment degree of agents, but also provides stronger generalization and task adaptation for visual and multimodal agents.

Recent papers on the basic model of OS Agents are summarized as follows:

Summary of recent research work on the basic model of OS Agents

Agent Framework

In addition to a powerful basic model, OS Agents also need to be combined with an Agent framework to enhance perception, planning, memory and action capabilities.

These modules work together to enable OS Agents to respond efficiently to complex tasks and environments.

The following is a summary of the four key modules in the OS Agents framework:

OS Agents Framework: Perception, planning, memory and action

Perception: Perception acts as the "eyes" of OS Agents, observing the environment through input multimodal data (e.g., screenshots, HTML documents). We subdivide awareness into: 1. Text awareness: converting the state of the operating system into a structured text description, such as a DOM tree or HTML file; 2. Screen interface perception: Use visual encoder to understand screen interface screenshots, and accurately identify key elements through visual positioning (such as buttons, menus) and semantic connections (such as HTML tags).

Planning: As the "brain" of OS Agents, planning is responsible for formulating execution strategies for tasks, which can be divided into: 1. Global planning: generate a complete plan at a time and execute it; 2. Iterative planning: Dynamically adjust the plan as the environment changes, so that the agent can adapt to the real-time updated screen interface and task requirements.

Memory: The "memory"; section of the OS Agents framework helps store task data, operation history, and environment state. Memory is divided into three types: 1, Internal memory (Internal Memory) : store operation history, screen shots, status data and dynamic environment information, support the context of task execution understanding and trajectory optimization. For example, parsing screen layout with screenshots or generating decisions based on historical actions; 2, External Memory (External Memory) : provide long-term knowledge support, such as by calling external tools (such as API) or knowledge base to obtain domain background knowledge, to assist the decision of complex tasks; 3, Specific Memory (Specific Memory) : focus on the knowledge and user needs of specific tasks, such as storage subtask decomposition methods, user preferences or screen interface interaction functions, to provide highly targeted operation support. In addition, we also summarize a variety of memory optimization strategies.

Action: We define the action scope of OS Agents as the action space, which includes the way of operating system interaction. We subdivide it into three categories: 1. Input operation: Input is the basis of interaction between OS Agents and digital screen interface, mainly including mouse operation, touch operation and keyboard operation; 2. 2. Navigation operation: enable OS Agents to explore and move to the target platform to obtain information needed for task execution; 3, extended operation: breaking through the limitations of traditional screen interface interaction, to provide agents with more flexible task execution capabilities, such as: code execution and API calls.

Recent papers on the OS Agents framework are summarized as follows:

Summary of recent research work on OS Agents framework

Evaluation of OS Agents

Scientific evaluation has played a key role in the development of OS Agents, helping developers measure the performance of agents in various scenarios.

The table below summarizes recent papers on OS Agents evaluation benchmarks:

△ Summary of recent research work of OS Agents Benchmark

Evaluation Protocol

The core of operating system agent evaluation can be summarized into two key questions: how the evaluation process should be conducted and what needs to be evaluated.

The following article will focus on these two issues, the operating system agent evaluation principles and indicators.

Evaluation Principle: The Evaluation of OS Agents combines multi-dimensional technical methods to provide a comprehensive insight into their capabilities and limitations, which can be divided into two types: 1. Objective Evaluation: The performance of the agent in a specific task is evaluated using standardized numerical metrics. For example, the accuracy of operations, the success rate of tasks, and the accuracy of semantic matching. This evaluation method can measure the performance of the agent quickly and standardised. 2, Subjective Evaluation (Subjective Evaluation) : based on the subjective feelings of human users, evaluate the quality of the output of the agent, including its relevance, naturalness, coherence and overall effect. A growing number of studies also utilize (M)LLM-as-Judge for evaluation, which improves efficiency and consistency.
Evaluation Metric: Evaluation metric focuses on OS Agents' ability to understand, plan, and operate, measuring their performance on different tasks. It mainly includes the following two aspects: 1. Step level indicators: evaluate the accuracy of the agent in each step of operation, such as the semantic matching degree of the action in task execution, operation accuracy, etc.; 2. Task-level indicators: Focus on the completion of the entire task, including the success rate of the task and the efficiency of the completion of the task.

Evaluation Benchmark

To fully evaluate the performance of OS Agents, the researchers developed a variety of evaluation benchmarks covering different platforms, environment Settings, and task categories.

These benchmarks provide scientific basis for measuring the adaptability and dynamic task execution ability of agents across platforms.

Evaluation Platform: Evaluation platform builds an integrated evaluation environment, different platforms have unique challenges and evaluation priorities, we mainly divide them into three categories: Mobile platform (Mobile), Desktop platform (Desktop) and Web platform (Web).

Benchmark Setting: This section divides the evaluation environment for OS Agents into two broad categories: Static environment and Interactive environment, and further subdivide the interactive environment into Simulated environment and Real World environment. Static environments are good for off-line evaluation of basic tasks, while interactive environments (especially real-world environments) are better for fully testing the actual capabilities of OS Agents in complex dynamic scenarios. Real-world environments emphasize generalization and dynamic adaptability, which are important directions for future assessment.

Tasks: To fully assess the capabilities of OS Agents, current benchmarks integrate a variety of specialized tasks, ranging from system-level tasks (such as installing and uninstalling applications) to everyday application tasks (such as sending emails and shopping online). It can be divided into the following categories: 1. GUI Grounding: evaluates the ability of OS Agents to convert commands into screen interface operations, that is, how they interact with specified operable elements in the operating system. 2. Information Processing: Evaluate the ability of OS Agents to efficiently process and summarize information, especially in dynamic and complex environments, to extract useful information from large amounts of data; Agentic Tasks: Evaluate the core capabilities of OS Agents, such as the ability to plan and execute complex tasks. This type of task provides the agent with goals or instructions to complete the task without explicit guidance.

Challenges and future

This section discusses the main challenges and future directions of OS Agents, focusing on Safety & Privacy and Personalization & Self-Evolution.

Security and privacy

Security and privacy are areas that must be paid attention to in OS Agents development.

OS Agents face a variety of attack methods, including indirect prompt injection attacks, malicious pop-up Windows, and adversarial instruction generation, which can cause the system to perform incorrect operations or disclose sensitive information.

Although there are currently security frameworks for LLMs, defense mechanisms for OS Agents are still inadequate.

Current research mainly focuses on designing defense schemes specifically to deal with special threats such as injection attacks and backdoor attacks, and it is urgent to develop a comprehensive and extensible defense framework to improve the overall security and reliability of OS Agents.

In order to evaluate the robustness of OS Agents in different scenarios, some agent security benchmarks are also introduced to comprehensively test and improve the security performance of the system, such as ST-WebAgentBench[17] and MobileSafetyBench[18].

Individuation and self-evolution

Personalized OS Agents need to constantly adjust their behavior and functionality based on user preferences.

Multimodal large language models are gradually supporting the understanding of user history and dynamic adaptation to user needs, and OpenAI's Memory capabilities [19] have made some progress in this direction.

Improve personalization and performance by allowing agents to continuously learn and optimize through user interaction and task execution.

In the future, the memory mechanism will be extended to more complex forms, such as audio, video, sensor data, etc., to provide more advanced predictive capabilities and decision support.

At the same time, it supports user data-driven self-optimization to enhance user experience.

Sum up

The development of multimodal large language models has created new opportunities for OS Agents, bringing the idea of implementing advanced AI assistants closer to reality.

This review aims to provide an overview of the fundamentals of OS Agents, including their key components and capabilities.

In addition, the article reviews multiple approaches to building OS Agents, with a special focus on domain-specific base models and agent frameworks.

In the evaluation protocols and benchmarks, team members analyzed the various evaluation metrics in detail and categorized the benchmarks by environment, setting, and task.

Looking ahead, the team identified challenges that require ongoing research and attention, such as security and privacy, personalization and self-evolution. These areas are the focus of further research.

This review summarizes the current state of the field and points out potential directions for future work aimed at contributing to the continued development of OS Agents and enhancing their application value and practical significance in academia and industry.

If there is any mistake, welcome everyone to criticize and correct, the author also said, looking forward to the exchange of peers and friends!

Link to thesis: https://github.com/OS-Agent-Survey/OS-Agent-Survey
Project homepage: https://os-agent-survey.github.io/

TAGS：

PREV： OpenAI Recruiting Chinese Researchers again! High school shortlisted for the "Junior Nobel Prize" in the United States, and also teaches at Harvard

RETURN

NEXT： Grok's new picture function turned over, drawing people are always wrong, netizens shouted to Musk: apologize to Google