Leading the human-computer Interaction Revolution? Microsoft Research team publishes 80-page overview of large model GUI agents-News-Artificial Intelligence Global Cooperation Alliance

Leading the human-computer Interaction Revolution? Microsoft Research team publishes 80-page overview of large model GUI agents

#News ·2025-01-09

The paper's lead authors, Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, and others from the Data, Knowledge, and Intelligence (DKI) team, Member of the core development team for Microsoft Windows GUI Agent UFO.

The Graphical User Interface (GUI), one of the most iconic innovations of the digital age, has greatly simplified the complexity of human-computer interaction. From simple ICONS, buttons, and Windows to complex multi-application workflows, GUIs provide users with an intuitive and user-friendly experience. However, in the process of automation and intelligent upgrading, the traditional GUI control mode has always faced many technical challenges. The scripted or rule-driven approaches of the past have been helpful in certain scenarios, but as modern application environments become more complex and dynamic, their limitations become more apparent.

In recent years, the rapid development of artificial intelligence and Large Language Models (LLMs) has brought transformative opportunities to this field.

Recently, the Microsoft research team released an 80-page review paper with more than 30,000 words, "Large Language Model-Brained GUI Agents: A Survey." This review systematically reviews the current status, technical framework, challenges and applications of GUI agents driven by large models. By combining large Language Models (LLMs) with Visual Language Models (VLMs), GUI agents can automatically operate graphical interfaces according to natural language instructions and complete complex multi-step tasks. This breakthrough not only surpasses the inherent bottleneck of traditional GUI automation, but also promotes the transition of human-computer interaction from "click + input" to "natural language + intelligent operation".

Link: https://arxiv.org/abs/2411.18279

Limitations and new challenges of traditional GUI automation

In the past few decades, GUI automation technology has relied on two main approaches:

Scripting methods: Tools such as Selenium, AutoIt, and others rely on pre-written fixed scripts to simulate clicking, typing, and so on. This method is suitable for the relatively stable interface and flow, but when the interface is updated frequently or the layout changes dynamically, the script is easy to fail and the maintenance cost is high.
Rule-driven approach: Identify GUI components (such as buttons and input fields) according to preset rules and perform actions accordingly. This type of approach is inflexible and difficult to cope with complex or non-standardized workflows.

These traditional methods are inadequate in the face of highly dynamic, cross-application complex tasks. For example:

How do you get an automated system to understand the content of a web page and extract the key information the user needs from it?
How to adapt to the variety of GUI interfaces on different devices and operating systems?
How to maintain coherence and consistency of context in multi-step tasks?

Big Model: Intelligent GUI interaction engine

Figure 1: Conceptual demonstration of GUI agents.

Microsoft's review points out that the Large Language Model (LLM) plays a key role in solving the above problems, and its advantages are mainly manifested in the following three areas:

1. Natural language understanding and task planning

The large models represented by GPT series have excellent ability of natural language understanding and generation. They are able to automatically parse simple and intuitive user instructions such as "Open a file, extract key information, and send it to a colleague" into a series of actionable steps. Through Chain-of-Thought reasoning and task decomposition, agents can progressively complete extremely complex processes.

2. Visual understanding and environment perception

With the introduction of multimodal technology, visual language model (VLM) can process text and visual information. By analyzing GUI screenshots or UI structure trees, agents can understand the layout and meaning of interface elements (buttons, menus, text boxes). This provides the agent with human-like visual understanding, enabling it to perform precise operations in a dynamic interface. For example, automatically locate the search bar in the web page and enter keywords, or find specific buttons in the desktop application for copy and paste operations.

3. Dynamic execution and adaptive capability

Compared to traditional scripting methods, GUI intelligence using large models can respond to real-time feedback and adjust policies dynamically. When the interface status changes or error messages appear, the agent can try new paths and schemes instead of relying on the fixed script flow.

Figure 2: The development and main work of GUI agents.

With the support of large models, GUI agents bring qualitative improvement to human-computer interaction. Users only need natural language instructions, and the agent can complete the goals that would otherwise require tedious clicks and complex operations. This not only reduces the user's operation and learning costs, but also reduces the dependence on specific software apis and improves the system versatility. As shown in Figure 2, since 2023, research on GUI agents driven by large models has emerged in an endless stream and gradually become a frontier hot spot.

The core architecture of GUI agents

Microsoft's review states that a large model-driven GUI agent typically includes the following key components, as shown in Figure 3:

Figure 3: GUI agent basic architecture.

1. Operating environment awareness

Input data includes GUI screenshots, UI structure trees, element attributes (type, label, location), and window level information. Through Windows UI Automation, Android Accessibility API and other tools, agents can effectively capture interface information.

2. Prompt Engineering

The agent combines user instructions with the current GUI state, builds Prompt inputs, and uses a large language model to generate a plan for next actions. For example: "User command + screen shot + UI element attribute" after LLM processing, the agent will output a clear operation step (click, input, drag, etc.).

3. Model reasoning

After the constructed Prompt is entered into the LLM, the model predicts subsequent execution actions and planned steps.

4. Perform operations

The agent performs actual operations, such as mouse clicks, keyboard inputs, or touches, based on high-level commands output by the LLM to complete tasks in web pages, mobile applications, or desktop systems.

5. Memory mechanism

To cope with complex multi-step tasks, GUI agents design short-term memory (STM) and long-term memory (LTM) mechanisms to track task progress and historical operations to ensure consistency and coherence of context.

In addition, higher-order technologies (such as GUI parsing based on computer vision, multi-agent collaboration, self-reflection and evolution, reinforcement learning, etc.) are also being explored. These technologies will make GUI agents increasingly powerful and perfect. Microsoft's review has covered these frontier directions in detail.

GUI Agent Frameworks, data, models, and measurements: a comprehensive overview and practical guide

Microsoft's review systematically summarizes the evolution path of the field, covering framework design, data acquisition, model optimization, and performance measurement, providing a complete framework for researchers and developers to guide.

1. Framework design: multi-platform adaptation and cross-field expansion

The current framework design of GUI agents can be divided into: according to application scenarios and platform characteristics:

Web platform agents: such as WebAgent and SeeAct, based on HTML DOM or visualization features, perform multi-step operations such as web navigation, data fetching, and form filling.
Mobile platform Agent: UI hierarchy can be obtained through iOS and Android Accessibility apis, such as AppAgent and AutoDroid, which can cope with complex UI layout and various gesture operations on mobile devices.
Desktop platform agent: such as Microsoft's UFO agent, by analyzing the GUI hierarchy tree and API calls of Windows and macOS to simulate keyboard and mouse operation, complete the task execution of cross-software.
Cross-platform agents: such as AGUVI, the common framework can be adapted to multiple devices and systems, laying the foundation for cross-platform automation. This type of agent has stronger generalization ability and can freely migrate between different platforms.

The proposal and verification of these frameworks provide the possibility for GUI agents to be implemented in various application scenarios, and create a solid foundation for cross-platform automation.

2. Data acquisition: Construction of high-quality training data

Efficient and accurate GUI operation cannot be achieved without rich, real data support, including:

GUI environment data: screenshots, UI element attributes (type, label, location), window level information, etc., to provide the agent with visual and structured information basis.
Operation data: real interaction records of users, such as clicks, inputs, gestures, etc., provide samples for the model to learn human operation rules.

Figure 4: GUI agent data acquisition process.

These data provide the basis for training and testing, as well as a solid foundation for standardized assessments in the field. Figure 4 shows the data acquisition process for training GUI agent.

3. Big Action Model (LAM) : Core optimization of task execution

The review proposes the concept of "Large Action Model" (LAM), which is fine-tuned on the basis of LLM to solve the core challenges in GUI agent task execution:

Efficient inference: When fine-tuned on massive amounts of operational data, LAM quickly generates precise operational instructions and reduces inference latency.
Precise execution: Highly generalizable, adaptable to different platform GUI environments.
Multi-step task planning: Supports complex task disassembly and dynamic execution, and completes multiple operations continuously without predefined script flow.

Figure 5: Fine-tuning the "Big Action Model" for GUI agents.

As shown in Figure 5, by fine-tuning LAM in a real environment, the agent can significantly improve its execution efficiency and adaptability.

4. Evaluation methods and benchmarks: Evaluate the performance of GUI agents

Figure 6: Evaluation process of GUI agent.

Evaluation is an important means to measure the ability of the agent. As shown in Figure 6, the ability of the agent in all aspects can be evaluated by observing the trajectory and log recording of the agent's task execution. The main evaluation indicators mainly include:

Task completion rate: Whether user instructions are accurately executed and specific tasks are completed.
Execution Efficiency: Measures the time and steps required to complete a task, especially on resource-constrained hardware.
Completion rate under specific rules: Tests the agent&apos;s ability to complete tasks while following specific rules and policies provided by the user.
Risk ratio: Tests the agent's ability to identify and resolve execution risks.

A series of standardized benchmarks have appeared in the field, which provide an objective basis and platform for the performance evaluation and comparison of GUI agents.

Practical applications of GUI agents: from efficient testing to intelligent assistants

1. Software testing: from tedious scripts to natural language-driven intelligent exploration

Traditional software GUI testing often relies on lengthy scripting and repeated manual verification, which is time-consuming and prone to missing key scenarios. Now, with Large Language Model (LLM) -enabled GUI agents, we are ushering in a revolution in testing. These agents are no longer simply repeating fixed scripts, but can directly generate test cases through natural language descriptions, "self-explore" interface elements, and dynamically respond to various changes in the user interface. Research has shown (as demonstrated by tools such as GPTDroid, VisionDroid, and AUITestAgent) that agents can efficiently catch potential defects and trace complex interaction paths without the deep involvement of professional software engineers. Implement a fully automated testing process from input generation, bug reproduction to functional verification.

Take font size debugging as an example, with just one sentence "Please test the process of changing font size in system Settings", the GUI agent can autonomously navigate the interface, simulate user click and swipe options, and accurately confirm whether the font adjustment is effective in the result interface. Such natural language driven testing not only effectively improves test coverage and efficiency, but also makes it easy for non-technical personnel to participate in the quality assurance process. This means faster iterations of software products and the freeing up of development and quality assurance teams to focus more on innovation and optimization.

2. Intelligent Assistant: From passive response to multi-platform, multi-step all-round executive

Virtual assistants are no longer limited to simple alarm Settings or weather queries. When LLM-enabled GUI agents become the "brains" of virtual assistants, we get a true "generalist" - one who can automate tasks ranging from document editing and data table analysis to complex mobile operations, guided by natural language commands, across desktops, phones, Web browsers and enterprise applications.

These agents not only respond to commands, but also understand user requirements based on context and flexibly adapt to different interface elements. For example, they can autonomously find hidden feature entries in mobile apps and show new users how to take screenshots. Or in an office environment, a set of cross-platform data is collated to automatically generate reports. In such applications, users do not have to worry about memorizing cumbersome operation steps, nor do they have to face complex processes and dilemma, just describe the target in natural language, and the agent can quickly parse the context, locate the interface components and complete the instructions. Through continuous learning and optimization, these intelligent assistants can also become more and more "get you", effectively improving your productivity and experience satisfaction.

In summary, the GUI agent is no longer just a "tool" in real applications, but more like a 24/7 "digital assistant" and "quality expert". In the field of testing, they guarantee software quality and significantly reduce labor and time costs. In daily and business operations, they become cross-platform multi-functional helpers, allowing users to easily interact with the digital world in a more intuitive and human way. In the future, with the continuous iteration and upgrading of technology, these agents will continue to expand the application boundaries and inject new vitality into the digital transformation of all walks of life.

Technological challenges and future prospects

Despite the promise of GUI agents, Microsoft's review also makes clear where the current challenges lie:

Privacy and security: Agents need to access user interface content, and data security and privacy protection need to be improved.
Inference delay and hardware constraints: Large model inference is expensive and needs to strike a balance between performance and real-time.
Security and trust: Ensure that agents can perform tasks reliably and avoid misoperation and security risks.
Man-machine collaboration and interaction strategy: balancing the decision and execution relationship between user and agent in complex tasks.
Personalization and customization: how agents learn user preferences and habits to more precisely meet user needs.
Ethics and norms: Ensure transparent, fair, and accountable decision-making by agents.
Universal generalization: The adaptation of different devices, operating systems and complex non-standard interface elements is still a problem.

Looking to the future, as large language models and multimodal technologies continue to evolve, GUI agents will land in more areas, bringing profound changes to productivity and workflow.

Conclusion: Towards a new era of intelligent interaction

The rise of large models opens up a whole new space for GUI automation. When GUI agents no longer rely solely on rigid scripts and rules, but rely on natural language and visual understanding to make decisions and perform actions, human-computer interaction changes in a qualitative way. This not only simplifies user operations, but also provides strong support for application scenarios such as intelligent assistants and automated testing.

With the continuous iteration of technology and the maturity of ecology, GUI agents are expected to become a key tool in daily work and life, making complex operations more intelligent and efficient, and ultimately leading human-computer interaction to a new era of intelligence.

TAGS：

PREV： Tongyi Wan phase video generation heavy upgrade, successfully reached the top of VBench, operation mirror, texture straight to professional level

RETURN

NEXT： Generate 3D objects from a single image in one second, support real-time editing, Stability AI introduces a new 3D generation method SPAR3D

about/About

EVENTS/Exhibition

News/Information