OpenAI releases New Year goals, Agent agents or will usher in a hundred model war

#News ·2025-01-03

On January 1, 2025, OpenAI CEO Altman announced the company's New Year goals, covering AGI, agents, 4o upgrades, models with better memory, longer context, and so on.

图片picture

Regarding the Agent, some netizens broke the news that OpenAI may release an Agent named "operator" in January, which will have the ability to directly control the computer.

图片picture

In fact, in October 24, Claude has released a multi-modal large model based on Claude 3.5 Sonnet, can operate the computer agent.

图片picture

Claude is able to sense and interact with the computer interface, converting the user's instructions (e.g., "Fill out this form using my computer and online data") into computer commands (e.g., check a spreadsheet; Move the cursor to open the Web browser; And so on).

The realization of such agents relies on at least three technical capabilities of the large model.

First, the user intention understanding ability, for example, the user says "open the browser and search the latest progress of AI", the large model needs to parse out the two subtasks of "open the browser" and "search the latest progress of AI".

Second, task planning and execution capabilities require the decomposition of complex tasks into a series of executable sub-tasks. For example, "Send an email" is broken down into "Open email app", "Click Compose button", "enter recipient", "enter content", "Click Send" and so on.

Third, visual understanding (multimodal large model), for example, "Open URL" requires identifying the location of the browser address bar and entering the URL.

Almost at the same time, Microsoft also opened source a screen parsing tool based on the GMT-4V vision large model - OmniParser, which can convert user interface (UI) screenshots into structured elements to help AI accurately understand the screen content and generate operational instructions.

图片picture

Of course, the domestic large model is not to be outdone, wisdom spectrum mobile end AutoGLM, through voice commands to simulate human operation, to achieve e-commerce shopping, ordering takeout, wechat reply and other functions.

Here's a comparison of the three tools:

图片picture

However, at this stage, such agents still have flaws and their capabilities are not perfect.

But I think that's okay, we don't expect it to do a large or complex task, at this stage can help us to do a single, repetitive work is good, um... Automatic ticket grab?

Oh, and a few days ago, Google also released a similar agent, based on the new multi-modal large model Gemini 2.0 browser agent, can automatically use the browser to do work.

The topic of agents has been hot for the past year, and in October OpenAI also opened source a multi-agent Python development framework, Swarm.

图片picture

This time, the agent is listed as a New Year's goal, and it looks like it is going to do a big job. It is estimated that the major model manufacturers will also have a volume in this field.

We still choose to look out the window.

TAGS:

  • 13004184443

  • Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai

  • gcfai@dongfangyuzhe.com

  • wechat

  • WeChat official account

Quantum (Shanghai) Artificial Intelligence Technology Co., Ltd. ICP:沪ICP备2025113240号-1

friend link