OpenAI releases New Year goals, Agent agents or will usher in a hundred model war-News-Artificial Intelligence Global Cooperation Alliance

OpenAI releases New Year goals, Agent agents or will usher in a hundred model war

#News ·2025-01-03

On January 1, 2025, OpenAI CEO Altman announced the company's New Year goals, covering AGI, agents, 4o upgrades, models with better memory, longer context, and so on.

picture

Regarding the Agent, some netizens broke the news that OpenAI may release an Agent named "operator" in January, which will have the ability to directly control the computer.

picture

In fact, in October 24, Claude has released a multi-modal large model based on Claude 3.5 Sonnet, can operate the computer agent.

picture

Claude is able to sense and interact with the computer interface, converting the user's instructions (e.g., "Fill out this form using my computer and online data") into computer commands (e.g., check a spreadsheet; Move the cursor to open the Web browser; And so on).

The realization of such agents relies on at least three technical capabilities of the large model.

First, the user intention understanding ability, for example, the user says "open the browser and search the latest progress of AI", the large model needs to parse out the two subtasks of "open the browser" and "search the latest progress of AI".

Second, task planning and execution capabilities require the decomposition of complex tasks into a series of executable sub-tasks. For example, "Send an email" is broken down into "Open email app", "Click Compose button", "enter recipient", "enter content", "Click Send" and so on.

Third, visual understanding (multimodal large model), for example, "Open URL" requires identifying the location of the browser address bar and entering the URL.

Almost at the same time, Microsoft also opened source a screen parsing tool based on the GMT-4V vision large model - OmniParser, which can convert user interface (UI) screenshots into structured elements to help AI accurately understand the screen content and generate operational instructions.

picture

Of course, the domestic large model is not to be outdone, wisdom spectrum mobile end AutoGLM, through voice commands to simulate human operation, to achieve e-commerce shopping, ordering takeout, wechat reply and other functions.

Here's a comparison of the three tools:

picture

However, at this stage, such agents still have flaws and their capabilities are not perfect.

But I think that's okay, we don't expect it to do a large or complex task, at this stage can help us to do a single, repetitive work is good, um... Automatic ticket grab?

Oh, and a few days ago, Google also released a similar agent, based on the new multi-modal large model Gemini 2.0 browser agent, can automatically use the browser to do work.

The topic of agents has been hot for the past year, and in October OpenAI also opened source a multi-agent Python development framework, Swarm.

picture

This time, the agent is listed as a New Year's goal, and it looks like it is going to do a big job. It is estimated that the major model manufacturers will also have a volume in this field.

We still choose to look out the window.

TAGS：

PREV： Scaling Law hits a wall? CMU and DeepMind's new approach allows VLMS to generate their own memories

RETURN

NEXT： Stanford AI scientific artifact open source, a key written GPT-4o mini blessing! Scientific writing is completely liberating

about/About

EVENTS/Exhibition

News/Information

OpenAI releases New Year goals, Agent agents or will usher in a hundred model war

Tengchong Scientists Forum Center and Artificial Intelligence Global Cooperation Alliance to explore the new future of artificial intelligence

Artificial intelligence + helping the industry to a new line - one of the observations of high-quality development of artificial intelligence

A New Year's pact for major-country diplomacy