Home > Information > News
#News ·2025-01-03
On January 1, 2025, OpenAI CEO Altman announced the company's New Year goals, covering AGI, agents, 4o upgrades, models with better memory, longer context, and so on.
picture
Regarding the Agent, some netizens broke the news that OpenAI may release an Agent named "operator" in January, which will have the ability to directly control the computer.
picture
In fact, in October 24, Claude has released a multi-modal large model based on Claude 3.5 Sonnet, can operate the computer agent.
picture
Claude is able to sense and interact with the computer interface, converting the user's instructions (e.g., "Fill out this form using my computer and online data") into computer commands (e.g., check a spreadsheet; Move the cursor to open the Web browser; And so on).
The realization of such agents relies on at least three technical capabilities of the large model.
First, the user intention understanding ability, for example, the user says "open the browser and search the latest progress of AI", the large model needs to parse out the two subtasks of "open the browser" and "search the latest progress of AI".
Second, task planning and execution capabilities require the decomposition of complex tasks into a series of executable sub-tasks. For example, "Send an email" is broken down into "Open email app", "Click Compose button", "enter recipient", "enter content", "Click Send" and so on.
Third, visual understanding (multimodal large model), for example, "Open URL" requires identifying the location of the browser address bar and entering the URL.
Almost at the same time, Microsoft also opened source a screen parsing tool based on the GMT-4V vision large model - OmniParser, which can convert user interface (UI) screenshots into structured elements to help AI accurately understand the screen content and generate operational instructions.
picture
Of course, the domestic large model is not to be outdone, wisdom spectrum mobile end AutoGLM, through voice commands to simulate human operation, to achieve e-commerce shopping, ordering takeout, wechat reply and other functions.
Here's a comparison of the three tools:
picture
However, at this stage, such agents still have flaws and their capabilities are not perfect.
But I think that's okay, we don't expect it to do a large or complex task, at this stage can help us to do a single, repetitive work is good, um... Automatic ticket grab?
Oh, and a few days ago, Google also released a similar agent, based on the new multi-modal large model Gemini 2.0 browser agent, can automatically use the browser to do work.
The topic of agents has been hot for the past year, and in October OpenAI also opened source a multi-agent Python development framework, Swarm.
picture
This time, the agent is listed as a New Year's goal, and it looks like it is going to do a big job. It is estimated that the major model manufacturers will also have a volume in this field.
We still choose to look out the window.
2025-02-17
2025-02-14
2025-02-13
13004184443
Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai
gcfai@dongfangyuzhe.com
WeChat official account
friend link
13004184443
立即获取方案或咨询top