Scaling Law hits a wall? CMU and DeepMind's new approach allows VLMS to generate their own memories-News-Artificial Intelligence Global Cooperation Alliance

Scaling Law hits a wall? CMU and DeepMind's new approach allows VLMS to generate their own memories

#News ·2025-01-03

There's been a lot of discussion in the AI community lately about whether Scaling Law is hitting a wall. One reason for Scaling Law against the wall is that AI is almost running out of high-quality data, with one study predicting that if LLM continues on its current trajectory, it will be used up by 2028 or so.

Image Source: Will we run out of data? Limits of LLM scaling based on human-generated data

A study by Carnegie Mellon University (CMU) and Google DeepMind titled "VLM Agents Generate their Own Memories: Distilling Experiences into Embodied Mind Programs" found that it may be possible to address this lack of high-quality data by using low-quality data and feedback. The proposed ICAL allows LLMS and VLMS to create effective prompt words based on suboptimal presentations and human feedback, thereby improving decision making and reducing reliance on expert presentations. The paper is a Spotlight paper for NeurIPS 2024, and the project code has also been published.

Title: VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
Address: https://openreview.net/pdf?id=5G7MRfPngt
Project address: https://ical-learning.github.io/
The code address: https://github.com/Gabesarch/ICAL

We know that humans have an excellent ability to learn with small samples, and by combining observed behavior with a model of the internal world, can quickly generalize from a single task demonstration to a relevant situation. Humans can distinguish between factors that are or are not associated with success and predict possible failure. Through repeated practice and feedback, humans quickly find the right abstraction to help mimic and adapt tasks to various situations. This process facilitates the continuous improvement and transfer of knowledge across a variety of tasks and environments.

Recent research has explored the use of large language models (LLM) and Visual-language models (VLM) to extract high-level insights from trajectories and experiences. These insights are generated by models through introspection, and by attaching them to prompt words, their performance can be improved - taking advantage of their powerful contextual learning capabilities.

Existing approaches typically focus more on verbal task reward signals, storing human corrections after failure, using domain experts to manually write or manually pick examples (without introspection), or using language to shape strategies and rewards. The point is that these methods are often text-based, don't include any visual cues or presentations, or use introspection only in case of failure, but this is just one of several ways that humans and machines can integrate experiences and extract insights.

The approach of the CMU and DeepMind research team is: given the suboptimal presentation and human natural language feedback, the VLM learns to solve the new task by learning contextual experience abstraction. This method is named In-Context Abstraction Learning (ICAL).

How does contextual abstract learning learn?

ICAL This approach allows VLM to create multimodal abstractions for unfamiliar domains by prompting them.

We know that previous research efforts usually only store and retrieve successful motion plans or trajectories. But ICAL is different, with an emphasis on learning an abstraction that actually contains dynamic and critical knowledge about the task, as shown in Figure 1.

Specifically, ICAL handles four types of cognitive abstraction:

Tasks and causation, identifying the basic principles or actions needed to achieve the goal and how the elements relate to each other through causation;
Changes in the state of the object, describing the various forms or conditions that the object will take;
Time abstraction, decomposing tasks into sub-goals;
task construction (task construals), highlighting key visual details in a task.

When there are optimal or suboptimal presentations, ICAL translates those presentations into optimized trajectories through the prompt VLM, while creating relevant linguistic and visual abstractions. These abstractions are then optimized by executing these trajectories in the environment, guided by natural language feedback from humans.

Each step of the abstraction generation process uses previously derived abstractions, allowing the model to not only improve its execution, but also improve its abstraction capabilities.

Figure 2 gives an overview of ICAL.

Each iteration begins with a noisy trajectory. ICAL abstracts this in two stages:

F_abstract: VLM uses language comments to correct errors and enrich sequences. The abstract processing of this stage has been introduced in the previous article, and the relevant specific prompt words, etc., please refer to the appendix of the original paper.
Human-in-the-loop (F_hitl) : in this stage, the sequence is executed in the environment and its abstract process is guided by human feedback conveyed in natural language. The specific process of this stage can be divided into six steps: execution of optimization trajectory, monitoring and intervention, feedback integration and trajectory correction, environmental reset and retry, success criteria and feedback limits, and saving examples. Please visit the original paper for details.

After successful execution of the trace, it is archived in a growing sample library. These examples are used as contextual references for the agent to provide references to previously unseen instructions and environments during the learning and reasoning phases.

Overall, this learned abstraction can summarize key information about action sequences, state transitions, rules, and areas of concern, and can be explicitly expressed through free-form natural language and visual representations.

Once an ICAL sample is learned, it can be deployed using retrieval augmentation generation for new tasks and new environments.

Experiments have shown that ICAL is indeed feasible

The researchers tested ICAL's task planning capabilities in TEACh and VisualWebArena, and its motion prediction capabilities in the Ego4D benchmark. Among them, TEACh aims at conversational teaching in the family environment, VisualWebArena is a multimodal automated network task, and Ego4D is used for video action prediction.

First, the team found that ICAL outperformed fixed presentations for instruction following tasks in a home setting.

Table 1 presents the results of the study on the previously unseen TEACh validation set, where performance on new instructions, houses, and objects is evaluated.

As shown in Figure 4, ICAL's modification of the noisy trajectories resulted in a higher success rate for the training task, with kinesthetic and visual presentations improving the success rate by 42% and 86%, respectively, compared to mimicking the original trajectories. This shows that ICAL not only adds useful abstractions, but also corrects errors in passive video presentations and improves success rates in the original presentation environment.

As shown in Table 1, ICAL outperformed the unprocessed demo as a contextual example on the unseen task, achieving a 17.9% improvement in success rate over the original demo with predicted actions and an 8.6% improvement over the demo with real action annotations. This highlights the effectiveness of the newly proposed abstract approach in improving sample quality to improve contextual learning, unlike the previous dominant approach of saving and retrieving successful action plans or trajectories but not abstracting.

In addition, for visual networking tasks, ICAL also achieved SOTA performance. On VisualWebArena, the new agent outperforms the combination of GPT4 + Set of Marks, increasing from 14.3% to 22.7% with GPT4V and from 18.9% to 23.4% with GPT4o.

In the Ego4D setup, ICAL outperformed the smaller sample GPT4V using the Chain of Mind, reducing noun and action editing distances by 6.4 and 1.7, respectively, and was similar to the fully supervised approach - but using 639 times less in-domain training data.

Overall, the new approach significantly reduces reliance on expert examples and is consistently superior to contextual learning that "lacks action planning and trajectories of such abstractions."

In addition, with the increase in the number of examples, ICAL can be significantly improved. This suggests that the new method can scale well, too.

TAGS：

PREV： New exploration of multi-modal large language model spatial intelligence: Just a single picture or a sentence can accurately generate 3D modeling code!

RETURN

NEXT： OpenAI releases New Year goals, Agent agents or will usher in a hundred model war

about/About

EVENTS/Exhibition

News/Information