Home > Information > News
#News ·2025-01-07
Industry experts have predicted that 2024 will be a milestone year for generative AI. Practical application cases continue to emerge, technological advances are lowering the barriers to entry in the field, and general AI seems to be within reach.
So have these predictions come true?
Partially realized. As 2024 draws to a close, some of these predictions are already coming true. Others, especially general AI, will take more time to incubate.
Here's what famed futurist and investor Tomasz Tunguz has to say about the data engineering and AI space by the end of 2024, along with a few of my personal predictions.
The data engineering trends of 2025 are upon us.
In the third year of AI dystopia, we have observed that companies are beginning to create value in some of the expected areas - however, not all breakthroughs. Tomasz believes that current AI can be divided into three main categories.
While AI "copilots" and search capabilities have had some success (especially the former), the development of inference models does not seem to have kept pace. There's an obvious reason for this, Tomasz points out.
That's the accuracy of the model.
Tomasz explains that current models struggle to effectively break down tasks into different steps unless they have encountered the same pattern multiple times. And that's not often the case for most of the work that these models might do.
"At the moment, if a large model is asked to produce a financial plan and analysis chart, it can do it. But if there is any material change - for example, we move from billing per software to billing per usage - the model gets overwhelmed."
So the current situation is that AI copilots and partially accurate search results have the upper hand.
The value of a new tool depends on the strength of the processes that underpin it.
As the "modern data stack" evolves over the years, data teams sometimes find themselves in a perpetual state of being on the run. They focus too much on what the platform can do and lose sight of the more critical question of how to use those capabilities efficiently.
However, as the corporate world moves towards production-ready AI - AI systems that are ready for use in real business environments and can provide value to the enterprise or users - how to put these new tools to use becomes particularly urgent.
Taking data quality as an example, the status of the data required for AI in 2024 is becoming increasingly prominent, and data quality is also becoming a focus. With production-ready AI on the horizon, enterprise data leaders don't have time to pick and choose from the data quality menu - try dbt testing here, point solutions there. They need to deliver value now and are desperate for trusted solutions that can go online immediately and be deployed effectively.
With production-ready AI looming, enterprise data leaders don't have time to pick and choose from the data quality menu. They are already tasked with delivering business value and are in desperate need of trusted solutions that can be brought online immediately and deployed effectively.
The reality is that even if you have the most cutting-edge data quality platform on the market - the most advanced automations technology, the best copilots systems, the best integrations - if you can't get your business up and running quickly, All you have is a small line item on your budget and a new TAB on your desktop.
Over the next year, I expect data teams to move more toward proven end-to-end solutions rather than fragmented toolsets in order to focus on the more critical challenges of data quality ownership, event management, and long-term domain empowerment.
Solutions that meet these core needs will stand out in the AI field and win the final victory.
As with all data products, the value of GenAI comes in both cost reduction and revenue generation.
When it comes to generating revenue, technologies such as AI SDRS, data-enhanced devices, or recommendation systems may be involved. Tomasz points out that while these tools can broaden sales channels... But the quality of the channel may not be ideal. Therefore, if AI fails to directly increase revenue, it should aim to reduce costs - and this is where the emerging technology has already made a difference.
"In reality, not many companies have shut down operations as a result. Its main function is to reduce costs. Klarna, for example, cut two-thirds of its workforce. Microsoft and ServiceNow have improved engineering efficiency by 50 to 75 percent."
Tomasz believes that AI applications have the potential to achieve cost reductions if they meet one of three criteria:
One example Tomasz cites as an effective use of AI to generate new revenue is EvenUp - a law firm that automates the processing of demand letters. Companies like EvenUp that support templating but offer highly customized services may see significant benefits with the help of existing AI technologies.
Compared to last year's rush to propose "AI strategies," today's leaders seem to have a more cautious attitude toward AI technology.
"There was a wave last year of people trying to put out all kinds of software just to see it. Their boards are all asking about their AI strategy. But now, a lot of those early attempts have given up."
Some companies do not see the value of AI in their initial attempts, while others feel overwhelmed by the rapid development of the technology itself. Tomasz noted that this is one of the biggest challenges of investing in AI companies. It is not that AI technology has no value in theory, but that companies have not yet mastered how to use it effectively in practice.
Tomasz believes that the next phase of AI adoption will be different from the first wave because leaders will be more clear about their needs and how to meet them.
Like the final dress rehearsal before the curtain goes up, the teams already know what they are looking for, they have solved many of the issues related to legal and procurement - especially those related to data loss and data protection - and they are just waiting for the right opportunity to come along.
What are the challenges ahead? "How do you discover and realize value faster?"
The open source vs. managed debate is a well-worn topic, but it gets even more complicated when it comes to AI.
At the enterprise level, it's not just about control or interoperability, although those factors do exist, but more critical are operating costs.
Tomasz believes that the largest B2C businesses are likely to go straight to off-the-shelf models, while B2B businesses are more inclined to develop their own proprietary models or adopt open source models.
"In the B2B space, you're seeing smaller models overall, and more open source models. That's because it's much cheaper to run a small open source model."
But the advantage of small models is not just cost, they can also improve performance. Large models like Google's are designed to cope with a variety of scenarios, and users can ask large models almost any question, so these models need to be trained on a vast corpus of data to provide relevant answers, such as water polo, Chinese history, or French toast.
However, the more topics the model is trained on, the easier it is to confuse different concepts - and the more errors it outputs over time.
"You can take a model like llama 2 that has 8 billion parameters and use 10,000 support tickets. "10,000 support tickets" are 10,000 support tickets, questions or requests recorded by a business during customer service or technical support. Each ticket may contain information such as problems encountered by the customer, solutions, communication records, etc.) Fine-tune it and it performs significantly better, "explains Tomasz.
In addition, ChatGPT and other hosted solutions frequently face legal challenges because their creators may not have legally obtained the data used to train their models.
In many cases, this accusation is not unfounded.
Beyond cost and performance, this issue could have an impact on the long-term adoption of proprietary models - especially in highly regulated industries - but the exact extent of its impact remains uncertain.
Of course, proprietary models aren't sitting still, and Sam Altman certainly isn't giving up on them.
Proprietary models are already stimulating demand by slashing prices. Models like ChatGPT have already reduced prices by about 50% and are expected to drop another 50% in the next 6 months. This cost reduction could be a key boost for B2C businesses competing in the AI arms race.
When scaling data pipeline production, data teams typically face two major challenges: analysts' lack of technical experience and data engineers' limited time.
This seems to be a problem that AI can solve.
As we look ahead to how data teams are likely to evolve, I see two key trends that are likely to drive the integration of engineering and analytical responsibilities in 2025:
The logic is simple - as demand grows, data pipeline automation will naturally evolve to meet it. As automation advances, the barriers to creating and managing these data pipelines will be lowered. The skills gap will narrow and the ability to create new value will increase.
The move to self-serve AI-driven data pipeline management means that the most tedious parts of everyone's job will be replaced by automation - and their ability to create and demonstrate new value in the process will be enhanced. That sounds like a good future.
You may have seen the picture of a snake swallowing an elephant. If you look closely, it bears a striking resemblance to contemporary AI developments.
There are currently about 21-25 trillion tokens (words) on the Internet. The AI models currently in use already consume all of this data. For AI to continue to improve, it needs to be trained on a much larger corpus of data. The more data, the richer the context of the output and the higher the accuracy.
So what do AI researchers do when they run out of training data?
They make their own data.
As training data becomes increasingly scarce, companies like OpenAI believe that synthetic data will become an important part of training models in the future. In the past two years, entire industries have grown up around this vision - including companies like Tonic that generate synthetic structured data, and Gretel that create compliance data for regulated industries like finance and healthcare.
But is synthetic data a long-term solution? Probably not.
Synthetic data works by using models to create artificial data sets that simulate what one might find in a natural environment, and then using this new data to train the model. On a small scale, it does make sense. But as the saying goes, "too much is not enough"...
You can think of it as "contextual malnutrition." Just like food, if fresh organic data is the most nutritious for model training, then the data extracted from the existing data set must be inherently less "nutritious" than the original data.
Adding a little artificial seasoning is fine - but if you rely on synthetic training data for a long time without introducing new "natural" data [1], the model will eventually fail (or, at the very least, its performance will significantly decline).
It's not a question of "if" it will happen, it's a question of "when."
According to Tomasz, we are far from a model crash. But as AI research continues to push models to their functional limits, it's not hard to imagine that AI will eventually reach its functional platform limits - perhaps sooner than we expect.
The idea of using unstructured data in production is not new - but in the age of artificial intelligence, unstructured data has taken on a whole new role.
According to an IDC report, only about half of enterprise unstructured data is currently analyzed [2].
That's all about to change.
When it comes to generative AI, the success of a business depends largely on the vast amount of unstructured data used for training, fine-tuning, and augmentation. As more organizations look to apply AI to enterprise use cases, the enthusiasm for unstructured data, and the emerging "unstructured data stack [3]," will continue to grow.
Some teams are even exploring how additional LLMs (large language models) can be used to add structure to unstructured data [4] to extend its use in other training and analysis use cases.
Identifying the unstructured first-party data that exists in the enterprise and how to activate it for stakeholders is a whole new opportunity for data leaders looking to demonstrate the business value of their data platform (and hopefully snag some extra budget for priority initiatives in the process).
If 2024 was the year to explore the potential of unstructured data, then 2025 will be all about realizing its value. The problem is that Which tools will stand out?
If you've been around venture capital lately, you've probably heard two buzzwords a lot: "copilot," which is an AI for a single task (like "fix my bad code"), and "agents," which are workflows that gather information and use it to perform multistep tasks (like "blog about my bad code and post it to my WordPress").
In 2024, AI copilots have certainly achieved a lot (just ask Github, Snowflake, Microsoft's paperclip team), but how have AI agents fared?
While Agentic AI is a lot of fun for the customer support team, it looks like it can only do so much in the short term. While these early AI agents mark an important step forward, the accuracy of their workflows still falls short.
For AI, 75-90% accuracy is already the most advanced level, and most AI is at the level of high school students. If the accuracy of the three steps is 75-90%, then the final accuracy may only be about 50%.
We can train elephants to draw with better accuracy than that.
If most AI agents are put into production at their current performance, they will not be able to bring benefits to the enterprise and are likely to have a negative impact. According to Tomasz, we need to solve this problem first.
It's important to be able to talk about these AI agents, and no one has ever had any success outside of a project demo. People in Silicon Valley may love to talk about AI agents, but that discussion doesn't translate into actual performance.
"At a dinner with many leaders in the AI field, I asked how many were satisfied with the quality of the output, and no one responded. We do have a serious quality challenge in terms of ensuring consistent output."
Every year, Monte Carlo surveys [5] data experts on the true state of data quality. This year, we focused on the reach of AI, and the signals were clear.
Data quality risks are evolving, but data quality management has not kept pace.
"We've observed teams building vector databases or embedded models at scale, using SQLLite at scale, totaling 100 million small databases. They started with architectural design at the CDN layer to run these small models. The iPhone will also be equipped with machine learning models. We expect to see a significant increase in the total number of data pipelines, but smaller amounts of data processed by each pipeline."
The fine-tuning model will lead to a dramatic increase in the number of data pipelines within the enterprise. However, the larger the data pipeline, the more difficult it is to ensure data quality.
Data quality is directly related to the number and complexity of data pipelines. The more (and more complex) the data pipeline, the higher the chance of a problem, and the less likely it is to be detected in a timely manner.
2025-02-17
2025-02-14
2025-02-13
13004184443
Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai
gcfai@dongfangyuzhe.com
WeChat official account
friend link
13004184443
立即获取方案或咨询top