跳转到主要内容

category

从LLM将现代数据堆栈转换为矢量数据库的数据可观察性,以下是我对2024年顶级数据工程趋势的预测。

Image courtesy of The Everett Collection on Shutterstock.

“The data and AI space moves fast. If you don’t stop and look around once in a while, you just might miss it.”

2023 was the year of GenAI. And 2024 is shaping up to be…another year of GenAI.

But where 2023 saw teams scrambling to name drop, 2024 will see teams prioritizing real business problems for their AI models. And with renewed focus comes new priorities.

When it comes to the future of data, a rising tide lifts all ships. And GenAI will continue to rise in 2024, raising the standards — and priorities — of the data industry right along with it.

Here are my top 10 predictions for what’s next for data and AI teams — and how your team can stay one step ahead.

1. LLMs will transform the stack

This one was a given.

It’s no exaggeration to say that large language models (LLMs) have transformed the face of technology over the last 12 months. From companies with legitimate use cases to fly by night teams with technology on the hunt for a problem, everyone and their data steward is trying to use generative AI (GenAI) in one fashion or another.

LLMs are set to continue that transformation into 2024 and beyond — from driving increased demand for data and necessitating new architectures like vector databases (a.k.a, the “AI stack”), to changing the way we manipulate and use the data for our end users.

Automated data analysis and activation will become an expected tool in every product and at every level of the data stack. The question is: how do we make sure these new products are providing real value in 2024 and not just a little new flash for the PR credit?

2. Data teams will look like software teams

The most sophisticated data teams are viewing their data assets as bonafide data products — complete with product requirements, documentation, sprints, and even SLAs for end-users.

So, as organizations begin mapping more and more value to their defined data products, more and more data teams will start looking — and being managed — like the critical product teams that they are.

3. And software teams will become data practitioners

When engineers try to build data products or GenAI without thinking about the data, it doesn’t end well. Just ask United Healthcare.

As AI continues to eat the world, engineering and data will become one in the same. No major software development will enter the market without an eye toward AI — and no major AI will enter the market without some level of real enterprise data powering it.

That means that as engineers seek to elevate new AI products, they’ll need to develop an eye toward the data — and how to work with it — in order to build models that add new and continued value.

4. RAG will be all the RAGe

After a series of high-profile GenAI failures, the need for clean, reliable, and curated context data to augment AI products has become increasingly obvious.

As the AI field continues to develop and blind spots in general LLM training become painfully apparent, teams with proprietary data will turn to RAG (retrieval augmented generation) and fine-tuning en masse to augment their enterprise AI products and deliver a demonstrable value moat for their stakeholders.

RAG is still relatively new on the scene (it was first introduced by Meta AI in 2020), and organizations have yet to develop experience or best practices around RAG — but they’re coming.

5. Teams will operationalize enterprise-ready AI products

The data engineering trend that keeps on trending — data products. And make no mistake, AI is a data product.

If 2023 was the year of AI, 2024 will be the year of operationalizing AI products. Whether out of need or coercion, data teams across industries will embrace enterprise-ready AI products. The question is — will they really be enterprise ready?

Gone are (hopefully) the days of creating random chat features just to say you’re integrating AI when the board asks. In 2024, teams are likely to become more sophisticated about how they develop AI products leveraging better training practices to create value and identifying problems to solve instead of pumping out technology to create new problems.

6. Data observability will support AI and vector databases

In Amazon Web Services (AWS)’ 2023 CDO Insights survey, respondents were asked what their organization’s biggest challenge was in realizing the potential of generative AI.

The most common answer? Data quality.

Generative AI is, at its core, a data product. And like any data product, it doesn’t function without reliable data. But at the scale of LLMs, manual monitoring can’t provide the comprehensive and efficient quality coverage required to make any AI reliable.

To truly be successful, data teams need a living, breathing data observability strategy tailored to AI stacks that can empower them to detect, resolve, and prevent data downtime consistently within the context of a growing and dynamic environment. And, those solutions need to prioritize resolution, pipeline efficiency, and the streaming/vector infrastructures that support AI in order to be a contender in the modern AI reliability battle in 2024.

7. Big data will get small

Thirty years ago, a personal computer was a novelty. Now, with modern Macbooks boasting the same computational power as the AWS servers Snowflake launched their MVP warehouse on in 2012, hardware is blurring the lines between commercial and enterprise solutions.

Since most workloads are small, data teams will begin to use in-process and in-memory/in-process databases to analyze and move datasets.

Particularly for teams that need to scale quickly, these solutions are fast to get started and can rise to enterprise level functionality with commercial cloud offerings.

8. Right-sizing will take priority

Today’s data leaders are faced with an impossible task. Use more data, create more impact, leverage more AI — but lower those cloud costs.

As Harvard Business Review puts it, chief data and AI officers are set up to fail. As of Q1 2023, IDC reports that cloud infrastructure spending rose to $21.5 billion. According to McKinsey, many companies are seeing cloud spend grow up to 30% each year.

Low-impact approaches like metadata monitoring and tools that allow teams to see and right-size utilization will be invaluable in 2024.

9. The Iceberg will rise (Apache Iceberg)

Apache Iceberg is an open source data lakehouse table format developed by the data engineering team at Netflix to provide a faster and easier way to process large datasets at scale. It’s designed to be easily queryable with SQL even for large analytic tables with petabytes of data.

Where modern data warehouses and lakehouses will offer both compute and storage, Iceberg focuses on providing cost effective, structured storage that can be accessed by the many different engines that may be leveraged across your organization at the same time, like Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Impala.

Recently, Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. As the lakehouse becomes a de facto solution for many organizations, Apache Iceberg — and Iceberg alternatives — are likely to continue to grow in popularity as well.

10. Return to office for…someone

RTO — everyone’s least favorite initialism. Or possibly their favorite! Honestly, I can’t keep up at this point. While teams appear to be divided on the issue, more and more teams are being called back to their cubicle/open floor plan/flexible working environments for at least a couple days per week.

According to a September 2023 report by Resume Builder, 90% of companies plan to enforce return-to-office policies by the end of 2024 — nearly four years after that fateful spring in 2020.

In fact, several powerful CEOs — including Amazon’s Andy Jassy, OpenAI’s Sam Altman, and Google’s Sundar Pichai — have already enacted return-to-office policies over the past several months. And there do appear to be at least some benefits to working in an office (at least part-time) versus exclusively from home.

Find yourself in the stay-at-home-forever camp? It appears the answer — as is always the case in data — is to deliver more value. Despite recent economic headwinds and its impact on the job market, data and AI teams are in high demand. And employers will often do what it takes to get them — and keep them. While some companies are mandating all employees return to the office regardless of role, other companies like Salesforce are requesting that non-remote engineers go in much less, for a total of 10 days per quarter.