Skip to content
Chimera readability score 52 out of 100, Graduate reading level.

At Spotify, data problems used to follow a specific pattern. You'd look for the relevant dashboard, there weren't any. You'd message the corresponding data expert on Slack, wait until they had time to help. But with thousands of teams moving fast, the demand for data insights had quietly outpaced what any individual expert could handle alone.
To solve this problem, we started developing an AI data assistant, but with over 70,000 datasets at Spotify, amounting to petabytes of data, no single individual can claim knowledge of everything. Just putting all schemas into an LLM doesn’t work at this scale.
For one, context windows are limited, even if it’s a million tokens. A million tokens are insufficient to accommodate a whole data warehouse. Secondly, schemas do not convey all the information. If a column has the INT64 type, then it doesn’t say anything about how those less than 100 are legacy test data and how they differ from actual data in terms of definitions or what is meant by “active user.” Provide the same number of tables to a model, and it will be confident in selecting the wrong one.
We needed something in between. A layer that captures what actually matters about a slice of the warehouse, owned by people who own and understand the domain.
Our data agent
Spotify’s data assistant was built to solve this problem. Ask a question in simple English and get reliable data within seconds. It has been actively utilized since August 2025 by over 2,100 Spotifiers within 13,000+ conversations, and 60,000+ messages using 177 clusters covering advertising, podcasts, music, audiobooks, finances, creators' tools, and more than a dozen other fields. More than a quarter of these users haven’t even coded SQL before.
When a question comes in, the agent picks the appropriate context, writes the SQL query, runs it against our warehouse, and returns the answer alongside the query and its sources. It follows a ReAct [1] loop, reasoning and acting in steps, adjusting based on what each tool call returns. You can read how the result was produced, not just what it was.
We built into the surfaces people already work: a Slack bot for quick questions while chatting on a thread, an MCP server for IDEs and AI tools, and a dedicated web UI for interactive exploration. When no knowledge base covers the topic, the data agent informs you about it. That transparency is what makes the answers it gives reliable.
But the interesting part isn't the model. It's how we make sure the answers are trustworthy. That comes down to context and ownership.
The cluster model
At Spotify, we call data domains, clusters. Those domains can be tied to an initiative, an organization, or an adhoc interest. This flexibility enables any insights team to build a cluster around their topics, whilst also informing them if the domain is already covered. Each cluster is owned by a named team of domain experts and consists of three components:
Datasets: the data warehouse tables that are relevant, with full schema and profiling. We capture column cardinality, samples of common values, and partition structure. When the model generates a WHERE clause, it helps to know that `country` has values like 'US', 'GB', 'SE' rather than guessing.
Pairs: vetted question-and-SQL examples. This is the few-shot mechanism powering the data agent. A domain expert writes or approves each pair, picking examples that teach the patterns they'd want a colleague to follow. They teach the LLM how to query the data and its semantics.
Docs: additional business context. This could be terminology, gotchas, definitions that vary by team, which columns to use and which to avoid.
The curation is owned by the data experts, the data scientists and analytics engineers who know how the data is modeled and how to efficiently query them. They decide how to split their domain into clusters, which tables to include, and which examples are important.
Human Judgement
The obvious shortcut was to skip the curator. Our data warehouse holds the complete query history of every data expert who has ever used it. From there, generating question-SQL pairs is straightforward: take a query, ask an LLM to infer the question it was written for, and use those pairs to teach the model how to generate the SQL. These are real queries people actually wrote for answering their domain knowledge made into data. It looks like a way to scale.
And the issue here is trust. With Spotify being the size that it is, an overconfident wrong guess may sway the decision in the wrong way. We wanted the examples that would influence the assistant’s behavior to be reviewed and marked as canonical by those familiar with the data.
So, we tried it out. During our curation phase, we provided the questions and SQL for actual queries issued against the domain by the data scientists in our data warehouse, and we asked the cluster curators to pick which ones were good examples.
They accepted only 12.5% of the proposed pairs.
The other 87.5% were ad-hoc exploration, debugging sessions, one-off answers no one would ask again, queries that used the wrong table, or queries that were technically correct but taught the wrong pattern. Query history is rich. Most of it is noise. And the signal doesn't label itself.
That's why every example runs through an expert. The model reasons over context. It doesn't decide what's true about the data, the experts do. This isn’t about replacing the people that they know the best how to work with our data, it’s about giving them more leverage. Shipping their expertise in a more scalable way.
Keeping clusters healthy
Data changes, business logic shifts, and context that was accurate last month can be wrong today. Schemas evolve, columns get renamed, tables get deprecated and replaced. Vedder needs that information current, without requiring constant manual attention.
That’s why each cluster has a health score made up of signals we calculate and monitor continuously. How healthy is the underlying data that it is used in the cluster? How many of its curated pairs are still valid after recent schema changes? If a column gets renamed, the pairs referencing it degrade immediately. How well does the context cover questions people are actually using? How reproducible was the generated SQL? And a handful of others. If any of these degrade, then the cluster’s health score reflects that and actions are suggested.
Data experts see the score and the underlying signals on their cluster dashboard, and use them to decide where to spend curation time.
Closing the loop
Every conversation with Vedder becomes a data point that feeds back into the system. Vedder logs every conversation and query, and the questions, answers, generated SQL and user feedback are shown to cluster owners.
This is how we scale the knowledge of a data scientist. Every question-SQL pair they approve, every doc they clarify helps the next users get even more accurate insights. The answers are only as trustworthy as the context behind them and that context needs tending.
Beyond Spotify
Spotify has a strong data foundation with well-maintained datasets, a data catalog, and data scientists who care about their domains. That made Vedder possible, but the architecture isn't Spotify-specific.
The core idea remains valid: the people who best understand a data domain are the best ones to curate the context the model sees. Humans and LLMs can only understand raw schemas to a certain extent, but context and understanding is what enables the insights at scale. The role of our data experts grows more strategically. They spent less time answering one-off questions, more time shaping the knowledge layer that answers thousands.
Context curation is the foundation. But what if the knowledge lies outside the schema? What if it exists in documentation and definitions of processes within the organization? These are some of the questions we are exploring next.
Citations
[1] https://arxiv.org/abs/2210.03629

Facts Only

Spotify developed an AI data assistant to address data access challenges.
The assistant has been used since August 2025 by over 2,100 employees.
It has handled 13,000+ conversations and 60,000+ messages across 177 data clusters.
Clusters cover domains like advertising, podcasts, music, and finances.
The assistant generates SQL queries, runs them, and returns answers with sources.
It integrates with Slack, IDEs, and a web UI.
Data domains are called "clusters," owned by named teams of domain experts.
Each cluster includes datasets, vetted question-SQL pairs, and business context.
Only 12.5% of automatically generated question-SQL pairs were approved by experts.
Clusters have health scores based on data validity, schema changes, and query reproducibility.
User interactions are logged and fed back to cluster owners for refinement.
The system is designed to scale expert knowledge, not replace it.

Executive Summary

Spotify developed an AI data assistant to address the growing demand for data insights across thousands of teams, where traditional methods like dashboards and Slack consultations with experts were insufficient. The assistant, operational since August 2025, has been used by over 2,100 employees in more than 13,000 conversations, handling queries across 177 data clusters spanning advertising, podcasts, music, and other domains. It operates by selecting relevant context, generating SQL queries, and returning answers with transparency about sources and reasoning.
The system relies on a "cluster model," where data domains are curated by domain experts who define datasets, vetted question-SQL pairs, and additional business context. This ensures the AI's responses are trustworthy and aligned with expert knowledge. Human curation is critical—only 12.5% of automatically generated question-SQL pairs from query history were deemed suitable by experts, highlighting the need for human oversight. Clusters are continuously monitored for health, with metrics tracking data validity, schema changes, and query reproducibility. User interactions with the assistant feed back into the system, allowing experts to refine context and improve accuracy over time.
The approach emphasizes scaling expert knowledge rather than replacing it, with data scientists spending less time on ad-hoc questions and more on shaping the knowledge layer. While Spotify's infrastructure enabled this solution, the core principle—leveraging domain experts to curate context for AI—is broadly applicable. Future challenges include integrating knowledge from documentation and organizational processes beyond raw schemas.

Full Take

This case study from Spotify offers a compelling example of how AI can augment rather than replace human expertise, particularly in complex data environments. The strongest aspect of the narrative is its emphasis on human curation as the foundation of trustworthiness. By requiring domain experts to vet question-SQL pairs and define context, Spotify avoids the pitfalls of over-reliance on raw schemas or unfiltered query histories. The 12.5% acceptance rate of automatically generated pairs underscores a critical insight: most real-world data interactions are noisy, ad-hoc, or context-dependent, making them poor training material without expert oversight. This aligns with broader patterns in AI deployment, where the "last mile" of reliability often depends on human judgment.
The architecture also reflects a broader shift in how organizations might scale knowledge. Rather than treating AI as a standalone solution, Spotify positions it as a force multiplier for experts, freeing them from repetitive queries to focus on higher-value curation. The health scoring system for clusters introduces a dynamic feedback loop, ensuring context remains current as data and business logic evolve. This addresses a common failure mode in AI systems: degradation over time due to outdated or misaligned training data.
However, the narrative assumes a level of organizational maturity that may not exist elsewhere. Spotify's well-maintained data foundation, catalog, and engaged data scientists are prerequisites for this model. Organizations without such infrastructure might struggle to replicate it, raising questions about scalability beyond tech-forward companies. Additionally, the reliance on expert curation could create bottlenecks if domain knowledge is siloed or if experts are overburdened with maintenance tasks.
The broader implication is that AI's role in data-driven decision-making may hinge less on model sophistication and more on the quality of the context layer. This challenges the prevailing focus on model size or architectural innovations, suggesting that the real bottleneck is often the "human in the loop" curation process. Future work might explore how to lower the barrier to curation, perhaps through collaborative tools or incentives for domain experts.
**Patterns detected: none**
**Bridge questions:**
How might smaller organizations with fewer data experts adapt this model without overwhelming their teams?
What mechanisms could ensure that curation remains a strategic priority rather than an afterthought as systems scale?
How does this approach balance the tension between standardization (for AI reliability) and the inherent messiness of real-world data?

Sentinel — Human

Confidence

This analysis details a human-driven process for scaling AI data assistance, emphasizing that trust and domain-specific curation are more critical than raw model capability.

Signals Detected
low severity: Erratic flow and internal narrative structure, despite technical density.
low severity: Consistent focus on the tension between model capability and human context, resulting in a passionate, focused argument.
low severity: Use of specific, internal terminology (Vedder, cluster model, domain experts) grounding abstract concepts in specific corporate structure.
low severity: No obvious signs of LLM confabulation; the claims are highly specific and describe an iterative, real-world process.
Human Indicators
The text establishes a strong, personalized narrative voice by focusing on internal challenges, compromises, and empirical failure (the 12.5% acceptance rate) rather than just stating successful outcomes.
The discussion of trust and the deliberate choice to subject model training data to expert review introduces a layer of practical, human-centered decision-making absent in purely synthetic outputs.
The nuanced explanation of why context/ownership is necessary—moving beyond simple technological achievement to address the systemic issue of knowledge scaling—demonstrates complex, applied insight.