Tech Explained: Here’s a simplified explanation of the latest technology update around Tech Explained: AI’s data supply chain concentrates wealth and threatens long-term sustainability in Simple Termsand what it means for users..
The global rush to secure data for artificial intelligence (AI) is reshaping the digital economy, but a new academic analysis warns that the current system is structurally unstable and economically unfair. As AI firms race to license, scrape, and monetize vast volumes of text, images, audio, and code, the people and communities that generate this data are being systematically excluded from the value they create. The imbalance, researchers argue, is no longer a side effect of rapid innovation but a defining flaw that threatens the long-term health of the AI ecosystem.
The new position paper, titled A Sustainable AI Economy Needs Data Deals That Work for Generators, and presented in the NeurIPS 2025 Position Paper Track, analyzes the economics of modern AI data pipelines and concludes that most current data deals concentrate wealth in the hands of a few large platforms while leaving data generators with little compensation, limited visibility, and no bargaining power .
How the AI data economy became extractive
What was once an academic research pipeline has evolved into a global market where data is treated as a strategic asset. Major AI developers report billions of dollars in revenue, and licensing deals for news archives, image libraries, academic content, and user-generated platforms have become routine. Yet the paper finds that the financial structure of these deals overwhelmingly favors data aggregators and model developers.
Among the 73 deals examined, the majority disclose no revenue-sharing arrangements with individual creators. Where revenue figures are available, the total disclosed value exceeds $677 million, but documented payouts to original data contributors amount to a negligible fraction of that sum. In many cases, platforms hosting user-generated content license entire datasets to AI firms under broad terms, while the individuals who produced the content receive no direct payment at all.
According to the authors, this outcome is driven by three interlocking structural failures. The first is the loss of provenance. Once data is copied, bundled, and reused, information about who created it, under what consent, and with which license is often stripped away. This makes it difficult or impossible to trace how specific contributions influence trained models or downstream products, cutting off any path for attribution or compensation.
The second failure is asymmetric bargaining power. Individual creators are fragmented and lack negotiating leverage, while large platforms and AI firms operate at massive scale. Licensing decisions are typically made between corporate entities, not between developers and the people whose work fills the datasets. Standardized terms of service, often written years before generative AI became commercially dominant, grant platforms sweeping reuse rights that creators cannot realistically contest.
The third issue is inefficient price discovery. Most data deals rely on flat, one-time payments or opaque lump sums that fail to reflect how data value changes over time. A dataset that marginally improves a model today may become far more valuable after fine-tuning, retraining, or deployment in new products. Yet current contracts rarely account for this dynamic value, locking creators out of future gains.
These mechanisms create a pipeline where economic value flows in one direction only. Data generators supply the raw material, but aggregators and model monetizers capture nearly all of the returns. The study argues that this is not simply a fairness issue but a systemic risk.
Why the current model threatens AI itself
The authors warn that the existing data economy could undermine the future of artificial intelligence. Modern machine learning systems depend on large, diverse, and continually refreshed datasets. If creators are excluded from the value chain, incentives to produce and share high-quality data weaken over time.
The paper highlights several risks that follow from this dynamic. First, data supply may shrink or become less representative as creators opt out, restrict access, or demand stricter controls. Second, markets may become more concentrated, with a small number of data holders and AI firms controlling access to critical resources. Third, legal and regulatory uncertainty increases as disputes over consent, copyright, and misuse escalate.
The study points to ongoing litigation and regulatory scrutiny as early signs of stress in the system. Lawsuits involving news publishers, image libraries, and code repositories reflect growing resistance to opaque data practices. At the same time, regulators in multiple jurisdictions are signaling that existing data protection and competition frameworks may not be sufficient to address the realities of generative AI.
The authors argue that simply relying on courts or regulation to fix the problem is unlikely to succeed on its own. Legal action is slow, expensive, and reactive. Blanket regulatory controls risk stifling innovation or favoring large incumbents who can absorb compliance costs. What is missing, they contend, is a technical and economic infrastructure that allows data to be exchanged in a way that is transparent, flexible, and fair by design.
Importantly, the study notes that the current system may also harm AI developers themselves. When deals are negotiated through intermediaries, developers lack clear insight into data quality, provenance, and long-term availability. This can expose firms to legal risk and limit their ability to optimize models for specific tasks. In an industry where marginal performance gains matter, inefficient data markets become a competitive disadvantage.
The paper frames the problem as a feedback loop. Missing provenance weakens bargaining power. Weak bargaining leads to one-time buyouts. Buyouts remove incentives to maintain provenance or invest in data quality. Breaking that cycle, the authors argue, requires rethinking how data markets are structured from the ground up.
A proposed framework for fairer data markets
To address these challenges, the study proposes the Equitable Data-Value Exchange framework, known as EDVEX. Rather than a single platform or regulation, EDVEX is presented as a modular blueprint for a new kind of data economy that aligns incentives across creators, aggregators, and AI developers.
EDVEX is based on three pillars. The first is task-based data matching. Instead of acquiring large datasets based on brand or scale alone, developers would identify data sources based on how well they improve performance for a specific task. Small-scale evaluations could estimate the marginal utility of different datasets, allowing developers to assemble task-optimized data bundles rather than relying on blunt, all-purpose licenses.
The second pillar is auditable lineage tracking. EDVEX envisions systems that automatically record where data comes from and how it is used throughout the machine learning pipeline. This includes tracking transformations, training steps, and downstream applications. Such lineage records would make it possible to audit usage, enforce consent, and allocate revenue based on actual contribution, without requiring creators to reveal sensitive details.
The third pillar is utility-driven valuation. Instead of fixed prices or opaque negotiations, data would be priced according to its measured impact on model performance. Revenue could be shared dynamically among contributors, potentially using established methods from cooperative game theory to estimate each participant’s marginal contribution. This approach aims to tie compensation directly to value creation rather than bargaining power.
The authors point up that EDVEX is not a call for abandoning existing markets or services. Data-for-service models, synthetic data generation, and large-scale licensing agreements can still play a role. However, the framework seeks to make the value of data explicit and contestable, giving creators a clearer stake in how their work is used beyond the immediate service they receive.
Importantly, the paper does not claim that EDVEX is ready for deployment. It identifies numerous open research problems, from scaling lineage tracking to millions of contributors, to preventing manipulation of valuation metrics, to avoiding price collapse for highly substitutable data. The authors frame these challenges as opportunities for the machine learning research community to engage with economic and social dimensions of AI, not just technical performance.
The study also positions EDVEX as complementary to emerging regulation. By embedding transparency and accountability into the data pipeline itself, such systems could make compliance with data protection and fairness rules easier to implement and verify. Rather than treating regulation as an external constraint, EDVEX treats it as a design requirement.
