Johanna Cabildo: Big Tech’s Data Addiction Is Breaking AI


In Brief
Big Tech’s reliance on synthetic data is degrading AI quality, entrenching bias, and centralizing control, while the real solution lies in rebuilding a fair, transparent, and human-centered data ecosystem.

Meta’s LLaMA-4 was launched with high expectations. Instead, it disappointed. Compared to its predecessor, it delivered weaker reasoning, more hallucinations, and overall diminished performance. According to D-GN’s CEO Johanna Cabildo, the reason wasn’t a lack of compute or innovation—it was data.
Having exhausted the internet’s supply of clean, diverse, and high-quality text, Meta turned to synthetic data: AI-generated content used to train newer AI. This creates a loop where models learn from themselves, losing accuracy and depth with each cycle.
Other major players—OpenAI, Google, Anthropic—face the same dilemma. The age of abundant, real-world training data has ended. What’s left is synthetic filler. As a result, progress is stalling, and the illusion of advancement is masking a quiet decline.
Who Owns the Data?
The 2024 Stanford AI Index reported that eight companies now control 89% of global AI training data and infrastructure. This isn’t just about market power. It affects what knowledge is embedded in AI and whose perspectives are excluded.
Models trained on biased or narrow datasets can reinforce real-world harm. AI tools built on American healthcare records misdiagnose patients in other countries. Hiring systems penalize applicants with non-Western names. Facial recognition is less accurate on darker skin, particularly for women. Filters silence minority dialects as offensive or irrelevant.
As models lean more heavily on synthetic data, the errors worsen. Researchers warn of recursive loops that produce “polished nonsense”—text that sounds correct but contains fabricated facts. By early 2025, the Columbia Journalism Review found Google Gemini only gave fully accurate citations 10% of the time. The more these systems train on their own flawed outputs, the faster they decay.
Locked In, Locked Out
AI companies built their models on the backbone of publicly available knowledge—books, Wikipedia, forums, and even news articles. But now, the same firms are walling off their models and monetizing access.
In late 2023, The New York Times sued OpenAI and Microsoft over unauthorized use of its content. Meanwhile, Reddit and Stack Overflow entered exclusive licensing deals, giving OpenAI access to user-generated content previously open to all.
This strategy is clear: harvest free public knowledge, monetize it, and lock it behind APIs. The same companies that benefited from open ecosystems now restrict access while promoting synthetic data as a sustainable alternative, despite the mounting evidence that it degrades model performance. AI can’t evolve by learning from itself. There’s no insight in a mirror.
A Different Path
Fixing AI’s data crisis doesn’t require more compute or bigger models—it requires a shift in how data is collected, valued, and governed.
Web3 technologies offer one possible way forward. Blockchain can track where data comes from. Tokenized systems can fairly compensate people who contribute their knowledge. Projects like Morpheus Labs have used these tools to improve Swahili language AI performance by 30%, simply by incentivizing community input.
Privacy-preserving tools like zero-knowledge proofs add another layer of trust. They make it possible to train models on sensitive information, like medical records, without exposing private data. This ensures that models can learn ethically while still delivering high performance.
These ideas aren’t speculative. Startups are already using decentralized tools to build culturally accurate, privacy-respecting AI systems around the world.
Reclaiming the Future
AI is shaping the systems that shape society—education, medicine, work, and communication. The central question is no longer whether AI will dominate, but who controls what it becomes.
As the AI industry confronts the limitations of synthetic data and monopolized infrastructure, platforms like D-GN offer a clear path forward: one where AI is trained by people, for people, and in service of a more just and intelligent future.
Will we allow a handful of companies to recycle their own outputs, degrade model quality, and entrench bias? Or will we invest in building a new kind of data ecosystem—one that values transparency, fairness, and shared ownership?
The problem is not that machines don’t have enough data. The problem is that the data they’re using is increasingly synthetic, narrow, and controlled. The solution is to return power to the people who create meaningful content—and reward them for it. Better AI starts with better data. And better data starts with us.
Disclaimer
In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.
About The Author
Victoria is a writer on a variety of technology topics including Web3.0, AI and cryptocurrencies. Her extensive experience allows her to write insightful articles for the wider audience.
More articles

Victoria is a writer on a variety of technology topics including Web3.0, AI and cryptocurrencies. Her extensive experience allows her to write insightful articles for the wider audience.