Earlier this year, I dived into the research of crypto and AI, and had the idea of building a decentralized Scale AI. I talked with early Scale, MTurk employees and ~50 AI companies needing data labeling and wrote the following memo. I will outline this idea's challenge and potential solution transparently, what I found along the way, and why this idea will and will not work.
Enjoy!
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
The advancement of AI models is constrained by limited access to high-quality labeled training data. Current labeling platforms have centralized workforces with limited reach and opaque processes.
Our solution is a web3 protocol that connects companies directly with a decentralized, global crowd of data labelers. Workers are incentivized and rewarded for their contributions through crypto token payments and rewards.
Labelers earn tokens for completing labeling tasks accurately and quickly. They can also earn rewards for training new labelers and building the community. Experienced labelers are motivated to maintain quality standards.
Companies purchase the ecosystem’s governance token to access the decentralized labeling workforce and compensate them in tokens. This creates sustainable token demand and liquidity. By cryptoeconomically coordinating crowds at scale, we transform data labeling into an open, efficient protocol — unlocking new sources of training data to advance AI.
For labelers, we democratize access to work and fair pay. For companies, we reduce costs and friction in sourcing high-quality training data. The result is a win-win decentralized platform, governed and improved by community participants. Our vision is an AI ecosystem fueled by abundant, affordable labeled data, accessible to all, and advanced collaboratively for the benefit of society.
- Leveraging blockchain technology to connect data labelers directly with companies needing training data labeled could cut out intermediaries and reduce costs.
- Integrating crypto wallets gives labelers an easy way to receive micropayments for each data sample labeled. Once a laberer onboard, we immediately generate an embedded wallet for them.
- Making it a protocol instead of a centralized platform enables anyone to build on top of it.
- Allowing labelers to vote on label schema, accuracy, etc creates a decentralized, crowd-sourced data labeling ecosystem.
- Important to think about the integrity of labels when opening up to the crowd. Mechanisms to ensure quality could include reputation systems, consensus protocols, staking.
- Companies get access to a large diverse labeler pool and lower costs. Labelers get paid for contributions.
- We could explore tokens/rewards for participation, and create a data marketplace for buying/selling labeled datasets.
- Key challenges: onboarding labelers, ensuring data quality, handling abuse, building reputation systems, etc.
Our decentralized platform seamlessly connects companies to crowds to label AI training data, with privacy ensured by zero-knowledge-proof cryptography. Embedded crypto wallets and reputation systems incentivize high-quality affordable labeling. This unlocks the crowd’s potential to generate diverse training data, fueling AI advancement. Our platform provides economic opportunities for contributors while accelerating innovation — catalyzing a collaborative AI flywheel.
The advent of LLMs has created a much larger demand for labeled datasets, especially custom datasets (healthcare, finance, military/industrial). The cost of this from centralized vendors is high because their workforces are inelastic (unable to increase or decrease without hiring recruiters or firing people), and they possess a professional managerial class that increases customer costs.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
After talking with early Scale AI people and comparing other data labeling solutions, I think it’s very challenging to maintain quality by decentralizing the network. Scale succeeded because the early team was super strong, and Alex(CEO) was excellent at closing 8-figure enterprise contracts. It was a very hard-core operationally heavy business, given the training going into training the labelers and managing the marketplace of workers.
The only way the idea of Paymo could work is for someone to figure out a way to break down the task efficiently so that labelers only need to answer binary questions to micro-label data versus intensively label very specific data.
Imagine a future where people download Paymo as an app, go through a few exercises of labeling tasks, and start labeling data by swiping left and right. Labeling tasks will be broken down to what people can do on the subway. It does sound sexy.
While an intriguing concept, operational complexities around maintaining data quality at scale and onboarding sufficient labelers pose challenges. For this model to work, tasks must be simplified into micro-labeling binary decisions. Decentralized crowdsourcing could become viable if the labeling process could be gamified into easy left/right swiping tasks that anyone could do casually. However, centralized managed solutions still appear superior for intensive data labeling of complex datasets. Though decentralized protocols promise to disrupt industries, operational realities constrain their applicability for rigorous data labeling.