Community-curated catalog

DataLicenses.org

Live and emerging initiatives that shape AI data access, licensing, and enforcement. Use the glossary for vocabulary and the method and contribute guide for how entries are reviewed and how to add one.

A current snapshot of live and emerging initiatives shaping AI data access, licensing, and enforcement. Each initiative appears once under a primary approach type, even when it also uses secondary mechanisms.

Initiatives
46
Live now
32
WIP tracked
14
Approach types
8
What these labels mean
Initiative
One tracked effort in the catalog, such as a standard, product, registry, protocol, marketplace, or organizational program.
Live
Publicly available now, with enough concrete evidence that someone could inspect or use it today.
WIP
Work in progress that matters to track, but is still emerging, incomplete, or not yet clearly deployable.
Approach type
The main mechanism an initiative uses, like preference signals, formal licenses, collectives, marketplaces, tollgates, blocking, infrastructure, or certification.
More filters All filters off

Search is always on. Open this panel when you want to sort, focus on live vs WIP, or drill into specific approach types and data types.

Status
Approach types
Data types
46 initiatives

Showing all initiatives.

Primary approach type

Preference signal

Signals that express whether AI systems may crawl, train on, or reuse content, usually through metadata, headers, or other machine-readable notices.

Initiative Website Latest update Approach type

Internet Engineering Task Force is working on a standardized preference signal for AI agents and crawlers ("building blocks that allow for the expression of preferences about how content is collected and processed for Artificial Intelligence (AI) model development, deployment, and use.")

More context
datatracker.ietf.org/wg/aipref/about Nov 30, 2025 Vocabulary draft updated Preference signal
TDM·AI WIP

Asset-level protocol for binding machine-readable TDM and AI-training preferences to digital works.

More context
Evidence trail
Data types
Multimodal
Uses
IETF AI Preferences (AIPref)
tdmai.org Nov 03, 2025 Usage vocabulary updated Preference signal Also uses New infrastructure Pipeline: Collect -> Train -> Fine-tune
TK Labels Live

Local Contexts labels that let Indigenous communities express culturally specific conditions for access and reuse of knowledge and data.

localcontexts.org/labels/traditional-knowledge-labels Sep 17, 2025 Guide to using TK Labels and Notices updated Preference signal Pipeline: Collect -> Retrieve -> Train

robots.txt can be used to express AI crawler access preferences. See example from OpenAI. Additionally, the X-Robots-Tag response header allows servers to send crawler directives via HTTP response headers.

More context
Evidence trail
Data types
Web content
platform.openai.com/docs/bots Sep 16, 2025 OpenAI bots documentation Preference signal

Registry and opt-out workflow for marking works that should not be used in future AI training datasets.

More context
Data types
Images
haveibeentrained.com Sep 14, 2025 Face Reveal launched for Have I Been Trained Preference signal Also uses New infrastructure Pipeline: Collect -> Train

Content Credentials-based preference system for signaling that generative AI should not train on or use a creator's files.

More context
helpx.adobe.com/creative-cloud/apps/adobe-content-authenticity/generative-ai-training-preferences.html Sep 01, 2025 Generative AI training and usage preference documentation Preference signal Pipeline: Train -> Generate

A proposed convention for AI-specific crawler directives via an `ai.txt` file.

More context
Evidence trail
Data types
Web content
site.spawning.ai/spawning-ai-txt Aug 27, 2025 Improved crawler-control post published Preference signal

Proposed signals for communicating reuse preferences to AI and web agents.

github.com/creativecommons/cc-signals Aug 26, 2025 Response to feedback Preference signal

Proposed AT Protocol mechanism for users to declare data-reuse preferences such as generative-AI training.

More context
Evidence trail
Uses
IETF AI Preferences (AIPref)
demo.user-intents.org Mar 07, 2025 Proposal discussion opened Preference signal Pipeline: Collect -> Train -> Retrieve

Embedded image and video metadata fields for expressing whether assets may be used in data-mining and generative-AI training datasets.

More context
Data types
ImagesVideo
pluscoalition.org/about-ai-and-ml-image-rights-standards Feb 19, 2025 IPTC and PLUS explain metadata stance on GenAI training Preference signal Pipeline: Collect -> Train -> Fine-tune

W3C specification for expressing text and data mining permissions via a well-known JSON file, designed for EU DSM Directive compliance.

More context
Evidence trail
Data types
Web content
w3.org/community/tdmrep Aug 08, 2024 Version 3 final report listed Preference signal Also uses Formal license
NoML WIP

Proposal to add a `noml` directive so content can stay searchable but not be used for machine learning.

More context
Data types
Web content
noml.info May 28, 2024 Project featured in scholarly article on search and AI opt-out Preference signal Pipeline: Collect -> Train

Platform-level HTML and HTTP directives that tell external AI datasets and models not to use artists' work unless they opt in.

More context
deviantart.com/team/journal/UPDATE-All-Deviations-Are-OptedOut-of-AI-Datasets-934500371 Apr 16, 2024 NoAI labels added to DeviantArt Studio Preference signal Pipeline: Collect -> Train

Primary approach type

Formal license

Formal legal terms or license language that grant, restrict, or condition AI-related reuse of content, datasets, or model inputs.

Initiative Website Latest update Approach type

A machine-readable licensing schema for clearly signaling reuse permissions and conditions (including payment or use restriction).

More context
Data types
Web content
Signals
Users 1,500+ organizations Data billions of web pages
rslstandard.org Dec 09, 2025 Technical standards released Formal license Also uses Preference signal

Machine-readable licensing layer that lets websites declare AI usage terms and pricing.

More context
Evidence trail
Data types
Web content
copyright.sh Oct 27, 2025 WordPress plugin launched Formal license

Research-backed proposal for modular standard data licenses tailored to AI data sharing.

mlcommons.org/2025/03/unlocking-data-collab Mar 16, 2025 Research findings published Formal license Pipeline: Collect -> Train -> Fine-tune -> Retrieve

Primary approach type

Licensing collective

Shared bargaining, aggregation, or rights-management structures that let many publishers or creators negotiate AI access together.

Initiative Website Latest update Approach type

A 50/50 revenue-share platform connecting publishers with AI companies, with 700+ publishers signed up including major news outlets.

More context
Data types
Text
Signals
Users 700+ publishers
prorata.ai Sep 04, 2025 Gist Answers launched Licensing collective Also uses Marketplace / Tollgate

Coalition pursuing licensing, compensation, and enforcement for publisher content used by AI systems.

More context
Evidence trail
Data types
Text
publishersrights.org Jan 20, 2025 Coalition overview published Licensing collective Pipeline: Train -> Retrieve

Trade alliance of dataset licensors pushing for legal clarity, ethical sourcing, and scalable licensing markets for AI training data.

More context
Signals
Users 12 announced members
thedpa.ai Dec 09, 2024 DPA welcomed five new members Licensing collective

Primary approach type

Marketplace

Commercial platforms or brokers that package, list, or sell access to datasets, content libraries, or licensing opportunities for AI use.

Initiative Website Latest update Approach type

Paid marketplace routing premium publisher content into Microsoft Copilot, MSN, and Discover experiences.

More context
Data types
Text
Signals
Users 7 launch publisher partners
about.ads.microsoft.com/en/blog/post/february-2026/introducing-the-microsoft-publisher-content-marketplace Feb 05, 2026 Publisher Content Marketplace announced with seven launch publishers Marketplace Pipeline: Retrieve -> Generate
Defined.ai Live

Marketplace for ethically sourced, annotated datasets used to train and fine-tune AI systems.

More context
Evidence trail
Data types
Multimodal
Signals
Payments several partners generate $1M+/year
defined.ai Jan 26, 2026 2025 marketplace growth review published Marketplace Pipeline: Train -> Fine-tune

AI data marketplace for licensed multimedia datasets, now being integrated into Cloudflare's AI crawl and content-access stack.

More context
humannative.ai Jan 14, 2026 Cloudflare acquisition and integration announcement Marketplace Also uses New infrastructure Pipeline: Train
Protege Live

AI training data platform for compliant exchange of proprietary, real-world datasets across sectors.

More context
Data types
Multimodal
Signals
Users hundreds of organizations
withprotege.ai Jan 06, 2026 Series A extension cites hundreds of data partners and cross-vertical growth Marketplace Also uses New infrastructure Pipeline: Train -> Fine-tune

Contributor compensation and licensed visual training-data program tied to Bria's commercially safe generative AI stack.

More context
Data types
Images
Signals
Users 30+ data partners
bria.ai/artist-program Sep 17, 2025 Platform release highlights rights-clear models Marketplace Pipeline: Train
Dappier Live

Rights-cleared content marketplace and monetization layer for RAG, assistants, and other AI applications.

More context
Data types
Text
dappier.com/marketplace Aug 17, 2025 Licensing program launch announced Marketplace Also uses Tollgate Pipeline: Retrieve -> Generate

Opt-in music dataset licensing program that pays rights holders for AI training use of tracks and catalogs.

More context
Data types
Music
Signals
Data 14M+ tracks
sourceaudio.com/why-music-ai-dataset-partnerships-matter Jun 04, 2025 SourceAudio outlines AI dataset licensing program scale Marketplace Pipeline: Train -> Fine-tune
Credtent Live

Independent creative registry for opting out of AI use, licensing content, and certifying human-created work.

More context
Data types
Multimodal
Signals
Users thousands of creators
credtent.org Apr 01, 2025 Ethical AI licensing marketplace described Marketplace Also uses Certification / New infrastructure Pipeline: Train

Rights-cleared book licensing platform for AI training, reference, and transformative use.

More context
Data types
Text
Signals
Users 100+ bestselling authors
createdbyhumans.ai Mar 25, 2025 Mission article published Marketplace Pipeline: Train -> Retrieve

Licensed-content marketplace for the faith ecosystem, seeded with a pooled guarantee for AI assistants and search experiences.

More context
Signals
Payments $5M pooled guarantee
gloo.com/resources/news/gloo-launches-first-ai-licensing-offering-for-faith-ecosystem Feb 19, 2025 Gloo launches AI Licensing with pooled guarantee Marketplace Pipeline: Retrieve -> Generate
vAIsual Live

Marketplace for rights-managed visual and biometric datasets tailored to AI training and evaluation.

More context
Data types
Images
Signals
Data 600,000+ high-quality images
vaisual.com Aug 13, 2024 Diversity image database for better AI described Marketplace Pipeline: Train -> Fine-tune

Music licensing platform for rights-cleared AI training data and audio assets.

More context
Data types
Music
Signals
Data 79M songs / 330,000 hours of music
gcx.co Mar 20, 2024 GCX launched as rights-cleared music dataset for AI Marketplace Pipeline: Train -> Fine-tune

Primary approach type

Tollgate

Access layers that require payment, metering, or authenticated entry before content can be fetched, queried, or reused for AI workflows.

Initiative Website Latest update Approach type
TollBit Live

Add subdomains to make content accessible to AI with blocking and monetization.

More context
Data types
Text
Signals
Users 4,000+ premium publishers
tollbit.com Dec 15, 2025 Imperva partnership announced Tollgate Also uses Marketplace

A set of tools to block or charge for scraping; includes AI Audit dashboard, managed robots.txt, and pay-per-crawl marketplace.

More context
Data types
Web content
Signals
Users 1M+ customers enabled AI bot blocking Data 1B+ 402 responses/day
blog.cloudflare.com/control-content-use-for-ai-training Nov 17, 2025 Content Signals launched Tollgate Also uses Technical blocking / Preference signal
Humpback WIP

LLM paywall and analytics layer for publishers that want to control or monetize AI bot access.

More context
Data types
Web content
hback.xyz No evidence links yet Tollgate Also uses Technical blocking

Primary approach type

Technical blocking

Technical controls that deny, rate-limit, or otherwise constrain crawling, downloading, or automated collection unless a requester meets specific conditions.

Initiative Website Latest update Approach type

A simple anti-scraping tool intended to protect datasets from basic crawlers/scrapers.

More context
Uses
Cloudflare AI Crawl Control
github.com/Responsible-Dataset-Sharing/easy-dataset-share Jan 08, 2026 easy-dataset-share paper published Technical blocking Pipeline: Collect

Hub feature that requires users to request access and share identity details before downloading a dataset.

huggingface.co/docs/hub/datasets-gated Sep 29, 2022 Gated datasets launch explained Technical blocking Also uses New infrastructure Pipeline: Collect -> Train -> Fine-tune

Primary approach type

New infrastructure

New registries, protocols, hosting patterns, or coordination layers that make governed data access, compliance, or contribution easier to operate.

Initiative Website Latest update Approach type

Working group standardizing cryptographic authentication for bots and AI agents on the web.

More context
Evidence trail
Data types
Web content
datatracker.ietf.org/wg/webbotauth/about Jan 17, 2026 Use cases draft published New infrastructure Pipeline: Collect -> Retrieve

Enterprise-grade APIs and structured dumps for Wikipedia and sister projects, designed for large-scale reuse in AI, search, and knowledge graphs.

More context
Evidence trail
Data types
TextStructured data
Signals
Users 10+ announced partners Data 920+ datasets / 300M+ unique project pages
enterprise.wikimedia.com Jan 14, 2026 New enterprise partners announced New infrastructure Pipeline: Retrieve -> Train

Proposal for a commons-based infrastructure for large-scale access to digitized European books with conditional commercial access.

More context
Evidence trail
Data types
Text
openfuture.eu/publication/outline-for-a-european-books-data-commons Nov 19, 2025 Outline paper published New infrastructure Pipeline: Collect -> Train
SyftBox Live

Open-source protocol for privacy-preserving AI and analytics across distributed datasets without centralizing the underlying data.

openmined.org/syftbox Nov 11, 2025 syft-flwr release demonstrates active federated learning workflows on SyftBox New infrastructure Pipeline: Train -> Retrieve
CommonsDB Live

Registry for public-domain and openly licensed works using verifiable rights declarations and content-derived identifiers.

More context
Evidence trail
Signals
Data 200,000+ declarations
commonsdb.org Oct 30, 2025 Explorer launched New infrastructure Pipeline: Collect -> Train -> Retrieve

Approach from OpenMined

openmined.org/attribution-based-control Oct 05, 2025 OpenMined explains attribution-based control New infrastructure
FlexOlmo WIP

Distributed language-model training approach that lets data owners contribute experts without sharing raw data or giving up opt-out control.

More context
allenai.org/blog/flexolmo Jul 08, 2025 Ai2 introduces FlexOlmo and invites organizations with sensitive data to participate New infrastructure Pipeline: Train

Participatory governance framework for communities to define conditions for data reuse, including AI training.

blog.thegovlab.org/reimagining-data-governance-for-ai-operationalizing-social-licensing-for-data-reuse May 12, 2025 Operationalization report released New infrastructure Pipeline: Collect -> Train -> Fine-tune -> Retrieve

Python package and API helpers for checking whether works are opted out before model training.

github.com/Spawning-Inc/datadiligence Oct 08, 2024 PyPI release 0.1.7 published New infrastructure Pipeline: Collect -> Train -> Fine-tune

Primary approach type

Certification

Third-party review, badges, or verification programs that signal whether a model, company, or dataset follows stated sourcing or licensing requirements.

Initiative Website Latest update Approach type

Certify that models are fairly trained

More context
Signals
Users 16+ announced certified entities
fairlytrained.org Jul 31, 2024 Individual model certification updated Certification