ai.txt
A proposed standard (similar to robots.txt) specifically for communicating AI-related permissions. Allows site owners to express preferences about AI training, retrieval, and generation uses of their content.
Key terms and concepts used throughout the data licensing landscape.
A proposed standard (similar to robots.txt) specifically for communicating AI-related permissions. Allows site owners to express preferences about AI training, retrieval, and generation uses of their content.
The documented history of a piece of data: where it came from, how it was processed, and what permissions apply. Critical for compliance and attribution in AI systems.
A certification program that verifies AI models were trained with proper consent and licensing. Provides third-party validation that a model meets ethical training criteria.
A proposed Internet Engineering Task Force standard for AI preference signaling. Aims to create a standardized, extensible format for expressing AI-related permissions across the internet.
An organization that manages rights on behalf of multiple creators, negotiating licenses and distributing payments. Examples include ASCAP for music and new AI-focused collectives like Fairly Trained.
A consent model where explicit permission is required before content can be used. More protective than opt-out but harder to scale. Some licensing collectives operate on opt-in basis.
A mechanism allowing content creators to indicate they do not want their work used for AI training. May be technical (robots.txt) or legal (GDPR rights). Effectiveness depends on crawler compliance.
A phase in the AI data lifecycle. Common stages include: Collect/Scrape (gathering data), Train (initial model training), Fine-tune (specialization), Retrieve (RAG lookups), and Generate (output creation).
A machine-readable hint that tells AI agents or web crawlers how content may or may not be used. Examples include robots.txt directives, ai.txt files, and HTTP headers. These rely on voluntary compliance by the crawler.
A technique where AI models retrieve relevant documents at inference time to augment their responses. Raises different licensing questions than training since content is accessed dynamically.
A text file placed at the root of a website that instructs web crawlers which pages they may or may not access. Originally designed for search engines, now being extended for AI training crawlers with user-agent specific rules.
A machine-readable licensing format that allows content creators to specify terms for AI use of their work. Designed to be easily parsed by automated systems for compliance checking.
The automated extraction of content from websites. AI companies scrape the web to collect training data. May violate terms of service or copyright depending on jurisdiction and use.
The initiative can be used today. May have varying levels of evidence of adoption, from "new" (limited evidence) to "strong evidence" (broad, documented adoption).
Work in Progress. The initiative is under active development and details may change. Not yet ready for production use.
Text and Data Mining Reservation Protocol. A W3C community standard for expressing rights reservations about text and data mining, including AI training. Uses HTTP headers or HTML meta tags.
Measures that detect, limit, or block unwanted automated access to content. Includes rate limiting, CAPTCHAs, bot detection, and fingerprinting. Unlike preference signals, these are enforced rather than advisory.
A technical mechanism that requires payment, authentication, or rate limiting for automated access to content. Allows monetization of bot traffic while maintaining control over access.
The collection of text, images, code, or other content used to train machine learning models. The provenance, licensing, and consent around training data is a central issue in AI governance.