ai.txt
A proposed standard (similar to robots.txt) specifically for communicating AI-related permissions. Allows site owners to express preferences about AI training, retrieval, and generation uses of their content.
Key terms and concepts used throughout the data licensing landscape.
A proposed standard (similar to robots.txt) specifically for communicating AI-related permissions. Allows site owners to express preferences about AI training, retrieval, and generation uses of their content.
A programmatic interface for requesting data or services. In this catalog, APIs can become tollgates, licensed delivery channels, or enforcement points for AI-relevant data access.
A preference or rights signal attached to an individual file or work rather than to an entire website or domain. These signals are useful when content moves between platforms or needs permissions that travel with the asset itself.
Credit connecting content back to its source, author, or steward. It can involve citations, links, provenance metadata, or contractual reporting requirements.
Methods for verifying the identity of an automated client and conveying who operates it. In AI access control, stronger bot identity can make rate limits, differentiated permissions, and enforcement more reliable.
A third-party review or badge signaling that a model, dataset, or vendor meets a stated standard, such as consented training-data sourcing.
A platform that helps rights holders or data owners offer licensed access to content or datasets for AI use. These systems often handle discovery, permissions, pricing, delivery, and reporting for training or retrieval use cases.
The documented history of a piece of data: where it came from, how it was processed, and what permissions apply. Critical for compliance and attribution in AI systems.
A certification program that verifies AI models were trained with proper consent and licensing. Provides third-party validation that a model meets ethical training criteria.
A proposed Internet Engineering Task Force standard for AI preference signaling. Aims to create a standardized, extensible format for expressing AI-related permissions across the internet.
An organization that manages rights on behalf of multiple creators, negotiating licenses and distributing payments. Examples include ASCAP for music and newer publisher or creator collectives negotiating AI-related uses.
A consent model where explicit permission is required before content can be used. More protective than opt-out but harder to scale. Some licensing collectives operate on opt-in basis.
A mechanism allowing content creators to indicate they do not want their work used for AI training. May be technical (robots.txt) or legal (GDPR rights). Effectiveness depends on crawler compliance.
A phase in the AI data lifecycle. Common stages include collection or scraping, training, fine-tuning, retrieval, and user-facing output generation.
A machine-readable hint that tells AI agents or web crawlers how content may or may not be used. Examples include robots.txt directives, ai.txt files, and HTTP headers. These rely on voluntary compliance by the crawler.
Ways of training, evaluating, or collaborating on AI systems without freely exposing the underlying raw data. In this catalog, it points to approaches where data holders keep more custody and control while still enabling model development.
A technique where AI models retrieve relevant documents at inference time to augment their responses. Raises different licensing questions than training since content is accessed dynamically.
A shared list or lookup service that records rights, preferences, or status information about works. In this space, registries can help others discover opt-outs, verify rights claims, or check reuse conditions across many assets.
The person or organization that controls the relevant rights in a work or dataset, and can authorize, refuse, or condition reuse. Rights holders may be creators, publishers, labels, archives, or other stewards depending on the context.
A text file placed at the root of a website that instructs web crawlers which pages they may or may not access. Originally designed for search engines, now being extended for AI training crawlers with user-agent specific rules.
A machine-readable licensing format that allows content creators to specify terms for AI use of their work. Designed to be easily parsed by automated systems for compliance checking.
The automated extraction of content from websites. AI companies scrape the web to collect training data. May violate terms of service or copyright depending on jurisdiction and use.
The initiative is publicly live or deployable today, even if adoption is still limited or emerging.
Text and Data Mining Reservation Protocol. A W3C Community Group effort for expressing TDM policies, typically via a machine-readable `/.well-known/tdmrep.json` file designed to support EU DSM Directive opt-outs.
Measures that detect, limit, or block unwanted automated access to content. Includes rate limiting, CAPTCHAs, bot detection, and fingerprinting. Unlike preference signals, these are enforced rather than advisory.
A technical mechanism that requires payment, authentication, or rate limiting for automated access to content. Allows monetization of bot traffic while maintaining control over access.
The collection of text, images, code, or other content used to train machine learning models. The provenance, licensing, and consent around training data is a central issue in AI governance.
The World Wide Web Consortium, a standards body that incubates web specifications and community group proposals relevant to machine-readable signals and protocols.
An HTTP response header that lets a server send crawler instructions along with a page or file response. It is useful when a site needs per-URL or per-file rules instead of relying only on a site-wide `robots.txt` file.
No glossary terms match that search.