The New York Times has taken a step against AI by updating its terms of service on August 3, preventing the scraping of its content for training AI or machine learning models. This covers a broad spectrum of content including text, images, audio and video clips, designs, and metadata.
In addition to this, the revised terms restrict website crawlers — tools that index web pages for search results — from using this content to instruct LLMs or AI tools. A breach of these guidelines could lead to penalties, although the exact nature remains ambiguous. The New York Times refrained from adding any comments beyond its terms of service.
Katie Gardner, a partner at Gunderson Dettmer, observed, "While restrictions on data scraping are commonplace in terms of service, it's rare to see a direct mention of AI training."
AI models, such as ChatGPT, depend on vast amounts of content, like journalistic articles, to produce results. This raises concerns for publishers with subscription models, as AI might replicate and redistribute content without acknowledgment, potentially compromising revenue and trust.
It's tricky for publishers to ascertain the intentions of crawlers, whether for improving search engine visibility or training AI. As reported by Digiday, some are exploring ways to block these crawlers. Meanwhile, crawlers like CommonCrawl have entered into agreements with giants like OpenAI, Meta, and Google for AI training purposes.
OpenAI introduced GPTBot this week, a web crawler designed to enhance AI models and allow publishers to manage its access. However, major players like Bing and Google haven't introduced such control features yet.
The Washington Post scrutinized Google’s C4 dataset and found content from prominent sites, including The New York Times, used in training LLMs.
Chris Pedigo from Digital Content Next notes that other publishers are now revisiting their terms of service in light of these developments.
Towards Licensing Agreements
The response from AI companies to these updated terms remains to be seen. Given potential legal pitfalls, there are talks between AI firms and top-tier publishers to form licensing agreements, similar to the one between OpenAI and The Associated Press.
The aim is not just monetary compensation. Publishers are pushing for citations for their content and introducing procedures within AI firms to ensure content accuracy.
Pedigo emphasizes the importance of quality, stating, "For any licensing agreements, publishers want their information to maintain a certain brand standard."