Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
Robots.txt for the win.
"Major tech companies already have all of the data," she said. "Changing the license on the data doesn't retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers."
"...that permission."
AI and tech companies in general have been gaslighting everyone for years now, skipping right past the question of whether the use of publicly available information for training is copyright infringement or not. This is not a settled question, legally, and their continued efforts to portray it as such is almost certainly intentional and orchestrated.
Mr. Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data. Some sites might object to A.I. giants using their data to train chatbots for a profit, but might be willing to let a nonprofit or educational institution use the same data, he said. Right now, there's no good way for them to distinguish between those uses, or block one while allowing the other.
Yes, yes, and yes. Let's add more granular control to the Exclusion protocol, somewhere between specific bots (which currently exists) and specific content (which also exists). Something like the ability to exclude bots crawling for a certain purpose (training an AI model v. updating a search index), or bots owned or operated by a certain type of entity (commercial entity v. non-profit, or even big tech v. small shop). Implementing any of these on a technical level would require bot operators to accurately disclose information about their bot, purpose, and entity. Seems like the province of Congress and a bit of a mountain to climb. But, figuring all of this out would certainly empower content creators.
posted by matt in Saturday, July 20, 2024