Social media entity Reddit (RDDT.N) said on Tuesday it would enhance a web standard utilized by the site to block automated data scraping, as reports have lately emerged that some AI startups go around the rule in order to collect content for their systems.
This move comes after increased pressure from publishers, which claim that AI companies use their content to train AI-driven summaries without credit or permission.
It will update the Robots Exclusion Protocol, commonly known as “robots.txt,” a popular platform used to indicate areas of a website that are writable to its bots. Second, Reddit will continue to make use of rate-limiting to limit the number of requests coming in from any single entity while it blocks unknown bots and crawlers on its platform from scraping any data—collecting and storing raw data.
It’s a file that has, really only recently, taken center stage as a salient tool for publishers seeking to prevent tech companies from scraping their content gratis to train AI algorithms and summary generation in response to some search queries. Content licensing startup TollBit last week told publishers that several AI firms were bypassing the above-mentioned web standard to scrape sites. This comes a day after publication Wired revealed AI search startup Perplexity avoided its web crawler blockades through robots.txt.
Last week, the business media publisher Forbes accused Perplexity of plagiarizing its investigative stories for use in generative AI systems without proper credit. Reddit will still allow researchers and organizations like the Internet Archive access to its content for non-commercial purposes.
Leave a Reply