WIRED Observation of Perplexity: Web Crawling and the Common Crawl Controversy at the end of the Robot Exclusion Protocol
A WIRED analysis and one carried out by developer Robb Knight suggest that Perplexity is able to achieve this partly through apparently ignoring a widely accepted web standard known as the Robots Exclusion Protocol to surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t. WIRED observed a machine tied to Perplexity—more specifically, one on an Amazon server and almost certainly operated by Perplexity—doing this on wired.com and across other Condé Nast publications.
Kate Knibbs, senior writer at WIRED joins the show to talk about web crawling and the controversy over Common Crawl. We talk with Randall Lane of Forbes about Perplexity. AI repurposed a Forbes article and presented it as its own story, without first asking permission or properly citing the source.
The Perplexity chatbot itself is more specific. Prompted to describe what Perplexity is, it provides a text that says, “Perplexity Artificial Intelligence is a search engine with features similar to traditional search engines.” It provides answers to user queries in a matter of seconds, by using information from recent articles and the web daily.
Web crawling has been a way to find information on the internet for decades. It has primarily been used by search engines like Google and nonprofits like Internet Archive and Common Crawl to catalog the contents of the open internet and make it searchable. Until recently, the practice of web crawling has rarely been seen as controversial, as websites depended on the process as a way for people to find their content. But now crawling tech has been subsumed by the use of machines and Artificial Intelligence that can absorb whole articles that are fed into them.
Hacks on Max. Randall Lane, Daniel Kalore, and GadgetLab (a podcast about hacking on the internet and Twitter)
Randall recommends his new horse racing league, the National Thoroughbred League. The book that Kate recommends is by Andrew Boryga. Lauren recommends the show Hacks on Max.
Randall Lane can be found on social media. The person is Kate Knibbs. Lauren is known as LaurenGoode. Michael Calore is @snackfight. Bling the main hotline at @GadgetLab. The show is produced with the assistance of Mr. Ashworth. The theme song for our concert is by Solar Keys.
You can always listen to this week’s podcast through the audio player on this page, but if you want to subscribe for free to get every episode, here’s how:
If you’re on an iPhone or iPad, open the app called Podcasts, or just tap this link. You can also download an app like Overcast or Pocket Casts, and search for Gadget Lab. If you use Android, you can find us in the Google Podcasts app just by tapping here. We’re on Spotify too. And in case you really need it, here’s the RSS feed.