Learning to Crawl
We had internet access at my house before there was ever a World Wide Web. Back then, the internet was like a bunch of neighborhoods with no roads between them, each server its own social world with unique norms and shibboleths.
When the World Wide Web launched, it became possible to access tens of thousands of websites with a new technology called a “browser.” There was just one problem: you had to know where to find them.
Within just a few months of the Web’s launch, half a dozen teams had converged on a similar solution: the web crawler. The crawler leveraged the new capability to include hyperlinks that would send you to a new URL. Starting with a group of known websites, the crawler automatically followed all those hyperlinks to new web pages and websites—then followed any hyperlinks there to still more pages, and so on.
Crawler tools made search engines possible. Soon, everyone had a favorite: Lycos, Yahoo!, Altavista.
In the earliest days of the web, what made a search engine great was the size of its index, which was the single greatest contributor to how satisfied searchers were with the results.
This was an almost unfathomably sparse web by the standards of 2024, but it looked vast to the dial-up users of 1996: A month after launch, Lycos had indexed 390,000 web pages—enough to look at a new page every minute for nine months (and about one web page for every 20,000 people on the planet).
Tipping the Scales
In order to understand what happened next, it’s important to think for just a moment about the scale of information humans had previously had access to.
Estimates vary, but scholars believe the Great Library of Alexandria had somewhere between 100,000 and 500,000 books in the third century BC.
The Library of Congress, the largest library in the modern world, contains about 31 million books (and another 61 million manuscripts).
In other words, when search engines started, they were working with quantities of information humans had previously indexed and catalogued.
There was a good reason for this: putting your content on the web in 1996 required significant human effort.
You—or someone you knew, at any rate—needed to pay $100 to Network Solutions to register a domain name for your website. You had to find some hosting service to host your site. And of course, you had to hand-code it in HTML.
This was not easy to do, and most websites looked terrible, but they tended to contain content that was at least somewhat valuable to someone—after all, if it wasn’t useful, why would someone have spent time making it available on the Web?
In this way, website publishing was similar to book publishing: even the most inept examples of the form had required significant effort to produce.
By the year 2000, the scales had tipped: blogging platforms and simplified web page hosting meant that new web pages could be generated with minimal effort.
The web exploded. By the end of 2000, there were close to 10 million websites—some with many, many thousands or tens of thousands of individual pages. A billion people could access the Web, and all of them seemed to have something to say.
Dutifully—and, unbeknownst to them, already doomed—Lycos and Altavista and Yahoo! continued to crawl the chaos.
Let Me Google That For You
As the new century dawned, a second-generation competitor, Google, out-competed everything that had come before (enough to become the generic verb for conducting a web search).
By Q4 2002, Google was the plurality leader in web search. People switched from Yahoo! and Altavista and Lycos and Ask Jeeves for one big reason: Google worked brilliantly. As an early adopter of Google search, I can vouch that when Google started, its results were miles better than its competitors.
Google changed the game by assessing the quality of each of its indexed pages in a way no one else did: by seeing how many other people linked back to them. A website with “high authority” was one a lot of people linked to. They called it “the BackRub approach.”
This simple insight generated uniformly good results at first.
But BackRubs quickly became a matter of quid pro quo, and now it’s time to invoke Goodhart’s Law: When a metric becomes a target, it ceases to be a valid metric.
Until Google entered the fray, site authority—the number of people linking back (via a “backlink”) to a particular website—was generated organically, by people who simply thought that URL held interesting, informative, or funny content. You’d typically only link to something with value, so the more people had linked something, the more value it had.
But this was now the 21st century. The internet wasn’t just for nerdy hobbyists. It was, increasingly, a place to conduct commerce and advertise wares. As soon as anyone thought to conduct studies about how people used search engines, they quickly discovered that most people won’t look beyond the first few results.
Suddenly, the difference between appearing as the first result for a search and the tenth could be millions of dollars. A new, nefarious science emerged: search engine optimization, or SEO.
So Google wanted backlinks? No problem. Enter “Link farms” and link spam comments. The false mustache of paid/spam backlinks disguised garbage websites and unknown newcomers, rendering them identical to trusted sources of authority in Google’s algorithm and allowing them in to the high society of front page results.
Of course, this made the internet much worse than it had been.
Search results populated with low-quality spam content. Comment sections for blogs and forums, even those with low traffic, became unreadable without a combination hall monitor/janitor (so-called “moderators”) who could enforce the rules and take out the trash.
Although Google’s utility declined, no one else had comparable search results. That gave Google time to come up with a new approach.
The approach they chose would put them into the top five companies by market cap in the world—and destroy anything that resembled the original internet.
The Forever Click War
For approximately 15 years (from 2003-2018, give or take), Google fought back a tidal wave of garbage with moderate success. Updates with a range of quirky and cute names—Boston, Cassandra, Dominic, Esmeralda, Fritz, Florida, Austin, Brandy, Allegra, Bourbon, Gilligan, Jagger, Panda, Penguin—used tailor-made rules that filtered out spammy websites.
Lower the rankings of URLs with keyword densities that are too high and you get rid of a bunch of low-effort garbage sites that were written just to farm clicks. Create some lists of known bad backlinkers and punish the sites that use them.
Sure, some legitimate websites were punished with poor rankings and some of those companies closed. All wars have collateral damage.
The bigger problem was that with ever-growing amounts of cash on the line, the spammers and vendors weren’t going to give up on their dreams of being the #1 search result. Google would put out a new update, and SEO “gurus” would figure out the next metric that might allow you to climb the rankings regardless of quality.
In a never-ending arms race to keep up with “the algorithm,” companies needed new weapons with which to fight the increasingly expensive click war.
A Battlefield Owned By an Arms Dealer
Fortunately for site owners, there was—even before Google’s very first named algorithm update—a way to bypass all those search ranking problems.
You could just pay Google to put you in front.
In late 2000, Google launched the easiest, fastest way to advertise to the internet: Google AdWords. In 2002, they added a new, key feature: auctions, in which coveted top-of-screen real estate would be awarded to the highest bidder.
It was hard to keep a website on top organically. What worked well in one update would destroy your search rankings in the next. It was no longer enough to simply write web pages people would want to read about topics they cared about. You needed professionals to keep you in the rankings—consultancies or full-time employees dedicated to keeping up with the newest wrinkle in the search ranking algorithm.
Google’s much-vaunted “don’t be evil” motto masked an obviously evil strategy. It had played both sides from the beginning, driving up the prices of weapons by constantly changing the terrain of the battlefield to render the previous arms obsolete.
Ads started out clearly marked, and grew less so.
Google grew into a behemoth that no longer cared about any previous commitments to an open web, or freedom of information.
After establishing sufficient market dominance to no longer need brilliant PR, they shuttered projects ruthlessly in the pursuit of profit—no surprise for a corporation, but Google had once built its reputation on being explicitly more moral than its competitors.
Getting Worse, But Not For the Reason You Think
Around 2019, discussion around Google search started to change.
“Are Google results getting worse?”
The question had been asked before, of course. Typically there was a bit of a low period for search result quality just after an update, while Google’s search engine fixers fine-tuned the algorithm in response to various outcries.
In the past, these callouts had been short-lived.
This time, things were different.
Now, it seemed like search results were getting notably worse over the medium- and long-term, and even when dedicated people made dedicated efforts to improve the search ranking of a high-quality site, it would be drowned out by low-quality garbage.
The problem started before the start of content generated by LLMs, but “AI” didn’t make things any easier. By 2024, researchers were producing actual studies of Google’s search quality decline, and court filings in 2023—quoted in this Wired article, now available only in archived format after they removed it under immense pressure from Google—show that the frustrating, byzantine nature of modern search is (at least to some degree) intentional.
Users have become increasingly frustrated with searches, but Google is now the only real game in town. The only meaningful competition, Bing (whose algorithm is also used by DuckDuckGo, Brave, and other “independent” search engines), is an also-ran with mediocre results hampered by the fact that everyone’s constantly changing their sites to chase the next Google update.
But … is that the real story?
Or did something much more structural happen—something that makes it actually impossible for Google to perform the job it once did?
The Labor Theory of Search Value
Remember that back in the Wild West Web, it took quite a lot of human effort to put new content online. Today, creating vast amounts of new garbage content has never been cheaper or easier, with only the most trivial effort required.
This dramatic change (probably a reduction in effort of 1000:1 or more!) is not a fact we think very much about. It’s the water we’re swimming in.
But what if that simple, obvious scaling problem holds the key to all of it—why the internet is getting worse, why search doesn’t seem to work, why no competitors seem to be able to get a toehold even as Google blows through decades of accumulated trust and goodwill?
Today, Google indexes over 400 billion results.
Uncountable billions are auto-generated garbage. Billions of others contain real writing, imagery, audio, or video that some human being (or many human beings) labored on.
Some fraction of these results are useful. It’s increasingly obvious that a larger and larger percentage of them are not. LLMs have thrown the problem into sharper relief than ever—especially because in spite of products claiming to offer “AI detection,” it’s actually mathematically impossible to reliably algorithmically detect LLM writing.
This is no longer a problem on the scale of the Library of Alexandria (like Lycos faced). It’s not a problem on the scale of the Library of Congress (like Google faced in its earliest years). It is a problem on a scale humans have never contended with.
If the Internet was a gold mine, with the gold representing real, valuable results for users, it’s spent years filling with fool’s gold—results that look like the real thing at a surface level, but are actually worthless on deeper inspection. In some ways, the fool’s gold is worse than filling up the mine with dirt: at least dirt can more easily be sifted out, without requiring close inspection.
The Decline and Fall of the Google Empire
A gold mine may not be the best metaphor for what’s actually happening to Google.
A better metaphor is metastatic cancer.
Cancer starts when a cell, led astray by a faulty moment in its division, begins to multiply out of control. When the cells of the first tumor make their way into other parts of the body, they produce tumors wherever they land. Without intervention, the body succumbs to the growths.
Low-quality, low-effort spam content is, by definition, easier to reproduce in vast quantities than high-quality, high-effort content.
When Google identified sources of rapid, unchecked growth of low-quality content, it called for a course chemotherapy in the form of algorithm updates.
As time passed, the only tumors to survive and continue to grow were those that were resistant to the previous courses of chemotherapy—the most aggressive, the best at mimicking ordinary cells.
Whatever works, spreads.
And so the World Wide Web we see today is cancerous, mutated, barely recognizable to the people who were there from the start.
Google is a Stage IV cancer patient, laughing because he knows he’s the majority stockholder in the company that makes his chemo drugs. The cancer is definitely going to kill him eventually—but in the meantime, he’ll treat it like a chronic disease and rake in the money while he can.
The End of the “Search Everything” Era
There is a way to cure the cancer. I do mean actually cure it, rather than just treating a few tumors for a few more years while the symptoms gradually get worse.
Since the first web crawlers were invented, every search engine has aspired to catalogue and index the entire World Wide Web.
That made a lot of sense in an era when each web page required human effort. It is no longer clear that “search everything” is a desirable or practical goal for a search engine today.
Within a few years, it will be obviously stupid.
If the Web is 50 percent some-effort content and 50% total garbage, it would probably still make sense to search all of it. What about when 99% of it is garbage generated by LLMs that will literally say anything for a high search ranking, regardless of truth value?
What about when 99.9999% is? I’m not talking about just low-effort, stupid content. I’m talking about content that required zero human effort to make, that was made in batches of tens of thousands and never so much as looked at by a human eye.
Even if you can scrape it all and index it and rank it with the most complicated algorithm ever seen…why?
Why waste vast resources on looking at dead sites and cancerous sites when the real stuff is increasingly rare?
Google, Senile and Senescent
Perhaps Google’s endgame is for the paid results to become better than the “organic” ones. Call it the ultimate triumph of the market’s invisible hand, holding auctions for who will become the visible source of truth for seekers.
Or perhaps they really think their efforts to incorporate AI into search will bear fruit. I think it’s ill-advised FOMO that dilutes any brand promise they had left when the new AI search tells you to eat rocks or drink bleach.
I don’t think scraping the internet to find answers using AI is a good idea for all the reasons above: if most of the internet becomes AI-generated, no-human-ever-touched-it content and the search engine scrapes it, what are the answers going to be like?
It’s a problem that gets worse, not better, with more time and training data.
The stupid AI answers you’ve seen Google playing whack-a-mole with are the cancer spreading to the brain, and it is non-operable.
Any new competitor in the search industry—which is to say, the “getting people to the information they want to see on the World Wide Web” industry—will need to take a different approach.
The Museum Campus
Instead of a wide-open, “search everything” model, new search engines will work by having human curators add known good domains and web pages, as well as having individual content creators submit their work for human review and a human decision about inclusion.
In a good form, perhaps we could imagine search of the future resembling a museum campus—made of many diverse collections and subject to heavy curation.
Yes, curated search engines of this type would each have their own metrics for inclusion. It’s unlikely a single behemoth would satisfy everyone. Yes, this would lead to Balkanization, and people being increasingly ensconced in small information bubbles, and many people would never leave the search engine where their worldview was enforced.
When you submitted to such a search engine, you could certify that you hadn’t used certain scammy techniques. You could sign off that you accepted you’d be removed if you suddenly started generating garbage content, or used LLMs without fact-checking outputs. Maybe some search engines only take your consumer products into their curated collections if you have a specific BBB rating (or whatever replaces it in the future). Maybe others only accept your blog articles if you have a Ph.D. in a hard science.
I’m not saying it’s a perfect idea. I’m not even saying it’s a good idea.
I am saying that within a decade, unless it gets significantly harder to mass-produce content with little to no human input, it’s the only strategy left that can ensure any human user of the internet can ever again find what they are really looking for.
Our Great Library of Alexandria has been forced to accept into the stacks the output of monkeys on typewriters, until the real books are rare (and getting rarer).
The only thing that can save us is curation.
This is the dawning of the age of librarians.