Posted by Tom Foremski - July 16, 2010
When I was in Brazil recently, I met with Berthier Ribeiro-Neto, head of engineering at Google Brazil. During our conversation I mentioned an idea I had about making the Google index into an open database that anyone could access, I said that this could dramatically speed up the Internet.
He said it was a good idea and that I "should write a position paper" on this subject.
(As a further thought, maybe it could also serve to take away some of the heat Google is feeling lately, in terms of its index rankings potentially favoring its own business interests.)
Here is my logic:
Looking at my server logs shows that 20 different robots visit my site, one of the more frequent ones is the Googlebot. Each of these robots is trying to create an index of my site.
Each of these robots takes up a considerable amount of my resources. For June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8 gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together, they used up 45% of my bandwidth just to create an index of my site.
These robots are all seeking the same information and they use nearly one-half of my bandwidth, slowing the site for all my readers. This is also the same for tens of millions of web sites.
What if there was a single index that anyone could access?
You would get an immediate speed increase in the Internet for no additional investment in infrastructure.
Google and others, could perform their own analysis of the index using their secret algorithms. After all, the value is not in the index it is in the analysis of that index.
Mr. Ribeiro-Neto said, "That's a good idea. You probably wouldn't even need to spider the web sites."
Each web site could update the central index automatically each time something changed. This would result in a massive savings in bandwidth used by dozens of robots scouring the Internet for new information.
What if Google opened up its index to the world as a goodwill gesture because it has the best index? It could still maintain the privacy of its algorithm but everyone would have the same information on which to perform their analysis.
It would show that there was nothing unusual or unethical in how Google collects information for its index. This might relieve some of the pressure it has come under this week to reveal more about how its search service is presented.
Also, Google founders were once strong advocates that the search index should be run as a non-profit.
On page 39 "Inside Larry and Sergey's Brain" by Richard Brandt (referral link).
Andrei Broder, who led the team that created the AltaVista search engine, the best of its time, talks about meeting Larry and Sergey. "When the discussion turned to the topic of making money from the technology, Broder found that Page had a profound difference of philosophy on the subject. "It was a very funny thing about Larry," Broder recalls. "He was very adamant about search engines not being owned by commercial entities. He said it should all be done by a nonprofit. I guess Larry has changed his mind about that."
Brian Lent, now CEO at Medio Systems:
"The problem with the Google search engine at the time, Lent recalls, is that Larry and Sergey didn't want to commercialize it, and Lent was anxious to become an entrepreneur. Their mantra at the time was more socialistic than entrepreneurial. "Originally, 'Don't be evil' was 'Don't go commercial,'" says Lent.
- - -
Don MacAskill, CEO of SmugMug writes:
... I would estimate close to 50% of our web server CPU resources (and related data access layers) go to serving crawler robots. Stop and think about that for a minute. SmugMug is a Top 300 website with tens of millions of visitors, more than half a billion page views, and billions of HTTP / AJAX requests (we're very dynamic) each month. As measured by both Google and Alexa, we're extremely fast (faster than 84% of sites) despite being very media heavy. We invest heavily in performance.
And maybe 50% of that is wasted on crawler robots. We have billions of 'unique' URLs since we have galleries, timelines, keywords, feeds, etc. Tons of ways to slice and dice our data. Every second of every day, we're being crawled by Google, Yahoo, Microsoft, etc. And those are the well-behaved robots. The startups who think nothing of just hammering us with crazy requests all day long are even worse. And if you think about it, the robots are much harder to optimize for - they're crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.
Worst part about the whole thing, though? We're serving the exact same data to Google. And to Yahoo. And to Microsoft. And to Billy Bob's Startup. You get the idea. For every new crawler, our costs go up.
We spend significant effort attempting to serve the robots quickly and well, but the duplicated effort is getting pretty insane. I wouldn't be surprised if that was part of the reason Facebook revised their robots.txt policy, and I wouldn't be surprised to see us do something similar in the near future, which would allow us to devote our resources to the crawlers that really matter.
Anyway, if a vote were held to decide whether the world needs an open-to-all index, rather than all this duplicated crawling, I'd vote YES! And SmugMug would get even faster than it is today.
- The NYTimes: The Google Algorithm
- FT.com / Comment / Opinion - Do not neutralize the web's endless search (Subscription required.)