The lie of distribution--search engines return very little value to news/blog sites yet hog bandwidth and increase server loads

By Tom Foremski - October 26, 2005

. . .or is it just me?
By Tom Foremski, Silicon Valley Watcher.com

Search-and-scrape sites such as Google, Yahoo, MSN, and oodles of others claim they bring traffic to web sites. And they do--but at what cost? It was a question I asked myself following a chat with Jim Buckmaster, ceo of craigslist, and its recent complaint that Oodle was scraping its listings way too aggressively and slowing down the entire system.

I took a look at my server stats from mid-October.

The search-and-scrapers sucked out one-third of my bandwidth and provided just 3.7 percent of the traffic!


awstats3.png


awstats1.png

Microsoft is by far the most egregious of the lot. Over the past three weeks it sucked out 4.6GB or 18 per cent of my bandwidth and returned...275 page views--0.0007 percent of the total!!

awstats2.png

It is news/blog/original content sites that are being targeted by these over-zealous bots because they provide fresh content, and without fresh content the search-and scrapers have nothing. And the more often you post fresh content the more attention and visits you get from the bot army.

And lets not talk about the masses of cached pages out there that are hit and viewed but do not show up in the server stats--yet are counted and monetized by the search-and-scrapers.

It is inevitable that content owners will increasingly choose to glue down their content. You can only get it here and you have to come here to get it--will be the mantra.

Why do you think Yahoo is trying to scramble as quickly as it can into producing original content ;-). It knows the writing is on the wall.

Tell me what your AWstats are showing...


« Runnin' the Corridor: 2 Weeks On the Digital (and REAL) Highway | Main | Craigslist: Battling the spider and bot armies of the swarms of VC-funded search-and-scrape startups . . .a chat with ceo Jim Buckmaster »


                   

October 26, 2005 | Permalink | Comment | Category: MediaWatch | Subscribe to SVW

Comments (7)

bhiv:

I think the problem has less to do with the "search-and-scrape" sites and more to do with the netiquette of bots/crawlers.



A couple months ago several people I know were hit by bandwidth overages, or were shut down when yahoo started aggressively searching for images. How much of it was yahoo's fault and how much of it was the webmaster's fault for having multiple high resolution photos that were opened to be crawled? A would wager a little bit of both



Also, I don't think the number of links from search engines really paints the entire picture. It doesn't count the people who found articles on the search engine and then posted it to their blog, or sent it to a friend. Granted this may end up to be a marginal difference, but my point is measuring search engine's effectiveness (or return) through one vector can be inaccurate.



There was a deal (of sorts) struck in 1996 that the robots.txt protocol would serve as the guidelines for a web site operator could impose on well behaving spiders/crawlers/bots. This imposes the onus on the webmaster to create one of these files and on the programmer of the crawler to obey them.



While not an official standard, all major search engines obey this rule. MSNbot and Inktomi Slurp (Yahoo) will even obey the Crawl-delay directive so if they are using too much bandwidth.



If you are unhappy about how MSNBot has been indexing this site, you should consider creating your own robots.txt file.


Tony:

I think this analysis is very misleading. When analyzing bandwidth usage it must be done against the backdrop of peak load. You don't truly pay for the bits, you pay for the size of the pipe that you need during peak load.



When a crawler hits your site at 1am in the morning, even if it uses a lot of bandwidth, the economic cost is zero. It doesn't cost you a dime and it doesn't effect your users.


To the above comment, it's clear that crawlers need to be well behaved. They shouldn't be hitting your site at noon or at least not doing major crawling at this time. Having said that, most of the crawler I see hitting my site are well behaved.


You are right, I should control the robots, but, even with the fudged up AW stats, there is a new paradigm, or new rules. The new rules say that search sites are not going to get you much traffic because news/blog sites become part of people's daily rituals of information ingestion. You don't go to search engines fo that, it beomes a habit of your daily routine to read news/blogs. That is a powerful model and a barrier to others...


bhiv:

I suppose if you content is time sensitive than there is less a benefit for you for search engines, but that doesn't mean when your article has lapsed past the blogosphere's conscience that when someone else is doing research for a new story and finds one of your articles helpful and links to it that the search engine hasn't played a role in driving more traffic.


blog sites and social bookmarking sites will never replace search sites, it isn't new rules as much as it is addendum to the old rules.


Tom, your analysis is terribly flawed and you have the economics all wrong.

I posted more here:
http://brontemedia.com/2005/10/28/you-say-toe-mato-i-say-tomar-to/

But simple economics: CPMs go up, bandwidth goes down means the gap will widen even further.

It is also hard to be sympathetic when you actively market an RSS feed and do nothing about the robots.txt file.


all new channels start from 0. some of them work hard to grow.

as long as some of the leading channels block distribution they make it easier for new competitors to steal their market position.

just look at how well AOL's walled garden approach worked. also note that not everyone who publishes information publishes a channel on one specific topic and puts in as much effort into that one channel as you do...for them search is much more important than it is for you, but witout search you would lose market share to others.


Tom, now do this. Over time, measure how many people who initially found you through a search have come back directly to you. You'll find they make up much more than 3 percent of your traffic.

This is what I call the search gap. Once someone has found you, they often don't search again for you but come direct.

So, search engines may generate only 3 percent of your overall traffic, but that 3 percent will lead too many more visits over time.

As for the writing being on the wall, search engines have operated this way for 10 years, and it is the incredibly rare site owner who feels they eat up too much bandwidth for the traffic they deliver. But if you want to test it, just put up a robots.txt file banning the spiders from coming to your site. Bandwidth will drop -- and probably your traffic will as your rankings disappear, as well.


Post a comment