Opening a blog is a good opportunity to discover and rediscover how the main search engines work nowadays to get their information: long gone are the days when Altavista was the only major search engine.
It is also a good way to try and find out some of the reasons why a blog is a bit peculiar compared to a generic web site.
In particular, most blogs or news sites rely on technologies called RSS (for Really Simple Syndication) or Atom to provide a standard format for automated engines to know about available articles or posts; they’re called syndication formats. Most blog-specialized engines are based on the use of syndication formats; that’s the main point that makes them different from generic search engines. However, RSS was invented for news sites years before the word blog came into wide use; hence the difference between a blog and a web site is not as clear-cut as it may seem.
A web server’s access log is the best place to start when trying to understand who does what with your site. It lets you know a few basic things about each and every access; of particular interest we have:
- the accessing IP address;
- the accessed page on the site;
- the date and time;
- the type of browser, which can contain a lot of information by itself.
The blog software I use, WordPress, notifies a service called ping-o-matic for new posts. Ping-o-matic is an update service: when notified, it in turn lets a lot of engines know that a new post is available on your blog without them having to find it out the hard way, i.e. by accessing your site. It relieves the search engines and it relieves your site. You can also try ping-o-matic by hand by filling the form on their home page. On that page you can see some of the services that Ping-o-matic notifies; 22 services are listed at the time of this writing. A funny one, if you’re bored, is weblogs.com, whose home page simply provides a scrolling list of new blog posts when they are published.
My first post on this blog is dated 17:08:51 CET on 6 January 2007. This means my blog notified ping-o-matic at that time, give or take a few seconds. In turn, interested search engines schedule an access to the new page. It’s interesting to have a look at who gets where and in what order.
So, on to the engine rush. Beware (and sorry) that the following looks awfully like a catalog.
The first search engine to come up is YahooFeedSeeker/2.0
, at 17:08:58. Wow: barely 7 seconds after the page was posted. It doesn’t access the page quite directly: being a nice and W3C-compliant robot, it looks for a robots.txt
file about search engine access restrictions on this site. There’s no such file here, meaning that there are no restrictions, at least at the moment. It also provides us (hidden in the logged browser name) with a reference to a nice page on hints regarding RSS feeds. Most robots provide a similar informative page, allowing web site administrators to better know what the engines are doing. YahooFeedSeeker is not a generic page-indexing robot: it’s only looking for RSS feeds.
Next at 17:10:14, the site is accessed by Googlebot/2.1
, Google’s main robot. It provides us with an informative FAQ about its whereabouts. For the moment it only looks at the home page; it will index the whole site much later in the evening.
At 17:16:43, Technoratibot/0.7
in turn visits the home page. No nice informative URL here. The robot works for Technorati, a search engine specializing in blogs. The robot accesses the home page then the RSS feed; not very surprising from a blog search engine. What may be more surprising is that it never looks for a robots.txt
; the reason is that is only comes when a new post appears on the blog, i.e. as a result of human action; it learns of that through ping-o-matic.
At 17:19:11 comes Sphere Scout&v4.0 (beta)
. Sphere is another blog-specialized search engine. It looks for robots.txt
, then the home page, then the RSS feed… three times in a row: that’s probably not really optimal.
Moreoverbot/5.00
greets us at 17:47:35. Another blog indexing engine looking for RSS feeds, whose results are apparently not publicly accessible.
At 17:52:05 comes Bloglines/3.1
. Bloglines is a popular RSS aggregator. It comes after someone (myself, actually) added this blog to his personal feed list, rather than being notified by ping-o-matic. Bloglines has a nice feature: it gives the number of subscribers to this particular RSS feed as part of its logged browser description. Bloglines comes back every 30 minutes and doesn’t fetch the RSS feed unless it has been updated since the last access.
Later comes Relevant Noise, a business-intelligence blog-oriented search engine.
Feedfetcher-Google
, a RSS aggregator for Google Reader, is very similar to Bloglines. It comes only after being invited to by a subscriber; it subsequently comes back every hour or so to check for RSS updates.
At 19:54:51 Yahoo! Slurp
,Yahoo’s main generic web indexer, visits the whole site.
Hours later, msnbot-media/1.0
shows up. It is the indexer for MSN Search. It slowly fetches pages one after another; as of now I’m not even sure it has finished. Maybe that explains why MSN Search is so far behind Google in terms of indexed pages.
Other lesser-known, smaller engines I found in the logs:
This list is obviously not complete: some engines may not even know about this site yet; some engines may be taking their time; new engines appear every day… let’s just leave it at that for the moment!
You should install the “sitemap” extension, which lets Google (and possibly others) know how your site is organized.
Sam: done, it’s located at http://www.arnebrachhold.de/2005/06/05/google-sitemaps-generator-v2-final.
Apparently this format is supported by Google, Yahoo and MSN Search.
Thanks for the advice.