paul perry .   blog   |   about   |   notes

Personal Search Crawlers

No search engine farm will ever be able to index the entire web, let alone cache it, as Google offers. The sheer growth of the web makes it very hard for search engines to keep up, while the most interesting content is often the newest information and is often not indexed yet.

"Traditional search engines like Altavista or Excite don't even bother to fully index such sites both because of the magnitude of the information store and of the rate of change of the information stored within the pages. As such search engines seem to index the portion of the web site they find most relevant or useful to the community as a whole." [*]

Current web search engines actually find excellent starting points, but the user is left with the mind-numbing task of clicking away, and while the information they are looking for is a mere 19 clicks away, they're never going to get there.

While it would be nice to have a Big Fat Webserver (BFW) be at the ready to answer all my query needs, it's not going to happen because BFW's have as their foundation a crawler that suffers from the same problem of degraded performance with network growth.

Re/searchers have evolved their own strategies for their specific tasks. They have selected the key sites that pertain to their job/task, and they start crawling from there. They exploit the structure of these sites, be they topic specific search sites or industry specific content sites.

Turns out that these strategies can be successful.

So researchers have started down various other paths:

  • personal search engines
  • directed search engines
  • semantics based search engines
  • classifier engines

Personal crawlers learn the topic of interest from the example URLs, and then selectively seeks out pages and sites relevant to these topics. Owing to the specificity, a focused crawler running on modest desktop hardware can retrieve better searches and fresher results than a universal search engine.

Focused crawlers use a hypertext classifier that can learn from examples and identify topics of web pages and sites; a scheduler that identifies strategic nodes in the crawl graph; and a controlled by the recommendations of the classifier and the scheduler. Results showed that pages found after just an hour of crawling were significantly superior to those from a universal search engine.

The intuition that collaborative human efforts to organize, annotate and share search results can also be a source for better metadata of the web and a better place to start for specific knowledge communities (lawyers, analysts, researchers, librarians, etc.).

The application of machine learning seems to have taken root in the search space and is making improvements. Check WhizbangLabs (FlipDog) and TopicalNet (Links2Go).

But here I think that a little human involvement will again produce vastly improved results. If Open Directory, or DMOZ, was assembled with some ease, then a whole new version of this is likely to take place. But instead of having people categorizing at the page level we will see them do this against particular published XML Schemas. Various industries could start dedicating part of their staff to bind terms to schema entities and with sufficient critical mass a whole new level of granularity of semantic meaning will be available to search.

Providers of different types of services can chain together to perform more sophisticated services, and publish data posting standards to enhance query capabilities.

Key Papers

Other interesting references

search engine evaluation

Crawlers

search engine implementations

custom search engines

Classifiers

Domain Specific Search

User search strategies

Other sources

The rest