Personal Search Crawlers
No search engine farm will ever be able to index the entire web,
let alone cache it, as Google offers. The sheer growth of the web
makes it very hard for search engines to keep up, while the most
interesting content is often the newest information and is often not
indexed yet.
"Traditional search engines like Altavista or Excite don't even
bother to fully index such sites both because of the magnitude of the
information store and of the rate of change of the information stored
within the pages. As such search engines seem to index the portion of
the web site they find most relevant or useful to the community as a
whole." [*]
Current web search engines actually find excellent starting points,
but the user is left with the mind-numbing task of clicking away, and
while the information they are looking for is a mere 19 clicks
away, they're never going to get there.
While it would be nice to have a Big Fat Webserver (BFW) be at the
ready to answer all my query needs, it's not going to happen because
BFW's have as their foundation a crawler that suffers from the same
problem of degraded performance with network growth.
Re/searchers have evolved their own strategies for their specific
tasks. They have selected the key sites that pertain to their
job/task, and they start crawling from there. They exploit the
structure of these sites, be they topic specific search sites or
industry specific content sites.
Turns out that these strategies can be successful.
So researchers have started down various other paths:
- personal search engines
- directed search engines
- semantics based search engines
- classifier engines
Personal crawlers learn the topic of interest from the example
URLs, and then selectively seeks out pages and sites relevant to these
topics. Owing to the specificity, a focused crawler running on modest
desktop hardware can retrieve better searches and fresher results than
a universal search engine.
Focused crawlers use a hypertext classifier that can learn from
examples and identify topics of web pages and sites; a scheduler that
identifies strategic nodes in the crawl graph; and a controlled by the
recommendations of the classifier and the scheduler. Results showed
that pages found after just an hour of crawling were significantly
superior to those from a universal search engine.
The intuition that collaborative human efforts to organize,
annotate and share search results can also be a source for better
metadata of the web and a better place to start for specific knowledge
communities (lawyers, analysts, researchers, librarians, etc.).
The application of
machine learning seems to have taken root in the search space and
is making improvements. Check WhizbangLabs
(FlipDog) and TopicalNet (Links2Go).
But here I think that a little human involvement will again produce
vastly improved results. If Open Directory, or DMOZ, was assembled
with some ease, then a whole new version of this is likely to take
place. But instead of having people categorizing at the page level we
will see them do this against particular published XML Schemas.
Various industries could start dedicating part of their staff to bind
terms to schema entities and with sufficient critical mass a whole new
level of granularity of semantic meaning will be available to search.
Providers of different types of services can chain together to
perform more sophisticated services, and publish data posting
standards to enhance query capabilities.
Key Papers
- Sergey Brin and Lawrence Page
The Anatomy of a Large-Scale Hypertextual Web Search Engine ,
Stanford. [The Google search engine]
- Soumen
Chakrabarti, Byron Doma, Prabhakar Raghavana, Sridhar
Rajagopalana, David Gibsonb, and Jon Kleinbergc
Resource Compilation by Analyzing Hyperlink Structure and Associated
Text Special Issue of the Seventh World Wide Web
Conference, 30(1-7), April 1998.
- S. Chakrabarti, B. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan,
A. Tomkins, D. Gibson, J. Kleinberg. Mining the Web's Link
Structure. IEEE Computer , 32(8):60-67, August 1999.
- S. Chakrabarti, M. van den Berg, B. Dom. Focussed Crawling: A New
Approach to Topic Specific Resource Discovery. Proceedings of the
Eighth World Wide Web Conference , pages 545-562, 1999.
- S. Chakrabarti, B. Dom, S. Ravi Kumar, P. Raghavan,
S. Rajagopalan, A. Tomkins, J. M. Kleinberg, and D. Gibson. Hypersearching
the Web. Scientific American, June 1999.
- S. Chakrabarti D. Gibson, K. McCurley.
Surfing the Web Backwards In WWW8 1999.
- Charu C. Aggarwal, Fatima Al-Garawi, and Philip S. Yu Intelligent
Crawling on the World Wide Web with Arbitrary Predicates WWW10,
May 2-5, 2001, Hong Kong.
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page Efficient
Crawling Through URL Ordering
- Diligenti M.,
Coetzee F.M., Lawrence S., Giles C. L. and Gori M.
"Focused Crawling Using Context Graphs" , Proc. Very Large
Databases 2000, Cairo, Egypt.
- Jenny Edwards, Kevin McCurley, John Tomlin,
An Adaptive
Model for Optimizing Performance of an Incremental Web Crawler
WWW10, May 1-5, 2001, Hong Kong.
- Marc Najork, Janet L. Wiener "Breadth-first
search crawling yields high-quality pages". WWW10, May 2-5, 2001,
Hong Kong.
- Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Nemprempre,
Peter Szilagyi, Andrzej Duda, and David K. Gifford, HyPursuit:
A Hierarchical Network Search Engine that Exploits Content-Link
Hypertext Clustering HyperText 1996.
Other interesting references
search engine evaluation
- Steve Lawrence and C. Lee Giles
Searching the World Wide Web, Science, Volume 280, Number 5360,
pp. 98-100, 1998. also here.
[The coverage of any one engine is significantly
limited: No single engine indexes more than about one-third of the
"indexable Web", and combining the results of the six engines yields
about 3.5 times as many documents on average as compared with the
results from only one engine.]
- Heting Chu, Marilyn Rosenthal
Search Engines for the World Wide Web: A Comparative Study and
Evaluation Methodology , ASIS 1996 Conference.
- H. Vernon Leighton ,Dr. Jaideep Srivastava,
Precision among World Wide Web Search Services (Search Engines): Alta
Vista, Excite, Hotbot, Infoseek, Lycos
- David Hawking, Nick Craswell, Paul Thistlewaite Results and Challenges in
Web Search Evaluation , WWW8.
- Glen Pringle, Lloyd Allison and David L. Dowe
What is a tall poppy among web pages? WWW7.
- Danny Sullivan
SearchEngineWatch Reviews
and reports , or Aaron Schatz, the Sultan of Search's Lycos 50.
- Search Tool Comparisons
Crawlers
search engine implementations
custom search engines
Classifiers
- Chandra Chekuri, Michael H. Goldwasser, Prabhakar Raghavan, Eli
Upfal,
Web Search Using Automatic Classification
- Richard Fox, Internet
Classification Search Engine
- A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig,
Syntactic Clustering of the Web, WWW6, 1997.
- TopicalNet: in the absence
of a classification system that can be used (like the Dublin Core),
TopicalNet (also formerly known as TopicalStorm) has created a
classification technology based on machine learning. The insight
seems to be that the text in a link can be a key determinant of the
classification of the content.
Domain Specific Search
User search strategies
Other sources
- Vladimir Meñkov, David J. Neu, Qin Shi, AntWorld: A
Collaborative Web Search Tool, DCW 2000.
- Ziv Bar-Yossef, Alex Berg, Steve Chien, Jittat Fakcharoenphol,
and Dror Weitz
Approximating Aggregate Queries about Web Pages via Random Walks
Proceedings of the 26th International Conference on Very Large
Databases (VLDB), 2000, pages 535-544.
- The
Diameter of the World Wide Web, Nature, Vol. 401, 9 Sept. 1999.
- Haining Zhang , Ye Lu
Permeate, The Next Generation of Search Engine
- Distributed WWW Search
Using Semantic Routers based on Harvest.
- Krishna Bharat, George A. Mihaila Hilltop: A Search
Engine based on Expert Documents
- GHuRU -
Search Engine Interaction
- Michael Chen, Marti Hearst, Jason Hong, James Lin Cha-Cha: A System
for Organizing Intranet Search Results UC Berkeley. USITS 1999.
[Cha-Cha organizes web search results to reflect
the underlying structure of the intranet by combining a
hierarchical outline of the root pages with the search results.]
- Agustin Schapira,
Collaboratively Searching the Web UMass.
- Simple Web Indexing for
Humans - Enhanced SWISH-Enhanced is a fast, powerful, flexible,
free, and easy to use system for indexing collections of Web pages or
other text files.
-
Hypersearching the Web, Scientific American , 1999.
Chakrabarti et al, Clever Project, IBM Research.
- Advanced Search Facility ("tools
for gathering and organizing information within and among information
communities.")
- AltaVista Refine feature:
LiveTopics ,
Refine
- The
Anti-Filter Carson Reynolds MIT Media Lab.
- David Lowe Improving Web Search
Relevance: Using Navigational Structures to Provide a Search
Context in Australian World Wide Web Conference 2000.
The rest
|