An Interim Update on WWW Search Engines for Chemistry

Part of The Alchemist's Lair Web Site
Maintained by Harry E. Pence, Professor of Chemistry, SUNY Oneonta, for the use of his students. Any opinions are totally coincidental and have no official endorsement, including the people who sign my pay checks. Comments and suggestions are welcome (pencehe@oneonta.edu).

Last Revised Nov. 27, 2000


An Interim Update on WWW Search Engines for Chemistry
(written for the fall issue of the Computers in Chemical Education Newsletter)
Harry E. Pence
Chemistry Department
SUNY Oneonta
Oneonta, NY 13820
pencehe@oneonta.edu

An earlier article in this series, "Evaluating Search Engines for Chemistry," compared the search engines available at that time and suggested that AltaVista and NorthernLight were most useful for serious scientific searches. To a large extent, this recommendation was based on selecting the largest search engine indexes. Chemists generally search for material that goes beyond the common terms that are the focus of most web searches, and so a larger index improves the chance that unusual material will be available. Since that article was written, there have been some significant developments that should be considered when selecting a search engine. This presentation will be an interim treatment, focusing primarily on Google, a relatively new engine that has a very large index and is also unusually good at returning relevant web pages. See The Alchemist's Lair next semester for a broader reassessment of all of the major search engines.

Recently, Google has claimed that it has an effective index as large as the estimated number of indexable pages on the entire web. This is the first time that any search engine has claimed to cover the entire WWW. Even though this claim may not be completely accurate, it does represent very good news for web users who search for specialized information. Google claims that it links to over 602 million pages, and also includes 648 million URLs that it says are "partially linked."

The race for the title of the largest WWW search engine has been heating up for the past year. According to a November press release, the leaders are now FAST, which powers alltheweb, and Google. Each claims to have an index of over 575 million pages, moving them ahead of former leaders like AltaVista, Inktomi, and Northern Light, at least for the moment. In addition, however, Google claims that it can access a large number of pages that aren't included in the index, raising the total number of pages claimed to well over one billion. This equals the estimated total number of web pages that can be accessed by search engines. This is the first time that any engine has claimed to be able to search so much of the web.

There are at least two studies that indicate that the claims of a very large index are reflected in the success of WWW searches. In July, 2000, Danny Sullivan, one of the leading search engine experts and editor of the SearchEngineWatch.com site compared the major search engines using a set of what he called obscure terms, that is,terms where no engine produces more than 100 hits. This approach not only gives a better measure of general search engines performance, but also may be more appropriate for ranking engines for chemistry searches. He found that Google was clearly the best, but that FAST also did quite well, although at that time it claimed a somewhat smaller index than Google. On the other hand, AltaVista, which claimed that it had an index roughly the same size as FAST, was not only beaten by Google and FAST, but also by Northern Light as well as HotBot and iWon. The last two of these are both based on the Inktomi engine. Summarizing this and several similar tests, Sullivan concludes, "The main disappointment is AltaVista, which doesn't seem to rank where one would expect it to be."

Greg R. Notess, another well-known search engine guru, has also reported that Google has a very large indexe. His study, from October 19, 2000, surveyed the five largest search engines using 25 search terms, and verifying the actual number of hits. He found that FAST was best, with Google a close second. The other main engines were in the order, Northern Light, iWon and AltaVista. There seems to be agreement, at least on the fact that FAST and Google have the largest indices, having moved past some of the previous leaders in this category.

Of course, a large index is not the only factor that is important when ranking search engines. In addition to one of the largest indexes, Google uses a search technology called PageRank™, which is unusually successful at insuring that the most useful results appear first on the list returned by the engine. This method estimates the relevance of web sites by analyzing the link structure of the Internet itself. Like many other engines, Google evaluates the likely usefulness of a given page by measuring how many pages are linked to it, but Google goes a step further by recognizing that not all links are equally important. Google identifies certain pages as being frequently linked to and gives links from these pages more weight. This method doesn't just count the number of links; it factors in the apparent prestige of the pages that are doing the linking.

One of the frustrations of doing a search is that if you search for acid rain, most search engines will list pages, some of which only have the term acid and some of which only have the term rain. Google limits the results to pages that contain all of the search terms, either in the page text or in the link anchors pointing to the page. It also gives greater weight to pages where all of the search terms are relatively close together, increasing the chances that the pages reported will be relevant to the original query. This minimizes the number of sites that are returned that have no relevance to the query that you have constructed. Google has one further feature that can be very important; it caches many web pages to provide a back-up if the original page is not available. By no means does this completely eliminate the dreaded "404 Not Found" error message, but it may prove useful in some cases. (For more information about PageRankTM, see the paper entitled The Anatomy of a Large-Scale Hypertextual Web Search Engine, .

Google is an extremely powerful search engine that has an unusually good ranking method for determining the order in which the sites are listed. Until a specific comparison is made next semester, it is not justified to say that Google should be the major search engine for chemists, but any serious searcher should consider including this engine in his or her suite of search methods.


Return to The Alchemist's Lair Web Site

You are the visitor to the Alchemist's Lair site since Jan. 10,1997.