Article ID: | iaor20135206 |
Volume: | 24 |
Issue: | 3 |
Start Page Number: | 802 |
End Page Number: | 821 |
Publication Date: | Sep 2013 |
Journal: | Information Systems Research |
Authors: | Pant Gautam, Srinivasan Padmini |
Keywords: | internet, information theory, social, networks |
Topical locality on the Web is the notion that pages tend to link to other topically similar pages and that such similarity decays rapidly with link distance. This supports meaningful Web browsing and searching by information consumers. It also allows topical Web crawlers, programs that fetch pages by following hyperlinks, to harvest topical subsets of the Web for applications such as those in vertical search and business intelligence. We show that the Web exhibits another property that we call ‘status locality.’ It is based on the notion that pages tend to link to other pages of similar status (importance) and that this status similarity also decays rapidly with link distance. Analogous to topical locality, status locality may also be exploited by Web crawlers. Collections built by such crawlers include pages that are both topically relevant and also important. This capability is crucial because of the large numbers of Web pages addressing even niche topics. The challenge in exploiting status locality while crawling is that page importance (or