Status Locality on the Web: Implications for Building Focused Collections

0.00 Avg rating—0 Votes

Article ID:	iaor20135206
Volume:	24
Issue:	3
Start Page Number:	802
End Page Number:	821
Publication Date:	Sep 2013
Journal:	Information Systems Research
Authors:	Pant Gautam, Srinivasan Padmini
Keywords:	internet, information theory, social, networks

Abstract:

Topical locality on the Web is the notion that pages tend to link to other topically similar pages and that such similarity decays rapidly with link distance. This supports meaningful Web browsing and searching by information consumers. It also allows topical Web crawlers, programs that fetch pages by following hyperlinks, to harvest topical subsets of the Web for applications such as those in vertical search and business intelligence. We show that the Web exhibits another property that we call ‘status locality.’ It is based on the notion that pages tend to link to other pages of similar status (importance) and that this status similarity also decays rapidly with link distance. Analogous to topical locality, status locality may also be exploited by Web crawlers. Collections built by such crawlers include pages that are both topically relevant and also important. This capability is crucial because of the large numbers of Web pages addressing even niche topics. The challenge in exploiting status locality while crawling is that page importance (or status) is typically recognized through global measures computed by processing link data from billion of pages. In contrast, topical Web crawlers depend on local information based on previously downloaded pages. We solve this problem by using methods developed previously that utilize local characteristics of pages to estimate their global status. This leads to the design of new crawlers, specifically of utility‐biased crawlers guided by a Cobb‐Douglas utility function. Our crawler experiments show that status and topicality of Web collections present a trade‐off. An adaptive version of our utility‐biased crawler dynamically modifies output elasticities of topicality and status to create Web collections that maintain high average topicality. This can be done while simultaneously achieving significantly higher average status as compared to several benchmarks including a state‐of‐the‐art topical crawler.

Reviews

Required fields are marked *. Your email address will not be published.