Crawler Using Inverted WAH Bitmap Index and Searching User Defined Document Fields

  IJPTT-book-cover
 
International Journal of P2P Network Trends and Technology (IJPTT)          
 
© 2012 by IJPTT Journal
Volume-2 Issue-3                           
Year of Publication : 2012
Authors : Mr.Sanjay Kumar Singh,Prof Sonu Agrawal

Citation

Mr.Sanjay Kumar Singh,Prof Sonu Agrawal."Crawler Using Inverted WAH Bitmap Index and Searching User Defined Document Fields". International Journal of P2P Network Trends and Technology (IJPTT), V2(3):1-4  May - Jun 2012,  ISSN:2249-2615, www.ijpttjournal.org. Published by Seventh Sense Research Group.

Abstract

Crawler is a web crawler aiming to search and retrieve web pages from the World Wide Web, which are related to a specific topic. It based on some specific algorithms to select web pages relevant to some pre-defined set of topic. The main features of Crawler consist of a user interest specification module that mediates between users and search engines to identify target examples and keywords that together specify the topic of their interest, and a URL ordering strategy that combines features of several previous approaches and achieves significant improvement. It also provides a graphic user interface such that users can evaluate and visualize the crawling results that can be used as feedback to reconfigure the crawler. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. The crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue. The user then provides feedback and helps the baseline classifier to be progressively induced using active learning techniques. Once the classifier is in place the crawler can be started on its task of resource discovery.

References

[1] http://www.cs.uiuc.edul-dengcai2NIPSNIPS.html.
[2] S. Chakrabarti, M. van den Berg, B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery," in 8th International WWWConference, May 1999.
[3] P.M.E. De Bra, R.D.J. Post, "Information Retrieval in the World Wide Web: Making Client-based searching feasible", Computer Networks and ISDN Systems, 27(2) 1994, 183-192.
[4] M. Hersovici, A. Heydon, M. Mitzenmacher, D.pelleg, "The Shark search Algorithm-An application: Tailored Web Site Mapping. Proc of World Wide Conference", Brisbane. Australia, 1998, 317-326.
[5] S. Ganesh, M. Jayaraj, V. Kalyan, S. Murthy and G. Aghila. "Ontologybased Web Crawler", IEEE Computer Society, Las Vegas - Nevada- USA, pp. 337-341, 2004.
[6] S. Bri, L. Page, "The anatomy of large-scale hypertext Web search engine", Proc of World-Wide Web Conference, Brisbane, Australia, 1998, 107-117.
[7] Jon M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment", Journal of the ACM, 1999, 46(5), 604-632.
[8] J. Cho, H. Garcia-Molina, and L. Page, "Efficient crawling through URL ordering," in Proceedings of the Seventh World-Wide Web Conference, 1998.
[9] Y. Ye, F. Ma, Y. Lu, M. Chiu, and J. Huang, "iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples", APWeb, Springer,2004, pp. 122-134.

Keywords

Crawler, keyword extraction, classifier, URL, WAH Bitmap