About this thesis
- Title: Effective Web Crawling
- Author: Carlos Castillo
- Advisor: Ricardo Baeza-Yates
- Comitee:
- Mauricio Marín (U. Magallanes – Chile)
- Alistair Moffat (U. of Melbourne – Australia)
- Gonzalo Navarro (U. Chile)
- Nivio Ziviani (U. Fed. Minas Gerais – Brazil)
- Date: November 29, 2004
Download the full thesis
Ph.D. Thesis: EFFECTIVE WEB CRAWLING by Carlos Castillo [4Mb, 180 pags.]
Download the thesis by chapters
Thesis divided by chapters. Main chapters are marked in bold.
- Front matter: cover, indexes, etc.
- Introduction
- Related work, bibliographic review of Web crawling: state of the art in Web crawling, survey.
- A new crawling model and architecture: framework and classification of Web crawlers.
- Scheduling algorithms for effective Web crawling: long-term and short-term scheduling.
- Crawling the infinite Web, when to stop crawling: dynamic Web sites can be unbounded.
- Cooperation schemes for Web servers: to improve their representation in the search engine.
- Crawler implementation: algorithms and data structures.
- Web characterization: study of the Chilean Web.
- Conclusions
- Appendix: Practical Web crawling problems, Web crawling in practice: practical issues and caveats of Web crawling.
Abstract was published by ACM SIGIR Forum: “Effective Web Crawling (Doctoral Abstract)“. ACM SIGIR Forum 55 Vol.39 No. 1, pp. 55-56. June 2005. [acm]
This thesis is part of the WIRE project for developing an open-source Web Information Retrieval Environment.
Acknowledgments
This thesis would not have been possible without enough O.P.M.: during the thesis I received mostly the financial support of grant P01-029F of the Millennium Scientific Initiative, Mideplan, Chile. I also received financial support from the Faculty of Engineering and the Computer Science Department of the University of Chile, among other sources.
What you are is a consequence of whom you interact with, but just saying “thanks everyone for everything” would be wasting this opportunity. I have been very lucky of interacting with really great people, even if some times I am prepared to understand just a small fraction of what they have to teach me. I am sincerely grateful for the support given by my advisor Ricardo Baeza-Yates during this thesis. The comments received from the committee members Gonzalo Navarro, Alistair Moffat, Nivio Ziviani and Mauricio Marín during the review process were also very helpful and detailed. For writing the thesis, I also received data, comments and advice from Efthimis Efthimiadis, Marina Buzzi, Patrizia Andrónico, Massimo Santini, Andrea Rodriguez and Luc Devroye. I also thank Susana Docmac and everybody at Newtenberg.
This thesis is just a step on a very long road. I want to thank the professors I met during graduate studies: Vicente López, Claudio Gutierrez and José Pino; also, I was lucky to have really inspiring professors during the undergraduate studies: Martin Matamala, Marcos Kiwi, Patricio Poblete, Patricio Felmer and José Flores. There were some teachers in high and grade school that trusted in me and helped me get the most out of what I was given. During high school: Domingo Almendras, Belfor Aguayo, and in grade school: Manuel Guiñez, Ivonne Saintard and specially Carmen Tapia.
I would said at the end that I owe everything to my parents, but that would imply that they also owe everything to their parents and so on, creating an infinite recursion that is outside the context of this work. Therefore, I thank Myriam and Juan Carlos for being with me even from before the beginning, and sometimes giving everything they have and more. I am also thankful for the support of all the members of my family, specially Mercedes Pincheira.
Finally, my beloved wife Fabiola was exactly 10,000 days old on the day I gave my dissertation, and I need no calculation to say that she has given me the best part of those days – thank you.