{"id":60,"date":"2004-11-29T07:00:00","date_gmt":"2004-11-29T07:00:00","guid":{"rendered":"https:\/\/chato.cl\/science\/crawling_thesis\/"},"modified":"2024-09-15T13:29:07","modified_gmt":"2024-09-15T13:29:07","slug":"crawling_thesis","status":"publish","type":"page","link":"https:\/\/chato.cl\/science\/crawling_thesis\/","title":{"rendered":"Web Crawling (2004) &#8211; PhD Thesis"},"content":{"rendered":"\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<h2>About this thesis<\/h2>\n<ul>\n<li>Title: Effective Web Crawling<\/li>\n<li>Author: Carlos Castillo<\/li>\n<li>Advisor: Ricardo Baeza-Yates<\/li>\n<li>Comitee: <br \/>\n<ul>\n<li><a href=\"http:\/\/www.dcc.uchile.cl\/~mmarin\/\">Mauricio Mar\u00edn<\/a> (U. Magallanes &#8211; Chile)<\/li>\n<li><a href=\"http:\/\/www.cs.mu.oz.au\/~alistair\/\">Alistair Moffat<\/a> (U. of Melbourne &#8211; Australia)<\/li>\n<li><a href=\"http:\/\/www.dcc.uchile.cl\/%7Egnavarro\/\">Gonzalo Navarro<\/a> (U. Chile)<\/li>\n<li><a href=\"http:\/\/www.dcc.ufmg.br\/~nivio\/\">Nivio Ziviani<\/a> (U. Fed. Minas Gerais &#8211; Brazil)<\/li>\n<\/ul>\n<\/li>\n<li>Date: November 29, 2004<\/li>\n<\/ul>\n<h2>Download the full thesis<\/h2>\n<p class=\"download\">Ph.D. Thesis: <a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/effective_web_crawling.pdf\">EFFECTIVE WEB CRAWLING<\/a> by Carlos Castillo [4Mb, 180 pags.]<\/p>\n<h2>Download the thesis by chapters<\/h2>\n<p>Thesis divided by chapters. Main chapters are marked in bold.<\/p>\n<ul>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/frontmatter.pdf\">Front matter<\/a>: cover, indexes, etc.<\/li>\n<\/ul>\n<ul>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/introduction.pdf\">Introduction<\/a><\/li>\n<li><a style=\"font-weight: bold;\" href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/relatedwork.pdf\">Related work, bibliographic review of Web crawling<\/a>: state of the art in Web crawling, survey.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/newmodel.pdf\">A new crawling model and architecture<\/a>: framework and classification of Web crawlers.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/scheduling.pdf\"><strong>Scheduling algorithms for effective Web crawling<\/strong>:<\/a> long-term and short-term scheduling.<\/li>\n<li><a style=\"font-weight: bold;\" href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/infiniweb.pdf\">Crawling the infinite Web, when to stop crawling<\/a>: dynamic Web sites can be unbounded.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/cooperation.pdf\">Cooperation schemes for Web servers<\/a>: to improve their representation in the search engine.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/implementation.pdf\">Crawler implementation<\/a>: algorithms and data structures.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/charac.pdf\">Web characterization<\/a>: study of the Chilean Web.<\/li>\n<li><a href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/conclusions.pdf\">Conclusions<\/a><\/li>\n<\/ul>\n<ul>\n<li><a style=\"font-weight: bold;\" href=\"http:\/\/www.chato.cl\/papers\/crawling_thesis\/practical.pdf\">Appendix: Practical Web crawling problems, Web crawling in practice<\/a>: practical issues and caveats of Web crawling.<\/li>\n<\/ul>\n<p><span style=\"font-size: medium;\">Abstract was published by ACM SIGIR Forum: &#8220;<a href=\"http:\/\/sigir.org\/files\/forum\/2005J\/castillo_sigirforum_2005j.pdf\">Effective Web Crawling (Doctoral Abstract)<\/a>&#8220;. ACM SIGIR Forum 55 Vol.39 No. 1, pp. 55-56. June 2005. [<a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1067287\">acm<\/a>]<\/span><\/p>\n<p><span style=\"font-size: medium;\">This thesis is part of the <a href=\"http:\/\/www.cwr.cl\/projects\/WIRE\/\">WIRE project<\/a> for developing an open-source Web Information Retrieval Environment.<\/span><\/p>\n<h2>Acknowledgments<\/h2>\n<blockquote>\n<p>This thesis would not have been possible without enough O.P.M.: during the thesis I received mostly the financial support of grant P01-029F of the Millennium Scientific Initiative, Mideplan, Chile. I also received financial support from the Faculty of Engineering and the Computer Science Department of the University of Chile, among other sources.<\/p>\n<p>What you are is a consequence of whom you interact with, but just saying &#8220;thanks everyone for everything&#8221; would be wasting this opportunity. I have been very lucky of interacting with really great people, even if some times I am prepared to understand just a small fraction of what they have to teach me. I am sincerely grateful for the support given by my advisor Ricardo Baeza-Yates during this thesis. The comments received from the committee members Gonzalo Navarro, Alistair Moffat, Nivio Ziviani and Mauricio Mar\u00edn during the review process were also very helpful and detailed. For writing the thesis, I also received data, comments and advice from Efthimis Efthimiadis, Marina Buzzi, Patrizia Andr\u00f3nico, Massimo Santini, Andrea Rodriguez and Luc Devroye. I also thank Susana Docmac and everybody at Newtenberg.<\/p>\n<p>This thesis is just a step on a very long road. I want to thank the professors I met during graduate studies: Vicente L\u00f3pez, Claudio Gutierrez and Jos\u00e9 Pino; also, I was lucky to have really inspiring professors during the undergraduate studies: Martin Matamala, Marcos Kiwi, Patricio Poblete, Patricio Felmer and Jos\u00e9 Flores. There were some teachers in high and grade school that trusted in me and helped me get the most out of what I was given. During high school: Domingo Almendras, Belfor Aguayo, and in grade school: Manuel Gui\u00f1ez, Ivonne Saintard and specially Carmen Tapia.<\/p>\n<p>I would said at the end that I owe everything to my parents, but that would imply that they also owe everything to their parents and so on, creating an infinite recursion that is outside the context of this work. Therefore, I thank Myriam and Juan Carlos for being with me even from before the beginning, and sometimes giving everything they have and more. I am also thankful for the support of all the members of my family, specially Mercedes Pincheira.<\/p>\n<p>Finally, my beloved wife Fabiola was exactly 10,000 days old on the day I gave my dissertation, and I need no calculation to say that she has given me the best part of those days \u2013 thank you.<\/p>\n<\/blockquote>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>About this thesis Title: Effective Web Crawling Author: Carlos Castillo Advisor: Ricardo Baeza-Yates Comitee: Mauricio Mar\u00edn (U. Magallanes &#8211; Chile) Alistair Moffat (U. of Melbourne &#8211; Australia) Gonzalo Navarro (U. Chile) Nivio Ziviani (U. Fed. Minas Gerais &#8211; Brazil) Date: November 29, 2004 Download the full thesis Ph.D. Thesis: EFFECTIVE WEB CRAWLING by Carlos Castillo [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"nf_dc_page":"","footnotes":""},"class_list":["post-60","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/pages\/60","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/comments?post=60"}],"version-history":[{"count":3,"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/pages\/60\/revisions"}],"predecessor-version":[{"id":139,"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/pages\/60\/revisions\/139"}],"wp:attachment":[{"href":"https:\/\/chato.cl\/science\/wp-json\/wp\/v2\/media?parent=60"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}