You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sergey A Volkov <se...@gmail.com> on 2011/11/16 01:51:20 UTC
Nutch project and my Ph.D. thesis.
Hi!
I am postgraduate student in Saint Petersburg State University. I was
working with Nutch for about 3 years, have written my graduate work
based on it, and now I don't know what to do in my Ph.D work. (Nobody in
my department (System Programming) deals with web crawling)
I hope someone knows problems in web crawling, whose solutions can help
Nutch project and me in my future Ph.D. thesis. Any ideas?
Thanks,
Sergey.
Re: Nutch project and my Ph.D. thesis.
Posted by Sergey A Volkov <se...@gmail.com>.
Thank you for your reply!
Looks like at first I should read this book.
I'll came back with my thought after this=)
Sergey.
On Wed 16 Nov 2011 04:11:54 PM MSK, Markus Jelsma wrote:
> Hi Sergey,
>
> The most profound problems or interesting stuff we've encountered are:
> - dealing with dynamic URL's such as calendars, also known as spider traps;
> - detecting duplicates based on sub-domain, many sites allow www, ww, wwww or
> everything else, you have to deal with it;
> - normalizing of URL's, highly important as it already prevents a lot of
> duplicates
> - various kinds of link analysis
> - detecting spam (link spam, content spam, various techniques)
> - general crawler ethics
> - dynamic politeness, large sites can be crawled more intense than small sites
> - deep and or shallow crawling, is coverage or freshness more important
>
> For me Bing Lui's excellent book on Web and Data mining gives a lot of
> insights. The best thing is that the author provides a royal list of
> references to highly interesting papers that you can then find online.
>
> In my opinion this book is mandatory when one is serious with web crawling.
>
> [1]: http://www.cs.uic.edu/~liub/WebMiningBook.html
>
> Good luck!
> Markus
>
> On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
>> Hi!
>>
>> I am postgraduate student in Saint Petersburg State University. I was
>> working with Nutch for about 3 years, have written my graduate work
>> based on it, and now I don't know what to do in my Ph.D work. (Nobody in
>> my department (System Programming) deals with web crawling)
>>
>> I hope someone knows problems in web crawling, whose solutions can help
>> Nutch project and me in my future Ph.D. thesis. Any ideas?
>>
>> Thanks,
>> Sergey.
>
Re: Nutch project and my Ph.D. thesis.
Posted by Markus Jelsma <ma...@openindex.io>.
Hi Sergey,
The most profound problems or interesting stuff we've encountered are:
- dealing with dynamic URL's such as calendars, also known as spider traps;
- detecting duplicates based on sub-domain, many sites allow www, ww, wwww or
everything else, you have to deal with it;
- normalizing of URL's, highly important as it already prevents a lot of
duplicates
- various kinds of link analysis
- detecting spam (link spam, content spam, various techniques)
- general crawler ethics
- dynamic politeness, large sites can be crawled more intense than small sites
- deep and or shallow crawling, is coverage or freshness more important
For me Bing Lui's excellent book on Web and Data mining gives a lot of
insights. The best thing is that the author provides a royal list of
references to highly interesting papers that you can then find online.
In my opinion this book is mandatory when one is serious with web crawling.
[1]: http://www.cs.uic.edu/~liub/WebMiningBook.html
Good luck!
Markus
On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
> Hi!
>
> I am postgraduate student in Saint Petersburg State University. I was
> working with Nutch for about 3 years, have written my graduate work
> based on it, and now I don't know what to do in my Ph.D work. (Nobody in
> my department (System Programming) deals with web crawling)
>
> I hope someone knows problems in web crawling, whose solutions can help
> Nutch project and me in my future Ph.D. thesis. Any ideas?
>
> Thanks,
> Sergey.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350