You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sergey A Volkov <se...@gmail.com> on 2011/11/16 01:51:20 UTC

Nutch project and my Ph.D. thesis.

Hi!

I am postgraduate student in Saint Petersburg State University. I was 
working with Nutch for about 3 years, have written my graduate work 
based on it, and now I don't know what to do in my Ph.D work. (Nobody in 
my department (System Programming) deals with web crawling)

I hope someone knows problems in web crawling, whose solutions can help 
Nutch project and me in my future Ph.D. thesis. Any ideas?

Thanks,
  Sergey.

Re: Nutch project and my Ph.D. thesis.

Posted by Sergey A Volkov <se...@gmail.com>.

Thank you for your reply!

Looks like at first I should read this book. 
I'll came back with my thought after this=)

Sergey.

On Wed 16 Nov 2011 04:11:54 PM MSK, Markus Jelsma wrote:
> Hi Sergey,
>
> The most profound problems or interesting stuff we've encountered are:
> - dealing with dynamic URL's such as calendars, also known as spider traps;
> - detecting duplicates based on sub-domain, many sites allow www, ww, wwww or
> everything else, you have to deal with it;
> - normalizing of URL's, highly important as it already prevents a lot of
> duplicates
> - various kinds of link analysis
> - detecting spam (link spam, content spam, various techniques)
> - general crawler ethics
> - dynamic politeness, large sites can be crawled more intense than small sites
> - deep and or shallow crawling, is coverage or freshness more important
>
> For me Bing Lui's excellent book on Web and Data mining gives a lot of
> insights. The best thing is that the author provides a royal list of
> references to highly interesting papers that you can then find online.
>
> In my opinion this book is mandatory when one is serious with web crawling.
>
> [1]: http://www.cs.uic.edu/~liub/WebMiningBook.html
>
> Good luck!
> Markus
>
> On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
>> Hi!
>>
>> I am postgraduate student in Saint Petersburg State University. I was
>> working with Nutch for about 3 years, have written my graduate work
>> based on it, and now I don't know what to do in my Ph.D work. (Nobody in
>> my department (System Programming) deals with web crawling)
>>
>> I hope someone knows problems in web crawling, whose solutions can help
>> Nutch project and me in my future Ph.D. thesis. Any ideas?
>>
>> Thanks,
>>    Sergey.
>

Re: Nutch project and my Ph.D. thesis.

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Sergey,

The most profound problems or interesting stuff we've encountered are:
- dealing with dynamic URL's such as calendars, also known as spider traps;
- detecting duplicates based on sub-domain, many sites allow www, ww, wwww or 
everything else, you have to deal with it;
- normalizing of URL's, highly important as it already prevents a lot of 
duplicates
- various kinds of link analysis
- detecting spam (link spam, content spam, various techniques)
- general crawler ethics
- dynamic politeness, large sites can be crawled more intense than small sites
- deep and or shallow crawling, is coverage or freshness more important

For me Bing Lui's excellent book on Web and Data mining gives a lot of 
insights. The best thing is that the author provides a royal list of 
references to highly interesting papers that you can then find online.

In my opinion this book is mandatory when one is serious with web crawling.

[1]: http://www.cs.uic.edu/~liub/WebMiningBook.html

Good luck!
Markus

On Wednesday 16 November 2011 01:51:20 Sergey A Volkov wrote:
> Hi!
> 
> I am postgraduate student in Saint Petersburg State University. I was
> working with Nutch for about 3 years, have written my graduate work
> based on it, and now I don't know what to do in my Ph.D work. (Nobody in
> my department (System Programming) deals with web crawling)
> 
> I hope someone knows problems in web crawling, whose solutions can help
> Nutch project and me in my future Ph.D. thesis. Any ideas?
> 
> Thanks,
>   Sergey.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350