You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jiaxin Ye <ji...@usc.edu> on 2015/02/16 08:34:31 UTC

Re:

Hi Swati,

I am also the student in Prof Matmann's class. I think the politeness
depends on the crawl-delay to the same server. Usually in the robots.txt
the crawl-delay will be set to 5 to 15 seconds. It's true that setting
fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
value from robots.txt to be ignored, but you can set the
fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
requests time.

I also think we should change the content in suffix_urlfillter as well, as
our task is to collect as much data as we can from the three websites.

Jiaxin

On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <sw...@usc.edu> wrote:

> Hi,
> We are working on a project under Professor Chris Mattmann as part of
> Information Retrieval course.
> We are trying to edit different properties to change politeness and do url
> filtering.
>
> We are trying more than 1 thread, which makes it impolite, but we are not
> sure how impolite it should be made for better results.
> Also, url filtering blocks almost all image, audio, video formats in
> suffix_urlfilter.xml, should that be tampered with or not?
>

Re:

Posted by Majisha Parambath <pa...@usc.edu>.
Hey Jiaxin,

My understanding is that the suffix_urlfilter will not come into the
picture unless it is part of the plugin.includes property of the
nutch-configuration. By default only the regex_urlfilter is integrated into
nutch, and we need to set the mime types to skip/not skip in the
regex_urlfilter.txt

Please correct me if my understanding is wrong.

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <ji...@usc.edu> wrote:

> Hi Swati,
>
> I am also the student in Prof Matmann's class. I think the politeness
> depends on the crawl-delay to the same server. Usually in the robots.txt
> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
> value from robots.txt to be ignored, but you can set the
> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
> requests time.
>
> I also think we should change the content in suffix_urlfillter as well, as
> our task is to collect as much data as we can from the three websites.
>
> Jiaxin
>
> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <sw...@usc.edu> wrote:
>
>> Hi,
>> We are working on a project under Professor Chris Mattmann as part of
>> Information Retrieval course.
>> We are trying to edit different properties to change politeness and do
>> url filtering.
>>
>> We are trying more than 1 thread, which makes it impolite, but we are not
>> sure how impolite it should be made for better results.
>> Also, url filtering blocks almost all image, audio, video formats in
>> suffix_urlfilter.xml, should that be tampered with or not?
>>
>
>

Re:

Posted by Jiaxin Ye <ji...@usc.edu>.
Indeed it's doubtful, but I don't think there is a exact value for
politeness. Interestingly, nutch is described as "aggressively polite" here
http://opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/ . So
maybe nutch is polite anyway in the end.. :D

On Mon, Feb 16, 2015 at 12:52 AM, Swati Kothari <sw...@usc.edu> wrote:

> Thanks Jiaxin. We are already trying to vary the parameters as you said,
> but what values would be appropriate for the properties that we are
> changing is still doubtful.
>
> On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <ji...@usc.edu> wrote:
>
>> Hi Swati,
>>
>> I am also the student in Prof Matmann's class. I think the politeness
>> depends on the crawl-delay to the same server. Usually in the robots.txt
>> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
>> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
>> value from robots.txt to be ignored, but you can set the
>> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
>> requests time.
>>
>> I also think we should change the content in suffix_urlfillter as well,
>> as our task is to collect as much data as we can from the three websites.
>>
>> Jiaxin
>>
>> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <sw...@usc.edu> wrote:
>>
>>> Hi,
>>> We are working on a project under Professor Chris Mattmann as part of
>>> Information Retrieval course.
>>> We are trying to edit different properties to change politeness and do
>>> url filtering.
>>>
>>> We are trying more than 1 thread, which makes it impolite, but we are
>>> not sure how impolite it should be made for better results.
>>> Also, url filtering blocks almost all image, audio, video formats in
>>> suffix_urlfilter.xml, should that be tampered with or not?
>>>
>>
>>
>

Re:

Posted by Swati Kothari <sw...@usc.edu>.
Thanks Jiaxin. We are already trying to vary the parameters as you said,
but what values would be appropriate for the properties that we are
changing is still doubtful.

On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <ji...@usc.edu> wrote:

> Hi Swati,
>
> I am also the student in Prof Matmann's class. I think the politeness
> depends on the crawl-delay to the same server. Usually in the robots.txt
> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
> value from robots.txt to be ignored, but you can set the
> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
> requests time.
>
> I also think we should change the content in suffix_urlfillter as well, as
> our task is to collect as much data as we can from the three websites.
>
> Jiaxin
>
> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <sw...@usc.edu> wrote:
>
>> Hi,
>> We are working on a project under Professor Chris Mattmann as part of
>> Information Retrieval course.
>> We are trying to edit different properties to change politeness and do
>> url filtering.
>>
>> We are trying more than 1 thread, which makes it impolite, but we are not
>> sure how impolite it should be made for better results.
>> Also, url filtering blocks almost all image, audio, video formats in
>> suffix_urlfilter.xml, should that be tampered with or not?
>>
>
>