You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vladimir Loubenski <vl...@opentext.com> on 2016/11/24 20:10:02 UTC

Nutch 2.3.1 re-crawls unchanged web pages

Hi ,
I am using Nutch 2.3.1.
I run in loop generate, fetch, parse, updateDB steps. 
I noted that during re-crawl even if a  web page doesn't change nutch doesn't detect it  by value of  ETag, Last-Modified or signature fields and continue process all these steps for unchanged web pages.
 Is it expected behaviour?
Are there plans to fix it in future releases?  

Regards,
Vladimir.

-----Original Message-----
From: Jim Lamb [mailto:jlamb@mail.com] 
Sent: November-22-16 6:22 AM
To: user@nutch.apache.org
Subject: Re: Automating Nutch 2.3.1 on Amazon EMR


Further to this, I have found that I can only submit a maximum of 256 steps to EMR. Some of our crawls take over 100 rounds, so defining an arbitrary number of (generate,fetch,parse,updatedb,index,solrdedup) rounds each with 6 steps isn't going to work either :-(

Has nobody automated this?

Thanks,

Jim
 

Sent: Thursday, November 17, 2016 at 11:30 AM
From: "Jim Lamb" <jl...@mail.com>
To: user@nutch.apache.org
Subject: Re: Automating Nutch 2.3.1 on Amazon EMR Hi Sebastian,

Thanks for coming back to me.

> Adding
> set -x
> to bin/nutch and then running bin/crawl with a sample crawl which 
> includes all steps should log all commands with a full list of arguments.

Yes, that's a great idea. Thanks.

> But on EMR it should be possible to directly reference the Nutch job 
> file by a s3:// URL. (but haven't tried it this way)

Yes, that is possible. You add an S3 URL to the Jar= argument in your step definition of the create-cluster command.

> aws emr terminate-cluster ...

Ah, yes. I did wonder if the master instance had appropriate instance role privilege to do this. I'll try.

Unfortunately, it still doesn't solve the iteration issue. Short of defining many many repeated sets of steps, I don't see how I would get multiple rounds. What am I missing?

Thanks,

Jim

Re: Nutch 2.3.1 re-crawls unchanged web pages

Posted by Tom Chiverton <tc...@extravision.com>.

db.default.fetch.interval

db.fetch.schedule.adaptive.*

Tom


On 25/11/16 13:43, Vladimir Loubenski wrote:
> Thank you Tom,
> What the relevant config XML variables control it?
>
> Thank you in advance,
> Vladimir.
>
>
> -----Original Message-----
> From: Tom Chiverton [mailto:tc@extravision.com]
> Sent: November-25-16 2:31 AM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.3.1 re-crawls unchanged web pages
>
> I understand it's expected. Especially if the page is in the list of seeds.
>
> You can control this by changing the relevant config XML variables.
>
> On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski <vl...@opentext.com> wrote:
>> Hi ,
>> I am using Nutch 2.3.1.
>> I run in loop generate, fetch, parse, updateDB steps.
>> I noted that during re-crawl even if a  web page doesn't change nutch
>> doesn't detect it  by value of  ETag, Last-Modified or signature fields
>> and continue process all these steps for unchanged web pages.
>> Is it expected behaviour?
>> Are there plans to fix it in future releases?
>>
>> Regards,
>> Vladimir.
>>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>

RE: Nutch 2.3.1 re-crawls unchanged web pages

Posted by Vladimir Loubenski <vl...@opentext.com>.

Thank you Tom,
What the relevant config XML variables control it?

Thank you in advance,
Vladimir.


-----Original Message-----
From: Tom Chiverton [mailto:tc@extravision.com] 
Sent: November-25-16 2:31 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.3.1 re-crawls unchanged web pages

I understand it's expected. Especially if the page is in the list of seeds. 

You can control this by changing the relevant config XML variables. 

On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski <vl...@opentext.com> wrote:
>Hi ,
>I am using Nutch 2.3.1.
>I run in loop generate, fetch, parse, updateDB steps. 
>I noted that during re-crawl even if a  web page doesn't change nutch 
>doesn't detect it  by value of  ETag, Last-Modified or signature fields 
>and continue process all these steps for unchanged web pages.
> Is it expected behaviour?
>Are there plans to fix it in future releases?  
>
>Regards,
>Vladimir.
>

RE: Nutch 2.3.1 re-crawls unchanged web pages

Posted by Vladimir Loubenski <vl...@opentext.com>.

Thank you Tom,
What the relevant config XML variables control it?

Thank you in advance,
Vladimir.


-----Original Message-----
From: Tom Chiverton [mailto:tc@extravision.com] 
Sent: November-25-16 2:31 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.3.1 re-crawls unchanged web pages

I understand it's expected. Especially if the page is in the list of seeds. 

You can control this by changing the relevant config XML variables. 

On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski <vl...@opentext.com> wrote:
>Hi ,
>I am using Nutch 2.3.1.
>I run in loop generate, fetch, parse, updateDB steps. 
>I noted that during re-crawl even if a  web page doesn't change nutch 
>doesn't detect it  by value of  ETag, Last-Modified or signature fields 
>and continue process all these steps for unchanged web pages.
> Is it expected behaviour?
>Are there plans to fix it in future releases?  
>
>Regards,
>Vladimir.
>

Re: Nutch 2.3.1 re-crawls unchanged web pages

Posted by Tom Chiverton <tc...@extravision.com>.

I understand it's expected. Especially if the page is in the list of seeds. 

You can control this by changing the relevant config XML variables. 

On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski <vl...@opentext.com> wrote:
>Hi ,
>I am using Nutch 2.3.1.
>I run in loop generate, fetch, parse, updateDB steps. 
>I noted that during re-crawl even if a  web page doesn't change nutch
>doesn't detect it  by value of  ETag, Last-Modified or signature fields
>and continue process all these steps for unchanged web pages.
> Is it expected behaviour?
>Are there plans to fix it in future releases?  
>
>Regards,
>Vladimir.
>
>-----Original Message-----
>From: Jim Lamb [mailto:jlamb@mail.com] 
>Sent: November-22-16 6:22 AM
>To: user@nutch.apache.org
>Subject: Re: Automating Nutch 2.3.1 on Amazon EMR
>
>
>Further to this, I have found that I can only submit a maximum of 256
>steps to EMR. Some of our crawls take over 100 rounds, so defining an
>arbitrary number of (generate,fetch,parse,updatedb,index,solrdedup)
>rounds each with 6 steps isn't going to work either :-(
>
>Has nobody automated this?
>
>Thanks,
>
>Jim
>�
>
>Sent:�Thursday, November 17, 2016 at 11:30 AM
>From:�"Jim Lamb" <jl...@mail.com>
>To:�user@nutch.apache.org
>Subject:�Re: Automating Nutch 2.3.1 on Amazon EMR Hi Sebastian,
>
>Thanks for coming back to me.
>
>> Adding
>> set -x
>> to bin/nutch and then running bin/crawl with a sample crawl which 
>> includes all steps should log all commands with a full list of
>arguments.
>
>Yes, that's a great idea. Thanks.
>
>> But on EMR it should be possible to directly reference the Nutch job 
>> file by a s3:// URL. (but haven't tried it this way)
>
>Yes, that is possible. You add an S3 URL to the Jar= argument in your
>step definition of the create-cluster command.
>
>> aws emr terminate-cluster ...
>
>Ah, yes. I did wonder if the master instance had appropriate instance
>role privilege to do this. I'll try.
>
>Unfortunately, it still doesn't solve the iteration issue. Short of
>defining many many repeated sets of steps, I don't see how I would get
>multiple rounds. What am I missing?
>
>Thanks,
>
>Jim
>
>______________________________________________________________________
>This email has been scanned by the Symantec Email Security.cloud
>service.
>For more information please visit http://www.symanteccloud.com
>______________________________________________________________________

-- 
Tom Chiverton 
Sent from my phone. Please excuse my brevity.