You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Günter Ladwig <la...@searchhaus.net> on 2013/11/20 12:07:38 UTC

UpdateDbJob increases fetchtime of unfetched pages

Hi all,

I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in the update step. Everytime I run a crawl (using a modified bin/crawl script), the fetchtime is updated even for pages that were not fetched during the current crawl. 

I found the related bug report NUTCH-1457 [1] through a previous post on this list [2].

For me this means that Nutch 2.2.1 is unusable. I want to run continuous crawls in order to keep a Solr index of a website up-to-date. This bug basically ensures that most pages will never be fetched again as their fetchtime is increased on each updatedb.

Is there a workaround? Does this problem appear in Nutch 1.7?

Cheers,
Günter

[1] https://issues.apache.org/jira/browse/NUTCH-1457
[2] http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html

Re: UpdateDbJob increases fetchtime of unfetched pages

Posted by Günter Ladwig <la...@searchhaus.net>.
Hi,

thanks for the info, I tried the patch. Unfortunately it does not work correctly in my case.

I applied both NUTCH-1556-v3.patch and NUTCH-1556-batchId.patch (if this is not correct, tell me). I added the -batchId command to my updatedb call. Now I have the following problem:

1. Inject root URL of website, e.g. http://example.org/
2. First crawl: page is fetched and fetchtime (according to readdb) is set to two days from now, as configured.
3. Second crawl: pages that are linked from http://example.org are fetched. These also contain backlinks to http://example.org/.
4. The status of http://example.org/ is now status_unfetched with a fetchtime that is no longer two days from now, but set to now. Also the distance is no longer 0, but 2. It seems that the original entry for that page is simply overwritten (maybe because it is not in the batch?).
5. Third crawl: http://example.org is fetched again, much too early.

I’ll probably go with 1.7 for now as I don’t really have time for much more experimentation right now. As far as I can tell, the configuration for 1.7. is basically the same, which saves some effort.

Cheers,
Günter

On 21.11.2013, at 09:30, Talat UYARER <ta...@agmlab.com> wrote:

> Hi Günter,
> 
> UpdatedbJob of Nutch 2.2.1 doesn't accept batchId. You are right. But this problem fix with NUTCH-1556 issue [1]. If you apply this patch You will never have same problem.
> 
> We use 2.x in production. We don't find big issue. I think it is safe to use.
> 
> Thanks
> Talat
> 
> [1] https://issues.apache.org/jira/browse/NUTCH-1556
> 
> 20-11-2013 13:35 tarihinde, Julien Nioche yazdı:
>> Hi Gunter
>> 
>> Nutch 1.x is a lot more stable than 2.x which is very much work in
>> progress. This particular issue will hopefully be fixed in 2.x soon but you
>> won't have it in 1.x for sure.
>> 
>> Julien
>> 
>> 
>> On 20 November 2013 11:07, Günter Ladwig <la...@searchhaus.net> wrote:
>> 
>>> Hi all,
>>> 
>>> I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
>>> the update step. Everytime I run a crawl (using a modified bin/crawl
>>> script), the fetchtime is updated even for pages that were not fetched
>>> during the current crawl.
>>> 
>>> I found the related bug report NUTCH-1457 [1] through a previous post on
>>> this list [2].
>>> 
>>> For me this means that Nutch 2.2.1 is unusable. I want to run continuous
>>> crawls in order to keep a Solr index of a website up-to-date. This bug
>>> basically ensures that most pages will never be fetched again as their
>>> fetchtime is increased on each updatedb.
>>> 
>>> Is there a workaround? Does this problem appear in Nutch 1.7?
>>> 
>>> Cheers,
>>> Günter
>>> 
>>> [1] https://issues.apache.org/jira/browse/NUTCH-1457
>>> [2]
>>> http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html
>> 
>> 
>> 
>> 
> 
> 


Re: UpdateDbJob increases fetchtime of unfetched pages

Posted by Talat UYARER <ta...@agmlab.com>.
In addition my mail If you use -all parameter, You may have same problem.

Talat

21-11-2013 10:30 tarihinde, Talat UYARER yazdı:
> Hi Günter,
>
> UpdatedbJob of Nutch 2.2.1 doesn't accept batchId. You are right. But
> this problem fix with NUTCH-1556 issue [1]. If you apply this patch You
> will never have same problem.
>
> We use 2.x in production. We don't find big issue. I think it is safe to
> use.
>
> Thanks
> Talat
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1556
>
> 20-11-2013 13:35 tarihinde, Julien Nioche yazdı:
>> Hi Gunter
>>
>> Nutch 1.x is a lot more stable than 2.x which is very much work in
>> progress. This particular issue will hopefully be fixed in 2.x soon
>> but you
>> won't have it in 1.x for sure.
>>
>> Julien
>>
>>
>> On 20 November 2013 11:07, Günter Ladwig <la...@searchhaus.net> wrote:
>>
>>> Hi all,
>>>
>>> I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
>>> the update step. Everytime I run a crawl (using a modified bin/crawl
>>> script), the fetchtime is updated even for pages that were not fetched
>>> during the current crawl.
>>>
>>> I found the related bug report NUTCH-1457 [1] through a previous post on
>>> this list [2].
>>>
>>> For me this means that Nutch 2.2.1 is unusable. I want to run continuous
>>> crawls in order to keep a Solr index of a website up-to-date. This bug
>>> basically ensures that most pages will never be fetched again as their
>>> fetchtime is increased on each updatedb.
>>>
>>> Is there a workaround? Does this problem appear in Nutch 1.7?
>>>
>>> Cheers,
>>> Günter
>>>
>>> [1] https://issues.apache.org/jira/browse/NUTCH-1457
>>> [2]
>>> http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html
>>>
>>
>>
>>
>>
>


Re: UpdateDbJob increases fetchtime of unfetched pages

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Günter,

UpdatedbJob of Nutch 2.2.1 doesn't accept batchId. You are right. But 
this problem fix with NUTCH-1556 issue [1]. If you apply this patch You 
will never have same problem.

We use 2.x in production. We don't find big issue. I think it is safe to 
use.

Thanks
Talat

[1] https://issues.apache.org/jira/browse/NUTCH-1556

20-11-2013 13:35 tarihinde, Julien Nioche yazdı:
> Hi Gunter
>
> Nutch 1.x is a lot more stable than 2.x which is very much work in
> progress. This particular issue will hopefully be fixed in 2.x soon but you
> won't have it in 1.x for sure.
>
> Julien
>
>
> On 20 November 2013 11:07, Günter Ladwig <la...@searchhaus.net> wrote:
>
>> Hi all,
>>
>> I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
>> the update step. Everytime I run a crawl (using a modified bin/crawl
>> script), the fetchtime is updated even for pages that were not fetched
>> during the current crawl.
>>
>> I found the related bug report NUTCH-1457 [1] through a previous post on
>> this list [2].
>>
>> For me this means that Nutch 2.2.1 is unusable. I want to run continuous
>> crawls in order to keep a Solr index of a website up-to-date. This bug
>> basically ensures that most pages will never be fetched again as their
>> fetchtime is increased on each updatedb.
>>
>> Is there a workaround? Does this problem appear in Nutch 1.7?
>>
>> Cheers,
>> Günter
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-1457
>> [2]
>> http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html
>
>
>
>


Re: UpdateDbJob increases fetchtime of unfetched pages

Posted by Julien Nioche <li...@gmail.com>.
Hi Gunter

Nutch 1.x is a lot more stable than 2.x which is very much work in
progress. This particular issue will hopefully be fixed in 2.x soon but you
won't have it in 1.x for sure.

Julien


On 20 November 2013 11:07, Günter Ladwig <la...@searchhaus.net> wrote:

> Hi all,
>
> I’m currently using Nutch 2.2.1 and noticed what seems to a be a bug in
> the update step. Everytime I run a crawl (using a modified bin/crawl
> script), the fetchtime is updated even for pages that were not fetched
> during the current crawl.
>
> I found the related bug report NUTCH-1457 [1] through a previous post on
> this list [2].
>
> For me this means that Nutch 2.2.1 is unusable. I want to run continuous
> crawls in order to keep a Solr index of a website up-to-date. This bug
> basically ensures that most pages will never be fetched again as their
> fetchtime is increased on each updatedb.
>
> Is there a workaround? Does this problem appear in Nutch 1.7?
>
> Cheers,
> Günter
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1457
> [2]
> http://lucene.472066.n3.nabble.com/updatedb-in-nutch-2-0-increases-fetch-time-of-all-pages-td4008429.html




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble