You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "yoursoft@freemail.hu" <yo...@freemail.hu> on 2005/03/31 11:29:16 UTC

Whole web crawling

Dear Nutch Users!

I have a question with continous use of nutch:
- When I refetch pages (after 30 days), I think if the pages is 
modified, these will put into new segments, and the old version will be 
live in the old segments dir. And the db will be larger and larger in 
every new page versions? It is live problem or only I think this is problem?

Best Regrards,
    Ferenc

problem with a dedup segment

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Users,

I have a problem with dedup segment.
I try:
bin/nutch dedup segments dedup.txt
Error message:
Clearing old deletions in /segments/20050317092156/index
Exception in thread "main" java.io.IOException: Lock obtain timed out: 
Lock@/tmp/lucene-......-write.lock

How to solve this problem?

Best Regards,
    Ferenc

Re: [Nutch-general] Re: Whole web crawling

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Thanks for help, and good answers.

Matthias Jaekle wrotte:

>> In this case, we refretch everything in monthly? Why not enough 
>> refretch only changed pages (check last modified date and not 404 
>> error). 
>
> I think nutch is not able to do this in the moment.
>
>> I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
>
> 15 million pages + amount of known links from this pages = amount of 
> urls in db.
>
> Yes. If you would keep more documents in your index, increase the 
> amount of days for refetching.
>
>> The dedup only remove from segment index or remove from segments too?
>
> I am not sure.
>
> Matthias
>
>

Re: [Nutch-general] Re: Whole web crawling

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Ok, thanks for answers. A simpler question: How you use nutch on live db 
(how to update pages, delete old datas etc.)?

Andrzej Bialecki wrote:

> Matthias Jaekle wrote:
>
>>> In this case, we refretch everything in monthly? Why not enough 
>>> refretch only changed pages (check last modified date and not 404 
>>> error). 
>>
>>
>> I think nutch is not able to do this in the moment.
>>
>>> I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
>>
>>
>> 15 million pages + amount of known links from this pages = amount of 
>> urls in db.
>>
>> Yes. If you would keep more documents in your index, increase the 
>> amount of days for refetching.
>>
>>> The dedup only remove from segment index or remove from segments too?
>>
>>
>> I am not sure.
>
>
> Dedup removes only index entries, duplicate content is left in the 
> segment data - it would be too costly to remove it. However, duplicate 
> content is removed if you run the SegmentMergeTool.
>
>

Re: [Nutch-general] Re: Whole web crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Matthias Jaekle wrote:
>> In this case, we refretch everything in monthly? Why not enough 
>> refretch only changed pages (check last modified date and not 404 error). 
> 
> I think nutch is not able to do this in the moment.
> 
>> I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
> 
> 15 million pages + amount of known links from this pages = amount of 
> urls in db.
> 
> Yes. If you would keep more documents in your index, increase the amount 
> of days for refetching.
> 
>> The dedup only remove from segment index or remove from segments too?
> 
> I am not sure.

Dedup removes only index entries, duplicate content is left in the 
segment data - it would be too costly to remove it. However, duplicate 
content is removed if you run the SegmentMergeTool.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutch-general] Re: Whole web crawling

Posted by Matthias Jaekle <ja...@eventax.de>.

> In this case, we refretch everything in monthly? Why not enough refretch 
> only changed pages (check last modified date and not 404 error). 
I think nutch is not able to do this in the moment.

> I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
15 million pages + amount of known links from this pages = amount of 
urls in db.

Yes. If you would keep more documents in your index, increase the amount 
of days for refetching.

> The dedup only remove from segment index or remove from segments too?
I am not sure.

Matthias

Re: [Nutch-general] Re: Whole web crawling

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Matthias,

In this case, we refretch everything in monthly? Why not enough refretch 
only changed pages (check last modified date and not 404 error). In 
current situation is need large bandwith to fetching.

I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?

The dedup only remove from segment index or remove from segments too?

Matthias Jaekle wrotte:

>> I think better way Matthias idea: dedup segments. In the older then 
>> 30 days segments you can found not changed pages, thats are not 
>> exists in the new segments.
>
> No, pages become refetched after 30 days. So they will be in the new 
> segments. You could remove segments after 30 + x days.
>
> Matthias

Re: [Nutch-general] Re: Whole web crawling

Posted by Matthias Jaekle <ja...@eventax.de>.

> I think better way Matthias idea: dedup segments. 
> In the older then 30 
> days segments you can found not changed pages, thats are not exists in 
> the new segments.
No, pages become refetched after 30 days. So they will be in the new 
segments. You could remove segments after 30 + x days.

Matthias

-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events

Re: [Nutch-general] Re: Whole web crawling

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

I think better way Matthias idea: dedup segments. In the older then 30 
days segments you can found not changed pages, thats are not exists in 
the new segments.

Stefan Groschupf wrote:

> Segment folder older then 30 days can be deleted.
>
> Am 31.03.2005 um 11:29 schrieb yoursoft@freemail.hu:
>
>> Dear Nutch Users!
>>
>> I have a question with continous use of nutch:
>> - When I refetch pages (after 30 days), I think if the pages is 
>> modified, these will put into new segments, and the old version will 
>> be live in the old segments dir. And the db will be larger and larger 
>> in every new page versions? It is live problem or only I think this 
>> is problem?
>>
>> Best Regrards,
>>    Ferenc
>

Re: Whole web crawling

Posted by Stefan Groschupf <sg...@media-style.com>.

Segment folder older then 30 days can be deleted.

Am 31.03.2005 um 11:29 schrieb yoursoft@freemail.hu:

> Dear Nutch Users!
>
> I have a question with continous use of nutch:
> - When I refetch pages (after 30 days), I think if the pages is 
> modified, these will put into new segments, and the old version will 
> be live in the old segments dir. And the db will be larger and larger 
> in every new page versions? It is live problem or only I think this is 
> problem?
>
> Best Regrards,
>    Ferenc
>
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net

Re: [Nutch-general] Re: Whole web crawling

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Matthias!

Thanks, for fast, and useable answer!

Matthias Jaekle wrote:

> Hi,
>
>> - When I refetch pages (after 30 days), I think if the pages is 
>> modified, these will put into new segments, and the old version will 
>> be live in the old segments dir. 
>
> Run dedup to remove dupes or merge segments to avoid this.
>
>
>> And the db will be larger and larger in every new page versions? It 
>> is live problem or only I think this is problem?
>
> The db is growing with each new Link you find. Each link is stored once.
>
> Matthias
>

Re: Whole web crawling

Posted by Matthias Jaekle <ja...@eventax.de>.

Hi,

> - When I refetch pages (after 30 days), I think if the pages is 
> modified, these will put into new segments, and the old version will be 
> live in the old segments dir. 
Run dedup to remove dupes or merge segments to avoid this.


> And the db will be larger and larger in 
> every new page versions? It is live problem or only I think this is 
> problem?
The db is growing with each new Link you find. Each link is stored once.

Matthias

-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events