You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gus Bourg <gu...@bourg.net> on 2005/03/11 22:24:09 UTC

nutch analyze db 2 eats lots of disk space?

New user, sorry if this has already been discussed. I'm doing whole web 
indexing on a dual opteron with 250 gigs of space. My segment directory is 
about 7.5 gigs and my db directory is about 2.1 gigs. I'm running version 
0.6.

My problem is that when I go to run bin/nutch/analyze db 2 it runs out of 
space. Is it normal that it'll eat up 200+ gigs for a 7.5 gigs worth of 
segments?

I assume i need to run analyze frequently...?

I also don't really understand what this does. The way I think things work 
is like this:

generate db segment -topN <x>
* Generates a fetchlist of unfetched or out of date URLs.

fetch <segment>
* Builds a content segment of the URLs in that fetchlist

updatedb db
* Tells the web db which url's have been fetched so they don't get added 
to a new fetchlist (unless they're outdated)

analyze db 2
* Performs DB analysis. Two passes. I think this is the actual ranking 
mechanism - is that right?

merge index segments/*
* Merges indexed segments into a master index.

Thanks,
Gus

Re: nutch analyze db 2 eats lots of disk space?

Posted by Gus Bourg <gu...@bourg.net>.
Thanks Stefan. I'll give 1.4 a try today. Another thing I noticed is that 
the slow down (and increase disk space) seems to happen after 1 million 
records:
050321 143337 Pages consumed: 980000 (at index 980000)
050321 143337 Pages consumed: 985000 (at index 985000)
050321 143337 Pages consumed: 990000 (at index 990000)
050321 143338 Pages consumed: 995000 (at index 995000)
050321 143338 Pages consumed: 1000000 (at index 1000000)
050321 150728 Pages consumed: 1005000 (at index 1005000)
050321 155205 Pages consumed: 1010000 (at index 1010000)

Is there any logical reason why the sorting would be fast up 1,000,000 
records - then slow way down afterwords?

Thanks,
Gus

On Mon, 21 Mar 2005, Stefan Groschupf wrote:

> Well, may the java 1.5 is the problem.
> But this is just an may be since nutch is tested and developed against java 
> 1.4.
>
> There is another thing that confuse me.
> When you do db analysing, what kind of relation this has with a segment?
> Can you post somewhere a web db less then 20MB zipped to be able to reproduce 
> the problem.
>
>
>
> Am 21.03.2005 um 18:22 schrieb Gus Bourg:
>
>> I actually tried but couldn't get it to compile when I tried.
>> 
>> It's the tmpdir that fills up the sorted.0 and sorted.1 files get really 
>> big (>100 gigs each).
>> 
>> I'm running on Linux 2.6.9 on an opteron box:
>> Linux jeb 2.6.9-gentoo-r14 #1 SMP Tue Mar 8 12:05:13 PST 2005 x86_64 AMD 
>> Opteron(tm) Processor 242 AuthenticAMD GNU/Linux
>> 
>> Java version:
>> java version "1.5.0_01"
>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_01-b08, mixed mode)
>> 
>> For fetchlist, I just grabbed one of the larger ones:
>> http://bourg.net/~gus/fetchlist.tar.gz
>> 
>> Thanks,
>> Gus
>> 
>> On Thu, 17 Mar 2005, Stefan Groschupf wrote:
>> 
>>> Can you please try this with the latest code in subversion.
>>> I can not reproduce this problem, I used the 0.6 release many times but 
>>> never note such an problem.
>>> Please try with the latest code and in case the problem does still occurs, 
>>> post the OS, Java version, and a  compressed fetch list, so may people can 
>>> reproduce the problem.
>>> 
>>> 
>>> 
>>> Am 16.03.2005 um 21:58 schrieb Gus Bourg:
>>> 
>>>> No answer on this? :(
>>>> Gus
>>>> On Fri, 11 Mar 2005, Gus Bourg wrote:
>>>>> New user, sorry if this has already been discussed. I'm doing whole web 
>>>>> indexing on a dual opteron with 250 gigs of space. My segment directory 
>>>>> is about 7.5 gigs and my db directory is about 2.1 gigs. I'm running 
>>>>> version 0.6.
>>>>> My problem is that when I go to run bin/nutch/analyze db 2 it runs out 
>>>>> of space. Is it normal that it'll eat up 200+ gigs for a 7.5 gigs worth 
>>>>> of segments?
>>> ---------------------------------------------------------------
>>> company:		http://www.media-style.com
>>> forum:		http://www.text-mining.org
>>> blog:			http://www.find23.net
>>> 
>>> 
>>> 
>> 
>> 
> ---------------------------------------------------------------
> company:		http://www.media-style.com
> forum:		http://www.text-mining.org
> blog:			http://www.find23.net
>
>
>

Re: nutch analyze db 2 eats lots of disk space?

Posted by Stefan Groschupf <sg...@media-style.com>.
Well, may the java 1.5 is the problem.
But this is just an may be since nutch is tested and developed against 
java 1.4.

There is another thing that confuse me.
When you do db analysing, what kind of relation this has with a segment?
Can you post somewhere a web db less then 20MB zipped to be able to 
reproduce the problem.



Am 21.03.2005 um 18:22 schrieb Gus Bourg:

> I actually tried but couldn't get it to compile when I tried.
>
> It's the tmpdir that fills up the sorted.0 and sorted.1 files get 
> really big (>100 gigs each).
>
> I'm running on Linux 2.6.9 on an opteron box:
> Linux jeb 2.6.9-gentoo-r14 #1 SMP Tue Mar 8 12:05:13 PST 2005 x86_64 
> AMD Opteron(tm) Processor 242 AuthenticAMD GNU/Linux
>
> Java version:
> java version "1.5.0_01"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_01-b08, mixed mode)
>
> For fetchlist, I just grabbed one of the larger ones:
> http://bourg.net/~gus/fetchlist.tar.gz
>
> Thanks,
> Gus
>
> On Thu, 17 Mar 2005, Stefan Groschupf wrote:
>
>> Can you please try this with the latest code in subversion.
>> I can not reproduce this problem, I used the 0.6 release many times 
>> but never note such an problem.
>> Please try with the latest code and in case the problem does still 
>> occurs, post the OS, Java version, and a  compressed fetch list, so 
>> may people can reproduce the problem.
>>
>>
>>
>> Am 16.03.2005 um 21:58 schrieb Gus Bourg:
>>
>>> No answer on this? :(
>>> Gus
>>> On Fri, 11 Mar 2005, Gus Bourg wrote:
>>>> New user, sorry if this has already been discussed. I'm doing whole 
>>>> web indexing on a dual opteron with 250 gigs of space. My segment 
>>>> directory is about 7.5 gigs and my db directory is about 2.1 gigs. 
>>>> I'm running version 0.6.
>>>> My problem is that when I go to run bin/nutch/analyze db 2 it runs 
>>>> out of space. Is it normal that it'll eat up 200+ gigs for a 7.5 
>>>> gigs worth of segments?
>> ---------------------------------------------------------------
>> company:		http://www.media-style.com
>> forum:		http://www.text-mining.org
>> blog:			http://www.find23.net
>>
>>
>>
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


Re: nutch analyze db 2 eats lots of disk space?

Posted by Gus Bourg <gu...@bourg.net>.
I actually tried but couldn't get it to compile when I tried.

It's the tmpdir that fills up the sorted.0 and sorted.1 files get really 
big (>100 gigs each).

I'm running on Linux 2.6.9 on an opteron box:
Linux jeb 2.6.9-gentoo-r14 #1 SMP Tue Mar 8 12:05:13 PST 2005 x86_64 AMD 
Opteron(tm) Processor 242 AuthenticAMD GNU/Linux

Java version:
java version "1.5.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_01-b08, mixed mode)

For fetchlist, I just grabbed one of the larger ones:
http://bourg.net/~gus/fetchlist.tar.gz

Thanks,
Gus

On Thu, 17 Mar 2005, Stefan Groschupf wrote:

> Can you please try this with the latest code in subversion.
> I can not reproduce this problem, I used the 0.6 release many times but never 
> note such an problem.
> Please try with the latest code and in case the problem does still occurs, 
> post the OS, Java version, and a  compressed fetch list, so may people can 
> reproduce the problem.
>
>
>
> Am 16.03.2005 um 21:58 schrieb Gus Bourg:
>
>> No answer on this? :(
>> 
>> Gus
>> 
>> On Fri, 11 Mar 2005, Gus Bourg wrote:
>> 
>>> New user, sorry if this has already been discussed. I'm doing whole web 
>>> indexing on a dual opteron with 250 gigs of space. My segment directory is 
>>> about 7.5 gigs and my db directory is about 2.1 gigs. I'm running version 
>>> 0.6.
>>> 
>>> My problem is that when I go to run bin/nutch/analyze db 2 it runs out of 
>>> space. Is it normal that it'll eat up 200+ gigs for a 7.5 gigs worth of 
>>> segments?
>> 
>> 
> ---------------------------------------------------------------
> company:		http://www.media-style.com
> forum:		http://www.text-mining.org
> blog:			http://www.find23.net
>
>
>

Re: nutch analyze db 2 eats lots of disk space?

Posted by Stefan Groschupf <sg...@media-style.com>.
Can you please try this with the latest code in subversion.
I can not reproduce this problem, I used the 0.6 release many times but 
never note such an problem.
Please try with the latest code and in case the problem does still 
occurs, post the OS, Java version, and a  compressed fetch list, so may 
people can reproduce the problem.



Am 16.03.2005 um 21:58 schrieb Gus Bourg:

> No answer on this? :(
>
> Gus
>
> On Fri, 11 Mar 2005, Gus Bourg wrote:
>
>> New user, sorry if this has already been discussed. I'm doing whole 
>> web indexing on a dual opteron with 250 gigs of space. My segment 
>> directory is about 7.5 gigs and my db directory is about 2.1 gigs. 
>> I'm running version 0.6.
>>
>> My problem is that when I go to run bin/nutch/analyze db 2 it runs 
>> out of space. Is it normal that it'll eat up 200+ gigs for a 7.5 gigs 
>> worth of segments?
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


Re: nutch analyze db 2 eats lots of disk space?

Posted by Gus Bourg <gu...@bourg.net>.
No answer on this? :(

Gus

On Fri, 11 Mar 2005, Gus Bourg wrote:

> New user, sorry if this has already been discussed. I'm doing whole web 
> indexing on a dual opteron with 250 gigs of space. My segment directory is 
> about 7.5 gigs and my db directory is about 2.1 gigs. I'm running version 
> 0.6.
>
> My problem is that when I go to run bin/nutch/analyze db 2 it runs out of 
> space. Is it normal that it'll eat up 200+ gigs for a 7.5 gigs worth of 
> segments?