You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Briggs <ac...@gmail.com> on 2007/01/24 18:48:34 UTC

Merging large sets of segments, help.

Has anyone written an API that can merge thousands of segments?  The current
segment merge tool cannot handle this much data as there just isn't enough
RAM available on the box. So, I was wondering if there was a better,
incremental way to handle this.

Currently I have 1 segment for each domain that was crawled and I want to
merge them all into several large segments.  So, if anyone has any pointers
I would appreciate it.  Has anyone else attempted to keep segments at this
granularity?  This doesn't seem to work so well.


<briggs />

"Concious decisions by concious minds are what make reality real"

Re: Merging large sets of segments, help.

Posted by Andrzej Bialecki <ab...@getopt.org>.
Briggs wrote:
> Cool, thanks for your responses!
>
> Next time I should probably mention that we are using 0.7.2.  Not
> quite sure if we can even think about moving to something 'more
> current' as I don't really know the reasons to.

Ahh ... that's a whole world of difference! 0.7.2 is very different from 
0.8 and later, and offers only limited scalability.

Still, this workaround should work ok ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Merging large sets of segments, help.

Posted by Briggs <ac...@gmail.com>.
Cool, thanks for your responses!

Next time I should probably mention that we are using 0.7.2.  Not
quite sure if we can even think about moving to something 'more
current' as I don't really know the reasons to.

<briggs />

> Most of this information is already available on the Nutch Wiki. All I
> can say is that there is certainly a limit to what you can do using the
> "local" mode - if you need to handle large numbers of pages you will
> need to migrate to the distributed setup.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


-- 
"Concious decisions by concious minds are what make reality real"

Re: Merging large sets of segments, help.

Posted by Andrzej Bialecki <ab...@getopt.org>.
Briggs wrote:
>> Are you running this in a distributed setup, or in "local" mode? Local
>> mode is not designed to cope with such large datasets, so it's likely
>> that you will be getting OOM errors during sorting ... I can only
>> recommend that you use a distributed setup with several machines, and
>> adjust RAM consumption with the number of reduce tasks.
>
> Currently we are running in local mode.  We do not have the setup for
> distributing. That is why I want to merge these segments.  Would that
> not help?  Insteand of having potentially tens of thousands of
> segments, I want to create several large segments and index those.

Yes, it makes perfect sense, but you are probably hitting the limits of 
a single machine.

I suggest that you should do the merging in several steps: by trial and 
error find the maximum number of segments that don't explode 
SegmentMerger, and do the first pass merging these small segments into 
larger ones; then in the second pass merge these larger ones in the 
really large ones.


>
> Sorry for my ignorance, but not really sure how to scale nutch
> correctly.  Do you know of a document, or some pointers as to how
> segment/index data should be stored?

Most of this information is already available on the Nutch Wiki. All I 
can say is that there is certainly a limit to what you can do using the 
"local" mode - if you need to handle large numbers of pages you will 
need to migrate to the distributed setup.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Merging large sets of segments, help.

Posted by Briggs <ac...@gmail.com>.
> Are you running this in a distributed setup, or in "local" mode? Local
> mode is not designed to cope with such large datasets, so it's likely
> that you will be getting OOM errors during sorting ... I can only
> recommend that you use a distributed setup with several machines, and
> adjust RAM consumption with the number of reduce tasks.

Currently we are running in local mode.  We do not have the setup for
distributing. That is why I want to merge these segments.  Would that
not help?  Insteand of having potentially tens of thousands of
segments, I want to create several large segments and index those.

Sorry for my ignorance, but not really sure how to scale nutch
correctly.  Do you know of a document, or some pointers as to how
segment/index data should be stored?

<briggs />

"Concious decisions by concious minds are what make reality real"

Re: Merging large sets of segments, help.

Posted by Andrzej Bialecki <ab...@getopt.org>.
Briggs wrote:
> Has anyone written an API that can merge thousands of segments?  The 
> current
> segment merge tool cannot handle this much data as there just isn't 
> enough
> RAM available on the box. So, I was wondering if there was a better,
> incremental way to handle this.
>
> Currently I have 1 segment for each domain that was crawled and I want to
> merge them all into several large segments.  So, if anyone has any 
> pointers
> I would appreciate it.  Has anyone else attempted to keep segments at 
> this
> granularity?  This doesn't seem to work so well.


Are you running this in a distributed setup, or in "local" mode? Local 
mode is not designed to cope with such large datasets, so it's likely 
that you will be getting OOM errors during sorting ... I can only 
recommend that you use a distributed setup with several machines, and 
adjust RAM consumption with the number of reduce tasks.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com