You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/01/12 17:41:35 UTC

Fetching large files

I'm using nutch in distributed mode.  I'm crawling large files(a bunch of
videos), and when the fetcher map job goes to merge the spill files in
order to send to the reduce I'm getting an OOM exception.  It appears to be
because the merge is attempting to merge the data from all of the fetched
files in memory.

Has anyone else run into this problem?

Thanks.

Re: Fetching large files

Posted by Bai Shen <ba...@gmail.com>.

I could, but it's still possible that I'd run into this problem later.

The issue is that the map job merges the output from the spill files into
one file to send to the reduce job.  When it does this, it attempts to
bring the entire output into memory before writing out to the single file.
As such, if you have more fetched content than allocated for the job heap,
you get an OutOfMemoryException.

And no, I'm not having any other problems parsing the segments.

On Thu, Jan 12, 2012 at 4:51 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Is it possible for you to fetch smaller segments, parse them, then merge
> incrementally rather than attempting to merge several larger segments at
> once?
>
> Are you getting any IO problems when parsing the segments? If so this may
> be an early warning light to attack the problem from another angle.
>
> On Thu, Jan 12, 2012 at 4:41 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I'm using nutch in distributed mode.  I'm crawling large files(a bunch of
> > videos), and when the fetcher map job goes to merge the spill files in
> > order to send to the reduce I'm getting an OOM exception.  It appears to
> be
> > because the merge is attempting to merge the data from all of the fetched
> > files in memory.
> >
> > Has anyone else run into this problem?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Fetching large files

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Is it possible for you to fetch smaller segments, parse them, then merge
incrementally rather than attempting to merge several larger segments at
once?

Are you getting any IO problems when parsing the segments? If so this may
be an early warning light to attack the problem from another angle.

On Thu, Jan 12, 2012 at 4:41 PM, Bai Shen <ba...@gmail.com> wrote:

> I'm using nutch in distributed mode.  I'm crawling large files(a bunch of
> videos), and when the fetcher map job goes to merge the spill files in
> order to send to the reduce I'm getting an OOM exception.  It appears to be
> because the merge is attempting to merge the data from all of the fetched
> files in memory.
>
> Has anyone else run into this problem?
>
> Thanks.
>

-- 
*Lewis*