You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kunsheng Chen <ke...@yahoo.com> on 2009/10/20 03:21:20 UTC

Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge

I guess this is exactly the problem is! 

Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?



Thanks,

-Kun


--- On Mon, 10/19/09, Ashutosh Chauhan <as...@gmail.com> wrote:

> From: Ashutosh Chauhan <as...@gmail.com>
> Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when  data gets huge
> To: common-user@hadoop.apache.org
> Date: Monday, October 19, 2009, 3:30 PM
> You might be hitting into the problem
> of "small-files". This has been
> discussed multiple times on the list. Greping through
> archives will help.
> Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
> 
> Ashutosh
> 
> On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <ke...@yahoo.com>
> wrote:
> 
> > I and running a hadoop program to perform MapReduce
> work on files inside a
> > folder.
> >
> > My program is basically doing Map and Reduce work,
> each line of any file is
> > a pair of string, and the result is a string associate
> with occurence inside
> > all files.
> >
> > The program works fine until the number of files grow
> to about 80,000,then
> > the 'cannot allocate memory' error occur for some
> reason.
> >
> > Each of the file contains around 50 lines, but the
> total size of all files
> > is no more than 1.5 GB. There are 3 datanodes
> performing calculation,each of
> > them have more than 10GB hd left.
> >
> > I am wondering if that is normal for Hadoop because
> the data is too large ?
> > Or it might be my programs problem ?
> >
> > It is really not supposed to be since Hadoop was
> developed for processing
> > large data sets.
> >
> >
> > Any idea is well appreciated
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

For searching (grepping) mailing list archives, I like MarkMail:
http://hadoop.markmail.org/ (try searching for "small files").

For concatenating files -- cat works, if you don't care about
provenance; as an alternative, you can also write a simple MR program
that creates a SequenceFile by reading in all the little files and
producing (filePath, fileContents) records.

The Cloudera post Ashutosh referred you to has a brief overview of all
the "standard" ideas.

-Dmitriy

On Mon, Oct 19, 2009 at 9:21 PM, Kunsheng Chen <ke...@yahoo.com> wrote:
> I guess this is exactly the problem is!
>
> Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?
>
>
>
> Thanks,
>
> -Kun
>
>
> --- On Mon, 10/19/09, Ashutosh Chauhan <as...@gmail.com> wrote:
>
>> From: Ashutosh Chauhan <as...@gmail.com>
>> Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when  data gets huge
>> To: common-user@hadoop.apache.org
>> Date: Monday, October 19, 2009, 3:30 PM
>> You might be hitting into the problem
>> of "small-files". This has been
>> discussed multiple times on the list. Greping through
>> archives will help.
>> Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
>>
>> Ashutosh
>>
>> On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <ke...@yahoo.com>
>> wrote:
>>
>> > I and running a hadoop program to perform MapReduce
>> work on files inside a
>> > folder.
>> >
>> > My program is basically doing Map and Reduce work,
>> each line of any file is
>> > a pair of string, and the result is a string associate
>> with occurence inside
>> > all files.
>> >
>> > The program works fine until the number of files grow
>> to about 80,000,then
>> > the 'cannot allocate memory' error occur for some
>> reason.
>> >
>> > Each of the file contains around 50 lines, but the
>> total size of all files
>> > is no more than 1.5 GB. There are 3 datanodes
>> performing calculation,each of
>> > them have more than 10GB hd left.
>> >
>> > I am wondering if that is normal for Hadoop because
>> the data is too large ?
>> > Or it might be my programs problem ?
>> >
>> > It is really not supposed to be since Hadoop was
>> developed for processing
>> > large data sets.
>> >
>> >
>> > Any idea is well appreciated
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>