You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Boyu Zhang <bo...@gmail.com> on 2010/08/12 21:58:45 UTC

large parameter file, too many intermediate output

Dear All,

I am working on this algorithm that is kind of like "clustering" data set.
The algorithm is like this:

The data set is broken into N(~40) chunks, each chunk contains 2,000 lines.
For each mapper, it retrieves a "parameter file" which contains 500 lines
and for each line read from the data file, it compares the 1 line with all
the 500 lines from the parameter file, and output 500 <key, value> pairs.
And the reducer reduces the values for each key.

Now the problem is that each individual mapper outputs too many intermediate
data, 2,000*500, and many of them have the same key, although I am using a
combiner, but still there are too many intermediate output.

Is there any thing I can do to reduce the output? Any suggestion is welcome!

Thanks,
Boyu

Re: large parameter file, too many intermediate output

Posted by Himanshu Vashishtha <va...@gmail.com>.
(+1) for the combiner. recheck its implementation (if it is not working).

On Thu, Aug 12, 2010 at 3:47 PM, Steve Lewis <lo...@gmail.com> wrote:

> I don't think of half a  billion key value pairs as that large a number -
> nor 20,000 per task - these are
> not atypical for hadoop tasks and many users will see these as small
> numbers
> while you might use cleverness such as a combiner to reduce the output I
> wonder if this is needed
> What is your cluster size and how fast does the job perform???
>
>
> On Thu, Aug 12, 2010 at 2:20 PM, Boyu Zhang <bo...@gmail.com> wrote:
>
> > Hi Steve,
> >
> >
> > On Thu, Aug 12, 2010 at 4:54 PM, Steve Lewis <lo...@gmail.com>
> > wrote:
> >
> > > I fail to see the value of the chunking - a simple TextInputFormat will
> > > give
> > > you one line at a time -
> > >
> > Sorry I did not make myself clear, when I said chunk, I just meant that
> the
> > data stored in HDFS is broken into trunks by hadoop runtime system. I
> > mentioned it only to emphasize that each map task is working on its own
> > data, it does not have a global view  of the data.Sorry if that is
> > misleading.. It is actually the TextInputFormat I am using.
> >
> > > causing the mapper to emit 500 keys which should be OK - what stops you
> > > from
> > > sending one line at a time to the mapper
> > >
> >
> > Each mapper is dealing with one line, and outputs 500 kv pairs. That is a
> > large amount of intermediate data cause there are 1,000,000 lines in
> total,
> > and nearly 20,000 lines per map task.
> > Thanks for the attention!
> >
> > Boyu
> >
> > >
> > >
> > >
> > > On Thu, Aug 12, 2010 at 12:58 PM, Boyu Zhang <bo...@gmail.com>
> > > wrote:
> > >
> > > > Dear All,
> > > >
> > > > I am working on this algorithm that is kind of like "clustering" data
> > > set.
> > > > The algorithm is like this:
> > > >
> > > > The data set is broken into N(~40) chunks, each chunk contains 2,000
> > > lines.
> > > > For each mapper, it retrieves a "parameter file" which contains 500
> > lines
> > > > and for each line read from the data file, it compares the 1 line
> with
> > > all
> > > > the 500 lines from the parameter file, and output 500 <key, value>
> > pairs.
> > > > And the reducer reduces the values for each key.
> > > >
> > > > Now the problem is that each individual mapper outputs too many
> > > > intermediate
> > > > data, 2,000*500, and many of them have the same key, although I am
> > using
> > > a
> > > > combiner, but still there are too many intermediate output.
> > > >
> > > > Is there any thing I can do to reduce the output? Any suggestion is
> > > > welcome!
> > > >
> > > > Thanks,
> > > > Boyu
> > > >
> > >
> > >
> > >
> > > --
> > > Steven M. Lewis PhD
> > > Institute for Systems Biology
> > > Seattle WA
> > >
> >
>
>
>
> --
> Steven M. Lewis PhD
> Institute for Systems Biology
> Seattle WA
>

Re: large parameter file, too many intermediate output

Posted by Boyu Zhang <bo...@gmail.com>.
Hi Harsh,

Thank you for the reply. I will try that, although now the map tasks are
taking too much time, almost 20 min to finish all the map tasks(~90). I
don't know if compression will slow me down, but I will make a test and see.
Thank you very much!

Boyu

On Thu, Aug 12, 2010 at 11:07 PM, Harsh J <qw...@gmail.com> wrote:

> Apart from the combiner suggestion, I'd also suggest using
> intermediate map-output compression always (With LZO, if possible).
> Saves you some IO.
>
> On Fri, Aug 13, 2010 at 3:24 AM, Boyu Zhang <bo...@gmail.com> wrote:
> > Hi Steve,
> >
> > Thanks for the reply!
> >
> > On Thu, Aug 12, 2010 at 5:47 PM, Steve Lewis <lo...@gmail.com>
> wrote:
> >
> >> I don't think of half a  billion key value pairs as that large a number
> -
> >> nor 20,000 per task - these are
> >> not atypical for hadoop tasks and many users will see these as small
> >> numbers
> >> while you might use cleverness such as a combiner to reduce the output I
> >> wonder if this is needed
> >> What is your cluster size and how fast does the job perform???
> >>
> >
> > I am using combiner to compact the output a little bit before they got
> > written to the disk. My cluster is 48 cores (6 nodes * 8cores/node), my
> > chunk size is 12MB, there are 90 or so map tasks, and it takes about 30
> min
> > to process.  It is very slow I think. Thanks for the attention and
> interest!
> >
> > Boyu
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: large parameter file, too many intermediate output

Posted by Harsh J <qw...@gmail.com>.
Apart from the combiner suggestion, I'd also suggest using
intermediate map-output compression always (With LZO, if possible).
Saves you some IO.

On Fri, Aug 13, 2010 at 3:24 AM, Boyu Zhang <bo...@gmail.com> wrote:
> Hi Steve,
>
> Thanks for the reply!
>
> On Thu, Aug 12, 2010 at 5:47 PM, Steve Lewis <lo...@gmail.com> wrote:
>
>> I don't think of half a  billion key value pairs as that large a number -
>> nor 20,000 per task - these are
>> not atypical for hadoop tasks and many users will see these as small
>> numbers
>> while you might use cleverness such as a combiner to reduce the output I
>> wonder if this is needed
>> What is your cluster size and how fast does the job perform???
>>
>
> I am using combiner to compact the output a little bit before they got
> written to the disk. My cluster is 48 cores (6 nodes * 8cores/node), my
> chunk size is 12MB, there are 90 or so map tasks, and it takes about 30 min
> to process.  It is very slow I think. Thanks for the attention and interest!
>
> Boyu
>



-- 
Harsh J
www.harshj.com

Re: large parameter file, too many intermediate output

Posted by Boyu Zhang <bo...@gmail.com>.
Hi Steve,

Thanks for the reply!

On Thu, Aug 12, 2010 at 5:47 PM, Steve Lewis <lo...@gmail.com> wrote:

> I don't think of half a  billion key value pairs as that large a number -
> nor 20,000 per task - these are
> not atypical for hadoop tasks and many users will see these as small
> numbers
> while you might use cleverness such as a combiner to reduce the output I
> wonder if this is needed
> What is your cluster size and how fast does the job perform???
>

I am using combiner to compact the output a little bit before they got
written to the disk. My cluster is 48 cores (6 nodes * 8cores/node), my
chunk size is 12MB, there are 90 or so map tasks, and it takes about 30 min
to process.  It is very slow I think. Thanks for the attention and interest!

Boyu

Re: large parameter file, too many intermediate output

Posted by Steve Lewis <lo...@gmail.com>.
I don't think of half a  billion key value pairs as that large a number -
nor 20,000 per task - these are
not atypical for hadoop tasks and many users will see these as small numbers
while you might use cleverness such as a combiner to reduce the output I
wonder if this is needed
What is your cluster size and how fast does the job perform???


On Thu, Aug 12, 2010 at 2:20 PM, Boyu Zhang <bo...@gmail.com> wrote:

> Hi Steve,
>
>
> On Thu, Aug 12, 2010 at 4:54 PM, Steve Lewis <lo...@gmail.com>
> wrote:
>
> > I fail to see the value of the chunking - a simple TextInputFormat will
> > give
> > you one line at a time -
> >
> Sorry I did not make myself clear, when I said chunk, I just meant that the
> data stored in HDFS is broken into trunks by hadoop runtime system. I
> mentioned it only to emphasize that each map task is working on its own
> data, it does not have a global view  of the data.Sorry if that is
> misleading.. It is actually the TextInputFormat I am using.
>
> > causing the mapper to emit 500 keys which should be OK - what stops you
> > from
> > sending one line at a time to the mapper
> >
>
> Each mapper is dealing with one line, and outputs 500 kv pairs. That is a
> large amount of intermediate data cause there are 1,000,000 lines in total,
> and nearly 20,000 lines per map task.
> Thanks for the attention!
>
> Boyu
>
> >
> >
> >
> > On Thu, Aug 12, 2010 at 12:58 PM, Boyu Zhang <bo...@gmail.com>
> > wrote:
> >
> > > Dear All,
> > >
> > > I am working on this algorithm that is kind of like "clustering" data
> > set.
> > > The algorithm is like this:
> > >
> > > The data set is broken into N(~40) chunks, each chunk contains 2,000
> > lines.
> > > For each mapper, it retrieves a "parameter file" which contains 500
> lines
> > > and for each line read from the data file, it compares the 1 line with
> > all
> > > the 500 lines from the parameter file, and output 500 <key, value>
> pairs.
> > > And the reducer reduces the values for each key.
> > >
> > > Now the problem is that each individual mapper outputs too many
> > > intermediate
> > > data, 2,000*500, and many of them have the same key, although I am
> using
> > a
> > > combiner, but still there are too many intermediate output.
> > >
> > > Is there any thing I can do to reduce the output? Any suggestion is
> > > welcome!
> > >
> > > Thanks,
> > > Boyu
> > >
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > Institute for Systems Biology
> > Seattle WA
> >
>



-- 
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA

Re: large parameter file, too many intermediate output

Posted by Boyu Zhang <bo...@gmail.com>.
Hi Steve,


On Thu, Aug 12, 2010 at 4:54 PM, Steve Lewis <lo...@gmail.com> wrote:

> I fail to see the value of the chunking - a simple TextInputFormat will
> give
> you one line at a time -
>
Sorry I did not make myself clear, when I said chunk, I just meant that the
data stored in HDFS is broken into trunks by hadoop runtime system. I
mentioned it only to emphasize that each map task is working on its own
data, it does not have a global view  of the data.Sorry if that is
misleading.. It is actually the TextInputFormat I am using.

> causing the mapper to emit 500 keys which should be OK - what stops you
> from
> sending one line at a time to the mapper
>

Each mapper is dealing with one line, and outputs 500 kv pairs. That is a
large amount of intermediate data cause there are 1,000,000 lines in total,
and nearly 20,000 lines per map task.
Thanks for the attention!

Boyu

>
>
>
> On Thu, Aug 12, 2010 at 12:58 PM, Boyu Zhang <bo...@gmail.com>
> wrote:
>
> > Dear All,
> >
> > I am working on this algorithm that is kind of like "clustering" data
> set.
> > The algorithm is like this:
> >
> > The data set is broken into N(~40) chunks, each chunk contains 2,000
> lines.
> > For each mapper, it retrieves a "parameter file" which contains 500 lines
> > and for each line read from the data file, it compares the 1 line with
> all
> > the 500 lines from the parameter file, and output 500 <key, value> pairs.
> > And the reducer reduces the values for each key.
> >
> > Now the problem is that each individual mapper outputs too many
> > intermediate
> > data, 2,000*500, and many of them have the same key, although I am using
> a
> > combiner, but still there are too many intermediate output.
> >
> > Is there any thing I can do to reduce the output? Any suggestion is
> > welcome!
> >
> > Thanks,
> > Boyu
> >
>
>
>
> --
> Steven M. Lewis PhD
> Institute for Systems Biology
> Seattle WA
>

Re: large parameter file, too many intermediate output

Posted by Boyu Zhang <bo...@gmail.com>.
> Hi Himanshu,
>
> Thanks for the reply!
>
> On Thu, Aug 12, 2010 at 5:08 PM, Himanshu Vashishtha <
> vashishtha.h@gmail.com> wrote:
>
>> it seems each input line is generating 500 kv pairs (which is kind of
>> exploding the data), so chunking/not chunking will not make difference.
>>
>> I am wondering about the approach in itself: What sort of properties file
>> it
>> is. Are you sure you want to make every line should spit 500 kv pairs.
>>
>
> Each line of the data file looks like this:
> 2.3 4.5 9.9 x y z ......
> they are 3 dimensional coordinates of many atoms.
> Each line of the parameter file(properties file) looks like the data file
> exactly. they are also 3 dimensional coordinates of atoms too.
>
> The functionality I want is like this: I want to select 1 line from the
> parameter file, in order to select the 1 line, for each line in the
> parameter file I compare it with all lines in the data file, and give a
> accumulative score to the line in the parameter file. choose the line with
> the highest score.
>
> Say I have 500 lines in the parameter file, 100,000 lines in the data file.
> There is 500*100,000 comparisons. When implement it in MapReduce, I use the
> other way around, for each line in data file, compare with parameter file,
> output, and let the framework group all the values from the same key, and
> perform reduce on all the values.
>
> What bothers me is that the map  output too many intermediate output, each
> line shoots 500 kv pairs, that kind of too many. I am wondering if there is
> a way to work around this, but thank you very much for the interest and
> attention!
>

Boyu

Re: large parameter file, too many intermediate output

Posted by Himanshu Vashishtha <va...@gmail.com>.
it seems each input line is generating 500 kv pairs (which is kind of
exploding the data), so chunking/not chunking will not make difference.

I am wondering about the approach in itself: What sort of properties file it
is. Are you sure you want to make every line should spit 500 kv pairs.


On Thu, Aug 12, 2010 at 2:54 PM, Steve Lewis <lo...@gmail.com> wrote:

> I fail to see the value of the chunking - a simple TextInputFormat will
> give
> you one line at a time -
> causing the mapper to emit 500 keys which should be OK - what stops you
> from
> sending one line at a time to the mapper
>
>
>
> On Thu, Aug 12, 2010 at 12:58 PM, Boyu Zhang <bo...@gmail.com>
> wrote:
>
> > Dear All,
> >
> > I am working on this algorithm that is kind of like "clustering" data
> set.
> > The algorithm is like this:
> >
> > The data set is broken into N(~40) chunks, each chunk contains 2,000
> lines.
> > For each mapper, it retrieves a "parameter file" which contains 500 lines
> > and for each line read from the data file, it compares the 1 line with
> all
> > the 500 lines from the parameter file, and output 500 <key, value> pairs.
> > And the reducer reduces the values for each key.
> >
> > Now the problem is that each individual mapper outputs too many
> > intermediate
> > data, 2,000*500, and many of them have the same key, although I am using
> a
> > combiner, but still there are too many intermediate output.
> >
> > Is there any thing I can do to reduce the output? Any suggestion is
> > welcome!
> >
> > Thanks,
> > Boyu
> >
>
>
>
> --
> Steven M. Lewis PhD
> Institute for Systems Biology
> Seattle WA
>

Re: large parameter file, too many intermediate output

Posted by Steve Lewis <lo...@gmail.com>.
I fail to see the value of the chunking - a simple TextInputFormat will give
you one line at a time -
causing the mapper to emit 500 keys which should be OK - what stops you from
sending one line at a time to the mapper



On Thu, Aug 12, 2010 at 12:58 PM, Boyu Zhang <bo...@gmail.com> wrote:

> Dear All,
>
> I am working on this algorithm that is kind of like "clustering" data set.
> The algorithm is like this:
>
> The data set is broken into N(~40) chunks, each chunk contains 2,000 lines.
> For each mapper, it retrieves a "parameter file" which contains 500 lines
> and for each line read from the data file, it compares the 1 line with all
> the 500 lines from the parameter file, and output 500 <key, value> pairs.
> And the reducer reduces the values for each key.
>
> Now the problem is that each individual mapper outputs too many
> intermediate
> data, 2,000*500, and many of them have the same key, although I am using a
> combiner, but still there are too many intermediate output.
>
> Is there any thing I can do to reduce the output? Any suggestion is
> welcome!
>
> Thanks,
> Boyu
>



-- 
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA