You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Something Something <ma...@gmail.com> on 2013/07/31 07:26:30 UTC

Merging files

Hello,

One of our pig scripts creates over 500 small part files.  To save on
namespace, we need to cut down the # of files, so instead of saving 500
small files we need to merge them into 50.  We tried the following:

1)  When we set parallel number to 50, the Pig script takes a long time -
for obvious reasons.
2)  If we use Hadoop Streaming, it puts some garbage values into the key
field.
3)  We wrote our own Map Reducer program that reads these 500 small part
files & uses 50 reducers.  Basically, the Mappers simply write the line &
reducers loop thru values & write them out.  We set
job.setOutputKeyClass(NullWritable.class) so that the key is not written to
the output file.  This is performing better than Pig.  Actually Mappers run
very fast, but Reducers take some time to complete, but this approach seems
to be working well.

Is there a better way to do this?  What strategy can you think of to
increase speed of reducers.

Any help in this regard will be greatly appreciated.  Thanks.

Re: Merging files

Posted by Hailey Charlie <ha...@gmail.com>.
How big are your 50 files?  How long are the reducers taking?

- HC

On Jul 30, 2013, at 10:26 PM, Something Something <ma...@gmail.com> wrote:

> Hello,
> 
> One of our pig scripts creates over 500 small part files.  To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50.  We tried the following:
> 
> 1)  When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> field.
> 3)  We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers.  Basically, the Mappers simply write the line &
> reducers loop thru values & write them out.  We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file.  This is performing better than Pig.  Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
> 
> Is there a better way to do this?  What strategy can you think of to
> increase speed of reducers.
> 
> Any help in this regard will be greatly appreciated.  Thanks.


Re: Merging files

Posted by "j.barrett Strausser" <j....@gmail.com>.
Can't you solve for the  --max-file-blocks option given that you know the
sizes of the input files and desired number of outputfiles?


On Wed, Jul 31, 2013 at 12:21 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Thanks, John.  But I don't see an option to specify the # of output files.
>  How does Crush decide how many files to create?  Is it only based on file
> sizes?
>
> On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com
> >wrote:
>
> > Here's a great tool for handling exactly that case:
> > https://github.com/edwardcapriolo/filecrush
> >
> > On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > <ma...@gmail.com> wrote:
> > > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > > minutes.
> > >
> > > Note:  'getmerge' is not an option.  There isn't enough disk space to
> do
> > a
> > > getmerge on the local production box.  Plus we need a scalable solution
> > as
> > > these files will get a lot bigger soon.
> > >
> > > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com>
> wrote:
> > >
> > >> How big are your 50 files?  How long are the reducers taking?
> > >>
> > >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> mailinglists19@gmail.com> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > One of our pig scripts creates over 500 small part files.  To save
> on
> > >> > namespace, we need to cut down the # of files, so instead of saving
> > 500
> > >> > small files we need to merge them into 50.  We tried the following:
> > >> >
> > >> > 1)  When we set parallel number to 50, the Pig script takes a long
> > time -
> > >> > for obvious reasons.
> > >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> > key
> > >> > field.
> > >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> > part
> > >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> > line &
> > >> > reducers loop thru values & write them out.  We set
> > >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > written
> > >> to
> > >> > the output file.  This is performing better than Pig.  Actually
> > Mappers
> > >> run
> > >> > very fast, but Reducers take some time to complete, but this
> approach
> > >> seems
> > >> > to be working well.
> > >> >
> > >> > Is there a better way to do this?  What strategy can you think of to
> > >> > increase speed of reducers.
> > >> >
> > >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >>
> > >>
> >
>



-- 


https://github.com/bearrito
@deepbearrito

Re: Merging files

Posted by "j.barrett Strausser" <j....@gmail.com>.
That is what I was suggesting yes.




On Wed, Jul 31, 2013 at 4:39 PM, Something Something <
mailinglists19@gmail.com> wrote:

> So you are saying, we will first do a 'hadoop count' to get the total # of
> bytes for all files.  Let's say that comes to:  1538684305
>
> Default Block Size is:  128M
>
> So, total # of blocks needed:  1538684305 / 131072 = 11740
>
> Max file blocks = 11740 / 50 (# of output files) = 234
>
> Does this calculation look right?
>
> On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <john.meagher@gmail.com
> >wrote:
>
> > It is file size based, not file count based.  For fewer files up the
> > max-file-blocks setting.
> >
> > On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> > <ma...@gmail.com> wrote:
> > > Thanks, John.  But I don't see an option to specify the # of output
> > files.
> > >  How does Crush decide how many files to create?  Is it only based on
> > file
> > > sizes?
> > >
> > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com
> > >wrote:
> > >
> > >> Here's a great tool for handling exactly that case:
> > >> https://github.com/edwardcapriolo/filecrush
> > >>
> > >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > >> <ma...@gmail.com> wrote:
> > >> > Each bz2 file after merging is about 50Megs.  The reducers take
> about
> > 9
> > >> > minutes.
> > >> >
> > >> > Note:  'getmerge' is not an option.  There isn't enough disk space
> to
> > do
> > >> a
> > >> > getmerge on the local production box.  Plus we need a scalable
> > solution
> > >> as
> > >> > these files will get a lot bigger soon.
> > >> >
> > >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com>
> > wrote:
> > >> >
> > >> >> How big are your 50 files?  How long are the reducers taking?
> > >> >>
> > >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> >> mailinglists19@gmail.com> wrote:
> > >> >>
> > >> >> > Hello,
> > >> >> >
> > >> >> > One of our pig scripts creates over 500 small part files.  To
> save
> > on
> > >> >> > namespace, we need to cut down the # of files, so instead of
> saving
> > >> 500
> > >> >> > small files we need to merge them into 50.  We tried the
> following:
> > >> >> >
> > >> >> > 1)  When we set parallel number to 50, the Pig script takes a
> long
> > >> time -
> > >> >> > for obvious reasons.
> > >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> > the
> > >> key
> > >> >> > field.
> > >> >> > 3)  We wrote our own Map Reducer program that reads these 500
> small
> > >> part
> > >> >> > files & uses 50 reducers.  Basically, the Mappers simply write
> the
> > >> line &
> > >> >> > reducers loop thru values & write them out.  We set
> > >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > >> written
> > >> >> to
> > >> >> > the output file.  This is performing better than Pig.  Actually
> > >> Mappers
> > >> >> run
> > >> >> > very fast, but Reducers take some time to complete, but this
> > approach
> > >> >> seems
> > >> >> > to be working well.
> > >> >> >
> > >> >> > Is there a better way to do this?  What strategy can you think of
> > to
> > >> >> > increase speed of reducers.
> > >> >> >
> > >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >> >>
> > >> >>
> > >>
> >
>



-- 


https://github.com/bearrito
@deepbearrito

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
So you are saying, we will first do a 'hadoop count' to get the total # of
bytes for all files.  Let's say that comes to:  1538684305

Default Block Size is:  128M

So, total # of blocks needed:  1538684305 / 131072 = 11740

Max file blocks = 11740 / 50 (# of output files) = 234

Does this calculation look right?

On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <jo...@gmail.com>wrote:

> It is file size based, not file count based.  For fewer files up the
> max-file-blocks setting.
>
> On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> <ma...@gmail.com> wrote:
> > Thanks, John.  But I don't see an option to specify the # of output
> files.
> >  How does Crush decide how many files to create?  Is it only based on
> file
> > sizes?
> >
> > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com
> >wrote:
> >
> >> Here's a great tool for handling exactly that case:
> >> https://github.com/edwardcapriolo/filecrush
> >>
> >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> >> <ma...@gmail.com> wrote:
> >> > Each bz2 file after merging is about 50Megs.  The reducers take about
> 9
> >> > minutes.
> >> >
> >> > Note:  'getmerge' is not an option.  There isn't enough disk space to
> do
> >> a
> >> > getmerge on the local production box.  Plus we need a scalable
> solution
> >> as
> >> > these files will get a lot bigger soon.
> >> >
> >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com>
> wrote:
> >> >
> >> >> How big are your 50 files?  How long are the reducers taking?
> >> >>
> >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> >> mailinglists19@gmail.com> wrote:
> >> >>
> >> >> > Hello,
> >> >> >
> >> >> > One of our pig scripts creates over 500 small part files.  To save
> on
> >> >> > namespace, we need to cut down the # of files, so instead of saving
> >> 500
> >> >> > small files we need to merge them into 50.  We tried the following:
> >> >> >
> >> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> >> time -
> >> >> > for obvious reasons.
> >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> the
> >> key
> >> >> > field.
> >> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> >> part
> >> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> >> line &
> >> >> > reducers loop thru values & write them out.  We set
> >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> >> written
> >> >> to
> >> >> > the output file.  This is performing better than Pig.  Actually
> >> Mappers
> >> >> run
> >> >> > very fast, but Reducers take some time to complete, but this
> approach
> >> >> seems
> >> >> > to be working well.
> >> >> >
> >> >> > Is there a better way to do this?  What strategy can you think of
> to
> >> >> > increase speed of reducers.
> >> >> >
> >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >> >>
> >> >>
> >>
>

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
So you are saying, we will first do a 'hadoop count' to get the total # of
bytes for all files.  Let's say that comes to:  1538684305

Default Block Size is:  128M

So, total # of blocks needed:  1538684305 / 131072 = 11740

Max file blocks = 11740 / 50 (# of output files) = 234

Does this calculation look right?

On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <jo...@gmail.com>wrote:

> It is file size based, not file count based.  For fewer files up the
> max-file-blocks setting.
>
> On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> <ma...@gmail.com> wrote:
> > Thanks, John.  But I don't see an option to specify the # of output
> files.
> >  How does Crush decide how many files to create?  Is it only based on
> file
> > sizes?
> >
> > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meagher@gmail.com
> >wrote:
> >
> >> Here's a great tool for handling exactly that case:
> >> https://github.com/edwardcapriolo/filecrush
> >>
> >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> >> <ma...@gmail.com> wrote:
> >> > Each bz2 file after merging is about 50Megs.  The reducers take about
> 9
> >> > minutes.
> >> >
> >> > Note:  'getmerge' is not an option.  There isn't enough disk space to
> do
> >> a
> >> > getmerge on the local production box.  Plus we need a scalable
> solution
> >> as
> >> > these files will get a lot bigger soon.
> >> >
> >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com>
> wrote:
> >> >
> >> >> How big are your 50 files?  How long are the reducers taking?
> >> >>
> >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> >> mailinglists19@gmail.com> wrote:
> >> >>
> >> >> > Hello,
> >> >> >
> >> >> > One of our pig scripts creates over 500 small part files.  To save
> on
> >> >> > namespace, we need to cut down the # of files, so instead of saving
> >> 500
> >> >> > small files we need to merge them into 50.  We tried the following:
> >> >> >
> >> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> >> time -
> >> >> > for obvious reasons.
> >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> the
> >> key
> >> >> > field.
> >> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> >> part
> >> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> >> line &
> >> >> > reducers loop thru values & write them out.  We set
> >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> >> written
> >> >> to
> >> >> > the output file.  This is performing better than Pig.  Actually
> >> Mappers
> >> >> run
> >> >> > very fast, but Reducers take some time to complete, but this
> approach
> >> >> seems
> >> >> > to be working well.
> >> >> >
> >> >> > Is there a better way to do this?  What strategy can you think of
> to
> >> >> > increase speed of reducers.
> >> >> >
> >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >> >>
> >> >>
> >>
>

Re: Merging files

Posted by John Meagher <jo...@gmail.com>.
It is file size based, not file count based.  For fewer files up the
max-file-blocks setting.

On Wed, Jul 31, 2013 at 12:21 PM, Something Something
<ma...@gmail.com> wrote:
> Thanks, John.  But I don't see an option to specify the # of output files.
>  How does Crush decide how many files to create?  Is it only based on file
> sizes?
>
> On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <jo...@gmail.com>wrote:
>
>> Here's a great tool for handling exactly that case:
>> https://github.com/edwardcapriolo/filecrush
>>
>> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
>> <ma...@gmail.com> wrote:
>> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
>> > minutes.
>> >
>> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
>> a
>> > getmerge on the local production box.  Plus we need a scalable solution
>> as
>> > these files will get a lot bigger soon.
>> >
>> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:
>> >
>> >> How big are your 50 files?  How long are the reducers taking?
>> >>
>> >> On Jul 30, 2013, at 10:26 PM, Something Something <
>> >> mailinglists19@gmail.com> wrote:
>> >>
>> >> > Hello,
>> >> >
>> >> > One of our pig scripts creates over 500 small part files.  To save on
>> >> > namespace, we need to cut down the # of files, so instead of saving
>> 500
>> >> > small files we need to merge them into 50.  We tried the following:
>> >> >
>> >> > 1)  When we set parallel number to 50, the Pig script takes a long
>> time -
>> >> > for obvious reasons.
>> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
>> key
>> >> > field.
>> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
>> part
>> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
>> line &
>> >> > reducers loop thru values & write them out.  We set
>> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
>> written
>> >> to
>> >> > the output file.  This is performing better than Pig.  Actually
>> Mappers
>> >> run
>> >> > very fast, but Reducers take some time to complete, but this approach
>> >> seems
>> >> > to be working well.
>> >> >
>> >> > Is there a better way to do this?  What strategy can you think of to
>> >> > increase speed of reducers.
>> >> >
>> >> > Any help in this regard will be greatly appreciated.  Thanks.
>> >>
>> >>
>>

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
Thanks, John.  But I don't see an option to specify the # of output files.
 How does Crush decide how many files to create?  Is it only based on file
sizes?

On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <jo...@gmail.com>wrote:

> Here's a great tool for handling exactly that case:
> https://github.com/edwardcapriolo/filecrush
>
> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> <ma...@gmail.com> wrote:
> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > minutes.
> >
> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
> a
> > getmerge on the local production box.  Plus we need a scalable solution
> as
> > these files will get a lot bigger soon.
> >
> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:
> >
> >> How big are your 50 files?  How long are the reducers taking?
> >>
> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> mailinglists19@gmail.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > One of our pig scripts creates over 500 small part files.  To save on
> >> > namespace, we need to cut down the # of files, so instead of saving
> 500
> >> > small files we need to merge them into 50.  We tried the following:
> >> >
> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> time -
> >> > for obvious reasons.
> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> key
> >> > field.
> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> part
> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> line &
> >> > reducers loop thru values & write them out.  We set
> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> written
> >> to
> >> > the output file.  This is performing better than Pig.  Actually
> Mappers
> >> run
> >> > very fast, but Reducers take some time to complete, but this approach
> >> seems
> >> > to be working well.
> >> >
> >> > Is there a better way to do this?  What strategy can you think of to
> >> > increase speed of reducers.
> >> >
> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >>
> >>
>

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
Thanks, John.  But I don't see an option to specify the # of output files.
 How does Crush decide how many files to create?  Is it only based on file
sizes?

On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <jo...@gmail.com>wrote:

> Here's a great tool for handling exactly that case:
> https://github.com/edwardcapriolo/filecrush
>
> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> <ma...@gmail.com> wrote:
> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > minutes.
> >
> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
> a
> > getmerge on the local production box.  Plus we need a scalable solution
> as
> > these files will get a lot bigger soon.
> >
> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:
> >
> >> How big are your 50 files?  How long are the reducers taking?
> >>
> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> mailinglists19@gmail.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > One of our pig scripts creates over 500 small part files.  To save on
> >> > namespace, we need to cut down the # of files, so instead of saving
> 500
> >> > small files we need to merge them into 50.  We tried the following:
> >> >
> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> time -
> >> > for obvious reasons.
> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> key
> >> > field.
> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> part
> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> line &
> >> > reducers loop thru values & write them out.  We set
> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> written
> >> to
> >> > the output file.  This is performing better than Pig.  Actually
> Mappers
> >> run
> >> > very fast, but Reducers take some time to complete, but this approach
> >> seems
> >> > to be working well.
> >> >
> >> > Is there a better way to do this?  What strategy can you think of to
> >> > increase speed of reducers.
> >> >
> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >>
> >>
>

Re: Merging files

Posted by John Meagher <jo...@gmail.com>.
Here's a great tool for handling exactly that case:
https://github.com/edwardcapriolo/filecrush

On Wed, Jul 31, 2013 at 2:40 AM, Something Something
<ma...@gmail.com> wrote:
> Each bz2 file after merging is about 50Megs.  The reducers take about 9
> minutes.
>
> Note:  'getmerge' is not an option.  There isn't enough disk space to do a
> getmerge on the local production box.  Plus we need a scalable solution as
> these files will get a lot bigger soon.
>
> On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:
>
>> How big are your 50 files?  How long are the reducers taking?
>>
>> On Jul 30, 2013, at 10:26 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > One of our pig scripts creates over 500 small part files.  To save on
>> > namespace, we need to cut down the # of files, so instead of saving 500
>> > small files we need to merge them into 50.  We tried the following:
>> >
>> > 1)  When we set parallel number to 50, the Pig script takes a long time -
>> > for obvious reasons.
>> > 2)  If we use Hadoop Streaming, it puts some garbage values into the key
>> > field.
>> > 3)  We wrote our own Map Reducer program that reads these 500 small part
>> > files & uses 50 reducers.  Basically, the Mappers simply write the line &
>> > reducers loop thru values & write them out.  We set
>> > job.setOutputKeyClass(NullWritable.class) so that the key is not written
>> to
>> > the output file.  This is performing better than Pig.  Actually Mappers
>> run
>> > very fast, but Reducers take some time to complete, but this approach
>> seems
>> > to be working well.
>> >
>> > Is there a better way to do this?  What strategy can you think of to
>> > increase speed of reducers.
>> >
>> > Any help in this regard will be greatly appreciated.  Thanks.
>>
>>

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
Each bz2 file after merging is about 50Megs.  The reducers take about 9
minutes.

Note:  'getmerge' is not an option.  There isn't enough disk space to do a
getmerge on the local production box.  Plus we need a scalable solution as
these files will get a lot bigger soon.

On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:

> How big are your 50 files?  How long are the reducers taking?
>
> On Jul 30, 2013, at 10:26 PM, Something Something <
> mailinglists19@gmail.com> wrote:
>
> > Hello,
> >
> > One of our pig scripts creates over 500 small part files.  To save on
> > namespace, we need to cut down the # of files, so instead of saving 500
> > small files we need to merge them into 50.  We tried the following:
> >
> > 1)  When we set parallel number to 50, the Pig script takes a long time -
> > for obvious reasons.
> > 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> > field.
> > 3)  We wrote our own Map Reducer program that reads these 500 small part
> > files & uses 50 reducers.  Basically, the Mappers simply write the line &
> > reducers loop thru values & write them out.  We set
> > job.setOutputKeyClass(NullWritable.class) so that the key is not written
> to
> > the output file.  This is performing better than Pig.  Actually Mappers
> run
> > very fast, but Reducers take some time to complete, but this approach
> seems
> > to be working well.
> >
> > Is there a better way to do this?  What strategy can you think of to
> > increase speed of reducers.
> >
> > Any help in this regard will be greatly appreciated.  Thanks.
>
>

Re: Merging files

Posted by Something Something <ma...@gmail.com>.
Each bz2 file after merging is about 50Megs.  The reducers take about 9
minutes.

Note:  'getmerge' is not an option.  There isn't enough disk space to do a
getmerge on the local production box.  Plus we need a scalable solution as
these files will get a lot bigger soon.

On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <be...@gmail.com> wrote:

> How big are your 50 files?  How long are the reducers taking?
>
> On Jul 30, 2013, at 10:26 PM, Something Something <
> mailinglists19@gmail.com> wrote:
>
> > Hello,
> >
> > One of our pig scripts creates over 500 small part files.  To save on
> > namespace, we need to cut down the # of files, so instead of saving 500
> > small files we need to merge them into 50.  We tried the following:
> >
> > 1)  When we set parallel number to 50, the Pig script takes a long time -
> > for obvious reasons.
> > 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> > field.
> > 3)  We wrote our own Map Reducer program that reads these 500 small part
> > files & uses 50 reducers.  Basically, the Mappers simply write the line &
> > reducers loop thru values & write them out.  We set
> > job.setOutputKeyClass(NullWritable.class) so that the key is not written
> to
> > the output file.  This is performing better than Pig.  Actually Mappers
> run
> > very fast, but Reducers take some time to complete, but this approach
> seems
> > to be working well.
> >
> > Is there a better way to do this?  What strategy can you think of to
> > increase speed of reducers.
> >
> > Any help in this regard will be greatly appreciated.  Thanks.
>
>

Re: Merging files

Posted by Ben Juhn <be...@gmail.com>.
How big are your 50 files?  How long are the reducers taking?

On Jul 30, 2013, at 10:26 PM, Something Something <ma...@gmail.com> wrote:

> Hello,
> 
> One of our pig scripts creates over 500 small part files.  To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50.  We tried the following:
> 
> 1)  When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> field.
> 3)  We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers.  Basically, the Mappers simply write the line &
> reducers loop thru values & write them out.  We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file.  This is performing better than Pig.  Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
> 
> Is there a better way to do this?  What strategy can you think of to
> increase speed of reducers.
> 
> Any help in this regard will be greatly appreciated.  Thanks.