You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mapred Learn <ma...@gmail.com> on 2011/12/21 02:45:31 UTC

Re: How to create Output files of about fixed size

Hi Shevek/others,

I tried this.

First job created about 78 files of each 15 MB size.

I tried a second map only job with IdentityMapper with
-Dmapred.min.split.size=1073741824  but it did not cause output files to be
1 Gb each but same output as above i.e. 78 files of 15 MB size.

Is there a way to combine about files to 1 GB size each ?

Thanks,
-JJ

On Fri, Oct 28, 2011 at 9:53 AM, Shevek <sh...@karmasphere.com> wrote:

> If you run it as a pure map job, it will do it per split. If you run it as
> a
> single reducer job, it will do it overall. However, one starts to suspect
> that by the time you've paid that extra cost, you might as well reconsider
> your downstream process and the reason for this subdivision.
>
> S.
>
> On 27 October 2011 23:07, Mapred Learn <ma...@gmail.com> wrote:
>
> > Hi Shevek,
> > Thanks for the explanation !
> >
> > Can you point me to some documentatino for specifying size in output
> format
> > ?
> >
> > If i say size as 200 MB, then after 200 mb, it would do this per split or
> > overall ?
> > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then,
> say
> > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files
> only
> > ?
> >
> >
> >
> > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <sh...@karmasphere.com> wrote:
> >
> > > You can control the input to a computer program, but not (arbitrarily)
> > how
> > > much output it generates. The only way to generate output files of a
> > fixed
> > > size is to write a custom output format which shifts to a new filename
> > > every
> > > time that size is exceeded, but you will still get some small bits left
> > > over. The plumbing in this is pretty ugly, and I would not recommend it
> > > casually.
> > >
> > > You may be able to write a second map-only job which reprocesses the
> > output
> > > from the first job in chunks of X bytes, and just writes them out. Use
> an
> > > IdentityMapper and set the split size. I have not tried this at home.
> > >
> > > S.
> > >
> > > On 26 October 2011 07:03, Mapred Learn <ma...@gmail.com> wrote:
> > >
> > > >
> > > > >
> > > >
> > > > > Hi,
> > > > > I am trying to create output files of fixed size by using :
> > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > >
> > > > > But the problem is that the input Data size and metadata varies
>  and
> > I
> > > > have to adjust above value manually to achieve fixed size.
> > > > >
> > > > > Is there a way I can programmatically determine split size that
> would
> > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > >
> > >
> >
>

Re: How to create Output files of about fixed size

Posted by Mapred Learn <ma...@gmail.com>.
Hi Bejoy,
This is what I tried initially but in this case, just to run job over 5 GB
input takes more than an hour as RecordReader in LineRecordReader and
offset is around 64 MB. It's is making performance really bad.

Thanks,
Anurag Tangri



On Wed, Dec 21, 2011 at 12:13 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi JJ
>       If you use the default TextInputFormat, it wont do the job as it
> would generate at least one split for each file. So in your case there
> would be a min of 78 splits as there are that many input files and 78
> mappers and hence same 78 output files. You need to use
> CombineFileInputFormat  to combine more files into a single split. Also you
> need to specify the value mapred.max.split.size to the required size of
> output file.
>
> So in short if you require 1G output files, your aggregation map only job
> should contain the following arguments
> -D mapred.input.format.class = org.apache. .... .CombineFileInputFormat
> -D mapred.max.split.size=1073741824
> -D mapred.reduce.tasks=0
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
> On Wed, Dec 21, 2011 at 7:15 AM, Mapred Learn <mapred.learn@gmail.com
> >wrote:
>
> > Hi Shevek/others,
> >
> > I tried this.
> >
> > First job created about 78 files of each 15 MB size.
> >
> > I tried a second map only job with IdentityMapper with
> > -Dmapred.min.split.size=1073741824  but it did not cause output files to
> be
> > 1 Gb each but same output as above i.e. 78 files of 15 MB size.
> >
> > Is there a way to combine about files to 1 GB size each ?
> >
> > Thanks,
> > -JJ
> >
> > On Fri, Oct 28, 2011 at 9:53 AM, Shevek <sh...@karmasphere.com> wrote:
> >
> > > If you run it as a pure map job, it will do it per split. If you run it
> > as
> > > a
> > > single reducer job, it will do it overall. However, one starts to
> suspect
> > > that by the time you've paid that extra cost, you might as well
> > reconsider
> > > your downstream process and the reason for this subdivision.
> > >
> > > S.
> > >
> > > On 27 October 2011 23:07, Mapred Learn <ma...@gmail.com> wrote:
> > >
> > > > Hi Shevek,
> > > > Thanks for the explanation !
> > > >
> > > > Can you point me to some documentatino for specifying size in output
> > > format
> > > > ?
> > > >
> > > > If i say size as 200 MB, then after 200 mb, it would do this per
> split
> > or
> > > > overall ?
> > > > I mena would I end up with 200 mb and a 50 mb from 1st mapper and
> then,
> > > say
> > > > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb
> files
> > > only
> > > > ?
> > > >
> > > >
> > > >
> > > > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <sh...@karmasphere.com>
> > wrote:
> > > >
> > > > > You can control the input to a computer program, but not
> > (arbitrarily)
> > > > how
> > > > > much output it generates. The only way to generate output files of
> a
> > > > fixed
> > > > > size is to write a custom output format which shifts to a new
> > filename
> > > > > every
> > > > > time that size is exceeded, but you will still get some small bits
> > left
> > > > > over. The plumbing in this is pretty ugly, and I would not
> recommend
> > it
> > > > > casually.
> > > > >
> > > > > You may be able to write a second map-only job which reprocesses
> the
> > > > output
> > > > > from the first job in chunks of X bytes, and just writes them out.
> > Use
> > > an
> > > > > IdentityMapper and set the split size. I have not tried this at
> home.
> > > > >
> > > > > S.
> > > > >
> > > > > On 26 October 2011 07:03, Mapred Learn <ma...@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > Hi,
> > > > > > > I am trying to create output files of fixed size by using :
> > > > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > > > >
> > > > > > > But the problem is that the input Data size and metadata varies
> > >  and
> > > > I
> > > > > > have to adjust above value manually to achieve fixed size.
> > > > > > >
> > > > > > > Is there a way I can programmatically determine split size that
> > > would
> > > > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > JJ
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: How to create Output files of about fixed size

Posted by Bejoy Ks <be...@gmail.com>.
Hi JJ
       If you use the default TextInputFormat, it wont do the job as it
would generate at least one split for each file. So in your case there
would be a min of 78 splits as there are that many input files and 78
mappers and hence same 78 output files. You need to use
CombineFileInputFormat  to combine more files into a single split. Also you
need to specify the value mapred.max.split.size to the required size of
output file.

So in short if you require 1G output files, your aggregation map only job
should contain the following arguments
-D mapred.input.format.class = org.apache. .... .CombineFileInputFormat
-D mapred.max.split.size=1073741824
-D mapred.reduce.tasks=0

Hope it helps!..

Regards
Bejoy.K.S

On Wed, Dec 21, 2011 at 7:15 AM, Mapred Learn <ma...@gmail.com>wrote:

> Hi Shevek/others,
>
> I tried this.
>
> First job created about 78 files of each 15 MB size.
>
> I tried a second map only job with IdentityMapper with
> -Dmapred.min.split.size=1073741824  but it did not cause output files to be
> 1 Gb each but same output as above i.e. 78 files of 15 MB size.
>
> Is there a way to combine about files to 1 GB size each ?
>
> Thanks,
> -JJ
>
> On Fri, Oct 28, 2011 at 9:53 AM, Shevek <sh...@karmasphere.com> wrote:
>
> > If you run it as a pure map job, it will do it per split. If you run it
> as
> > a
> > single reducer job, it will do it overall. However, one starts to suspect
> > that by the time you've paid that extra cost, you might as well
> reconsider
> > your downstream process and the reason for this subdivision.
> >
> > S.
> >
> > On 27 October 2011 23:07, Mapred Learn <ma...@gmail.com> wrote:
> >
> > > Hi Shevek,
> > > Thanks for the explanation !
> > >
> > > Can you point me to some documentatino for specifying size in output
> > format
> > > ?
> > >
> > > If i say size as 200 MB, then after 200 mb, it would do this per split
> or
> > > overall ?
> > > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then,
> > say
> > > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files
> > only
> > > ?
> > >
> > >
> > >
> > > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <sh...@karmasphere.com>
> wrote:
> > >
> > > > You can control the input to a computer program, but not
> (arbitrarily)
> > > how
> > > > much output it generates. The only way to generate output files of a
> > > fixed
> > > > size is to write a custom output format which shifts to a new
> filename
> > > > every
> > > > time that size is exceeded, but you will still get some small bits
> left
> > > > over. The plumbing in this is pretty ugly, and I would not recommend
> it
> > > > casually.
> > > >
> > > > You may be able to write a second map-only job which reprocesses the
> > > output
> > > > from the first job in chunks of X bytes, and just writes them out.
> Use
> > an
> > > > IdentityMapper and set the split size. I have not tried this at home.
> > > >
> > > > S.
> > > >
> > > > On 26 October 2011 07:03, Mapred Learn <ma...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > > >
> > > > >
> > > > > > Hi,
> > > > > > I am trying to create output files of fixed size by using :
> > > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > > >
> > > > > > But the problem is that the input Data size and metadata varies
> >  and
> > > I
> > > > > have to adjust above value manually to achieve fixed size.
> > > > > >
> > > > > > Is there a way I can programmatically determine split size that
> > would
> > > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > > >
> > > > > > Thanks,
> > > > > > JJ
> > > > >
> > > >
> > >
> >
>