You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Zach Bailey <za...@dataclip.com> on 2010/12/21 23:52:22 UTC

Controlling resulting file size?

 Does anyone know of any existing StoreFunc to specify a maximum output file size? Or would I need to write a custom StoreFunc to do this?


I am running into a problem on Amazon's EMR where the files the reducers are writing are too large to be uploaded to S3 (5GB limit per file) and I need to figure out a way to get the output file sizes down into a reasonable range.


The other way would be to fire up more machines, which would provide more reducers, meaning the data is split into more files, yielding smaller files. But I want the resulting files to be split on some reasonable file size (50 - 100MB) so they are friendly for pulling down, inspecting, and testing with.


Any ideas?
-Zach

Re: Controlling resulting file size?

Posted by Andrew Hitchcock <ad...@gmail.com>.

Zach, as a followup, you can now use multipart upload to create files
large than 5 GB using EMR. You have to specifically enable this
however. The documentation about the feature and how to enable it can
be found here:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?UsingEMR_Config.html#Config_Multipart

Regards,
Andrew

On Tue, Dec 21, 2010 at 5:10 PM, Andrew Hitchcock <ad...@gmail.com> wrote:
> Zach,
>
> I work on the Elastic MapReduce team. We are planning to launch
> support for multipart upload into Amazon S3 in early January. This
> will enable you to write files into Amazon S3 from your reducer that
> are up to 5 TB in size.
>
> In the mean time, Dmitriy's advise should work. Increase the number of
> reducers and each reducer will process and write less data. This will
> work unless you have a very uneven data distribution.
>
> Regards,
> Andrew
>
> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com> wrote:
>>  Does anyone know of any existing StoreFunc to specify a maximum output file size? Or would I need to write a custom StoreFunc to do this?
>>
>>
>> I am running into a problem on Amazon's EMR where the files the reducers are writing are too large to be uploaded to S3 (5GB limit per file) and I need to figure out a way to get the output file sizes down into a reasonable range.
>>
>>
>> The other way would be to fire up more machines, which would provide more reducers, meaning the data is split into more files, yielding smaller files. But I want the resulting files to be split on some reasonable file size (50 - 100MB) so they are friendly for pulling down, inspecting, and testing with.
>>
>>
>> Any ideas?
>> -Zach
>>
>>
>>
>

Re: Controlling resulting file size?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Yes, iirc mapred.tasktracker.* are settings that are TT specific, not job
specific, and require the TT to be started with the desired values.

-D

On Tue, Dec 21, 2010 at 7:03 PM, Zach Bailey <za...@dataclip.com>wrote:

> Ah, good point. This is what the pig PARALLEL keyword does IIRC...
>
> However, for the mapred.tasktracker.reduce.tasks.maximum variable that will
> require a bootstrap script parameter, right?
>
> -Zach
>
> On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <adpowers@gmail.com
> >wrote:
>
> > What you could do is increase the number of reducers your job runs and
> > at the same time decrease the number of reducers that each machine
> > runs concurrently. The settings for that are:
> >
> > mapred.reduce.tasks (increase this one)
> > mapred.tasktracker.reduce.tasks.maximum (decrease this one)
> >
> > Andrew
> >
> > On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <za...@dataclip.com>
> > wrote:
> > > Thank you very much both Dmitriy and Andrew.
> > >
> > > Unfortunately I'm stuck in a bit of a bind. Specifying additional
> > reducers
> > > is a problem because the workload I have is very reduce heavy. So
> > > unfortunately I was running into memory problems exactly as described
> in
> > > this thread:
> > >
> > > https://forums.aws.amazon.com/thread.jspa?threadID=49024
> > >
> > > I ended up having to bump my EMR slave instances up to m2.xlarge
> > instances
> > > to handle the memory pressure.
> > >
> > > Since this is running on EMR I can of course opt to throw more machines
> > at
> > > the whole thing. Correct me if I'm wrong but that will hopefully solve
> > both
> > > problems at the same time, although it doesn't get my output files down
> > to
> > > the ideal size I was hoping for. The task was running on 4x m2.xlarge
> > > instances (after failing on 4x m1.large and 8x c1.medium), I think the
> > next
> > > run I'll try doubling up to 8x m1.large and hopefully that will be
> enough
> > > reduce slots to keep the file size down and avoid the memory pressure
> > > problem.
> > >
> > > -Zach
> > >
> > > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <adpowers@gmail.com
> > >wrote:
> > >
> > >> Zach,
> > >>
> > >> I work on the Elastic MapReduce team. We are planning to launch
> > >> support for multipart upload into Amazon S3 in early January. This
> > >> will enable you to write files into Amazon S3 from your reducer that
> > >> are up to 5 TB in size.
> > >>
> > >> In the mean time, Dmitriy's advise should work. Increase the number of
> > >> reducers and each reducer will process and write less data. This will
> > >> work unless you have a very uneven data distribution.
> > >>
> > >> Regards,
> > >> Andrew
> > >>
> > >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <
> zach.bailey@dataclip.com>
> > >> wrote:
> > >> >  Does anyone know of any existing StoreFunc to specify a maximum
> > output
> > >> file size? Or would I need to write a custom StoreFunc to do this?
> > >> >
> > >> >
> > >> > I am running into a problem on Amazon's EMR where the files the
> > reducers
> > >> are writing are too large to be uploaded to S3 (5GB limit per file)
> and
> > I
> > >> need to figure out a way to get the output file sizes down into a
> > reasonable
> > >> range.
> > >> >
> > >> >
> > >> > The other way would be to fire up more machines, which would provide
> > more
> > >> reducers, meaning the data is split into more files, yielding smaller
> > files.
> > >> But I want the resulting files to be split on some reasonable file
> size
> > (50
> > >> - 100MB) so they are friendly for pulling down, inspecting, and
> testing
> > >> with.
> > >> >
> > >> >
> > >> > Any ideas?
> > >> > -Zach
> > >> >
> > >> >
> > >> >
> > >>
> > >
> >
>

Re: Controlling resulting file size?

Posted by Zach Bailey <za...@dataclip.com>.

Ah, good point. This is what the pig PARALLEL keyword does IIRC...

However, for the mapred.tasktracker.reduce.tasks.maximum variable that will
require a bootstrap script parameter, right?

-Zach

On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <ad...@gmail.com>wrote:

> What you could do is increase the number of reducers your job runs and
> at the same time decrease the number of reducers that each machine
> runs concurrently. The settings for that are:
>
> mapred.reduce.tasks (increase this one)
> mapred.tasktracker.reduce.tasks.maximum (decrease this one)
>
> Andrew
>
> On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <za...@dataclip.com>
> wrote:
> > Thank you very much both Dmitriy and Andrew.
> >
> > Unfortunately I'm stuck in a bit of a bind. Specifying additional
> reducers
> > is a problem because the workload I have is very reduce heavy. So
> > unfortunately I was running into memory problems exactly as described in
> > this thread:
> >
> > https://forums.aws.amazon.com/thread.jspa?threadID=49024
> >
> > I ended up having to bump my EMR slave instances up to m2.xlarge
> instances
> > to handle the memory pressure.
> >
> > Since this is running on EMR I can of course opt to throw more machines
> at
> > the whole thing. Correct me if I'm wrong but that will hopefully solve
> both
> > problems at the same time, although it doesn't get my output files down
> to
> > the ideal size I was hoping for. The task was running on 4x m2.xlarge
> > instances (after failing on 4x m1.large and 8x c1.medium), I think the
> next
> > run I'll try doubling up to 8x m1.large and hopefully that will be enough
> > reduce slots to keep the file size down and avoid the memory pressure
> > problem.
> >
> > -Zach
> >
> > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <adpowers@gmail.com
> >wrote:
> >
> >> Zach,
> >>
> >> I work on the Elastic MapReduce team. We are planning to launch
> >> support for multipart upload into Amazon S3 in early January. This
> >> will enable you to write files into Amazon S3 from your reducer that
> >> are up to 5 TB in size.
> >>
> >> In the mean time, Dmitriy's advise should work. Increase the number of
> >> reducers and each reducer will process and write less data. This will
> >> work unless you have a very uneven data distribution.
> >>
> >> Regards,
> >> Andrew
> >>
> >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com>
> >> wrote:
> >> >  Does anyone know of any existing StoreFunc to specify a maximum
> output
> >> file size? Or would I need to write a custom StoreFunc to do this?
> >> >
> >> >
> >> > I am running into a problem on Amazon's EMR where the files the
> reducers
> >> are writing are too large to be uploaded to S3 (5GB limit per file) and
> I
> >> need to figure out a way to get the output file sizes down into a
> reasonable
> >> range.
> >> >
> >> >
> >> > The other way would be to fire up more machines, which would provide
> more
> >> reducers, meaning the data is split into more files, yielding smaller
> files.
> >> But I want the resulting files to be split on some reasonable file size
> (50
> >> - 100MB) so they are friendly for pulling down, inspecting, and testing
> >> with.
> >> >
> >> >
> >> > Any ideas?
> >> > -Zach
> >> >
> >> >
> >> >
> >>
> >
>

Re: Controlling resulting file size?

Posted by Andrew Hitchcock <ad...@gmail.com>.

What you could do is increase the number of reducers your job runs and
at the same time decrease the number of reducers that each machine
runs concurrently. The settings for that are:

mapred.reduce.tasks (increase this one)
mapred.tasktracker.reduce.tasks.maximum (decrease this one)

Andrew

On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <za...@dataclip.com> wrote:
> Thank you very much both Dmitriy and Andrew.
>
> Unfortunately I'm stuck in a bit of a bind. Specifying additional reducers
> is a problem because the workload I have is very reduce heavy. So
> unfortunately I was running into memory problems exactly as described in
> this thread:
>
> https://forums.aws.amazon.com/thread.jspa?threadID=49024
>
> I ended up having to bump my EMR slave instances up to m2.xlarge instances
> to handle the memory pressure.
>
> Since this is running on EMR I can of course opt to throw more machines at
> the whole thing. Correct me if I'm wrong but that will hopefully solve both
> problems at the same time, although it doesn't get my output files down to
> the ideal size I was hoping for. The task was running on 4x m2.xlarge
> instances (after failing on 4x m1.large and 8x c1.medium), I think the next
> run I'll try doubling up to 8x m1.large and hopefully that will be enough
> reduce slots to keep the file size down and avoid the memory pressure
> problem.
>
> -Zach
>
> On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <ad...@gmail.com>wrote:
>
>> Zach,
>>
>> I work on the Elastic MapReduce team. We are planning to launch
>> support for multipart upload into Amazon S3 in early January. This
>> will enable you to write files into Amazon S3 from your reducer that
>> are up to 5 TB in size.
>>
>> In the mean time, Dmitriy's advise should work. Increase the number of
>> reducers and each reducer will process and write less data. This will
>> work unless you have a very uneven data distribution.
>>
>> Regards,
>> Andrew
>>
>> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com>
>> wrote:
>> >  Does anyone know of any existing StoreFunc to specify a maximum output
>> file size? Or would I need to write a custom StoreFunc to do this?
>> >
>> >
>> > I am running into a problem on Amazon's EMR where the files the reducers
>> are writing are too large to be uploaded to S3 (5GB limit per file) and I
>> need to figure out a way to get the output file sizes down into a reasonable
>> range.
>> >
>> >
>> > The other way would be to fire up more machines, which would provide more
>> reducers, meaning the data is split into more files, yielding smaller files.
>> But I want the resulting files to be split on some reasonable file size (50
>> - 100MB) so they are friendly for pulling down, inspecting, and testing
>> with.
>> >
>> >
>> > Any ideas?
>> > -Zach
>> >
>> >
>> >
>>
>

Re: Controlling resulting file size?

Posted by Zach Bailey <za...@dataclip.com>.

Thank you very much both Dmitriy and Andrew.

Unfortunately I'm stuck in a bit of a bind. Specifying additional reducers
is a problem because the workload I have is very reduce heavy. So
unfortunately I was running into memory problems exactly as described in
this thread:

https://forums.aws.amazon.com/thread.jspa?threadID=49024

I ended up having to bump my EMR slave instances up to m2.xlarge instances
to handle the memory pressure.

Since this is running on EMR I can of course opt to throw more machines at
the whole thing. Correct me if I'm wrong but that will hopefully solve both
problems at the same time, although it doesn't get my output files down to
the ideal size I was hoping for. The task was running on 4x m2.xlarge
instances (after failing on 4x m1.large and 8x c1.medium), I think the next
run I'll try doubling up to 8x m1.large and hopefully that will be enough
reduce slots to keep the file size down and avoid the memory pressure
problem.

-Zach

On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <ad...@gmail.com>wrote:

> Zach,
>
> I work on the Elastic MapReduce team. We are planning to launch
> support for multipart upload into Amazon S3 in early January. This
> will enable you to write files into Amazon S3 from your reducer that
> are up to 5 TB in size.
>
> In the mean time, Dmitriy's advise should work. Increase the number of
> reducers and each reducer will process and write less data. This will
> work unless you have a very uneven data distribution.
>
> Regards,
> Andrew
>
> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com>
> wrote:
> >  Does anyone know of any existing StoreFunc to specify a maximum output
> file size? Or would I need to write a custom StoreFunc to do this?
> >
> >
> > I am running into a problem on Amazon's EMR where the files the reducers
> are writing are too large to be uploaded to S3 (5GB limit per file) and I
> need to figure out a way to get the output file sizes down into a reasonable
> range.
> >
> >
> > The other way would be to fire up more machines, which would provide more
> reducers, meaning the data is split into more files, yielding smaller files.
> But I want the resulting files to be split on some reasonable file size (50
> - 100MB) so they are friendly for pulling down, inspecting, and testing
> with.
> >
> >
> > Any ideas?
> > -Zach
> >
> >
> >
>

Re: Controlling resulting file size?

Posted by Andrew Hitchcock <ad...@gmail.com>.

Zach,

I work on the Elastic MapReduce team. We are planning to launch
support for multipart upload into Amazon S3 in early January. This
will enable you to write files into Amazon S3 from your reducer that
are up to 5 TB in size.

In the mean time, Dmitriy's advise should work. Increase the number of
reducers and each reducer will process and write less data. This will
work unless you have a very uneven data distribution.

Regards,
Andrew

On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com> wrote:
>  Does anyone know of any existing StoreFunc to specify a maximum output file size? Or would I need to write a custom StoreFunc to do this?
>
>
> I am running into a problem on Amazon's EMR where the files the reducers are writing are too large to be uploaded to S3 (5GB limit per file) and I need to figure out a way to get the output file sizes down into a reasonable range.
>
>
> The other way would be to fire up more machines, which would provide more reducers, meaning the data is split into more files, yielding smaller files. But I want the resulting files to be split on some reasonable file size (50 - 100MB) so they are friendly for pulling down, inspecting, and testing with.
>
>
> Any ideas?
> -Zach
>
>
>

Re: Controlling resulting file size?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I don't know of anything that would give you this out of the box; you'd have
to write your own output format + StoreFunc.

As far as firing up more machines -- you don't really need to, you can just
increase parallelism of your job -- if you ask for more reducers than you
have reduce slots in the cluster, they will get scheduled in waves instead
of all at the same time, but they'll come through.

D

On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <za...@dataclip.com>wrote:

>  Does anyone know of any existing StoreFunc to specify a maximum output
> file size? Or would I need to write a custom StoreFunc to do this?
>
>
> I am running into a problem on Amazon's EMR where the files the reducers
> are writing are too large to be uploaded to S3 (5GB limit per file) and I
> need to figure out a way to get the output file sizes down into a reasonable
> range.
>
>
> The other way would be to fire up more machines, which would provide more
> reducers, meaning the data is split into more files, yielding smaller files.
> But I want the resulting files to be split on some reasonable file size (50
> - 100MB) so they are friendly for pulling down, inspecting, and testing
> with.
>
>
> Any ideas?
> -Zach
>
>
>