You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Cheolsoo Park <pi...@gmail.com> on 2013/12/02 05:31:38 UTC

Re: Changing pig.maxCombinedSplitSize dynamically in single run

Unfortunately, no. The settings are script-wide. Can you add an order-by
before storing your output and set its parallel to a smaller number? That
will force a reduce phase and combine small files. Of course, it will add
extra MR jobs.


On Sat, Nov 30, 2013 at 9:20 AM, Something Something <
mailinglists19@gmail.com> wrote:

> Is there a way in Pig to change this configuration
> (pig.maxCombinedSplitSize) at different steps inside the *same* Pig script?
>
> For example, when I am LOADing the data I want this value to be low so that
> we use the block size effectively & many mappers get triggered. (Otherwise,
> the job takes too long).
>
> But later when I SPLIT my output, I want split size to be large so we don't
> create 4000 small output files.  (SPLIT is a mapper only task).
>
> Is there a way to accomplish this?
>

Re: Changing pig.maxCombinedSplitSize dynamically in single run

Posted by Something Something <ma...@gmail.com>.

Adding ORDER BY is what I have done.  Basically, ordering by the same field
that I am splitting by.  This field is the same on all rows so essentially
there's nothing to order!  But this sounds kludgy!  That's why I asked.
 Thanks.


On Sun, Dec 1, 2013 at 8:31 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Unfortunately, no. The settings are script-wide. Can you add an order-by
> before storing your output and set its parallel to a smaller number? That
> will force a reduce phase and combine small files. Of course, it will add
> extra MR jobs.
>
>
> On Sat, Nov 30, 2013 at 9:20 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
> > Is there a way in Pig to change this configuration
> > (pig.maxCombinedSplitSize) at different steps inside the *same* Pig
> script?
> >
> > For example, when I am LOADing the data I want this value to be low so
> that
> > we use the block size effectively & many mappers get triggered.
> (Otherwise,
> > the job takes too long).
> >
> > But later when I SPLIT my output, I want split size to be large so we
> don't
> > create 4000 small output files.  (SPLIT is a mapper only task).
> >
> > Is there a way to accomplish this?
> >
>