You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thomas Kappler <tk...@googlemail.com> on 2011/06/16 09:38:56 UTC

MultiStorage for many key values

Hi all,

piggybank.storage.MultiStorage allows storing the Pig output into
different directories, taken from a given field in a relation, so that
the output is partitioned by the unique values of that field.

This is just what I need for my use-case. However, I have about 50,000
unique values in the partitioning field. It seems that MutliStorage
will run one reducer per unique value, i.e., per output directory.
Obviously, this takes a long time.

Is there a better way of doing it?

I could group by the partitioning field and write a post-processing
script to go through the Pig output and write each line to a different
line. It would be simple, but I'd prefer to do it all in Pig for
consistency.

Thanks,
Thomas

Re: MultiStorage for many key values

Posted by Xiaomeng Wan <sh...@gmail.com>.
We used to take the first character of the partition field, and
multistorage on that.

Shawn

On Fri, Jun 17, 2011 at 4:18 AM, Thomas Kappler <tk...@googlemail.com> wrote:
> On Thu, Jun 16, 2011 at 20:00, Daniel Dai <ji...@yahoo-inc.com> wrote:
>> Try custom partitioner:
>> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby
>
> AFAIK "partition by" maps to the Hadoop partitioning, which is about
> what keys go to which reducer, which is a different problem.
>
> Hadoop In Action chapter 7.2 addresses partitioning into multiple
> output files, and highlights this difference. The book shows a custom
> implementation of MultipleOutputFormat as a solution.
>
> Thomas
>
>
>> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>>>
>>> Hi all,
>>>
>>> piggybank.storage.MultiStorage allows storing the Pig output into
>>> different directories, taken from a given field in a relation, so that
>>> the output is partitioned by the unique values of that field.
>>>
>>> This is just what I need for my use-case. However, I have about 50,000
>>> unique values in the partitioning field. It seems that MutliStorage
>>> will run one reducer per unique value, i.e., per output directory.
>>> Obviously, this takes a long time.
>>>
>>> Is there a better way of doing it?
>>>
>>> I could group by the partitioning field and write a post-processing
>>> script to go through the Pig output and write each line to a different
>>> line. It would be simple, but I'd prefer to do it all in Pig for
>>> consistency.
>>>
>>> Thanks,
>>> Thomas
>>
>>
>

Re: MultiStorage for many key values

Posted by Thomas Kappler <tk...@googlemail.com>.
On Thu, Jun 16, 2011 at 20:00, Daniel Dai <ji...@yahoo-inc.com> wrote:
> Try custom partitioner:
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby

AFAIK "partition by" maps to the Hadoop partitioning, which is about
what keys go to which reducer, which is a different problem.

Hadoop In Action chapter 7.2 addresses partitioning into multiple
output files, and highlights this difference. The book shows a custom
implementation of MultipleOutputFormat as a solution.

Thomas


> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>>
>> Hi all,
>>
>> piggybank.storage.MultiStorage allows storing the Pig output into
>> different directories, taken from a given field in a relation, so that
>> the output is partitioned by the unique values of that field.
>>
>> This is just what I need for my use-case. However, I have about 50,000
>> unique values in the partitioning field. It seems that MutliStorage
>> will run one reducer per unique value, i.e., per output directory.
>> Obviously, this takes a long time.
>>
>> Is there a better way of doing it?
>>
>> I could group by the partitioning field and write a post-processing
>> script to go through the Pig output and write each line to a different
>> line. It would be simple, but I'd prefer to do it all in Pig for
>> consistency.
>>
>> Thanks,
>> Thomas
>
>

Re: MultiStorage for many key values

Posted by Jameson Li <ho...@gmail.com>.
I have the same doubt as Thomas Kappler.
And it will be kind of you if someone can say something more detailed about
'custom partitioner' said by Daniel Dai.
I think the docs 'piglatin_ref2.html#partitionby' seems too simple.


2011/6/17 Daniel Dai <ji...@yahoo-inc.com>

> Try custom partitioner: http://pig.apache.org/docs/r0.**
> 8.1/piglatin_ref2.html#**partitionby<http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby>
>
> Daniel
>
>
> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>
>> Hi all,
>>
>> piggybank.storage.MultiStorage allows storing the Pig output into
>> different directories, taken from a given field in a relation, so that
>> the output is partitioned by the unique values of that field.
>>
>> This is just what I need for my use-case. However, I have about 50,000
>> unique values in the partitioning field. It seems that MutliStorage
>> will run one reducer per unique value, i.e., per output directory.
>> Obviously, this takes a long time.
>>
>> Is there a better way of doing it?
>>
>> I could group by the partitioning field and write a post-processing
>> script to go through the Pig output and write each line to a different
>> line. It would be simple, but I'd prefer to do it all in Pig for
>> consistency.
>>
>> Thanks,
>> Thomas
>>
>
>

Re: MultiStorage for many key values

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Try custom partitioner: 
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby

Daniel

On 06/16/2011 12:38 AM, Thomas Kappler wrote:
> Hi all,
>
> piggybank.storage.MultiStorage allows storing the Pig output into
> different directories, taken from a given field in a relation, so that
> the output is partitioned by the unique values of that field.
>
> This is just what I need for my use-case. However, I have about 50,000
> unique values in the partitioning field. It seems that MutliStorage
> will run one reducer per unique value, i.e., per output directory.
> Obviously, this takes a long time.
>
> Is there a better way of doing it?
>
> I could group by the partitioning field and write a post-processing
> script to go through the Pig output and write each line to a different
> line. It would be simple, but I'd prefer to do it all in Pig for
> consistency.
>
> Thanks,
> Thomas