You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Aniket Mokashi <am...@andrew.cmu.edu> on 2011/04/01 01:02:03 UTC

Re: Custom Storage Functions - MultiStorage

In my opinion, MultiStorage should work just fine if you have less number
of buckets (0-100+, not sure about the limit, but definitely not 512) even
if you have large number of records in one bucket.
But, I think this method is error-prone against the task failures. I think
more scalable way is to generate files with tagged names and then move
them into one directory.
If you take a bag of grouped tuples and change your partitioner to fork
more than one reducer spitting into one directory it should work too. But,
this is only useful if you have uniform distribution of your bucket size
(and again another limit on no of buckets).

~Aniket
On Thu, March 31, 2011 5:17 pm, Dmitriy Ryaboy wrote:
> I think the problem there is # of unique keys -- one winds up creating
> way too many filehandles all at the same time. I may be misunderstanding
> the nature of the bug. If I do understand it correctly, it's endemic to the
> whole concept of MultiStorage; creating 7K files * # reducers sounds like
> a really bad thing to do; if you are running into the problem, you
> probably shouldn't be using MultiStorage.
>
>
> Or am I misreading what's happening?
>
>
> D
>
>
> On Thu, Mar 31, 2011 at 9:12 AM, Jonathan Holloway <
> jonathan.holloway@gmail.com> wrote:
>
>> Hi all,
>>
>>
>> I'm working with some data at the moment, for which I needed to
>> generate multiple reports for a given grouped set of data by name. I
>> wasn't initially sure about how to do this, I came across MultiStorage
>> in Pig contrib, but a little worried about the 7k limit there at
>> the moment due to a bug:
>>
>> https://issues.apache.org/jira/browse/PIG-1547
>>
>>
>> Does anybody know what the issue here is - I can take a look at this if
>>  necessary and someone can point me in the right way in terms of fixing
>> it?  I've currently hacked MultiStorage to take a bag and the contained
>> tuples and spit out the tuples with a tab delimiter between them.  Is
>> this the best way to go?
>>
>> Just looking for some feedback.
>>
>>
>> Cheers,
>> Jon.
>>
>>
>