You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by hc busy <hc...@gmail.com> on 2010/06/01 22:44:31 UTC
Re: does EvalFunc generate the entire bag always ?
well, see that's the thing, the 'sort A by $0' is already nlg(n)
ahh, I see, my own example suffers from this problem.
I guess I'm wondering how 'limit' works in conjunction with UDF's... A
practical application escapes me right now, But if I do
C = foreach B{
C1 = MyUdf(B.bag_on_b);
C2 = limit C1 5;
}
does it know to push limit in this case?
On Thu, May 27, 2010 at 2:32 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> The default case is that a UDFs that take bags (such as COUNT, etc.) are
> handed the entire bag at once. In the case where all UDFs in a foreach
> implement the algebraic interface and the expression itself is algebraic
> than the combiner will be used, thus significantly limiting the size of the
> bag handed to the UDF. The accumulator does hand records to the UDF a few
> thousand at a time. Currently it has no way to turn off the flow of
> records.
>
> What you want might be accomplished by the LIMIT operator, which can be
> used inside a nested foreach. Something like:
>
> C = foreach B {
> C1 = sort A by $0;
> C2 = limit 5 C1;
> generate myUDF(C2);
> }
>
> Alan.
>
>
> On May 26, 2010, at 11:59 AM, hc busy wrote:
>
> Hey, guys, how are Bags passed to EvalFunc stored?
>>
>> I was looking at the Accumulator interface and it says that the reason why
>> this needed for COUNT and SUM is because EvalFunc always gives you the
>> entire bag when the EvalFunc is run on a bag.
>>
>> I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
>> inside that does
>>
>>
>> for(Tuple entry:inputDataBag){
>> .... stuff
>> }
>>
>>
>> was an actual iterator that iterated on the bag sequentially without
>> necessarily having the entire bag in memory all at once. ?? Because it's
>> an
>> iterator, so there's no way to do anything other than to stream through
>> it.
>>
>> I'm looking at this because Accumulator has no way of telling Pig "I've
>> seen
>> enough" It streams through the entire bag no matter what happens. (like,
>> hypothetically speaking, if I was writing "5th item of a sorted bag" udf),
>> after I see 5th of a 5 million entry bag, I want to stop executing if
>> possible.
>>
>> Is there a easy way to make this happen?
>>
>
>
Re: does EvalFunc generate the entire bag always ?
Posted by Alan Gates <ga...@yahoo-inc.com>.
I don't think it pushes limit yet in this case.
Alan.
On Jun 1, 2010, at 1:44 PM, hc busy wrote:
> well, see that's the thing, the 'sort A by $0' is already nlg(n)
>
> ahh, I see, my own example suffers from this problem.
>
> I guess I'm wondering how 'limit' works in conjunction with UDF's... A
> practical application escapes me right now, But if I do
>
> C = foreach B{
> C1 = MyUdf(B.bag_on_b);
> C2 = limit C1 5;
> }
>
> does it know to push limit in this case?
>
>
> On Thu, May 27, 2010 at 2:32 PM, Alan Gates <ga...@yahoo-inc.com>
> wrote:
>
>> The default case is that a UDFs that take bags (such as COUNT,
>> etc.) are
>> handed the entire bag at once. In the case where all UDFs in a
>> foreach
>> implement the algebraic interface and the expression itself is
>> algebraic
>> than the combiner will be used, thus significantly limiting the
>> size of the
>> bag handed to the UDF. The accumulator does hand records to the
>> UDF a few
>> thousand at a time. Currently it has no way to turn off the flow of
>> records.
>>
>> What you want might be accomplished by the LIMIT operator, which
>> can be
>> used inside a nested foreach. Something like:
>>
>> C = foreach B {
>> C1 = sort A by $0;
>> C2 = limit 5 C1;
>> generate myUDF(C2);
>> }
>>
>> Alan.
>>
>>
>> On May 26, 2010, at 11:59 AM, hc busy wrote:
>>
>> Hey, guys, how are Bags passed to EvalFunc stored?
>>>
>>> I was looking at the Accumulator interface and it says that the
>>> reason why
>>> this needed for COUNT and SUM is because EvalFunc always gives you
>>> the
>>> entire bag when the EvalFunc is run on a bag.
>>>
>>> I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and
>>> the code
>>> inside that does
>>>
>>>
>>> for(Tuple entry:inputDataBag){
>>> .... stuff
>>> }
>>>
>>>
>>> was an actual iterator that iterated on the bag sequentially without
>>> necessarily having the entire bag in memory all at once. ??
>>> Because it's
>>> an
>>> iterator, so there's no way to do anything other than to stream
>>> through
>>> it.
>>>
>>> I'm looking at this because Accumulator has no way of telling Pig
>>> "I've
>>> seen
>>> enough" It streams through the entire bag no matter what happens.
>>> (like,
>>> hypothetically speaking, if I was writing "5th item of a sorted
>>> bag" udf),
>>> after I see 5th of a 5 million entry bag, I want to stop executing
>>> if
>>> possible.
>>>
>>> Is there a easy way to make this happen?
>>>
>>
>>