You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/02/03 21:30:02 UTC

the nuts and bolts on how algebraic UDFs are invoked? // would this UDF be efficient?

Howdy, I am curious about how algebraic UDFs are invoked. I know about how
to write them, but in the case where your initial step is to create a costly
datastructure, how can you ensure that this is created as few times as
possible?

What I have in mind is this: I have a huge set of data, and a big one. I
essentially want to join the two. I want to create a balanced binary search
tree of the big set of data, and then do the join against that... I imagine
that you could construct the search tree in every initial algebraic call,
and on the intermediate and final you simply will be passed the constructed
search tress. HOWEVER, if the initial call will be on EVERY row of data, it
might be faster to make it an accumulator? Is there is a difference in this
case?

Would this sort of strategyfor  creating a "join" via UDF be efficient? Is
there a more efficient way I want to avoid implementing something on the
lower pig level if possible (in large part because I have little experience
with parsers etc), but am not against that if I have to (in a way, this
could be a trial run for possible trying to implement some sort of band
join).

I'd appreciate your thoughts
Jon

Re: the nuts and bolts on how algebraic UDFs are invoked? // wouldthis UDF be efficient?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Right.. hence construct in constructor.
UDFs are not static functions, they are processor units. You can build
them as needed. Construction happens once (ish). exec() happens many
times.

D

On Thu, Feb 3, 2011 at 3:54 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> Ok so for context, currently I have a cross of a set size m and of size n. M >> n but both are quite large. N can live in memory, I believe, but m cannot. Currently we use a cross, so runtime mn (of course, done in parallel, but total work done).
>
> Now, if I made the binary search tree, that takes nlogn to make, but then the line to line is logn. So total runtime is (m+n)logn. Since m >>n, though, ithis is better. However, if you have to remake the datastructure on every call, then the runtime is mnlogn -- worse! This is why I need a way to construct the searchg tree as few times as possible.
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <dv...@gmail.com>
> Date: Thu, 3 Feb 2011 15:39:08
> To: <us...@pig.apache.org>
> Reply-To: user@pig.apache.org
> Subject: Re: the nuts and bolts on how algebraic UDFs are invoked? // would
>  this UDF be efficient?
>
> Construct in the constructors for each of initial, intermediate, final.
>
> initial.exec is called on every tuple, intermediate.exec is called on
> "0 or more" collected bags of results from initial, and final on
> collected bags of results that might be outputs of intermediate, or
> might come straight from initial.
>
> D
>
> On Thu, Feb 3, 2011 at 12:30 PM, Jonathan Coveney <jc...@gmail.com> wrote:
>> Howdy, I am curious about how algebraic UDFs are invoked. I know about how
>> to write them, but in the case where your initial step is to create a costly
>> datastructure, how can you ensure that this is created as few times as
>> possible?
>>
>> What I have in mind is this: I have a huge set of data, and a big one. I
>> essentially want to join the two. I want to create a balanced binary search
>> tree of the big set of data, and then do the join against that... I imagine
>> that you could construct the search tree in every initial algebraic call,
>> and on the intermediate and final you simply will be passed the constructed
>> search tress. HOWEVER, if the initial call will be on EVERY row of data, it
>> might be faster to make it an accumulator? Is there is a difference in this
>> case?
>>
>> Would this sort of strategyfor  creating a "join" via UDF be efficient? Is
>> there a more efficient way I want to avoid implementing something on the
>> lower pig level if possible (in large part because I have little experience
>> with parsers etc), but am not against that if I have to (in a way, this
>> could be a trial run for possible trying to implement some sort of band
>> join).
>>
>> I'd appreciate your thoughts
>> Jon
>>
>

Re: the nuts and bolts on how algebraic UDFs are invoked? // wouldthis UDF be efficient?

Posted by Jonathan Coveney <jc...@gmail.com>.

Ok so for context, currently I have a cross of a set size m and of size n. M >> n but both are quite large. N can live in memory, I believe, but m cannot. Currently we use a cross, so runtime mn (of course, done in parallel, but total work done).

Now, if I made the binary search tree, that takes nlogn to make, but then the line to line is logn. So total runtime is (m+n)logn. Since m >>n, though, ithis is better. However, if you have to remake the datastructure on every call, then the runtime is mnlogn -- worse! This is why I need a way to construct the searchg tree as few times as possible.

Sent via BlackBerry

-----Original Message-----
From: Dmitriy Ryaboy <dv...@gmail.com>
Date: Thu, 3 Feb 2011 15:39:08 
To: <us...@pig.apache.org>
Reply-To: user@pig.apache.org
Subject: Re: the nuts and bolts on how algebraic UDFs are invoked? // would
 this UDF be efficient?

Construct in the constructors for each of initial, intermediate, final.

initial.exec is called on every tuple, intermediate.exec is called on
"0 or more" collected bags of results from initial, and final on
collected bags of results that might be outputs of intermediate, or
might come straight from initial.

D

On Thu, Feb 3, 2011 at 12:30 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> Howdy, I am curious about how algebraic UDFs are invoked. I know about how
> to write them, but in the case where your initial step is to create a costly
> datastructure, how can you ensure that this is created as few times as
> possible?
>
> What I have in mind is this: I have a huge set of data, and a big one. I
> essentially want to join the two. I want to create a balanced binary search
> tree of the big set of data, and then do the join against that... I imagine
> that you could construct the search tree in every initial algebraic call,
> and on the intermediate and final you simply will be passed the constructed
> search tress. HOWEVER, if the initial call will be on EVERY row of data, it
> might be faster to make it an accumulator? Is there is a difference in this
> case?
>
> Would this sort of strategyfor  creating a "join" via UDF be efficient? Is
> there a more efficient way I want to avoid implementing something on the
> lower pig level if possible (in large part because I have little experience
> with parsers etc), but am not against that if I have to (in a way, this
> could be a trial run for possible trying to implement some sort of band
> join).
>
> I'd appreciate your thoughts
> Jon
>

Re: the nuts and bolts on how algebraic UDFs are invoked? // would this UDF be efficient?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Construct in the constructors for each of initial, intermediate, final.

initial.exec is called on every tuple, intermediate.exec is called on
"0 or more" collected bags of results from initial, and final on
collected bags of results that might be outputs of intermediate, or
might come straight from initial.

D

On Thu, Feb 3, 2011 at 12:30 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> Howdy, I am curious about how algebraic UDFs are invoked. I know about how
> to write them, but in the case where your initial step is to create a costly
> datastructure, how can you ensure that this is created as few times as
> possible?
>
> What I have in mind is this: I have a huge set of data, and a big one. I
> essentially want to join the two. I want to create a balanced binary search
> tree of the big set of data, and then do the join against that... I imagine
> that you could construct the search tree in every initial algebraic call,
> and on the intermediate and final you simply will be passed the constructed
> search tress. HOWEVER, if the initial call will be on EVERY row of data, it
> might be faster to make it an accumulator? Is there is a difference in this
> case?
>
> Would this sort of strategyfor  creating a "join" via UDF be efficient? Is
> there a more efficient way I want to avoid implementing something on the
> lower pig level if possible (in large part because I have little experience
> with parsers etc), but am not against that if I have to (in a way, this
> could be a trial run for possible trying to implement some sort of band
> join).
>
> I'd appreciate your thoughts
> Jon
>