You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Joshua Penton <Jo...@geocent.com> on 2012/10/17 00:05:36 UTC

Group Data By UDF Result?

Greetings.

I currently have two sets of data, let's call them QUERY and TARGETS. What I am currently trying to do is the following:

1. For each row in QUERY extract a 'query' property
2. For each 'query' extracted locate all TARGET rows whose 'value' property "matches" the 'query' property.

Note: Determining the "matches" state involves the execution of a custom UDF to determine the validity of equality. (Essentially implementing a SQL LIKE-style request) As a result there doesn't appear to be in-built Pig functionality to perform this comparison.

I have tried multiple methods including utilizing a FOREACH with a FILTER command, convoluted COGROUPing, and countless other methods to no avail. The only method that I've found works is to compute a full CROSS between QUERY and TARGETS and performing the FILTER on the result. However the execution time of this single task is on the order runs on the order of 30 minutes and would only grow exponentially once operational data is introduced.

So, am I missing something obvious or is there some standard method to implement this functionality?

(Please be kind, for as embarrassingly long as I have been on the internet I have never before submitted information to a mailing list.)

Re: Group Data By UDF Result?

Posted by Russell Jurney <ru...@gmail.com>.

The 'enormous intermediate data way':

queries = foreach my_row generate id, extract_query(field1) as query;
target_queries = cross queries, target;
result = filter target_queries by my_condition(queries.query), etc.

The 'looping smaller chunks in ram in a UDF if your data partitions way':

queries = foreach my_row generate id, extract_query(field1) as query;
by_key = group queries by some_key;
also_by_key = group target by some_key;
crossed_groups = cross by_key, also_by_key;
result = filter crossed_groups by looping_udf(fields);

Russell Jurney http://datasyndrome.com

On Oct 16, 2012, at 3:06 PM, Joshua Penton <Jo...@geocent.com> wrote:

> Greetings.
>
> I currently have two sets of data, let's call them QUERY and TARGETS. What I am currently trying to do is the following:
>
> 1. For each row in QUERY extract a 'query' property
> 2. For each 'query' extracted locate all TARGET rows whose 'value' property "matches" the 'query' property.
>
> Note: Determining the "matches" state involves the execution of a custom UDF to determine the validity of equality. (Essentially implementing a SQL LIKE-style request) As a result there doesn't appear to be in-built Pig functionality to perform this comparison.
>
> I have tried multiple methods including utilizing a FOREACH with a FILTER command, convoluted COGROUPing, and countless other methods to no avail. The only method that I've found works is to compute a full CROSS between QUERY and TARGETS and performing the FILTER on the result. However the execution time of this single task is on the order runs on the order of 30 minutes and would only grow exponentially once operational data is introduced.
>
> So, am I missing something obvious or is there some standard method to implement this functionality?
>
> (Please be kind, for as embarrassingly long as I have been on the internet I have never before submitted information to a mailing list.)

Re: Group Data By UDF Result?

Posted by Jonathan Coveney <jc...@gmail.com>.

Howdy Joshua. This question comes up a fair amount, in various forms, and
here is the answer: unless you can figure out a way to reduce this to an
equi-join, then it is going to be tough.

Why is that? Because of how joining in map-reduce land works. The way
joining generally works is by hashing the join key in each relation and
sending equal hash values to the same reducer. Can you see why doing more
complicated equality operations is tough?

What is the algorithm around the equality?

Essentially, for a join to work, you need to be able to find a function
such that
f1(x,y) = true iff f2(x)=f2(y)

f1 is your current function.

Make sense?

2012/10/16 Joshua Penton <Jo...@geocent.com>

> Greetings.
>
> I currently have two sets of data, let's call them QUERY and TARGETS. What
> I am currently trying to do is the following:
>
> 1. For each row in QUERY extract a 'query' property
> 2. For each 'query' extracted locate all TARGET rows whose 'value'
> property "matches" the 'query' property.
>
> Note: Determining the "matches" state involves the execution of a custom
> UDF to determine the validity of equality. (Essentially implementing a SQL
> LIKE-style request) As a result there doesn't appear to be in-built Pig
> functionality to perform this comparison.
>
> I have tried multiple methods including utilizing a FOREACH with a FILTER
> command, convoluted COGROUPing, and countless other methods to no avail.
> The only method that I've found works is to compute a full CROSS between
> QUERY and TARGETS and performing the FILTER on the result. However the
> execution time of this single task is on the order runs on the order of 30
> minutes and would only grow exponentially once operational data is
> introduced.
>
> So, am I missing something obvious or is there some standard method to
> implement this functionality?
>
> (Please be kind, for as embarrassingly long as I have been on the internet
> I have never before submitted information to a mailing list.)