You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Michael Lok <fu...@gmail.com> on 2012/01/19 09:54:03 UTC

Comparing each row with the same resultset

Hi folks,

I've got one resultset which I need to run a comparison with all the
rows within the same resultset.  For example:

R1
R2
R3
R4
R5

Take R1, I'll need to compare R1 with all rows from R2-R5.  The
comparison will be written in a UDF.  Here's what I have so far:

============================================
RAW = load 'raw_data.txt' using PigStorage(',');

RAW_2 = foreach RAW generate *;

PROCESSED = foreach RAW {
    /* perform comparo here */
};
============================================

I'm stuck at the filtering inside the nested block.  How should I go
about the comparing the rows there?

Any help is greatly appreciated.


Thanks!

Re: Comparing each row with the same resultset

Posted by Alan Gates <ga...@hortonworks.com>.

I would just use the HDFS interfaces directly, this is much easier.  For an example of a UDF that opens and HDFS file take a look at https://github.com/alanfgates/programmingpig/blob/master/udfs/java/com/acme/marketing/MetroResolver.java

Alan.

On Jan 20, 2012, at 12:28 AM, Michael Lok wrote:

> Hi Alan,
> 
> Quick question.  Do I use HDataStorage and HFile to read files from
> HDFS within the UDF?
> 
> Thanks.
> 
> On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <fu...@gmail.com> wrote:
>> Hi Alan,
>> 
>> Missed your suggestion earlier :)  After having a sample size of just
>> 30k records, performing a cross join totally killed the disk space I
>> have :(
>> 
>> Will try your suggestion next.
>> 
>> Thanks!
>> 
>> On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <ga...@hortonworks.com> wrote:
>>> 
>>> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>>> 
>>>> Hi Dmitriy,
>>>> 
>>>> Am I correct to say that all rows in "results" is inside a bag when
>>>> passed into the UDF?
>>> 
>>> Yes.  The other issue you'll face here is that if you have more than one map task each map task will be comparing against a different first record, which probably isn't what you want.
>>> 
>>> The best way to do this would probably be to write a UDF that opens the file directly in HDFS and reads the first record.  It can then compare each input record against the first record without needing to hold all of the records in memory and with every map seeing the same first record.
>>> 
>>> So your script would look like:
>>> 
>>> A = load 'file';
>>> B = foreach A generate yourudf('file', *);
>>> ...
>>> 
>>> Ideally the UDF should store the side file in the distributed cache to avoid too many maps opening the file at once, but you can add that once you get the base feature working.
>>> 
>>> Alan.
>>> 
>>> 
>>>> 
>>>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>>>>> results = foreach (group raw all) generate MyUdf(raw)
>>>>> 
>>>>> input to the udf will be a tuple with a single field. This field will be a
>>>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>>> 
>>>>> Note that this forces everything into memory and isn't scalable...
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>>>>> 
>>>>>> Hi folks,
>>>>>> 
>>>>>> I've got one resultset which I need to run a comparison with all the
>>>>>> rows within the same resultset.  For example:
>>>>>> 
>>>>>> R1
>>>>>> R2
>>>>>> R3
>>>>>> R4
>>>>>> R5
>>>>>> 
>>>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>>> 
>>>>>> ============================================
>>>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>>> 
>>>>>> RAW_2 = foreach RAW generate *;
>>>>>> 
>>>>>> PROCESSED = foreach RAW {
>>>>>>    /* perform comparo here */
>>>>>> };
>>>>>> ============================================
>>>>>> 
>>>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>>>> about the comparing the rows there?
>>>>>> 
>>>>>> Any help is greatly appreciated.
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>

Re: Comparing each row with the same resultset

Posted by Michael Lok <fu...@gmail.com>.

Hi Alan,

Quick question.  Do I use HDataStorage and HFile to read files from
HDFS within the UDF?

Thanks.

On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <fu...@gmail.com> wrote:
> Hi Alan,
>
> Missed your suggestion earlier :)  After having a sample size of just
> 30k records, performing a cross join totally killed the disk space I
> have :(
>
> Will try your suggestion next.
>
> Thanks!
>
> On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <ga...@hortonworks.com> wrote:
>>
>> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>>
>>> Hi Dmitriy,
>>>
>>> Am I correct to say that all rows in "results" is inside a bag when
>>> passed into the UDF?
>>
>> Yes.  The other issue you'll face here is that if you have more than one map task each map task will be comparing against a different first record, which probably isn't what you want.
>>
>> The best way to do this would probably be to write a UDF that opens the file directly in HDFS and reads the first record.  It can then compare each input record against the first record without needing to hold all of the records in memory and with every map seeing the same first record.
>>
>> So your script would look like:
>>
>> A = load 'file';
>> B = foreach A generate yourudf('file', *);
>> ...
>>
>> Ideally the UDF should store the side file in the distributed cache to avoid too many maps opening the file at once, but you can add that once you get the base feature working.
>>
>> Alan.
>>
>>
>>>
>>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>>>> results = foreach (group raw all) generate MyUdf(raw)
>>>>
>>>> input to the udf will be a tuple with a single field. This field will be a
>>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>>
>>>> Note that this forces everything into memory and isn't scalable...
>>>>
>>>>
>>>>
>>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> I've got one resultset which I need to run a comparison with all the
>>>>> rows within the same resultset.  For example:
>>>>>
>>>>> R1
>>>>> R2
>>>>> R3
>>>>> R4
>>>>> R5
>>>>>
>>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>>
>>>>> ============================================
>>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>>
>>>>> RAW_2 = foreach RAW generate *;
>>>>>
>>>>> PROCESSED = foreach RAW {
>>>>>    /* perform comparo here */
>>>>> };
>>>>> ============================================
>>>>>
>>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>>> about the comparing the rows there?
>>>>>
>>>>> Any help is greatly appreciated.
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>

Re: Comparing each row with the same resultset

Posted by Michael Lok <fu...@gmail.com>.

Hi Alan,

Missed your suggestion earlier :)  After having a sample size of just
30k records, performing a cross join totally killed the disk space I
have :(

Will try your suggestion next.

Thanks!

On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <ga...@hortonworks.com> wrote:
>
> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>
>> Hi Dmitriy,
>>
>> Am I correct to say that all rows in "results" is inside a bag when
>> passed into the UDF?
>
> Yes.  The other issue you'll face here is that if you have more than one map task each map task will be comparing against a different first record, which probably isn't what you want.
>
> The best way to do this would probably be to write a UDF that opens the file directly in HDFS and reads the first record.  It can then compare each input record against the first record without needing to hold all of the records in memory and with every map seeing the same first record.
>
> So your script would look like:
>
> A = load 'file';
> B = foreach A generate yourudf('file', *);
> ...
>
> Ideally the UDF should store the side file in the distributed cache to avoid too many maps opening the file at once, but you can add that once you get the base feature working.
>
> Alan.
>
>
>>
>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>>> results = foreach (group raw all) generate MyUdf(raw)
>>>
>>> input to the udf will be a tuple with a single field. This field will be a
>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>
>>> Note that this forces everything into memory and isn't scalable...
>>>
>>>
>>>
>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I've got one resultset which I need to run a comparison with all the
>>>> rows within the same resultset.  For example:
>>>>
>>>> R1
>>>> R2
>>>> R3
>>>> R4
>>>> R5
>>>>
>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>
>>>> ============================================
>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>
>>>> RAW_2 = foreach RAW generate *;
>>>>
>>>> PROCESSED = foreach RAW {
>>>>    /* perform comparo here */
>>>> };
>>>> ============================================
>>>>
>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>> about the comparing the rows there?
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>>
>>>> Thanks!
>>>>
>

Re: Comparing each row with the same resultset

Posted by Alan Gates <ga...@hortonworks.com>.

On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:

> Hi Dmitriy,
> 
> Am I correct to say that all rows in "results" is inside a bag when
> passed into the UDF?

Yes.  The other issue you'll face here is that if you have more than one map task each map task will be comparing against a different first record, which probably isn't what you want.

The best way to do this would probably be to write a UDF that opens the file directly in HDFS and reads the first record.  It can then compare each input record against the first record without needing to hold all of the records in memory and with every map seeing the same first record.  

So your script would look like:

A = load 'file';
B = foreach A generate yourudf('file', *);
...

Ideally the UDF should store the side file in the distributed cache to avoid too many maps opening the file at once, but you can add that once you get the base feature working.

Alan.

> 
> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>> results = foreach (group raw all) generate MyUdf(raw)
>> 
>> input to the udf will be a tuple with a single field. This field will be a
>> bag of tuples. Each of those tuples is one of your raw rows.
>> 
>> Note that this forces everything into memory and isn't scalable...
>> 
>> 
>> 
>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>> 
>>> Hi folks,
>>> 
>>> I've got one resultset which I need to run a comparison with all the
>>> rows within the same resultset.  For example:
>>> 
>>> R1
>>> R2
>>> R3
>>> R4
>>> R5
>>> 
>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>> comparison will be written in a UDF.  Here's what I have so far:
>>> 
>>> ============================================
>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>> 
>>> RAW_2 = foreach RAW generate *;
>>> 
>>> PROCESSED = foreach RAW {
>>>    /* perform comparo here */
>>> };
>>> ============================================
>>> 
>>> I'm stuck at the filtering inside the nested block.  How should I go
>>> about the comparing the rows there?
>>> 
>>> Any help is greatly appreciated.
>>> 
>>> 
>>> Thanks!
>>>

Re: Comparing each row with the same resultset

Posted by Michael <fu...@gmail.com>.

Hi Scott,

I think the cross join approach will work. But i dont think my hdfs storage has sufficient space to handle the result of the join as my data size is already 12m rows. 

Probably have to process the records in chunks. 


Thanks 



On Jan 20, 2012, at 2:36, Scott Carey <sc...@richrelevance.com> wrote:

> If your goal is to compare all rows with all other rows, you can do a
> distributed CROSS self-join.
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#CROSS
> 
> Something like 
> 
> exploded = CROSS data, data;
> 
> which will produce n^2 rows, where n is the number of rows in the alias
> 'data'.
> 
> Then you would have each row paired with each other row in your result.
> 
> I haven't tried this myself on a larger dataset -- the n^2 data explosion
> is something to be wary of.
> 
> On 1/19/12 5:57 AM, "Michael Lok" <fu...@gmail.com> wrote:
> 
>> Hi Dmitriy,
>> 
>> Am I correct to say that all rows in "results" is inside a bag when
>> passed into the UDF?
>> 
>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>>> results = foreach (group raw all) generate MyUdf(raw)
>>> 
>>> input to the udf will be a tuple with a single field. This field will
>>> be a
>>> bag of tuples. Each of those tuples is one of your raw rows.
>>> 
>>> Note that this forces everything into memory and isn't scalable...
>>> 
>>> 
>>> 
>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>>> 
>>>> Hi folks,
>>>> 
>>>> I've got one resultset which I need to run a comparison with all the
>>>> rows within the same resultset.  For example:
>>>> 
>>>> R1
>>>> R2
>>>> R3
>>>> R4
>>>> R5
>>>> 
>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>> 
>>>> ============================================
>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>> 
>>>> RAW_2 = foreach RAW generate *;
>>>> 
>>>> PROCESSED = foreach RAW {
>>>>   /* perform comparo here */
>>>> };
>>>> ============================================
>>>> 
>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>> about the comparing the rows there?
>>>> 
>>>> Any help is greatly appreciated.
>>>> 
>>>> 
>>>> Thanks!
>>>> 
>

Re: Comparing each row with the same resultset

Posted by Scott Carey <sc...@richrelevance.com>.

If your goal is to compare all rows with all other rows, you can do a
distributed CROSS self-join.
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#CROSS

Something like 

exploded = CROSS data, data;

which will produce n^2 rows, where n is the number of rows in the alias
'data'.

Then you would have each row paired with each other row in your result.

I haven't tried this myself on a larger dataset -- the n^2 data explosion
is something to be wary of.

On 1/19/12 5:57 AM, "Michael Lok" <fu...@gmail.com> wrote:

>Hi Dmitriy,
>
>Am I correct to say that all rows in "results" is inside a bag when
>passed into the UDF?
>
>On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com>
>wrote:
>> results = foreach (group raw all) generate MyUdf(raw)
>>
>> input to the udf will be a tuple with a single field. This field will
>>be a
>> bag of tuples. Each of those tuples is one of your raw rows.
>>
>> Note that this forces everything into memory and isn't scalable...
>>
>>
>>
>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>>
>>> Hi folks,
>>>
>>> I've got one resultset which I need to run a comparison with all the
>>> rows within the same resultset.  For example:
>>>
>>> R1
>>> R2
>>> R3
>>> R4
>>> R5
>>>
>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>> comparison will be written in a UDF.  Here's what I have so far:
>>>
>>> ============================================
>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>
>>> RAW_2 = foreach RAW generate *;
>>>
>>> PROCESSED = foreach RAW {
>>>    /* perform comparo here */
>>> };
>>> ============================================
>>>
>>> I'm stuck at the filtering inside the nested block.  How should I go
>>> about the comparing the rows there?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks!
>>>

Re: Comparing each row with the same resultset

Posted by Michael Lok <fu...@gmail.com>.

Hi Dmitriy,

Am I correct to say that all rows in "results" is inside a bag when
passed into the UDF?

On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> results = foreach (group raw all) generate MyUdf(raw)
>
> input to the udf will be a tuple with a single field. This field will be a
> bag of tuples. Each of those tuples is one of your raw rows.
>
> Note that this forces everything into memory and isn't scalable...
>
>
>
> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I've got one resultset which I need to run a comparison with all the
>> rows within the same resultset.  For example:
>>
>> R1
>> R2
>> R3
>> R4
>> R5
>>
>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>> comparison will be written in a UDF.  Here's what I have so far:
>>
>> ============================================
>> RAW = load 'raw_data.txt' using PigStorage(',');
>>
>> RAW_2 = foreach RAW generate *;
>>
>> PROCESSED = foreach RAW {
>>    /* perform comparo here */
>> };
>> ============================================
>>
>> I'm stuck at the filtering inside the nested block.  How should I go
>> about the comparing the rows there?
>>
>> Any help is greatly appreciated.
>>
>>
>> Thanks!
>>

Re: Comparing each row with the same resultset

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

results = foreach (group raw all) generate MyUdf(raw)

input to the udf will be a tuple with a single field. This field will be a
bag of tuples. Each of those tuples is one of your raw rows.

Note that this forces everything into memory and isn't scalable...



On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <fu...@gmail.com> wrote:

> Hi folks,
>
> I've got one resultset which I need to run a comparison with all the
> rows within the same resultset.  For example:
>
> R1
> R2
> R3
> R4
> R5
>
> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
> comparison will be written in a UDF.  Here's what I have so far:
>
> ============================================
> RAW = load 'raw_data.txt' using PigStorage(',');
>
> RAW_2 = foreach RAW generate *;
>
> PROCESSED = foreach RAW {
>    /* perform comparo here */
> };
> ============================================
>
> I'm stuck at the filtering inside the nested block.  How should I go
> about the comparing the rows there?
>
> Any help is greatly appreciated.
>
>
> Thanks!
>