You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Lakshminarayana Motamarri <na...@gmail.com> on 2011/07/15 10:23:09 UTC

JOIN or COGROUP? - need to join 1 column from first file which should lie in between 2 columns from second file.

Hi all

I have 2 CSV files a shown below:

*File 1:                     File2:
col1  col2             col1    col2   col3   col4
1234    2                1000   1999
2222    3                2000   2999
3333    5                3000   3999
4444    6                4000   4999*

Now I need to JOIN these 2 files in such a way that:

File1-col1 should lie in between File2-col1 and File2-col2

Can I use JOIN / COGROUP or any other existing operators?

or shud I build a new UDF?

thanks
Narayan.

Re: JOIN or COGROUP? - need to join 1 column from first file which should lie in between 2 columns from second file.

Posted by Xiaomeng Wan <sh...@gmail.com>.
you can always use a cross followed by filter

x = cross file1, file2;
y = filter x by file1::col1>file2::col1 and file1::col1<file2::col2;

Shawn

On Fri, Jul 15, 2011 at 2:12 PM, Lakshminarayana Motamarri
<na...@gmail.com> wrote:
> Hi Jacob,
>
> thanks for the response and suggestion. But as u mentioned the range is NOT
> small and NON linear...
>
>              file 2:
>              col1               col2            50331648    67108863
> 67108864 67109119  67109120 67109631  67109632 67110399  67110400
> 67110655
> ....
>
> so as mentioned, the range is not small and there could be more than one
> match from file 1, which like in between these...
>
> suggestions? ideas? please...
>
> thanks
> Narayan.
>
> On Fri, Jul 15, 2011 at 5:23 AM, Jacob Perkins <ja...@gmail.com>wrote:
>
>> Assuming col1 is numeric, as you've indicated, couldn't you simply
>> generate a new column in file 1 by rounding to the nearest 1000? Then
>> file 1 would look like:
>>
>> *File 1:
>> col1  col2 join_key
>> 1234  2    1000
>> 2222  3    2000
>> 3333  5    3000
>> 4444  6    4000
>>
>> Then you could just join by the new key from file 1 and col2 from file
>> 2.
>>
>> This works even if your ranges are smaller, just round to whatever makes
>> sense. Eg, nearest 10. What this does not work for is if your ranges are
>> variable. Are your ranges variable? :)
>>
>> --jacob
>> @thedatachef
>>
>> On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
>> > Hi all
>> >
>> > I have 2 CSV files a shown below:
>> >
>> > *File 1:                     File2:
>> > col1  col2             col1    col2   col3   col4
>> > 1234    2                1000   1999
>> > 2222    3                2000   2999
>> > 3333    5                3000   3999
>> > 4444    6                4000   4999*
>> >
>> > Now I need to JOIN these 2 files in such a way that:
>> >
>> > File1-col1 should lie in between File2-col1 and File2-col2
>> >
>> > Can I use JOIN / COGROUP or any other existing operators?
>> >
>> > or shud I build a new UDF?
>> >
>> > thanks
>> > Narayan.
>>
>>
>>
>
>
> --
> Thanking you,
> ---
> Thanks & Regards,
> Mr. Lakshminarayana. Motamarri, {Call-out Name: Narayan}
> Graduate Student (M.S in Computer Science),
> Arizona State University,
> School of Computing and Informatics,
> Mobile: (480) 280 0179.
> http://www.public.asu.edu/~lmotamar/
>

Re: JOIN or COGROUP? - need to join 1 column from first file which should lie in between 2 columns from second file.

Posted by Lakshminarayana Motamarri <na...@gmail.com>.
Hi Jacob,

thanks for the response and suggestion. But as u mentioned the range is NOT
small and NON linear...

              file 2:
              col1               col2            50331648    67108863
67108864 67109119  67109120 67109631  67109632 67110399  67110400
67110655
....

so as mentioned, the range is not small and there could be more than one
match from file 1, which like in between these...

suggestions? ideas? please...

thanks
Narayan.

On Fri, Jul 15, 2011 at 5:23 AM, Jacob Perkins <ja...@gmail.com>wrote:

> Assuming col1 is numeric, as you've indicated, couldn't you simply
> generate a new column in file 1 by rounding to the nearest 1000? Then
> file 1 would look like:
>
> *File 1:
> col1  col2 join_key
> 1234  2    1000
> 2222  3    2000
> 3333  5    3000
> 4444  6    4000
>
> Then you could just join by the new key from file 1 and col2 from file
> 2.
>
> This works even if your ranges are smaller, just round to whatever makes
> sense. Eg, nearest 10. What this does not work for is if your ranges are
> variable. Are your ranges variable? :)
>
> --jacob
> @thedatachef
>
> On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
> > Hi all
> >
> > I have 2 CSV files a shown below:
> >
> > *File 1:                     File2:
> > col1  col2             col1    col2   col3   col4
> > 1234    2                1000   1999
> > 2222    3                2000   2999
> > 3333    5                3000   3999
> > 4444    6                4000   4999*
> >
> > Now I need to JOIN these 2 files in such a way that:
> >
> > File1-col1 should lie in between File2-col1 and File2-col2
> >
> > Can I use JOIN / COGROUP or any other existing operators?
> >
> > or shud I build a new UDF?
> >
> > thanks
> > Narayan.
>
>
>


-- 
Thanking you,
---
Thanks & Regards,
Mr. Lakshminarayana. Motamarri, {Call-out Name: Narayan}
Graduate Student (M.S in Computer Science),
Arizona State University,
School of Computing and Informatics,
Mobile: (480) 280 0179.
http://www.public.asu.edu/~lmotamar/

Re: JOIN or COGROUP? - need to join 1 column from first file which should lie in between 2 columns from second file.

Posted by Jacob Perkins <ja...@gmail.com>.
Assuming col1 is numeric, as you've indicated, couldn't you simply
generate a new column in file 1 by rounding to the nearest 1000? Then
file 1 would look like:

*File 1:
col1  col2 join_key
1234  2    1000
2222  3    2000
3333  5    3000
4444  6    4000

Then you could just join by the new key from file 1 and col2 from file
2.

This works even if your ranges are smaller, just round to whatever makes
sense. Eg, nearest 10. What this does not work for is if your ranges are
variable. Are your ranges variable? :)

--jacob
@thedatachef

On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
> Hi all
> 
> I have 2 CSV files a shown below:
> 
> *File 1:                     File2:
> col1  col2             col1    col2   col3   col4
> 1234    2                1000   1999
> 2222    3                2000   2999
> 3333    5                3000   3999
> 4444    6                4000   4999*
> 
> Now I need to JOIN these 2 files in such a way that:
> 
> File1-col1 should lie in between File2-col1 and File2-col2
> 
> Can I use JOIN / COGROUP or any other existing operators?
> 
> or shud I build a new UDF?
> 
> thanks
> Narayan.