You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Roger Hoover <ro...@gmail.com> on 2014/04/15 02:07:05 UTC

How to cogroup/join pair RDDs with different key types?

Hi,

I'm trying to figure out how to join two RDDs with different key types and
appreciate any suggestions.

Say I have two RDDS:
    ipToUrl of type (IP, String)
    ipRangeToZip of type (IPRange, String)

How can I join/cogroup these two RDDs together to produce a new RDD of type
(IP, (String, String)) where IP is the key and the values are the urls and
zipcodes?

Say I have a method on the IPRange class called matches(ip: IP), I want the
joined records to match when ipRange.matches(ip).

Thanks,

Roger

Re: How to cogroup/join pair RDDs with different key types?

Posted by Roger Hoover <ro...@gmail.com>.
Thanks for following up.  I hope to get some free time this afternoon to
get it working.  Will let you know.


On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash <an...@andrewash.com> wrote:

> Glad to hear you're making progress!  Do you have a working version of the
> join?  Is there anything else you need help with?
>
>
> On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover <ro...@gmail.com>wrote:
>
>> Ah, in case this helps others, looks like RDD.zipPartitions will
>> accomplish step 4.
>>
>>
>> On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <ro...@gmail.com>wrote:
>>
>>> Andrew,
>>>
>>> Thank you very much for your feedback.  Unfortunately, the ranges are
>>> not of predictable size but you gave me an idea of how to handle it.
>>>  Here's what I'm thinking:
>>>
>>> 1. Choose number of partitions, n, over IP space
>>> 2. Preprocess the IPRanges, splitting any of them that cross partition
>>> boundaries
>>> 3. Partition ipToUrl and the new ipRangeToZip according to the
>>> partitioning scheme from step 1
>>> 4. Join matching partitions of these two RDDs
>>>
>>> I still don't know how to do step 4 though.  I see that RDDs have a
>>> mapPartitions() operation to let you do whatever you want with a partition.
>>>  What I need is a way to get my hands on two partitions at once, each from
>>> different RDDs.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <an...@andrewash.com>wrote:
>>>
>>>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>>>> 192.168.0.0/16 or 10.0.0.0/8?
>>>>
>>>> If the range is always an even subnet mask and not split across
>>>> subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and
>>>> then joining the two RDDs.  The expansion would be at most 32x if all your
>>>> ranges can be expressed in CIDR notation, and in practice would be much
>>>> smaller than that (typically you don't need things bigger than a /8 and
>>>> often not smaller than a /24)
>>>>
>>>> Hopefully you can use your knowledge of the ip ranges to make this
>>>> feasible.
>>>>
>>>> Otherwise, you could additionally flatmap the ipRangeToZip out to a
>>>> list of CIDR notations and do the join then, but you're starting to have
>>>> the cartesian product work against you on scale at that point.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to figure out how to join two RDDs with different key types
>>>>> and appreciate any suggestions.
>>>>>
>>>>> Say I have two RDDS:
>>>>>     ipToUrl of type (IP, String)
>>>>>     ipRangeToZip of type (IPRange, String)
>>>>>
>>>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>>>> type (IP, (String, String)) where IP is the key and the values are the urls
>>>>> and zipcodes?
>>>>>
>>>>> Say I have a method on the IPRange class called matches(ip: IP), I
>>>>> want the joined records to match when ipRange.matches(ip).
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Roger
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to cogroup/join pair RDDs with different key types?

Posted by Andrew Ash <an...@andrewash.com>.
Glad to hear you're making progress!  Do you have a working version of the
join?  Is there anything else you need help with?


On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover <ro...@gmail.com>wrote:

> Ah, in case this helps others, looks like RDD.zipPartitions will
> accomplish step 4.
>
>
> On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <ro...@gmail.com>wrote:
>
>> Andrew,
>>
>> Thank you very much for your feedback.  Unfortunately, the ranges are not
>> of predictable size but you gave me an idea of how to handle it.  Here's
>> what I'm thinking:
>>
>> 1. Choose number of partitions, n, over IP space
>> 2. Preprocess the IPRanges, splitting any of them that cross partition
>> boundaries
>> 3. Partition ipToUrl and the new ipRangeToZip according to the
>> partitioning scheme from step 1
>> 4. Join matching partitions of these two RDDs
>>
>> I still don't know how to do step 4 though.  I see that RDDs have a
>> mapPartitions() operation to let you do whatever you want with a partition.
>>  What I need is a way to get my hands on two partitions at once, each from
>> different RDDs.
>>
>> Any ideas?
>>
>> Thanks,
>>
>> Roger
>>
>>
>> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>>> 192.168.0.0/16 or 10.0.0.0/8?
>>>
>>> If the range is always an even subnet mask and not split across subnets,
>>> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
>>> joining the two RDDs.  The expansion would be at most 32x if all your
>>> ranges can be expressed in CIDR notation, and in practice would be much
>>> smaller than that (typically you don't need things bigger than a /8 and
>>> often not smaller than a /24)
>>>
>>> Hopefully you can use your knowledge of the ip ranges to make this
>>> feasible.
>>>
>>> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
>>> of CIDR notations and do the join then, but you're starting to have the
>>> cartesian product work against you on scale at that point.
>>>
>>> Andrew
>>>
>>>
>>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to figure out how to join two RDDs with different key types
>>>> and appreciate any suggestions.
>>>>
>>>> Say I have two RDDS:
>>>>     ipToUrl of type (IP, String)
>>>>     ipRangeToZip of type (IPRange, String)
>>>>
>>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>>> type (IP, (String, String)) where IP is the key and the values are the urls
>>>> and zipcodes?
>>>>
>>>> Say I have a method on the IPRange class called matches(ip: IP), I want
>>>> the joined records to match when ipRange.matches(ip).
>>>>
>>>> Thanks,
>>>>
>>>> Roger
>>>>
>>>>
>>>
>>
>

Re: How to cogroup/join pair RDDs with different key types?

Posted by Roger Hoover <ro...@gmail.com>.
Ah, in case this helps others, looks like RDD.zipPartitions will accomplish
step 4.


On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <ro...@gmail.com>wrote:

> Andrew,
>
> Thank you very much for your feedback.  Unfortunately, the ranges are not
> of predictable size but you gave me an idea of how to handle it.  Here's
> what I'm thinking:
>
> 1. Choose number of partitions, n, over IP space
> 2. Preprocess the IPRanges, splitting any of them that cross partition
> boundaries
> 3. Partition ipToUrl and the new ipRangeToZip according to the
> partitioning scheme from step 1
> 4. Join matching partitions of these two RDDs
>
> I still don't know how to do step 4 though.  I see that RDDs have a
> mapPartitions() operation to let you do whatever you want with a partition.
>  What I need is a way to get my hands on two partitions at once, each from
> different RDDs.
>
> Any ideas?
>
> Thanks,
>
> Roger
>
>
> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>> 192.168.0.0/16 or 10.0.0.0/8?
>>
>> If the range is always an even subnet mask and not split across subnets,
>> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
>> joining the two RDDs.  The expansion would be at most 32x if all your
>> ranges can be expressed in CIDR notation, and in practice would be much
>> smaller than that (typically you don't need things bigger than a /8 and
>> often not smaller than a /24)
>>
>> Hopefully you can use your knowledge of the ip ranges to make this
>> feasible.
>>
>> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
>> of CIDR notations and do the join then, but you're starting to have the
>> cartesian product work against you on scale at that point.
>>
>> Andrew
>>
>>
>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I'm trying to figure out how to join two RDDs with different key types
>>> and appreciate any suggestions.
>>>
>>> Say I have two RDDS:
>>>     ipToUrl of type (IP, String)
>>>     ipRangeToZip of type (IPRange, String)
>>>
>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>> type (IP, (String, String)) where IP is the key and the values are the urls
>>> and zipcodes?
>>>
>>> Say I have a method on the IPRange class called matches(ip: IP), I want
>>> the joined records to match when ipRange.matches(ip).
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>
>

Re: How to cogroup/join pair RDDs with different key types?

Posted by Roger Hoover <ro...@gmail.com>.
I'm thinking of creating a union type for the key so that IPRange and IP
types can be joined.


On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <ro...@gmail.com>wrote:

> Andrew,
>
> Thank you very much for your feedback.  Unfortunately, the ranges are not
> of predictable size but you gave me an idea of how to handle it.  Here's
> what I'm thinking:
>
> 1. Choose number of partitions, n, over IP space
> 2. Preprocess the IPRanges, splitting any of them that cross partition
> boundaries
> 3. Partition ipToUrl and the new ipRangeToZip according to the
> partitioning scheme from step 1
> 4. Join matching partitions of these two RDDs
>
> I still don't know how to do step 4 though.  I see that RDDs have a
> mapPartitions() operation to let you do whatever you want with a partition.
>  What I need is a way to get my hands on two partitions at once, each from
> different RDDs.
>
> Any ideas?
>
> Thanks,
>
> Roger
>
>
> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>> 192.168.0.0/16 or 10.0.0.0/8?
>>
>> If the range is always an even subnet mask and not split across subnets,
>> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
>> joining the two RDDs.  The expansion would be at most 32x if all your
>> ranges can be expressed in CIDR notation, and in practice would be much
>> smaller than that (typically you don't need things bigger than a /8 and
>> often not smaller than a /24)
>>
>> Hopefully you can use your knowledge of the ip ranges to make this
>> feasible.
>>
>> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
>> of CIDR notations and do the join then, but you're starting to have the
>> cartesian product work against you on scale at that point.
>>
>> Andrew
>>
>>
>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I'm trying to figure out how to join two RDDs with different key types
>>> and appreciate any suggestions.
>>>
>>> Say I have two RDDS:
>>>     ipToUrl of type (IP, String)
>>>     ipRangeToZip of type (IPRange, String)
>>>
>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>> type (IP, (String, String)) where IP is the key and the values are the urls
>>> and zipcodes?
>>>
>>> Say I have a method on the IPRange class called matches(ip: IP), I want
>>> the joined records to match when ipRange.matches(ip).
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>
>

Re: How to cogroup/join pair RDDs with different key types?

Posted by Roger Hoover <ro...@gmail.com>.
Andrew,

Thank you very much for your feedback.  Unfortunately, the ranges are not
of predictable size but you gave me an idea of how to handle it.  Here's
what I'm thinking:

1. Choose number of partitions, n, over IP space
2. Preprocess the IPRanges, splitting any of them that cross partition
boundaries
3. Partition ipToUrl and the new ipRangeToZip according to the partitioning
scheme from step 1
4. Join matching partitions of these two RDDs

I still don't know how to do step 4 though.  I see that RDDs have a
mapPartitions() operation to let you do whatever you want with a partition.
 What I need is a way to get my hands on two partitions at once, each from
different RDDs.

Any ideas?

Thanks,

Roger


On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <an...@andrewash.com> wrote:

> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
> 192.168.0.0/16 or 10.0.0.0/8?
>
> If the range is always an even subnet mask and not split across subnets,
> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
> joining the two RDDs.  The expansion would be at most 32x if all your
> ranges can be expressed in CIDR notation, and in practice would be much
> smaller than that (typically you don't need things bigger than a /8 and
> often not smaller than a /24)
>
> Hopefully you can use your knowledge of the ip ranges to make this
> feasible.
>
> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
> of CIDR notations and do the join then, but you're starting to have the
> cartesian product work against you on scale at that point.
>
> Andrew
>
>
> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to join two RDDs with different key types
>> and appreciate any suggestions.
>>
>> Say I have two RDDS:
>>     ipToUrl of type (IP, String)
>>     ipRangeToZip of type (IPRange, String)
>>
>> How can I join/cogroup these two RDDs together to produce a new RDD of
>> type (IP, (String, String)) where IP is the key and the values are the urls
>> and zipcodes?
>>
>> Say I have a method on the IPRange class called matches(ip: IP), I want
>> the joined records to match when ipRange.matches(ip).
>>
>> Thanks,
>>
>> Roger
>>
>>
>

Re: How to cogroup/join pair RDDs with different key types?

Posted by Andrew Ash <an...@andrewash.com>.
Are your IPRanges all on nice, even CIDR-format ranges?  E.g. 192.168.0.0/16or
10.0.0.0/8?

If the range is always an even subnet mask and not split across subnets,
I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
joining the two RDDs.  The expansion would be at most 32x if all your
ranges can be expressed in CIDR notation, and in practice would be much
smaller than that (typically you don't need things bigger than a /8 and
often not smaller than a /24)

Hopefully you can use your knowledge of the ip ranges to make this feasible.

Otherwise, you could additionally flatmap the ipRangeToZip out to a list of
CIDR notations and do the join then, but you're starting to have the
cartesian product work against you on scale at that point.

Andrew


On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <ro...@gmail.com>wrote:

> Hi,
>
> I'm trying to figure out how to join two RDDs with different key types and
> appreciate any suggestions.
>
> Say I have two RDDS:
>     ipToUrl of type (IP, String)
>     ipRangeToZip of type (IPRange, String)
>
> How can I join/cogroup these two RDDs together to produce a new RDD of
> type (IP, (String, String)) where IP is the key and the values are the urls
> and zipcodes?
>
> Say I have a method on the IPRange class called matches(ip: IP), I want
> the joined records to match when ipRange.matches(ip).
>
> Thanks,
>
> Roger
>
>