You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by John Berryman <jo...@eventbrite.com> on 2015/07/22 06:44:10 UTC

many-to-many join

Quick example problem that's stumping me:

* Users have 1 or more phone numbers and therefore one or more area codes.
* There are 100M users.
* States have one or more area codes.
* I would like to the states for the users (as indicated by phone area
code).

I was thinking about something like this:

If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"])
then we could do

states_and_users_mixed = area_code_user.join(area_code_state) \
    .reduceByKey(lambda a,b: a+b) \
    .values()

user_state_pairs = states_and_users_mixed.flatMap(
        emit_cartesian_prod_of_userids_and_states )
user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)

user_to_states.first(1)

>>> (1234567,["Tennessee","Tennessee","California"])

This would work, but the user_state_pairs is just a list of user_ids and
state names mixed together and emit_cartesian_prod_of_userids_and_states
has to correctly pair them. This is problematic because 1) it's weird and
sloppy and 2) there will be lots of users per state and having so many
users in a single row is going to make
emit_cartesian_prod_of_userids_and_states work extra hard to first locate
states and then emit all userid-state pairs.

How should I be doing this?

Thanks,
-John

Re: many-to-many join

Posted by ayan guha <gu...@gmail.com>.

Hi

RDD solution:
>>> u = [(615,1),(720,1),(615,2)]
>>> urdd=sc.parallelize(u,1)
>>> a1 = [(615,'T'),(720,'C')]
>>> ardd=sc.parallelize(a1,1)
>>> def addString(s1,s2):
...     return s1+','+s2
>>> j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString)
>>> print j.collect()
[(2, 'T'), (1, 'C,T')]

However, if you can assume <number of users>  is far far greater than
<number of distinct area codes>, you may think to broadcast variable in a
dict format and look up in the map. Like this

>>> u = [(1,615),(1,720),(2,615)]
>>> a = {615:'T',720:'C'}
>>> urdd=sc.parallelize(u)
>>> def usr_area_state(tup):
...     uid=tup[0]
...     aid=tup[1]
...     sid=bc.value[aid]
...     return uid,(sid,)
...
>>> bc=sc.broadcast(a)
>>> usrdd=urdd.map(usr_area_state)
>>> def addTuple(t1,t2):
...     return t1+t2
...
>>> out=usrdd.reduceByKey(addTuple)
>>> print out.collect()
[(1, ('T', 'C')), (2, ('T',))]

Best
Ayan

On Wed, Jul 22, 2015 at 5:14 PM, Sonal Goyal <so...@gmail.com> wrote:

> If I understand this correctly, you could join area_code_user and
> area_code_state and then flat map to get
> user, areacode, state. Then groupby/reduce by user.
>
> You can also try some join optimizations like partitioning on area code or
> broadcasting smaller table depending on size of area_code_state.
> On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote:
>
>> Quick example problem that's stumping me:
>>
>> * Users have 1 or more phone numbers and therefore one or more area codes.
>> * There are 100M users.
>> * States have one or more area codes.
>> * I would like to the states for the users (as indicated by phone area
>> code).
>>
>> I was thinking about something like this:
>>
>> If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
>> and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"])
>> then we could do
>>
>> states_and_users_mixed = area_code_user.join(area_code_state) \
>>     .reduceByKey(lambda a,b: a+b) \
>>     .values()
>>
>> user_state_pairs = states_and_users_mixed.flatMap(
>>         emit_cartesian_prod_of_userids_and_states )
>> user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)
>>
>> user_to_states.first(1)
>>
>> >>> (1234567,["Tennessee","Tennessee","California"])
>>
>> This would work, but the user_state_pairs is just a list of user_ids and
>> state names mixed together and emit_cartesian_prod_of_userids_and_states
>> has to correctly pair them. This is problematic because 1) it's weird and
>> sloppy and 2) there will be lots of users per state and having so many
>> users in a single row is going to make
>> emit_cartesian_prod_of_userids_and_states work extra hard to first locate
>> states and then emit all userid-state pairs.
>>
>> How should I be doing this?
>>
>> Thanks,
>> -John
>>
>


-- 
Best Regards,
Ayan Guha

Re: many-to-many join

Posted by Sonal Goyal <so...@gmail.com>.

If I understand this correctly, you could join area_code_user and
area_code_state and then flat map to get
user, areacode, state. Then groupby/reduce by user.

You can also try some join optimizations like partitioning on area code or
broadcasting smaller table depending on size of area_code_state.
On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote:

> Quick example problem that's stumping me:
>
> * Users have 1 or more phone numbers and therefore one or more area codes.
> * There are 100M users.
> * States have one or more area codes.
> * I would like to the states for the users (as indicated by phone area
> code).
>
> I was thinking about something like this:
>
> If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
> and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"])
> then we could do
>
> states_and_users_mixed = area_code_user.join(area_code_state) \
>     .reduceByKey(lambda a,b: a+b) \
>     .values()
>
> user_state_pairs = states_and_users_mixed.flatMap(
>         emit_cartesian_prod_of_userids_and_states )
> user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)
>
> user_to_states.first(1)
>
> >>> (1234567,["Tennessee","Tennessee","California"])
>
> This would work, but the user_state_pairs is just a list of user_ids and
> state names mixed together and emit_cartesian_prod_of_userids_and_states
> has to correctly pair them. This is problematic because 1) it's weird and
> sloppy and 2) there will be lots of users per state and having so many
> users in a single row is going to make
> emit_cartesian_prod_of_userids_and_states work extra hard to first locate
> states and then emit all userid-state pairs.
>
> How should I be doing this?
>
> Thanks,
> -John
>