You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Bryan Baugher <bj...@gmail.com> on 2015/02/18 21:35:46 UTC

Joins and null values

Hi everyone,

The other day I ran into the issue mentioned here[1] about joining data
with null values. This took awhile to figure out until I broke down and
went to look at the docs to see if I was doing something obviously wrong. I
used null values because I'm basically wanting to join two pcollections.

Can crunch either throw an exception or log errors if I do something like
this? Similarly would it be possible to get convenience methods for doing
joins on PCollections?

[1] - http://crunch.apache.org/user-guide.html#joins

Re: Joins and null values

Posted by Bryan Baugher <bj...@gmail.com>.
Ahh yes reading the whole doc would help. Thanks!

On Wed Feb 18 2015 at 10:38:56 PM David Ortiz <do...@videologygroup.com>
wrote:

>  You most definitely want Set.difference(setA, setB) ;
>
>
>  Sent from my T-Mobile 4G LTE Device
>
>
> -------- Original message --------
> From: Bryan Baugher
> Date:02/18/2015 11:07 PM (GMT-05:00)
> To: user@crunch.apache.org
> Subject: Re: Joins and null values
>
>  Hmm, I'm trying to get the elements of set A which are not in set B.
> Set#comm(..) could work but seems like the wrong choice. I'm currently
> doing a left outer join and then filtering to the results with only left
> side values. Does that seem like the best choice or are there more gems
> hidden in the crunch library?
>
> On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jw...@cloudera.com> wrote:
>
>> If I got that right, then I think o.a.c.lib.Set does what you want. LMK.
>>
>> On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Oh, I'm dumb-- you mean you want like a left-join like thing where you
>>> can find all values in collection A that aren't in collection B, etc., etc.?
>>>
>>>  J
>>>
>>> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
>>>> PCollection<T> right, int parallelism) in some way?
>>>>
>>>>  J
>>>>
>>>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Maybe,
>>>>>
>>>>>  PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T,
>>>>> T>>
>>>>>
>>>>>  You could make additional methods for the different join strategies
>>>>> or maybe an enum perhaps?
>>>>>
>>>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Bryan,
>>>>>>
>>>>>>  I like the idea of throwing exceptions when there are null values
>>>>>> in one of the collections in a join. Not sure if there are any other
>>>>>> implications of that I should think through first.
>>>>>>
>>>>>>  On the convenience methods for PCollection joins, what do you have
>>>>>> in mind?
>>>>>>
>>>>>>  J
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>  The other day I ran into the issue mentioned here[1] about joining
>>>>>>> data with null values. This took awhile to figure out until I broke down
>>>>>>> and went to look at the docs to see if I was doing something obviously
>>>>>>> wrong. I used null values because I'm basically wanting to join two
>>>>>>> pcollections.
>>>>>>>
>>>>>>>  Can crunch either throw an exception or log errors if I do
>>>>>>> something like this? Similarly would it be possible to get convenience
>>>>>>> methods for doing joins on PCollections?
>>>>>>>
>>>>>>>  [1] - http://crunch.apache.org/user-guide.html#joins
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>  Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>  Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>>
>>>  --
>>>  Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>>  --
>>  Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>  *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>

RE: Joins and null values

Posted by David Ortiz <do...@videologygroup.com>.
You most definitely want Set.difference(setA, setB) ;


Sent from my T-Mobile 4G LTE Device


-------- Original message --------
From: Bryan Baugher
Date:02/18/2015 11:07 PM (GMT-05:00)
To: user@crunch.apache.org
Subject: Re: Joins and null values

Hmm, I'm trying to get the elements of set A which are not in set B. Set#comm(..) could work but seems like the wrong choice. I'm currently doing a left outer join and then filtering to the results with only left side values. Does that seem like the best choice or are there more gems hidden in the crunch library?

On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jw...@cloudera.com>> wrote:
If I got that right, then I think o.a.c.lib.Set does what you want. LMK.

On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jw...@cloudera.com>> wrote:
Oh, I'm dumb-- you mean you want like a left-join like thing where you can find all values in collection A that aren't in collection B, etc., etc.?

J

On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jw...@cloudera.com>> wrote:
Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, PCollection<T> right, int parallelism) in some way?

J

On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com>> wrote:

Maybe,

PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>

You could make additional methods for the different join strategies or maybe an enum perhaps?

On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com>> wrote:
Hey Bryan,

I like the idea of throwing exceptions when there are null values in one of the collections in a join. Not sure if there are any other implications of that I should think through first.

On the convenience methods for PCollection joins, what do you have in mind?

J


On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com>> wrote:
Hi everyone,

The other day I ran into the issue mentioned here[1] about joining data with null values. This took awhile to figure out until I broke down and went to look at the docs to see if I was doing something obviously wrong. I used null values because I'm basically wanting to join two pcollections.

Can crunch either throw an exception or log errors if I do something like this? Similarly would it be possible to get convenience methods for doing joins on PCollections?

[1] - http://crunch.apache.org/user-guide.html#joins



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>
This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.

Re: Joins and null values

Posted by Bryan Baugher <bj...@gmail.com>.
Hmm, I'm trying to get the elements of set A which are not in set B.
Set#comm(..) could work but seems like the wrong choice. I'm currently
doing a left outer join and then filtering to the results with only left
side values. Does that seem like the best choice or are there more gems
hidden in the crunch library?

On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <jw...@cloudera.com> wrote:

> If I got that right, then I think o.a.c.lib.Set does what you want. LMK.
>
> On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Oh, I'm dumb-- you mean you want like a left-join like thing where you
>> can find all values in collection A that aren't in collection B, etc., etc.?
>>
>> J
>>
>> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
>>> PCollection<T> right, int parallelism) in some way?
>>>
>>> J
>>>
>>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com> wrote:
>>>
>>>>
>>>> Maybe,
>>>>
>>>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>
>>>>
>>>> You could make additional methods for the different join strategies or
>>>> maybe an enum perhaps?
>>>>
>>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Bryan,
>>>>>
>>>>> I like the idea of throwing exceptions when there are null values in
>>>>> one of the collections in a join. Not sure if there are any other
>>>>> implications of that I should think through first.
>>>>>
>>>>> On the convenience methods for PCollection joins, what do you have in
>>>>> mind?
>>>>>
>>>>> J
>>>>>
>>>>>
>>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> The other day I ran into the issue mentioned here[1] about joining
>>>>>> data with null values. This took awhile to figure out until I broke down
>>>>>> and went to look at the docs to see if I was doing something obviously
>>>>>> wrong. I used null values because I'm basically wanting to join two
>>>>>> pcollections.
>>>>>>
>>>>>> Can crunch either throw an exception or log errors if I do something
>>>>>> like this? Similarly would it be possible to get convenience methods for
>>>>>> doing joins on PCollections?
>>>>>>
>>>>>> [1] - http://crunch.apache.org/user-guide.html#joins
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Joins and null values

Posted by Josh Wills <jw...@cloudera.com>.
If I got that right, then I think o.a.c.lib.Set does what you want. LMK.

On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <jw...@cloudera.com> wrote:

> Oh, I'm dumb-- you mean you want like a left-join like thing where you can
> find all values in collection A that aren't in collection B, etc., etc.?
>
> J
>
> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
>> PCollection<T> right, int parallelism) in some way?
>>
>> J
>>
>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com> wrote:
>>
>>>
>>> Maybe,
>>>
>>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>
>>>
>>> You could make additional methods for the different join strategies or
>>> maybe an enum perhaps?
>>>
>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Hey Bryan,
>>>>
>>>> I like the idea of throwing exceptions when there are null values in
>>>> one of the collections in a join. Not sure if there are any other
>>>> implications of that I should think through first.
>>>>
>>>> On the convenience methods for PCollection joins, what do you have in
>>>> mind?
>>>>
>>>> J
>>>>
>>>>
>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> The other day I ran into the issue mentioned here[1] about joining
>>>>> data with null values. This took awhile to figure out until I broke down
>>>>> and went to look at the docs to see if I was doing something obviously
>>>>> wrong. I used null values because I'm basically wanting to join two
>>>>> pcollections.
>>>>>
>>>>> Can crunch either throw an exception or log errors if I do something
>>>>> like this? Similarly would it be possible to get convenience methods for
>>>>> doing joins on PCollections?
>>>>>
>>>>> [1] - http://crunch.apache.org/user-guide.html#joins
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Joins and null values

Posted by Josh Wills <jw...@cloudera.com>.
Oh, I'm dumb-- you mean you want like a left-join like thing where you can
find all values in collection A that aren't in collection B, etc., etc.?

J

On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <jw...@cloudera.com> wrote:

> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
> PCollection<T> right, int parallelism) in some way?
>
> J
>
> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com> wrote:
>
>>
>> Maybe,
>>
>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>
>>
>> You could make additional methods for the different join strategies or
>> maybe an enum perhaps?
>>
>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Bryan,
>>>
>>> I like the idea of throwing exceptions when there are null values in one
>>> of the collections in a join. Not sure if there are any other implications
>>> of that I should think through first.
>>>
>>> On the convenience methods for PCollection joins, what do you have in
>>> mind?
>>>
>>> J
>>>
>>>
>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> The other day I ran into the issue mentioned here[1] about joining data
>>>> with null values. This took awhile to figure out until I broke down and
>>>> went to look at the docs to see if I was doing something obviously wrong. I
>>>> used null values because I'm basically wanting to join two pcollections.
>>>>
>>>> Can crunch either throw an exception or log errors if I do something
>>>> like this? Similarly would it be possible to get convenience methods for
>>>> doing joins on PCollections?
>>>>
>>>> [1] - http://crunch.apache.org/user-guide.html#joins
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Joins and null values

Posted by Josh Wills <jw...@cloudera.com>.
Different from o.a.c.lib.Cartesian.cross(PCollection<U> left,
PCollection<T> right, int parallelism) in some way?

J

On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <bj...@gmail.com> wrote:

>
> Maybe,
>
> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>
>
> You could make additional methods for the different join strategies or
> maybe an enum perhaps?
>
> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Bryan,
>>
>> I like the idea of throwing exceptions when there are null values in one
>> of the collections in a join. Not sure if there are any other implications
>> of that I should think through first.
>>
>> On the convenience methods for PCollection joins, what do you have in
>> mind?
>>
>> J
>>
>>
>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> The other day I ran into the issue mentioned here[1] about joining data
>>> with null values. This took awhile to figure out until I broke down and
>>> went to look at the docs to see if I was doing something obviously wrong. I
>>> used null values because I'm basically wanting to join two pcollections.
>>>
>>> Can crunch either throw an exception or log errors if I do something
>>> like this? Similarly would it be possible to get convenience methods for
>>> doing joins on PCollections?
>>>
>>> [1] - http://crunch.apache.org/user-guide.html#joins
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Joins and null values

Posted by Bryan Baugher <bj...@gmail.com>.
Maybe,

PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>

You could make additional methods for the different join strategies or
maybe an enum perhaps?

On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <jw...@cloudera.com> wrote:

> Hey Bryan,
>
> I like the idea of throwing exceptions when there are null values in one
> of the collections in a join. Not sure if there are any other implications
> of that I should think through first.
>
> On the convenience methods for PCollection joins, what do you have in mind?
>
> J
>
>
> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> The other day I ran into the issue mentioned here[1] about joining data
>> with null values. This took awhile to figure out until I broke down and
>> went to look at the docs to see if I was doing something obviously wrong. I
>> used null values because I'm basically wanting to join two pcollections.
>>
>> Can crunch either throw an exception or log errors if I do something like
>> this? Similarly would it be possible to get convenience methods for doing
>> joins on PCollections?
>>
>> [1] - http://crunch.apache.org/user-guide.html#joins
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Joins and null values

Posted by Josh Wills <jw...@cloudera.com>.
Hey Bryan,

I like the idea of throwing exceptions when there are null values in one of
the collections in a join. Not sure if there are any other implications of
that I should think through first.

On the convenience methods for PCollection joins, what do you have in mind?

J


On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <bj...@gmail.com> wrote:

> Hi everyone,
>
> The other day I ran into the issue mentioned here[1] about joining data
> with null values. This took awhile to figure out until I broke down and
> went to look at the docs to see if I was doing something obviously wrong. I
> used null values because I'm basically wanting to join two pcollections.
>
> Can crunch either throw an exception or log errors if I do something like
> this? Similarly would it be possible to get convenience methods for doing
> joins on PCollections?
>
> [1] - http://crunch.apache.org/user-guide.html#joins
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>