You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Mārtiņš Kalvāns <ma...@gmail.com> on 2014/07/30 18:35:26 UTC

Join.join(PTable, ?) return empty collection

Hi.

I stumbled on weird behaviour (bug?) when joining PTable<?, Void> on left
side with any other PTable - resulting collection is empty.
Attached example code demonstrates unexpected behaviour.
Code in question is in org.apache.crunch.lib.join.InnerJoinFn line 59 where
it checks for null reference on left dataset (same for other join fn
implementations).
Anyone can comment on this?


--
Mārtiņš Kalvāns

Re: Join.join(PTable, ?) return empty collection

Posted by Josh Wills <jo...@gmail.com>.
Posted a doc fix for this in CRUNCH-453, along with a few other updates to
the user guide.


On Fri, Aug 1, 2014 at 4:32 AM, Mārtiņš Kalvāns <ma...@gmail.com>
wrote:

> Yes, I think at least documentation about know issue could help.
> Thanks!
>
>
> 2014-07-31 17:09 GMT+02:00 Josh Wills <jw...@cloudera.com>:
>
> > Understood. Anything I can do to help? Docfix, at least?
> >
> >
> > On Thu, Jul 31, 2014 at 1:08 AM, Mārtiņš Kalvāns <
> > martins.kalvans@gmail.com>
> > wrote:
> >
> > > It is avoidable almost always, problem is that in our company Crunch
> user
> > > base is growing and many of them are "not so technical" to fast and
> > > effectively catch problems like this and find workarounds. :(
> > >
> > >
> > > --
> > > Mārtiņš
> > >
> > >
> > > 2014-07-30 18:45 GMT+02:00 Josh Wills <jw...@cloudera.com>:
> > >
> > > > My hypothesis is that we re-use null in joins to indicate the absence
> > of
> > > a
> > > > value, so if the value of an entry is null, we assume it's
> > non-existent.
> > > > I'm assuming there isn't an easy way to switch the Void out for a
> > > non-null
> > > > but ignored value?
> > > >
> > > > J
> > > >
> > > >
> > > > On Wed, Jul 30, 2014 at 9:35 AM, Mārtiņš Kalvāns <
> > > > martins.kalvans@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > I stumbled on weird behaviour (bug?) when joining PTable<?, Void>
> on
> > > left
> > > > > side with any other PTable - resulting collection is empty.
> > > > > Attached example code demonstrates unexpected behaviour.
> > > > > Code in question is in org.apache.crunch.lib.join.InnerJoinFn line
> 59
> > > > > where it checks for null reference on left dataset (same for other
> > join
> > > > fn
> > > > > implementations).
> > > > > Anyone can comment on this?
> > > > >
> > > > >
> > > > > --
> > > > > Mārtiņš Kalvāns
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Director of Data Science
> > > > Cloudera <http://www.cloudera.com>
> > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>

Re: Join.join(PTable, ?) return empty collection

Posted by Mārtiņš Kalvāns <ma...@gmail.com>.
Yes, I think at least documentation about know issue could help.
Thanks!


2014-07-31 17:09 GMT+02:00 Josh Wills <jw...@cloudera.com>:

> Understood. Anything I can do to help? Docfix, at least?
>
>
> On Thu, Jul 31, 2014 at 1:08 AM, Mārtiņš Kalvāns <
> martins.kalvans@gmail.com>
> wrote:
>
> > It is avoidable almost always, problem is that in our company Crunch user
> > base is growing and many of them are "not so technical" to fast and
> > effectively catch problems like this and find workarounds. :(
> >
> >
> > --
> > Mārtiņš
> >
> >
> > 2014-07-30 18:45 GMT+02:00 Josh Wills <jw...@cloudera.com>:
> >
> > > My hypothesis is that we re-use null in joins to indicate the absence
> of
> > a
> > > value, so if the value of an entry is null, we assume it's
> non-existent.
> > > I'm assuming there isn't an easy way to switch the Void out for a
> > non-null
> > > but ignored value?
> > >
> > > J
> > >
> > >
> > > On Wed, Jul 30, 2014 at 9:35 AM, Mārtiņš Kalvāns <
> > > martins.kalvans@gmail.com>
> > > wrote:
> > >
> > > > Hi.
> > > >
> > > > I stumbled on weird behaviour (bug?) when joining PTable<?, Void> on
> > left
> > > > side with any other PTable - resulting collection is empty.
> > > > Attached example code demonstrates unexpected behaviour.
> > > > Code in question is in org.apache.crunch.lib.join.InnerJoinFn line 59
> > > > where it checks for null reference on left dataset (same for other
> join
> > > fn
> > > > implementations).
> > > > Anyone can comment on this?
> > > >
> > > >
> > > > --
> > > > Mārtiņš Kalvāns
> > > >
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Join.join(PTable, ?) return empty collection

Posted by Josh Wills <jw...@cloudera.com>.
Understood. Anything I can do to help? Docfix, at least?


On Thu, Jul 31, 2014 at 1:08 AM, Mārtiņš Kalvāns <ma...@gmail.com>
wrote:

> It is avoidable almost always, problem is that in our company Crunch user
> base is growing and many of them are "not so technical" to fast and
> effectively catch problems like this and find workarounds. :(
>
>
> --
> Mārtiņš
>
>
> 2014-07-30 18:45 GMT+02:00 Josh Wills <jw...@cloudera.com>:
>
> > My hypothesis is that we re-use null in joins to indicate the absence of
> a
> > value, so if the value of an entry is null, we assume it's non-existent.
> > I'm assuming there isn't an easy way to switch the Void out for a
> non-null
> > but ignored value?
> >
> > J
> >
> >
> > On Wed, Jul 30, 2014 at 9:35 AM, Mārtiņš Kalvāns <
> > martins.kalvans@gmail.com>
> > wrote:
> >
> > > Hi.
> > >
> > > I stumbled on weird behaviour (bug?) when joining PTable<?, Void> on
> left
> > > side with any other PTable - resulting collection is empty.
> > > Attached example code demonstrates unexpected behaviour.
> > > Code in question is in org.apache.crunch.lib.join.InnerJoinFn line 59
> > > where it checks for null reference on left dataset (same for other join
> > fn
> > > implementations).
> > > Anyone can comment on this?
> > >
> > >
> > > --
> > > Mārtiņš Kalvāns
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Join.join(PTable, ?) return empty collection

Posted by Mārtiņš Kalvāns <ma...@gmail.com>.
It is avoidable almost always, problem is that in our company Crunch user
base is growing and many of them are "not so technical" to fast and
effectively catch problems like this and find workarounds. :(


--
Mārtiņš


2014-07-30 18:45 GMT+02:00 Josh Wills <jw...@cloudera.com>:

> My hypothesis is that we re-use null in joins to indicate the absence of a
> value, so if the value of an entry is null, we assume it's non-existent.
> I'm assuming there isn't an easy way to switch the Void out for a non-null
> but ignored value?
>
> J
>
>
> On Wed, Jul 30, 2014 at 9:35 AM, Mārtiņš Kalvāns <
> martins.kalvans@gmail.com>
> wrote:
>
> > Hi.
> >
> > I stumbled on weird behaviour (bug?) when joining PTable<?, Void> on left
> > side with any other PTable - resulting collection is empty.
> > Attached example code demonstrates unexpected behaviour.
> > Code in question is in org.apache.crunch.lib.join.InnerJoinFn line 59
> > where it checks for null reference on left dataset (same for other join
> fn
> > implementations).
> > Anyone can comment on this?
> >
> >
> > --
> > Mārtiņš Kalvāns
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Join.join(PTable, ?) return empty collection

Posted by Josh Wills <jw...@cloudera.com>.
My hypothesis is that we re-use null in joins to indicate the absence of a
value, so if the value of an entry is null, we assume it's non-existent.
I'm assuming there isn't an easy way to switch the Void out for a non-null
but ignored value?

J


On Wed, Jul 30, 2014 at 9:35 AM, Mārtiņš Kalvāns <ma...@gmail.com>
wrote:

> Hi.
>
> I stumbled on weird behaviour (bug?) when joining PTable<?, Void> on left
> side with any other PTable - resulting collection is empty.
> Attached example code demonstrates unexpected behaviour.
> Code in question is in org.apache.crunch.lib.join.InnerJoinFn line 59
> where it checks for null reference on left dataset (same for other join fn
> implementations).
> Anyone can comment on this?
>
>
> --
> Mārtiņš Kalvāns
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>