You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Richard Marscher <rm...@localytics.com> on 2016/06/20 15:21:27 UTC

Inconsistent joinWith behavior?

I know recently outer join was changed to preserve actual nulls through the
join in https://github.com/apache/spark/pull/13425. I am seeing what seems
like inconsistent behavior though based on how the join is interacted with.
In one case the default datatype values are still used instead of nulls
whereas the other case passes the nulls through. I have a small databricks
notebook showing the case against 2.0 preview:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/4268263383756277/673639177603143/latest.html

-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Inconsistent joinWith behavior?

Posted by Richard Marscher <rm...@localytics.com>.
Hi,

thanks for the response. I have created a JIRA ticket:
https://issues.apache.org/jira/browse/SPARK-16076

On Mon, Jun 20, 2016 at 2:52 PM, Yin Huai <yh...@databricks.com> wrote:

> Hello Richard,
>
> Looks like the Dataset is Dataset[(Int, Int)]. I guess for the case of
> "ds.joinWith(other, expr, Outer).map({ case (t, u) => (Option(t),
> Option(u)) })". We are trying to use null to create a "(Int, Int)" and
> somehow it ended up with a tuple2 having default values.
>
> Can you create a jira? We will investigate the issue.
>
> Thanks!
>
> Yin
>
> On Mon, Jun 20, 2016 at 8:21 AM, Richard Marscher <
> rmarscher@localytics.com> wrote:
>
>> I know recently outer join was changed to preserve actual nulls through
>> the join in https://github.com/apache/spark/pull/13425. I am seeing what
>> seems like inconsistent behavior though based on how the join is interacted
>> with. In one case the default datatype values are still used instead of
>> nulls whereas the other case passes the nulls through. I have a small
>> databricks notebook showing the case against 2.0 preview:
>>
>>
>> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/4268263383756277/673639177603143/latest.html
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Inconsistent joinWith behavior?

Posted by Yin Huai <yh...@databricks.com>.
Hello Richard,

Looks like the Dataset is Dataset[(Int, Int)]. I guess for the case of
"ds.joinWith(other, expr, Outer).map({ case (t, u) => (Option(t),
Option(u)) })". We are trying to use null to create a "(Int, Int)" and
somehow it ended up with a tuple2 having default values.

Can you create a jira? We will investigate the issue.

Thanks!

Yin

On Mon, Jun 20, 2016 at 8:21 AM, Richard Marscher <rm...@localytics.com>
wrote:

> I know recently outer join was changed to preserve actual nulls through
> the join in https://github.com/apache/spark/pull/13425. I am seeing what
> seems like inconsistent behavior though based on how the join is interacted
> with. In one case the default datatype values are still used instead of
> nulls whereas the other case passes the nulls through. I have a small
> databricks notebook showing the case against 2.0 preview:
>
>
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/4268263383756277/673639177603143/latest.html
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>