You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Andres Perez <an...@tresata.com> on 2016/05/19 15:48:04 UTC

right outer joins on Datasets

Hi all, I'm getting some odd behavior when using the joinWith functionality
for Datasets. Here is a small test case:

  val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
  val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()

  val joined = left.toDF("k", "v").as[(String, Int)].alias("left")
    .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"),
functions.col("left.k") === functions.col("right.k"), "right_outer")
    .as[((String, Int), (String, String))]
    .map { case ((k, v), (_, u)) => (k, (v, u)) }.as[(String, (Int,
String))]

I would expect the result of this right-join to be:

  (a,(1,x))
  (a,(2,x))
  (b,(3,y))
  (d,(null,z))

but instead I'm getting:

  (a,(1,x))
  (a,(2,x))
  (b,(3,y))
  (null,(-1,z))

Not that the key for the final tuple is null instead of "d". (Also, is
there a reason the value for the left-side of the last tuple is -1 and not
null?)

-Andy

Re: right outer joins on Datasets

Posted by Zhan <zh...@gmail.com>.

The first item as a whole should be null please refer to the jira.


Sent from my iPhone

> On May 24, 2016, at 7:31 AM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> got it, but i assume thats an internal implementation detail, and it should show null not -1?
> 
>> On Tue, May 24, 2016 at 3:10 AM, Zhan Zhang <zh...@gmail.com> wrote:
>> The reason for "-1" is that the default value for Integer is -1 if the value
>> is null
>> 
>>   def defaultValue(jt: String): String = jt match {
>>     ...
>>     case JAVA_INT => "-1"
>>     ...
>>  }
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html
>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>

Re: right outer joins on Datasets

Posted by Koert Kuipers <ko...@tresata.com>.

got it, but i assume thats an internal implementation detail, and it should
show null not -1?

On Tue, May 24, 2016 at 3:10 AM, Zhan Zhang <zh...@gmail.com> wrote:

> The reason for "-1" is that the default value for Integer is -1 if the
> value
> is null
>
>   def defaultValue(jt: String): String = jt match {
>     ...
>     case JAVA_INT => "-1"
>     ...
>  }
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: right outer joins on Datasets

Posted by Zhan Zhang <zh...@gmail.com>.

The reason for "-1" is that the default value for Integer is -1 if the value
is null

  def defaultValue(jt: String): String = jt match {
    ...
    case JAVA_INT => "-1"
    ...   
 }



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: right outer joins on Datasets

Posted by Reynold Xin <rx...@databricks.com>.

I filed https://issues.apache.org/jira/browse/SPARK-15441



On Thu, May 19, 2016 at 8:48 AM, Andres Perez <an...@tresata.com> wrote:

> Hi all, I'm getting some odd behavior when using the joinWith
> functionality for Datasets. Here is a small test case:
>
>   val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
>   val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
>
>   val joined = left.toDF("k", "v").as[(String, Int)].alias("left")
>     .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"),
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>     .as[((String, Int), (String, String))]
>     .map { case ((k, v), (_, u)) => (k, (v, u)) }.as[(String, (Int,
> String))]
>
> I would expect the result of this right-join to be:
>
>   (a,(1,x))
>   (a,(2,x))
>   (b,(3,y))
>   (d,(null,z))
>
> but instead I'm getting:
>
>   (a,(1,x))
>   (a,(2,x))
>   (b,(3,y))
>   (null,(-1,z))
>
> Not that the key for the final tuple is null instead of "d". (Also, is
> there a reason the value for the left-side of the last tuple is -1 and not
> null?)
>
> -Andy
>