You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Richard Marscher <rm...@localytics.com> on 2016/06/01 16:58:08 UTC

Dataset Outer Join vs RDD Outer Join

Hi,

I've been working on transitioning from RDD to Datasets in our codebase in
anticipation of being able to leverage features of 2.0.

I'm having a lot of difficulties with the impedance mismatches between how
outer joins worked with RDD versus Dataset. The Dataset joins feel like a
big step backwards IMO. With RDD, leftOuterJoin would give you Option types
of the results from the right side of the join. This follows idiomatic
Scala avoiding nulls and was easy to work with.

Now with Dataset there is only joinWith where you specify the join type,
but it lost all the semantics of identifying missing data from outer joins.
I can write some enriched methods on Dataset with an implicit class to
abstract messiness away if Dataset nulled out all mismatching data from an
outer join, however the problem goes even further in that the values aren't
always null. Integer, for example, defaults to -1 instead of null. Now it's
completely ambiguous what data in the join was actually there versus
populated via this atypical semantic.

Are there additional options available to work around this issue? I can
convert to RDD and back to Dataset but that's less than ideal.

Thanks,
-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Richard Marscher <rm...@localytics.com>.

For anyone following along the chain went private for a bit, but there were
still issues with the bytecode generation in the 2.0-preview so this JIRA
was created: https://issues.apache.org/jira/browse/SPARK-15786

On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> That kind of stuff is likely fixed in 2.0.  If you can get a reproduction
> working there it would be very helpful if you could open a JIRA.
>
> On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarscher@localytics.com
> > wrote:
>
>> A quick unit test attempt didn't get far replacing map with as[], I'm
>> only working against 1.6.1 at the moment though, I was going to try 2.0 but
>> I'm having a hard time building a working spark-sql jar from source, the
>> only ones I've managed to make are intended for the full assembly fat jar.
>>
>>
>> Example of the error from calling joinWith as left_outer and then
>> .as[(Option[T], U]) where T and U are Int and Int.
>>
>> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
>> scala.Tuple2),None)
>> [info] :- decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>> [info] +- decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info]    +- input[1, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>>
>> Cause: java.util.concurrent.ExecutionException: java.lang.Exception:
>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>> 'generated.java', Line 32, Column 60: No applicable constructor/method
>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>> candidates are: "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>>
>> The generated code is passing InternalRow objects into the ByteBuffer
>>
>> Starting from two Datasets of types Dataset[(Int, Int)] with expression
>> $"left._1" === $"right._1". I'll have to spend some time getting a better
>> understanding of this analysis phase, but hopefully I can come up with
>> something.
>>
>> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> Option should place nicely with encoders, but its always possible there
>>> are bugs.  I think those function signatures are slightly more expensive
>>> (one extra object allocation) and its not as java friendly so we probably
>>> don't want them to be the default.
>>>
>>> That said, I would like to enable that kind of sugar while still taking
>>> advantage of all the optimizations going on under the covers.  Can you get
>>> it to work if you use `as[...]` instead of `map`?
>>>
>>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
>>> rmarscher@localytics.com> wrote:
>>>
>>>> Ah thanks, I missed seeing the PR for
>>>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>>>> null objects then I can implement methods that will map those back to
>>>> results that align closer to the RDD interface.
>>>>
>>>> As a follow on, I'm curious about thoughts regarding enriching the
>>>> Dataset join interface versus a package or users sugaring for themselves. I
>>>> haven't considered the implications of what the optimizations datasets,
>>>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>>>> missing a critical benefit there around say avoiding Options in favor of
>>>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>>>> or DataType yet and maybe for good reasons.
>>>>
>>>> I did find the RDD join interface elegant, though. In the ideal world
>>>> an API comparable the following would be nice:
>>>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>>>
>>>>
>>>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> Thanks for the feedback.  I think this will address at least some of
>>>>> the problems you are describing:
>>>>> https://github.com/apache/spark/pull/13425
>>>>>
>>>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>>>> rmarscher@localytics.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've been working on transitioning from RDD to Datasets in our
>>>>>> codebase in anticipation of being able to leverage features of 2.0.
>>>>>>
>>>>>> I'm having a lot of difficulties with the impedance mismatches
>>>>>> between how outer joins worked with RDD versus Dataset. The Dataset joins
>>>>>> feel like a big step backwards IMO. With RDD, leftOuterJoin would give you
>>>>>> Option types of the results from the right side of the join. This follows
>>>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>>>
>>>>>> Now with Dataset there is only joinWith where you specify the join
>>>>>> type, but it lost all the semantics of identifying missing data from outer
>>>>>> joins. I can write some enriched methods on Dataset with an implicit class
>>>>>> to abstract messiness away if Dataset nulled out all mismatching data from
>>>>>> an outer join, however the problem goes even further in that the values
>>>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>>>> Now it's completely ambiguous what data in the join was actually there
>>>>>> versus populated via this atypical semantic.
>>>>>>
>>>>>> Are there additional options available to work around this issue? I
>>>>>> can convert to RDD and back to Dataset but that's less than ideal.
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> *Richard Marscher*
>>>>>> Senior Software Engineer
>>>>>> Localytics
>>>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>>>> <http://localytics.com/blog> | Twitter
>>>>>> <http://twitter.com/localytics> | Facebook
>>>>>> <http://facebook.com/localytics> | LinkedIn
>>>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Richard Marscher*
>>>> Senior Software Engineer
>>>> Localytics
>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>
>>>
>>>
>>
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Michael Armbrust <mi...@databricks.com>.

That kind of stuff is likely fixed in 2.0.  If you can get a reproduction
working there it would be very helpful if you could open a JIRA.

On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rm...@localytics.com>
wrote:

> A quick unit test attempt didn't get far replacing map with as[], I'm only
> working against 1.6.1 at the moment though, I was going to try 2.0 but I'm
> having a hard time building a working spark-sql jar from source, the only
> ones I've managed to make are intended for the full assembly fat jar.
>
>
> Example of the error from calling joinWith as left_outer and then
> .as[(Option[T], U]) where T and U are Int and Int.
>
> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
> scala.Tuple2),None)
> [info] :- decodeusingserializer(input[0,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true)
> [info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))]
> [info] +- decodeusingserializer(input[1,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true)
> [info]    +- input[1, StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))]
>
> Cause: java.util.concurrent.ExecutionException: java.lang.Exception:
> failed to compile: org.codehaus.commons.compiler.CompileException: File
> 'generated.java', Line 32, Column 60: No applicable constructor/method
> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
> candidates are: "public static java.nio.ByteBuffer
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
> java.nio.ByteBuffer.wrap(byte[], int, int)"
>
> The generated code is passing InternalRow objects into the ByteBuffer
>
> Starting from two Datasets of types Dataset[(Int, Int)] with expression
> $"left._1" === $"right._1". I'll have to spend some time getting a better
> understanding of this analysis phase, but hopefully I can come up with
> something.
>
> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Option should place nicely with encoders, but its always possible there
>> are bugs.  I think those function signatures are slightly more expensive
>> (one extra object allocation) and its not as java friendly so we probably
>> don't want them to be the default.
>>
>> That said, I would like to enable that kind of sugar while still taking
>> advantage of all the optimizations going on under the covers.  Can you get
>> it to work if you use `as[...]` instead of `map`?
>>
>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
>> rmarscher@localytics.com> wrote:
>>
>>> Ah thanks, I missed seeing the PR for
>>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>>> null objects then I can implement methods that will map those back to
>>> results that align closer to the RDD interface.
>>>
>>> As a follow on, I'm curious about thoughts regarding enriching the
>>> Dataset join interface versus a package or users sugaring for themselves. I
>>> haven't considered the implications of what the optimizations datasets,
>>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>>> missing a critical benefit there around say avoiding Options in favor of
>>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>>> or DataType yet and maybe for good reasons.
>>>
>>> I did find the RDD join interface elegant, though. In the ideal world an
>>> API comparable the following would be nice:
>>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>>
>>>
>>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <michael@databricks.com
>>> > wrote:
>>>
>>>> Thanks for the feedback.  I think this will address at least some of
>>>> the problems you are describing:
>>>> https://github.com/apache/spark/pull/13425
>>>>
>>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>>> rmarscher@localytics.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been working on transitioning from RDD to Datasets in our
>>>>> codebase in anticipation of being able to leverage features of 2.0.
>>>>>
>>>>> I'm having a lot of difficulties with the impedance mismatches between
>>>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>>>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>>>> types of the results from the right side of the join. This follows
>>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>>
>>>>> Now with Dataset there is only joinWith where you specify the join
>>>>> type, but it lost all the semantics of identifying missing data from outer
>>>>> joins. I can write some enriched methods on Dataset with an implicit class
>>>>> to abstract messiness away if Dataset nulled out all mismatching data from
>>>>> an outer join, however the problem goes even further in that the values
>>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>>> Now it's completely ambiguous what data in the join was actually there
>>>>> versus populated via this atypical semantic.
>>>>>
>>>>> Are there additional options available to work around this issue? I
>>>>> can convert to RDD and back to Dataset but that's less than ideal.
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> *Richard Marscher*
>>>>> Senior Software Engineer
>>>>> Localytics
>>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Richard Marscher*
>>> Senior Software Engineer
>>> Localytics
>>> Localytics.com <http://localytics.com/> | Our Blog
>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>
>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Richard Marscher <rm...@localytics.com>.

A quick unit test attempt didn't get far replacing map with as[], I'm only
working against 1.6.1 at the moment though, I was going to try 2.0 but I'm
having a hard time building a working spark-sql jar from source, the only
ones I've managed to make are intended for the full assembly fat jar.


Example of the error from calling joinWith as left_outer and then
.as[(Option[T], U]) where T and U are Int and Int.

[info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
scala.Tuple2),None)
[info] :- decodeusingserializer(input[0,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true)
[info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))]
[info] +- decodeusingserializer(input[1,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true)
[info]    +- input[1, StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))]

Cause: java.util.concurrent.ExecutionException: java.lang.Exception: failed
to compile: org.codehaus.commons.compiler.CompileException: File
'generated.java', Line 32, Column 60: No applicable constructor/method
found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
candidates are: "public static java.nio.ByteBuffer
java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
java.nio.ByteBuffer.wrap(byte[], int, int)"

The generated code is passing InternalRow objects into the ByteBuffer

Starting from two Datasets of types Dataset[(Int, Int)] with expression
$"left._1" === $"right._1". I'll have to spend some time getting a better
understanding of this analysis phase, but hopefully I can come up with
something.

On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Option should place nicely with encoders, but its always possible there
> are bugs.  I think those function signatures are slightly more expensive
> (one extra object allocation) and its not as java friendly so we probably
> don't want them to be the default.
>
> That said, I would like to enable that kind of sugar while still taking
> advantage of all the optimizations going on under the covers.  Can you get
> it to work if you use `as[...]` instead of `map`?
>
> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
> rmarscher@localytics.com> wrote:
>
>> Ah thanks, I missed seeing the PR for
>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>> null objects then I can implement methods that will map those back to
>> results that align closer to the RDD interface.
>>
>> As a follow on, I'm curious about thoughts regarding enriching the
>> Dataset join interface versus a package or users sugaring for themselves. I
>> haven't considered the implications of what the optimizations datasets,
>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>> missing a critical benefit there around say avoiding Options in favor of
>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>> or DataType yet and maybe for good reasons.
>>
>> I did find the RDD join interface elegant, though. In the ideal world an
>> API comparable the following would be nice:
>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>
>>
>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> Thanks for the feedback.  I think this will address at least some of the
>>> problems you are describing: https://github.com/apache/spark/pull/13425
>>>
>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>> rmarscher@localytics.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've been working on transitioning from RDD to Datasets in our codebase
>>>> in anticipation of being able to leverage features of 2.0.
>>>>
>>>> I'm having a lot of difficulties with the impedance mismatches between
>>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>>> types of the results from the right side of the join. This follows
>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>
>>>> Now with Dataset there is only joinWith where you specify the join
>>>> type, but it lost all the semantics of identifying missing data from outer
>>>> joins. I can write some enriched methods on Dataset with an implicit class
>>>> to abstract messiness away if Dataset nulled out all mismatching data from
>>>> an outer join, however the problem goes even further in that the values
>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>> Now it's completely ambiguous what data in the join was actually there
>>>> versus populated via this atypical semantic.
>>>>
>>>> Are there additional options available to work around this issue? I can
>>>> convert to RDD and back to Dataset but that's less than ideal.
>>>>
>>>> Thanks,
>>>> --
>>>> *Richard Marscher*
>>>> Senior Software Engineer
>>>> Localytics
>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>
>>>
>>>
>>
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Michael Armbrust <mi...@databricks.com>.

Option should place nicely with encoders, but its always possible there are
bugs.  I think those function signatures are slightly more expensive (one
extra object allocation) and its not as java friendly so we probably don't
want them to be the default.

That said, I would like to enable that kind of sugar while still taking
advantage of all the optimizations going on under the covers.  Can you get
it to work if you use `as[...]` instead of `map`?

On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <rm...@localytics.com>
wrote:

> Ah thanks, I missed seeing the PR for
> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
> null objects then I can implement methods that will map those back to
> results that align closer to the RDD interface.
>
> As a follow on, I'm curious about thoughts regarding enriching the Dataset
> join interface versus a package or users sugaring for themselves. I haven't
> considered the implications of what the optimizations datasets, tungsten,
> and/or bytecode gen can do now regarding joins so I may be missing a
> critical benefit there around say avoiding Options in favor of nulls. If
> nothing else, I guess Option doesn't have a first class Encoder or DataType
> yet and maybe for good reasons.
>
> I did find the RDD join interface elegant, though. In the ideal world an
> API comparable the following would be nice:
> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>
>
> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Thanks for the feedback.  I think this will address at least some of the
>> problems you are describing: https://github.com/apache/spark/pull/13425
>>
>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>> rmarscher@localytics.com> wrote:
>>
>>> Hi,
>>>
>>> I've been working on transitioning from RDD to Datasets in our codebase
>>> in anticipation of being able to leverage features of 2.0.
>>>
>>> I'm having a lot of difficulties with the impedance mismatches between
>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>> types of the results from the right side of the join. This follows
>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>
>>> Now with Dataset there is only joinWith where you specify the join type,
>>> but it lost all the semantics of identifying missing data from outer joins.
>>> I can write some enriched methods on Dataset with an implicit class to
>>> abstract messiness away if Dataset nulled out all mismatching data from an
>>> outer join, however the problem goes even further in that the values aren't
>>> always null. Integer, for example, defaults to -1 instead of null. Now it's
>>> completely ambiguous what data in the join was actually there versus
>>> populated via this atypical semantic.
>>>
>>> Are there additional options available to work around this issue? I can
>>> convert to RDD and back to Dataset but that's less than ideal.
>>>
>>> Thanks,
>>> --
>>> *Richard Marscher*
>>> Senior Software Engineer
>>> Localytics
>>> Localytics.com <http://localytics.com/> | Our Blog
>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>
>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Richard Marscher <rm...@localytics.com>.

Ah thanks, I missed seeing the PR for
https://issues.apache.org/jira/browse/SPARK-15441. If the rows became null
objects then I can implement methods that will map those back to results
that align closer to the RDD interface.

As a follow on, I'm curious about thoughts regarding enriching the Dataset
join interface versus a package or users sugaring for themselves. I haven't
considered the implications of what the optimizations datasets, tungsten,
and/or bytecode gen can do now regarding joins so I may be missing a
critical benefit there around say avoiding Options in favor of nulls. If
nothing else, I guess Option doesn't have a first class Encoder or DataType
yet and maybe for good reasons.

I did find the RDD join interface elegant, though. In the ideal world an
API comparable the following would be nice:
https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06

On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Thanks for the feedback.  I think this will address at least some of the
> problems you are describing: https://github.com/apache/spark/pull/13425
>
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rmarscher@localytics.com
> > wrote:
>
>> Hi,
>>
>> I've been working on transitioning from RDD to Datasets in our codebase
>> in anticipation of being able to leverage features of 2.0.
>>
>> I'm having a lot of difficulties with the impedance mismatches between
>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>> types of the results from the right side of the join. This follows
>> idiomatic Scala avoiding nulls and was easy to work with.
>>
>> Now with Dataset there is only joinWith where you specify the join type,
>> but it lost all the semantics of identifying missing data from outer joins.
>> I can write some enriched methods on Dataset with an implicit class to
>> abstract messiness away if Dataset nulled out all mismatching data from an
>> outer join, however the problem goes even further in that the values aren't
>> always null. Integer, for example, defaults to -1 instead of null. Now it's
>> completely ambiguous what data in the join was actually there versus
>> populated via this atypical semantic.
>>
>> Are there additional options available to work around this issue? I can
>> convert to RDD and back to Dataset but that's less than ideal.
>>
>> Thanks,
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>

-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks for the feedback.  I think this will address at least some of the
problems you are describing: https://github.com/apache/spark/pull/13425

On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rm...@localytics.com>
wrote:

> Hi,
>
> I've been working on transitioning from RDD to Datasets in our codebase in
> anticipation of being able to leverage features of 2.0.
>
> I'm having a lot of difficulties with the impedance mismatches between how
> outer joins worked with RDD versus Dataset. The Dataset joins feel like a
> big step backwards IMO. With RDD, leftOuterJoin would give you Option types
> of the results from the right side of the join. This follows idiomatic
> Scala avoiding nulls and was easy to work with.
>
> Now with Dataset there is only joinWith where you specify the join type,
> but it lost all the semantics of identifying missing data from outer joins.
> I can write some enriched methods on Dataset with an implicit class to
> abstract messiness away if Dataset nulled out all mismatching data from an
> outer join, however the problem goes even further in that the values aren't
> always null. Integer, for example, defaults to -1 instead of null. Now it's
> completely ambiguous what data in the join was actually there versus
> populated via this atypical semantic.
>
> Are there additional options available to work around this issue? I can
> convert to RDD and back to Dataset but that's less than ideal.
>
> Thanks,
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>