You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Josh Rosen <ro...@gmail.com> on 2013/08/08 21:07:04 UTC

scala.Option vs Guava Optional in Spark Java APIs

I've noticed that Spark's Java API is inconsistent in how it represents
optional values. Some methods use scala.Option instances, while others use
Guava's Optional:

scala.Option is used in by methods like JavaSparkContext.getSparkHome(),
and the *outerJoin methods return a JavaPairRDD[K, (V, Option[W])].

Guava Optional is used in methods like Java*RDD.getCheckpointFile() and
JavaPairDStream.updateStateByKey() function arguments.

I'd like to remove this inconsistency and settle on a single class for
representing optional values in the Java API.

Both APIs are similar, but the Guava API seems nicer for Java users.  For
example, scala.Option.getOrElse(default) accepts a function, which isn't
really usable from Java.

http://www.scala-lang.org/api/current/index.html#scala.Option
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Optional.html

If we switch to exclusively using Guava Optional, we'd have to convert join
results before turning them into JavaRDDs so that we have JavaPairRDD[K,
(V, Optional[W])].  I don't anticipate this being a large performance issue.

This would be a backwards-incompatible API change and 0.8 seems like the
easiest time to make it.  I'd appreciate any thoughts on whether I should
use Guava Optional everywhere.

Thanks,
Josh

Re: scala.Option vs Guava Optional in Spark Java APIs

Posted by Patrick Wendell <pw...@gmail.com>.

For the streaming stuff, I'm fairly sure I used Guava (or I at least *want*
it to be Guava) so I'm personally in full support of Guava for this.

- Patrick


On Thu, Aug 8, 2013 at 12:07 PM, Josh Rosen <ro...@gmail.com> wrote:

> I've noticed that Spark's Java API is inconsistent in how it represents
> optional values. Some methods use scala.Option instances, while others use
> Guava's Optional:
>
> scala.Option is used in by methods like JavaSparkContext.getSparkHome(),
> and the *outerJoin methods return a JavaPairRDD[K, (V, Option[W])].
>
> Guava Optional is used in methods like Java*RDD.getCheckpointFile() and
> JavaPairDStream.updateStateByKey() function arguments.
>
> I'd like to remove this inconsistency and settle on a single class for
> representing optional values in the Java API.
>
> Both APIs are similar, but the Guava API seems nicer for Java users.  For
> example, scala.Option.getOrElse(default) accepts a function, which isn't
> really usable from Java.
>
> http://www.scala-lang.org/api/current/index.html#scala.Option
>
> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Optional.html
>
> If we switch to exclusively using Guava Optional, we'd have to convert join
> results before turning them into JavaRDDs so that we have JavaPairRDD[K,
> (V, Optional[W])].  I don't anticipate this being a large performance
> issue.
>
> This would be a backwards-incompatible API change and 0.8 seems like the
> easiest time to make it.  I'd appreciate any thoughts on whether I should
> use Guava Optional everywhere.
>
> Thanks,
> Josh
>