You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/10/14 01:01:05 UTC
[jira] [Comment Edited] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

    [ https://issues.apache.org/jira/browse/SPARK-9999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955868#comment-14955868 ] 

Reynold Xin edited comment on SPARK-9999 at 10/13/15 11:00 PM:
---------------------------------------------------------------

[~sandyr] Your concern is absolutely valid, but I don't think your EncodedRDD proposal works. For one, the map function (every other function that returns a type different from RDD's own T) will break. For two, the whole concept of PairRDDFunctions should go away with this new API.

As I said, it's actually my preference to just use the RDD API. But if you take a look at what's needed here, it'd break too many functions. So we have the following choices:

1. Don't create a new API, and break the RDD API. People then can't update to newer versions of Spark unless they rewrite their apps. We did this with the SchemaRDD -> DataFrame change, which went well -- but SchemaRDD wasn't really an advertised API back then.

2. Create a new API, and keep RDD API intact. People can update to new versions of Spark, but they can't take full advantage of all the Tungsten/DataFrame work immediately unless they rewrite their apps. Maybe we can implement the RDD API later in some cases using the new API so legacy apps can still take advantage whenever possible (e.g. inferring encoder based on classtags when possible). 

Also the RDD API as I see it today is actually a pretty good way for developers to provide data (i.e. used for data sources). If we break it, we'd still need to come up with a new data input API.





was (Author: rxin):
[~sandyr] Your concern is absolutely valid, but I don't think your EncodedRDD proposal works. For one, the map function (every other function that returns a type different from RDD's own T) will break. For two, the whole concept of PairRDDFunctions should go away with this new API.

As I said, it's actually my preference to just use the RDD API. But if you take a look at what's needed here, it'd break too many functions. So we have the following choices:

1. Don't create a new API, and break the RDD API. People then can't update to newer versions of Spark unless they rewrite their apps. We did this with the SchemaRDD -> DataFrame change, which went well -- but SchemaRDD wasn't really an advertised API back then.

2. Create a new API, and keep RDD API intact. People can update to new versions of Spark, but it can't take full advantage of all the Tungsten/DataFrame work immediately unless they rewrite their apps. Maybe we can implement the RDD API later in some cases using the new API so legacy apps can still take advantage whenever possible (e.g. inferring encoder based on classtags when possible). 

Also the RDD API as I see it today is actually a pretty good way for developers to provide data (i.e. used for data sources). If we break it, we'd still need to come up with a new data input API.




> RDD-like API on top of Catalyst/DataFrame
> -----------------------------------------
>
>                 Key: SPARK-9999
>                 URL: https://issues.apache.org/jira/browse/SPARK-9999
>             Project: Spark
>          Issue Type: Story
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or better than working with RDDs.  Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible.  When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in both Scala and Java.  Where possible, shared types like Array will be used in the API.  Where not possible, overloaded functions should be provided for both languages.  Scala concepts, such as ClassTags should not be required in the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate.  When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary.  Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input.
> For a detailed outline of the complete proposed API: [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org