You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/02/26 00:23:33 UTC

[discuss] DataFrame vs Dataset in Spark 2.0

When we first introduced Dataset in 1.6 as an experimental API, we wanted
to merge Dataset/DataFrame but couldn't because we didn't want to break the
pre-existing DataFrame API (e.g. map function should return Dataset, rather
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
and Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
ways to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of
are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what
libraries or applications need to do, and we won't see type mismatches
(e.g. a function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into
a function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to
maintain source compatibility for both (there is no concept of binary
compatibility), and for R, we are only supporting the DataFrame operations
anyway because that's more familiar interface for R users outside of Spark.

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Chester Chen <ch...@alpinenow.com>.

vote for Option 1.
  1)  Since 2.0 is major API, we are expecting some API changes,
  2)  It helps long term code base maintenance with short term pain on Java
side
  3) Not quite sure how large the code base is using Java DataFrame APIs.





On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xin <rx...@databricks.com> wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Jakob Odersky <ja...@odersky.com>.

I would recommend (non-binding) option 1.

Apart from the API breakage I can see only advantages, and that sole
disadvantage is minimal for a few reasons:

1. the DataFrame API has been "Experimental" since its implementation,
so no stability was ever implied
2. considering that the change is for a major release some
incompatibilities are to be expected
3. using type aliases may break code now, but it will remove the
possibility of library incompatibilities in the future (see Reynold's
second point "[...] and we won't see type mismatches (e.g. a function
expects DataFrame, but user is passing in Dataset[Row]")

On Fri, Feb 26, 2016 at 11:51 AM, Reynold Xin <rx...@databricks.com> wrote:
> That's actually not Row vs non-Row.
>
> It's just primitive vs non-primitive. Primitives get automatically
> flattened, to avoid having to type ._1 all the time.
>
> On Fri, Feb 26, 2016 at 2:06 AM, Sun, Rui <ru...@intel.com> wrote:
>>
>> Thanks for the explaination.
>>
>>
>>
>> What confusing me is the different internal semantic of Dataset on non-Row
>> type (primitive types for example) and Row type:
>>
>>
>>
>> Dataset[Int] is internally actually Dataset[Row(value:Int)]
>>
>>
>>
>> scala> val ds = sqlContext.createDataset(Seq(1,2,3))
>>
>> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
>>
>>
>>
>> scala> ds.schema.json
>>
>> res17: String =
>> {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}
>>
>>
>>
>> But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].
>>
>>
>>
>> From: Reynold Xin [mailto:rxin@databricks.com]
>> Sent: Friday, February 26, 2016 3:55 PM
>> To: Sun, Rui <ru...@intel.com>
>> Cc: Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org
>>
>>
>> Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0
>>
>>
>>
>> The join and joinWith are just two different join semantics, and is not
>> about Dataset vs DataFrame.
>>
>>
>>
>> join is the relational join, where fields are flattened; joinWith is more
>> like a tuple join, where the output has two fields that are nested.
>>
>>
>>
>> So you can do
>>
>>
>>
>> Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]
>>
>>
>> DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]
>>
>>
>>
>> Dataset[A] join Dataset[B] = Dataset[Row]
>>
>>
>>
>> DataFrame[A] join DataFrame[B] = Dataset[Row]
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <ru...@intel.com> wrote:
>>
>> Vote for option 2.
>>
>> Source compatibility and binary compatibility are very important from
>> user’s perspective.
>>
>> It ‘s unfair for Java developers that they don’t have DataFrame
>> abstraction. As you said, sometimes it is more natural to think about
>> DataFrame.
>>
>>
>>
>> I am wondering if conceptually there is slight subtle difference between
>> DataFrame and Dataset[Row]? For example,
>>
>> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>>
>> So,
>>
>> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>>
>>
>>
>> While
>>
>> DataFrame join DataFrame is still DataFrame of Row?
>>
>>
>>
>> From: Reynold Xin [mailto:rxin@databricks.com]
>> Sent: Friday, February 26, 2016 8:52 AM
>> To: Koert Kuipers <ko...@tresata.com>
>> Cc: dev@spark.apache.org
>> Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0
>>
>>
>>
>> Yes - and that's why source compatibility is broken.
>>
>>
>>
>> Note that it is not just a "convenience" thing. Conceptually DataFrame is
>> a Dataset[Row], and for some developers it is more natural to think about
>> "DataFrame" rather than "Dataset[Row]".
>>
>>
>>
>> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
>> too, and some methods would return DataFrame (e.g. sql method).
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> since a type alias is purely a convenience thing for the scala compiler,
>> does option 1 mean that the concept of DataFrame ceases to exist from a java
>> perspective, and they will have to refer to Dataset<Row>?
>>
>>
>>
>> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> When we first introduced Dataset in 1.6 as an experimental API, we wanted
>> to merge Dataset/DataFrame but couldn't because we didn't want to break the
>> pre-existing DataFrame API (e.g. map function should return Dataset, rather
>> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
>> and Dataset.
>>
>>
>>
>> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
>> ways to implement this:
>>
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>>
>>
>>
>> I'm wondering what you think about this. The pros and cons I can think of
>> are:
>>
>>
>>
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>>
>>
>> + Cleaner conceptually, especially in Scala. It will be very clear what
>> libraries or applications need to do, and we won't see type mismatches (e.g.
>> a function expects DataFrame, but user is passing in Dataset[Row]
>>
>> + A lot less code
>>
>> - Breaks source compatibility for the DataFrame API in Java, and binary
>> compatibility for Scala/Java
>>
>>
>>
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>>
>> The pros/cons are basically the inverse of Option 1.
>>
>>
>>
>> + In most cases, can maintain source compatibility for the DataFrame API
>> in Java, and binary compatibility for Scala/Java
>>
>> - A lot more code (1000+ loc)
>>
>> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
>> into a function that expects a DataFrame
>>
>>
>>
>>
>>
>> The concerns are mostly with Scala/Java. For Python, it is very easy to
>> maintain source compatibility for both (there is no concept of binary
>> compatibility), and for R, we are only supporting the DataFrame operations
>> anyway because that's more familiar interface for R users outside of Spark.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

That's actually not Row vs non-Row.

It's just primitive vs non-primitive. Primitives get automatically
flattened, to avoid having to type ._1 all the time.

On Fri, Feb 26, 2016 at 2:06 AM, Sun, Rui <ru...@intel.com> wrote:

> Thanks for the explaination.
>
>
>
> What confusing me is the different internal semantic of Dataset on non-Row
> type (primitive types for example) and Row type:
>
>
>
> Dataset[Int] is internally actually Dataset[Row(value:Int)]
>
>
>
> scala> val ds = sqlContext.createDataset(Seq(1,2,3))
>
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
>
>
>
> scala> ds.schema.json
>
> res17: String =
> {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}
>
>
>
> But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Friday, February 26, 2016 3:55 PM
> *To:* Sun, Rui <ru...@intel.com>
> *Cc:* Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org
>
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> The join and joinWith are just two different join semantics, and is not
> about Dataset vs DataFrame.
>
>
>
> join is the relational join, where fields are flattened; joinWith is more
> like a tuple join, where the output has two fields that are nested.
>
>
>
> So you can do
>
>
>
> Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]
>
>
> DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]
>
>
>
> Dataset[A] join Dataset[B] = Dataset[Row]
>
>
>
> DataFrame[A] join DataFrame[B] = Dataset[Row]
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <ru...@intel.com> wrote:
>
> Vote for option 2.
>
> Source compatibility and binary compatibility are very important from
> user’s perspective.
>
> It ‘s unfair for Java developers that they don’t have DataFrame
> abstraction. As you said, sometimes it is more natural to think about
> DataFrame.
>
>
>
> I am wondering if conceptually there is slight subtle difference between
> DataFrame and Dataset[Row]? For example,
>
> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>
> So,
>
> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>
>
>
> While
>
> DataFrame join DataFrame is still DataFrame of Row?
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Friday, February 26, 2016 8:52 AM
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> Yes - and that's why source compatibility is broken.
>
>
>
> Note that it is not just a "convenience" thing. Conceptually DataFrame is
> a Dataset[Row], and for some developers it is more natural to think about
> "DataFrame" rather than "Dataset[Row]".
>
>
>
> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
> too, and some methods would return DataFrame (e.g. sql method).
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset<Row>?
>
>
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
>
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>
> + A lot less code
>
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
> The pros/cons are basically the inverse of Option 1.
>
>
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
>
> - A lot more code (1000+ loc)
>
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
>
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>
>
>
>
>
>
>

RE: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by "Sun, Rui" <ru...@intel.com>.

Thanks for the explaination.

What confusing me is the different internal semantic of Dataset on non-Row type (primitive types for example) and Row type:

Dataset[Int] is internally actually Dataset[Row(value:Int)]

scala> val ds = sqlContext.createDataset(Seq(1,2,3))
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.schema.json
res17: String = {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}

But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].

From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Friday, February 26, 2016 3:55 PM
To: Sun, Rui <ru...@intel.com>
Cc: Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

The join and joinWith are just two different join semantics, and is not about Dataset vs DataFrame.

join is the relational join, where fields are flattened; joinWith is more like a tuple join, where the output has two fields that are nested.

So you can do

Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]

DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]

Dataset[A] join Dataset[B] = Dataset[Row]

DataFrame[A] join DataFrame[B] = Dataset[Row]



On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <ru...@intel.com>> wrote:
Vote for option 2.
Source compatibility and binary compatibility are very important from user’s perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:rxin@databricks.com<ma...@databricks.com>]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers <ko...@tresata.com>>
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a Dataset[Row], and for some developers it is more natural to think about "DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com>> wrote:
since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset<Row>?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com>> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

The join and joinWith are just two different join semantics, and is not
about Dataset vs DataFrame.

join is the relational join, where fields are flattened; joinWith is more
like a tuple join, where the output has two fields that are nested.

So you can do

Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]

DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]

Dataset[A] join Dataset[B] = Dataset[Row]

DataFrame[A] join DataFrame[B] = Dataset[Row]



On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <ru...@intel.com> wrote:

> Vote for option 2.
>
> Source compatibility and binary compatibility are very important from
> user’s perspective.
>
> It ‘s unfair for Java developers that they don’t have DataFrame
> abstraction. As you said, sometimes it is more natural to think about
> DataFrame.
>
>
>
> I am wondering if conceptually there is slight subtle difference between
> DataFrame and Dataset[Row]? For example,
>
> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>
> So,
>
> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>
>
>
> While
>
> DataFrame join DataFrame is still DataFrame of Row?
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Friday, February 26, 2016 8:52 AM
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> Yes - and that's why source compatibility is broken.
>
>
>
> Note that it is not just a "convenience" thing. Conceptually DataFrame is
> a Dataset[Row], and for some developers it is more natural to think about
> "DataFrame" rather than "Dataset[Row]".
>
>
>
> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
> too, and some methods would return DataFrame (e.g. sql method).
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset<Row>?
>
>
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
>
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>
> + A lot less code
>
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
> The pros/cons are basically the inverse of Option 1.
>
>
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
>
> - A lot more code (1000+ loc)
>
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
>
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>
>
>
>
>

RE: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by "Sun, Rui" <ru...@intel.com>.

Vote for option 2.
Source compatibility and binary compatibility are very important from user’s perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers <ko...@tresata.com>
Cc: dev@spark.apache.org
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a Dataset[Row], and for some developers it is more natural to think about "DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com>> wrote:
since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset<Row>?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com>> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a
Dataset[Row], and for some developers it is more natural to think about
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row]
too, and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:

> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset<Row>?
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> When we first introduced Dataset in 1.6 as an experimental API, we wanted
>> to merge Dataset/DataFrame but couldn't because we didn't want to break the
>> pre-existing DataFrame API (e.g. map function should return Dataset, rather
>> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
>> and Dataset.
>>
>> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are
>> two ways to implement this:
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>> I'm wondering what you think about this. The pros and cons I can think of
>> are:
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> + Cleaner conceptually, especially in Scala. It will be very clear what
>> libraries or applications need to do, and we won't see type mismatches
>> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>> + A lot less code
>> - Breaks source compatibility for the DataFrame API in Java, and binary
>> compatibility for Scala/Java
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>> The pros/cons are basically the inverse of Option 1.
>>
>> + In most cases, can maintain source compatibility for the DataFrame API
>> in Java, and binary compatibility for Scala/Java
>> - A lot more code (1000+ loc)
>> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
>> into a function that expects a DataFrame
>>
>>
>> The concerns are mostly with Scala/Java. For Python, it is very easy to
>> maintain source compatibility for both (there is no concept of binary
>> compatibility), and for R, we are only supporting the DataFrame operations
>> anyway because that's more familiar interface for R users outside of Spark.
>>
>>
>>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.

since a type alias is purely a convenience thing for the scala compiler,
does option 1 mean that the concept of DataFrame ceases to exist from a
java perspective, and they will have to refer to Dataset<Row>?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <rx...@databricks.com> wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

It might make sense, but this option seems to carry all the cons of Option
2, and yet doesn't provide compatibility for Java?

On Thu, Feb 25, 2016 at 3:31 PM, Michael Malak <mi...@yahoo.com>
wrote:

> Would it make sense (in terms of feasibility, code organization, and
> politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra
> lines to a Java compatibility layer/class?
>
>
> ------------------------------
> *From:* Reynold Xin <rx...@databricks.com>
> *To:* "dev@spark.apache.org" <de...@spark.apache.org>
> *Sent:* Thursday, February 25, 2016 4:23 PM
> *Subject:* [discuss] DataFrame vs Dataset in Spark 2.0
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Posted by Michael Malak <mi...@yahoo.com.INVALID>.

Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class?

      From: Reynold Xin <rx...@databricks.com>
 To: "dev@spark.apache.org" <de...@spark.apache.org> 
 Sent: Thursday, February 25, 2016 4:23 PM
 Subject: [discuss] DataFrame vs Dataset in Spark 2.0

When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and Dataset.
Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways to implement this:
Option 1. Make DataFrame a type alias for Dataset[Row]
Option 2. DataFrame as a concrete class that extends Dataset[Row]

I'm wondering what you think about this. The pros and cons I can think of are:

Option 1. Make DataFrame a type alias for Dataset[Row]
+ Cleaner conceptually, especially in Scala. It will be very clear what libraries or applications need to do, and we won't see type mismatches (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code- Breaks source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java

Option 2. DataFrame as a concrete class that extends Dataset[Row]
The pros/cons are basically the inverse of Option 1.
+ In most cases, can maintain source compatibility for the DataFrame API in Java, and binary compatibility for Scala/Java- A lot more code (1000+ loc)- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a function that expects a DataFrame

The concerns are mostly with Scala/Java. For Python, it is very easy to maintain source compatibility for both (there is no concept of binary compatibility), and for R, we are only supporting the DataFrame operations anyway because that's more familiar interface for R users outside of Spark.