You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Cody Koeninger <co...@koeninger.org> on 2016/06/16 17:53:47 UTC

Structured streaming use of DataFrame vs Datasource

Is there a principled reason why sql.streaming.* and
sql.execution.streaming.* are making extensive use of DataFrame
instead of Datasource?

Or is that just a holdover from code written before the move / type alias?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Structured streaming use of DataFrame vs Datasource

Posted by Tathagata Das <ta...@gmail.com>.

Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.

On Thu, Jun 16, 2016 at 2:05 PM, Cody Koeninger <co...@koeninger.org> wrote:

> I'm clear on what a type alias is.  My question is more that moving
> from e.g. Dataset[T] to Dataset[Row] involves throwing away
> information.  Reading through code that uses the Dataframe alias, it's
> a little hard for me to know when that's intentional or not.
>
>
> On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
> <ta...@gmail.com> wrote:
> > There are different ways to view this. If its confusing to think that
> Source
> > API returning DataFrames, its equivalent to thinking that you are
> returning
> > a Dataset[Row], and DataFrame is just a shorthand. And
> > DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object]
> is
> > to Array[String]. DataFrame is more general in a way, as every Dataset
> can
> > be boiled down to a DataFrame. So to keep the Source APIs general (and
> also
> > source-compatible with 1.x), they return DataFrame.
> >
> > On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Is this really an internal / external distinction?
> >>
> >> For a concrete example, Source.getBatch seems to be a public
> >> interface, but returns DataFrame.
> >>
> >> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
> >> <ta...@gmail.com> wrote:
> >> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> >> > Dataset is the main type and DataFrame is a derivative type.
> >> > However, internally, since everything is processed as Rows, everything
> >> > uses
> >> > DataFrames, Type classes used in a Dataset is internally converted to
> >> > rows
> >> > for processing. . Therefore internally DataFrame is like "main" type
> >> > that is
> >> > used.
> >> >
> >> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <co...@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Sorry, meant DataFrame vs Dataset
> >> >>
> >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <cody@koeninger.org
> >
> >> >> wrote:
> >> >> > Is there a principled reason why sql.streaming.* and
> >> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> >> > instead of Datasource?
> >> >> >
> >> >> > Or is that just a holdover from code written before the move / type
> >> >> > alias?
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Structured streaming use of DataFrame vs Datasource

Posted by Cody Koeninger <co...@koeninger.org>.

I'm clear on what a type alias is.  My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information.  Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.


On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
<ta...@gmail.com> wrote:
> There are different ways to view this. If its confusing to think that Source
> API returning DataFrames, its equivalent to thinking that you are returning
> a Dataset[Row], and DataFrame is just a shorthand. And
> DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is
> to Array[String]. DataFrame is more general in a way, as every Dataset can
> be boiled down to a DataFrame. So to keep the Source APIs general (and also
> source-compatible with 1.x), they return DataFrame.
>
> On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Is this really an internal / external distinction?
>>
>> For a concrete example, Source.getBatch seems to be a public
>> interface, but returns DataFrame.
>>
>> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
>> <ta...@gmail.com> wrote:
>> > DataFrame is a type alias of Dataset[Row], so externally it seems like
>> > Dataset is the main type and DataFrame is a derivative type.
>> > However, internally, since everything is processed as Rows, everything
>> > uses
>> > DataFrames, Type classes used in a Dataset is internally converted to
>> > rows
>> > for processing. . Therefore internally DataFrame is like "main" type
>> > that is
>> > used.
>> >
>> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> Sorry, meant DataFrame vs Dataset
>> >>
>> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <co...@koeninger.org>
>> >> wrote:
>> >> > Is there a principled reason why sql.streaming.* and
>> >> > sql.execution.streaming.* are making extensive use of DataFrame
>> >> > instead of Datasource?
>> >> >
>> >> > Or is that just a holdover from code written before the move / type
>> >> > alias?
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Structured streaming use of DataFrame vs Datasource

Posted by Tathagata Das <ta...@gmail.com>.

There are different ways to view this. If its confusing to think that
Source API returning DataFrames, its equivalent to thinking that you are
returning a Dataset[Row], and DataFrame is just a shorthand.
And DataFrame/Datasetp[Row] is to Dataset[String] is what java
Array[Object] is to Array[String]. DataFrame is more general in a way, as
every Dataset can be boiled down to a DataFrame. So to keep the Source APIs
general (and also source-compatible with 1.x), they return DataFrame.

On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Is this really an internal / external distinction?
>
> For a concrete example, Source.getBatch seems to be a public
> interface, but returns DataFrame.
>
> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
> <ta...@gmail.com> wrote:
> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> > Dataset is the main type and DataFrame is a derivative type.
> > However, internally, since everything is processed as Rows, everything
> uses
> > DataFrames, Type classes used in a Dataset is internally converted to
> rows
> > for processing. . Therefore internally DataFrame is like "main" type
> that is
> > used.
> >
> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Sorry, meant DataFrame vs Dataset
> >>
> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <co...@koeninger.org>
> >> wrote:
> >> > Is there a principled reason why sql.streaming.* and
> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> > instead of Datasource?
> >> >
> >> > Or is that just a holdover from code written before the move / type
> >> > alias?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>

Re: Structured streaming use of DataFrame vs Datasource

Posted by Cody Koeninger <co...@koeninger.org>.

Is this really an internal / external distinction?

For a concrete example, Source.getBatch seems to be a public
interface, but returns DataFrame.

On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
<ta...@gmail.com> wrote:
> DataFrame is a type alias of Dataset[Row], so externally it seems like
> Dataset is the main type and DataFrame is a derivative type.
> However, internally, since everything is processed as Rows, everything uses
> DataFrames, Type classes used in a Dataset is internally converted to rows
> for processing. . Therefore internally DataFrame is like "main" type that is
> used.
>
> On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Sorry, meant DataFrame vs Dataset
>>
>> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> > Is there a principled reason why sql.streaming.* and
>> > sql.execution.streaming.* are making extensive use of DataFrame
>> > instead of Datasource?
>> >
>> > Or is that just a holdover from code written before the move / type
>> > alias?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Structured streaming use of DataFrame vs Datasource

Posted by Tathagata Das <ta...@gmail.com>.

DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing. . Therefore internally DataFrame is like "main" type that
is used.

On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:

> Sorry, meant DataFrame vs Dataset
>
> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> > Is there a principled reason why sql.streaming.* and
> > sql.execution.streaming.* are making extensive use of DataFrame
> > instead of Datasource?
> >
> > Or is that just a holdover from code written before the move / type
> alias?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Structured streaming use of DataFrame vs Datasource

Posted by Cody Koeninger <co...@koeninger.org>.

Sorry, meant DataFrame vs Dataset

On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <co...@koeninger.org> wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just a holdover from code written before the move / type alias?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org