You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alex Nastetsky <al...@verve.com> on 2018/01/10 00:45:39 UTC

Dataset API inconsistencies

I am finding using the Dataset API to be very cumbersome to use, which is
unfortunate, as I was looking forward to the type-safety after coming from
a Dataframe codebase.

This link summarizes my troubles: http://loicdescotte.
github.io/posts/spark2-datasets-type-safety/

The problem is having to continuously switch back and forth between typed
and untyped semantics, which really kills productivity. In contrast, the
RDD API is consistently typed and the Dataframe API is consistently
untyped. I don't have to continuously stop and think about which one to use
for each operation.

I gave the Frameless framework (mentioned in the link) a shot, but
eventually started running into oddities and lack of enough documentation
and community support and did not want to sink too much time into it.

At this point I'm considering just sticking with Dataframes, as I don't
really consider Datasets to be usable. Has anyone had a similar experience
or has had better luck?

Alex.

Re: Dataset API inconsistencies

Posted by Michael Armbrust <mi...@databricks.com>.

I wrote Datasets, and I'll say I only use them when I really need to (i.e.
when it would be very cumbersome to express what I am trying to do
relationally).  Dataset operations are almost always going to be slower
than their DataFrame equivalents since they usually require materializing
objects (where as DataFrame operations usually generate code that operates
directly on binary encoded data).

We certainly could flesh out the API further (e.g. add orderBy that takes a
lambda function), but so far I have not seen a lot of demand for this, and
it would be strictly slower than the DataFrame version. I worry this
wouldn't actually be beneficial to users as it would give them a choice
that looks the same but has performance implications that are non-obvious.
If I'm in the minority though with this opinion, we should do it.

Regarding the concerns about type-safety, I haven't really found that to be
a major issue.  Even though you don't have type safety from the scala
compiler, the Spark SQL analyzer is checking your query before any
execution begins. This opinion is perhaps biased by the fact that I do a
lot of Spark SQL programming in notebooks where the difference between
"compile-time" and "runtime" is pretty minimal.

On Wed, Jan 10, 2018 at 1:45 AM, Alex Nastetsky <al...@verve.com>
wrote:

> I am finding using the Dataset API to be very cumbersome to use, which is
> unfortunate, as I was looking forward to the type-safety after coming from
> a Dataframe codebase.
>
> This link summarizes my troubles: http://loicdescotte.
> github.io/posts/spark2-datasets-type-safety/
>
> The problem is having to continuously switch back and forth between typed
> and untyped semantics, which really kills productivity. In contrast, the
> RDD API is consistently typed and the Dataframe API is consistently
> untyped. I don't have to continuously stop and think about which one to use
> for each operation.
>
> I gave the Frameless framework (mentioned in the link) a shot, but
> eventually started running into oddities and lack of enough documentation
> and community support and did not want to sink too much time into it.
>
> At this point I'm considering just sticking with Dataframes, as I don't
> really consider Datasets to be usable. Has anyone had a similar experience
> or has had better luck?
>
> Alex.
>