You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Soheil Pourbafrani <so...@gmail.com> on 2018/10/28 14:50:04 UTC

Processing Flexibility Between RDD and Dataframe API

Hi,
There are some functions like map, flatMap, reduce and ..., that construct
the base data processing operation in big data (and Apache Spark). But
Spark, in new versions, introduces the high-level Dataframe API and
recommend using it. This is while there are no such functions in Dataframe
API and it just has many built-in functions and the UDF. It's very
inflexible (at least to me) and I at many points should convert
Dataframes to RDD and vice-versa. My question is:
Is RDD going to be outdated and if so, what is the correct road-map to do
processing using Apache Spark, while Dataframe doesn't support functions
like Map and reduce? How UDF functions process the data, they will apply to
every row, like map functions? Are converting Dataframe to RDD comes with
many costs?

Re: Processing Flexibility Between RDD and Dataframe API

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I would recommend reading the book by Matei Zaharia. One of the main
differentiating factors between Spark 1.x and subsequent releases has been
optimization and hence dataframes, and in no way RDD is going away because
dataframes are built on RDD's. The use of RDD's are allowed and is
recommended under scenarios where greater flexibility will be required and
those scenarios are explicitly and clearly stated.

But if someone tells me that they will not use dataframes at all it means
that they are eventually going to end up delivering solutions to companies
which will be suboptimal, expensive to run, difficult to maintain, flaky
while scaling, and introduce resource dependency. No clues on why would
someone do that.


Regards,
Gourav Sengupta


On Mon, Oct 29, 2018 at 7:49 AM Jungtaek Lim <ka...@gmail.com> wrote:

> Just 2 cents on just one of contributors: while SQL semantic can express
> various use cases data scientists encounter, I also agree someone who are
> end users who are more familiar with code instead of SQL can feel it is not
> flexible.
>
> But counterless efforts have been incorporated into Spark SQL (and
> catalyst) so I guess it is clear Spark SQL and Structured Streaming are the
> things if your workload fits into them, but on the other hand, if it
> doesn't, just keep using RDD. RDD is still the thing underlying Spark SQL,
> so I don't expect it is deprecated unless Spark renews the underlying
> architecture.
>
> -Jungtaek Lim
>
> 2018년 10월 29일 (월) 오전 12:06, Adrienne Kole <ad...@gmail.com>님이 작성:
>
>> Thanks for bringing this issue to the mailing list.
>> As an addition, I would also ask the same questions about  DStreams and
>> Structured Streaming APIs.
>> Structured Streaming is high level and it makes difficult to express all
>> business logic in it, although Databricks are pushing it and recommending
>> for usage.
>> Moreover, there are some works are going on continuous streaming.
>> So, what is the Spark's future vision, support all or concentrate on one,
>> as all those paradigms have separate processing semantics?
>>
>>
>> Cheers,
>> Adrienne
>>
>> On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani <so...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> There are some functions like map, flatMap, reduce and ..., that
>>> construct the base data processing operation in big data (and Apache
>>> Spark). But Spark, in new versions, introduces the high-level Dataframe API
>>> and recommend using it. This is while there are no such functions in
>>> Dataframe API and it just has many built-in functions and the UDF. It's
>>> very inflexible (at least to me) and I at many points should convert
>>> Dataframes to RDD and vice-versa. My question is:
>>> Is RDD going to be outdated and if so, what is the correct road-map to
>>> do processing using Apache Spark, while Dataframe doesn't support functions
>>> like Map and reduce? How UDF functions process the data, they will apply to
>>> every row, like map functions? Are converting Dataframe to RDD comes with
>>> many costs?
>>>
>>

Re: Processing Flexibility Between RDD and Dataframe API

Posted by Jungtaek Lim <ka...@gmail.com>.

Just 2 cents on just one of contributors: while SQL semantic can express
various use cases data scientists encounter, I also agree someone who are
end users who are more familiar with code instead of SQL can feel it is not
flexible.

But counterless efforts have been incorporated into Spark SQL (and
catalyst) so I guess it is clear Spark SQL and Structured Streaming are the
things if your workload fits into them, but on the other hand, if it
doesn't, just keep using RDD. RDD is still the thing underlying Spark SQL,
so I don't expect it is deprecated unless Spark renews the underlying
architecture.

-Jungtaek Lim

2018년 10월 29일 (월) 오전 12:06, Adrienne Kole <ad...@gmail.com>님이 작성:

> Thanks for bringing this issue to the mailing list.
> As an addition, I would also ask the same questions about  DStreams and
> Structured Streaming APIs.
> Structured Streaming is high level and it makes difficult to express all
> business logic in it, although Databricks are pushing it and recommending
> for usage.
> Moreover, there are some works are going on continuous streaming.
> So, what is the Spark's future vision, support all or concentrate on one,
> as all those paradigms have separate processing semantics?
>
>
> Cheers,
> Adrienne
>
> On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani <so...@gmail.com>
> wrote:
>
>> Hi,
>> There are some functions like map, flatMap, reduce and ..., that
>> construct the base data processing operation in big data (and Apache
>> Spark). But Spark, in new versions, introduces the high-level Dataframe API
>> and recommend using it. This is while there are no such functions in
>> Dataframe API and it just has many built-in functions and the UDF. It's
>> very inflexible (at least to me) and I at many points should convert
>> Dataframes to RDD and vice-versa. My question is:
>> Is RDD going to be outdated and if so, what is the correct road-map to do
>> processing using Apache Spark, while Dataframe doesn't support functions
>> like Map and reduce? How UDF functions process the data, they will apply to
>> every row, like map functions? Are converting Dataframe to RDD comes with
>> many costs?
>>
>

Re: Processing Flexibility Between RDD and Dataframe API

Posted by Adrienne Kole <ad...@gmail.com>.

Thanks for bringing this issue to the mailing list.
As an addition, I would also ask the same questions about  DStreams and
Structured Streaming APIs.
Structured Streaming is high level and it makes difficult to express all
business logic in it, although Databricks are pushing it and recommending
for usage.
Moreover, there are some works are going on continuous streaming.
So, what is the Spark's future vision, support all or concentrate on one,
as all those paradigms have separate processing semantics?

Cheers,
Adrienne

On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani <so...@gmail.com>
wrote:

> Hi,
> There are some functions like map, flatMap, reduce and ..., that construct
> the base data processing operation in big data (and Apache Spark). But
> Spark, in new versions, introduces the high-level Dataframe API and
> recommend using it. This is while there are no such functions in Dataframe
> API and it just has many built-in functions and the UDF. It's very
> inflexible (at least to me) and I at many points should convert
> Dataframes to RDD and vice-versa. My question is:
> Is RDD going to be outdated and if so, what is the correct road-map to do
> processing using Apache Spark, while Dataframe doesn't support functions
> like Map and reduce? How UDF functions process the data, they will apply to
> every row, like map functions? Are converting Dataframe to RDD comes with
> many costs?
>