You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Tracewski, Lukasz " <lu...@credit-suisse.com> on 2015/10/08 16:40:58 UTC

Dataframes - sole data structure for parallel computations?

Hi,

Many people interpret this slide from Databricks
https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png
as indication that Dataframes API is going to be the main processing unit of Spark and sole access point to MLlib, Streaming and such. Is it true? My impression was that Dataframes are an additional abstraction layer with some promising optimisation coming from Tungsten project, but that's all. RDDs are there to stay. They are a natural selection when it comes to e.g. processing images.

Here is one article that advertises Dataframes as a "sole data structure for parallel computations":
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ (paragraph 4)

Cheers,
Lucas




=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
===============================================================================

Re: Dataframes - sole data structure for parallel computations?

Posted by Jerry Lam <ch...@gmail.com>.

I just read the article by ogirardot but I don’t agree
It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. 

Unless there is a computer science breakthrough in terms of data structure (i.e. the data structure of everything), the statement of sole data structure can be treated as a joke only. Just in case, people get upset. I AM JOKING :) 

> On Oct 8, 2015, at 1:56 PM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> Don't worry, the ability to work with domain objects and lambda functions is not going to go away.  However, we are looking at ways to leverage Tungsten's improved performance when processing structured data.
> 
> More details can be found here:
> https://issues.apache.org/jira/browse/SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
> 
> On Thu, Oct 8, 2015 at 7:40 AM, Tracewski, Lukasz <lukasz.tracewski@credit-suisse.com <ma...@credit-suisse.com>> wrote:
> Hi,
> 
>  
> 
> Many people interpret this slide from Databricks
> 
> https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png <https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png>
> as indication that Dataframes API is going to be the main processing unit of Spark and sole access point to MLlib, Streaming and such. Is it true? My impression was that Dataframes are an additional abstraction layer with some promising optimisation coming from Tungsten project, but that’s all. RDDs are there to stay. They are a natural selection when it comes to e.g. processing images.
> 
>  
> 
> Here is one article that advertises Dataframes as a “sole data structure for parallel computations”:
> 
> https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ <https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/> (paragraph 4)
> 
>  
> 
> Cheers,
> 
> Lucas
> 
>  
> 
>  
> 
> 
> 
> ==============================================================================
> Please access the attached hyperlink for an important electronic communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html <http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html>
> ==============================================================================
> 
>

Re: Dataframes - sole data structure for parallel computations?

Posted by Michael Armbrust <mi...@databricks.com>.

Don't worry, the ability to work with domain objects and lambda functions
is not going to go away.  However, we are looking at ways to leverage
Tungsten's improved performance when processing structured data.

More details can be found here:
https://issues.apache.org/jira/browse/SPARK-9999

On Thu, Oct 8, 2015 at 7:40 AM, Tracewski, Lukasz <
lukasz.tracewski@credit-suisse.com> wrote:

> Hi,
>
>
>
> Many people interpret this slide from Databricks
>
> https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png
>
> as indication that Dataframes API is going to be the main processing unit
> of Spark and sole access point to MLlib, Streaming and such. Is it true? My
> impression was that Dataframes are an additional abstraction layer with
> some promising optimisation coming from Tungsten project, but that’s all.
> RDDs are there to stay. They are a natural selection when it comes to e.g.
> processing images.
>
>
>
> Here is one article that advertises Dataframes as a “sole data structure
> for parallel computations”:
>
>
> https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
> (paragraph 4)
>
>
>
> Cheers,
>
> Lucas
>
>
>
>
>
>
>
> ==============================================================================
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==============================================================================
>