You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/01/26 23:18:34 UTC

renaming SchemaRDD -> DataFrame

Hi,

We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
get the community's opinion.

The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API. We also expect more and
more users to be programming directly against SchemaRDD API rather than the
core RDD API. SchemaRDD, through its less commonly used DSL originally
designed for writing test cases, always has the data-frame like API. In
1.3, we are redesigning the API to make the API usable for end users.


There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name than SchemaRDD.

2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
though it would contain some RDD functions like map, flatMap, etc), and
calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
DataFrame.rdd will return the underlying RDD for all RDD methods.


My understanding is that very few users program directly against the
SchemaRDD API at the moment, because they are not well documented. However,
oo maintain backward compatibility, we can create a type alias DataFrame
that is still named SchemaRDD. This will maintain source compatibility for
Scala. That said, we will have to update all existing materials to use
DataFrame rather than SchemaRDD.

Re: renaming SchemaRDD -> DataFrame

Posted by Michael Malak <mi...@yahoo.com.INVALID>.

I personally have no preference DataFrame vs. DataTable, but only wish to lay out the history and etymology simply because I'm into that sort of thing.

"Frame" comes from Marvin Minsky's 1970's AI construct: "slots" and the data that go in them. The S programming language (precursor to R) adopted this terminology in 1991. R of course became popular with the rise of Data Science around 2012.
http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22&cmpt=q&tz=

"DataFrame" would carry the implication that it comes along with its own metadata, whereas "DataTable" might carry the implication that metadata is stored in a central metadata repository.

"DataFrame" is thus technically more correct for SchemaRDD, but is a less familiar (and thus less accessible) term for those not immersed in data science or AI and thus may have narrower appeal.


----- Original Message -----
From: Evan R. Sparks <ev...@gmail.com>
To: Matei Zaharia <ma...@gmail.com>
Cc: Koert Kuipers <ko...@tresata.com>; Michael Malak <mi...@yahoo.com>; Patrick Wendell <pw...@gmail.com>; Reynold Xin <rx...@databricks.com>; "dev@spark.apache.org" <de...@spark.apache.org>
Sent: Tuesday, January 27, 2015 9:55 AM
Subject: Re: renaming SchemaRDD -> DataFrame

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelmalak@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <pw...@gmail.com>
> >>> To: Reynold Xin <rx...@databricks.com>
> >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org

> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by "Evan R. Sparks" <ev...@gmail.com>.

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelmalak@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <pw...@gmail.com>
> >>> To: Reynold Xin <rx...@databricks.com>
> >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by vha14 <vh...@msn.com>.

Matei wrote (Jan 26, 2015; 5:31pm): "The intent of Spark SQL though is to be
more than a SQL server -- it's meant to be a library for manipulating
structured data."

I think this is an important but nuanced point. There are engineers who for
various reasons associate the term "SQL" with business analyst,
non-engineering scenarios. If Matei or someone from the Spark team could
clarify this misunderstanding with respect to the potential of Spark SQL, it
would be very useful. (Potential places: Quora, StackOverflow,
Databricks.com's blog.) 

Thanks,
Vu Ha 
CTO, Semantic Scholar
http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/renaming-SchemaRDD-DataFrame-tp10271p10612.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Matei Zaharia <ma...@gmail.com>.

You're not really supposed to subclass DataFrame, instead you can make it from an RDD of Rows and a schema (e.g. with SQLContext.applySchema). Actually the Spark SQL data source API supports that too (org.apache.spark.sql.sources). Think of DataFrame as a container for structured data, not as a class that all data sources will have to implement. If you want to do something fancy like compute the Rows dynamically, your RDD can implement its own compute() method to do that.

Matei

> On Feb 10, 2015, at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> so i understand the success or spark.sql. besides the fact that anything
> with the words SQL in its name will have thousands of developers running
> towards it because of the familiarity, there is also a genuine need for a
> generic RDD that holds record-like objects, with field names and runtime
> types. after all that is a successfull generic abstraction used in many
> structured data tools.
> 
> but to me that abstraction is as simple as:
> 
> trait SchemaRDD extends RDD[Row] {
>  def schema: StructType
> }
> 
> and perhaps another abstraction to indicate it intends to be column
> oriented (with a few methods to efficiently extract a subset of columns).
> so that could be DataFrame.
> 
> such simple contracts would allow many people to write loaders for this
> (say from csv) and whatnot.
> 
> what i do not understand why it has to be much more complex than this. but
> if i look at DataFrame it has so much additional stuff, that has (in my
> eyes) nothing to do with generic structured data analysis.
> 
> for example to implement DataFrame i need to implement about 40 additional
> methods!? and for some the SQLness is obviously leaking into the
> abstraction. for example why would i care about:
>  def registerTempTable(tableName: String): Unit
> 
> 
> best, koert
> 
> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <ve...@gmail.com> wrote:
> 
>> It is true that you can persist SchemaRdds / DataFrames to disk via
>> Parquet, but a lot of time and inefficiencies is lost.   The in-memory
>> columnar cached representation is completely different from the
>> Parquet file format, and I believe there has to be a translation into
>> a Row (because ultimately Spark SQL traverses Row's -- even the
>> InMemoryColumnarTableScan has to then convert the columns into Rows
>> for row-based processing).   On the other hand, traditional data
>> frames process in a columnar fashion.   Columnar storage is good, but
>> nowhere near as good as columnar processing.
>> 
>> Another issue, which I don't know if it is solved yet, but it is
>> difficult for Tachyon to efficiently cache Parquet files without
>> understanding the file format itself.
>> 
>> I gave a talk at last year's Spark Summit on this topic.
>> 
>> I'm working on efforts to change this, however.  Shoot me an email at
>> velvia at gmail if you're interested in joining forces.
>> 
>> On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com> wrote:
>>> Yes, when a DataFrame is cached in memory, it's stored in an efficient
>>> columnar format. And you can also easily persist it on disk using
>> Parquet,
>>> which is also columnar.
>>> 
>>> Cheng
>>> 
>>> 
>>> On 1/29/15 1:24 PM, Koert Kuipers wrote:
>>>> 
>>>> to me the word DataFrame does come with certain expectations. one of
>> them
>>>> is that the data is stored columnar. in R data.frame internally uses a
>>>> list
>>>> of sequences i think, but since lists can have labels its more like a
>>>> SortedMap[String, Array[_]]. this makes certain operations very cheap
>>>> (such
>>>> as adding a column).
>>>> 
>>>> in Spark the closest thing would be a data structure where per Partition
>>>> the data is also stored columnar. does spark SQL already use something
>>>> like
>>>> that? Evan mentioned "Spark SQL columnar compression", which sounds like
>>>> it. where can i find that?
>>>> 
>>>> thanks
>>>> 
>>>> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com>
>>>> wrote:
>>>> 
>>>>> +1.... having proper NA support is much cleaner than using null, at
>>>>> least the Java null.
>>>>> 
>>>>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.sparks@gmail.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>> You've got to be a little bit careful here. "NA" in systems like R or
>>>>> 
>>>>> pandas
>>>>>> 
>>>>>> may have special meaning that is distinct from "null".
>>>>>> 
>>>>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
>>>>> 
>>>>> wrote:
>>>>>>> 
>>>>>>> Isn't that just "null" in SQL?
>>>>>>> 
>>>>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I believe that most DataFrame implementations out there, like
>> Pandas,
>>>>>>>> supports the idea of missing values / NA, and some support the idea
>> of
>>>>>>>> Not Meaningful as well.
>>>>>>>> 
>>>>>>>> Does Row support anything like that?  That is important for certain
>>>>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>>>>> but haven't looked into the details in a while.
>>>>>>>> 
>>>>>>>> -Evan
>>>>>>>> 
>>>>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> It shouldn't change the data source api at all because data sources
>>>>>>>> 
>>>>>>>> create
>>>>>>>>> 
>>>>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>>>>> 
>>>>>>>> (previously
>>>>>>>>> 
>>>>>>>>> to SchemaRDD).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>>>>> 
>>>>>>>>> One thing that will break the data source API in 1.3 is the
>> location
>>>>>>>>> of
>>>>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>>>>> 
>>>>>>>> moved to
>>>>>>>>> 
>>>>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>>>>> public
>>>>>>>> 
>>>>>>>> APIs
>>>>>>>>> 
>>>>>>>>> have first class classes/objects defined in sql directly.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
>> velvia.github@gmail.com
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hey guys,
>>>>>>>>>> 
>>>>>>>>>> How does this impact the data sources API?  I was planning on
>> using
>>>>>>>>>> this for a project.
>>>>>>>>>> 
>>>>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>>>>> desirable and useful.
>>>>>>>>>> 
>>>>>>>>>> By the way, one thing that prevents the columnar compression stuff
>>>>> 
>>>>> in
>>>>>>>>>> 
>>>>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>>>>> with
>>>>>>>>>> Reynold and Michael et al., that the format was not designed for
>>>>>>>>>> persistence.
>>>>>>>>>> 
>>>>>>>>>> I have a new project that aims to change that.  It is a
>>>>>>>>>> zero-serialisation, high performance binary vector library,
>>>>> 
>>>>> designed
>>>>>>>>>> 
>>>>>>>>>> from the outset to be a persistent storage friendly.  May be one
>>>>> 
>>>>> day
>>>>>>>>>> 
>>>>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>>>> 
>>>>>>>>>> Michael told me this would be a lot of work, and recreates parts
>> of
>>>>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>>>>> 
>>>>> details.
>>>>>>>>>> 
>>>>>>>>>> -Evan
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rxin@databricks.com
>>> 
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Alright I have merged the patch (
>>>>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>>>>> ) since I don't see any strong opinions against it (as a matter
>>>>> 
>>>>> of
>>>>>>>> 
>>>>>>>> fact
>>>>>>>>>>> 
>>>>>>>>>>> most were for it). We can still change it if somebody lays out a
>>>>>>>> 
>>>>>>>> strong
>>>>>>>>>>> 
>>>>>>>>>>> argument.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>>>>> <ma...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The type alias means your methods can specify either type and
>>>>> 
>>>>> they
>>>>>>>> 
>>>>>>>> will
>>>>>>>>>>>> 
>>>>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>>>>> 
>>>>> and
>>>>>>>> 
>>>>>>>> such
>>>>>>>>>>>> 
>>>>>>>>>>>> will
>>>>>>>>>>>> show DataFrame as the type.
>>>>>>>>>>>> 
>>>>>>>>>>>> Matei
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>>>>>>>>>>>> 
>>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Reynold,
>>>>>>>>>>>>> But with type alias we will have the same problem, right?
>>>>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
>>>>>>>>>>>>> to
>>>>>>>>>>>>> change
>>>>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>>>>> 
>>>>> an
>>>>>>>>>>>>> 
>>>>>>>>>>>>> implicit
>>>>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dirceu,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That is not possible because one cannot overload return
>>>>> 
>>>>> types.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>>>>> 
>>>>> return
>>>>>>>> 
>>>>>>>> some
>>>>>>>>>>>> 
>>>>>>>>>>>> type,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>>>>>>>>>>>>>> SchemaRDD
>>>>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>> not
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>>>>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
>>>>>>>> 
>>>>>>>> removed
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> in
>>>>>>>>>>>> 
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
>>>>>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With this, we don't impact in existing code for the next few
>>>>>>>>>>>>>>> releases.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>>>>> <ku...@gmail.com>:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>>>>> 
>>>>> heavy
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> lifting
>>>>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>>>>>>>>>>>>>>>> after
>>>>>>>> 
>>>>>>>> 30
>>>>>>>>>>>> 
>>>>>>>>>>>> years
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> research there is not a single good open source columnar
>>>>>>>> 
>>>>>>>> database
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> nearly
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>>>>> 
>>>>> need
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> such
>>>>>>>>>>>> 
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
>>>>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>>>>> <sa...@cloudera.com>
>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>>>>> 
>>>>> like
>>>>>>>> 
>>>>>>>> the
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> former
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> slightly better because it's more descriptive.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>>>>>>>> 
>>>>>>>> covers,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>>>>>>>> 
>>>>>>>> choose a
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> package
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
>>>>>>>> 
>>>>>>>> module
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>>>>>>>>>>>>>>>>> change
>>>>>>>> 
>>>>>>>> at
>>>>>>>>>>>> 
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
>>>>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> supporting structured data efficiently does require
>>>>> 
>>>>> quite a
>>>>>>>> 
>>>>>>>> bit
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>>>>>>>> 
>>>>>>>> columnar
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>>>>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
>>>>> 
>>>>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>>>>>>>> 
>>>>>>>> koert@tresata.com>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>>>>> 
>>>>> data
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>> 
>>>>> used
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>>>>>>>>>>>>>>>>>>>> core,
>>>>>>>> 
>>>>>>>> not
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>>>>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>>>>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>>>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>>>>> 
>>>>> there
>>>>>>>> 
>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>>>>>>>> 
>>>>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> get the community's opinion.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>>>>> 
>>>>> data
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>> 
>>>>>>>> for
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>>>>> 
>>>>> We
>>>>>>>> 
>>>>>>>> also
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>>>>> 
>>>>> SchemaRDD
>>>>>>>> 
>>>>>>>> API
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>>>>> 
>>>>> used
>>>>>>>> 
>>>>>>>> DSL
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>>>>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>>>>> 
>>>>> usable
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>>>>> 
>>>>> than
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>>>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>>>>> 
>>>>> highly
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>>>>> 
>>>>> well
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>>>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> materials to
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>>> 
>>>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>>> 
>>>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>> 
>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>> 
>>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

It's a good point. I will update the documentation to say that this is not
meant to be subclassed externally.


On Tue, Feb 10, 2015 at 12:10 PM, Koert Kuipers <ko...@tresata.com> wrote:

> thanks matei its good to know i can create them like that
>
> reynold, yeah somehow the words sql gets me going :) sorry...
> yeah agreed that you need new transformations to preserve the schema info.
> i misunderstood and thought i had to implement the bunch but that is
> clearly not necessary as matei indicated.
>
> allright i am clearly being slow/dense here, but now it makes sense to
> me....
>
>
>
>
>
>
> On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > Koert,
> >
> > Don't get too hang up on the name SQL. This is exactly what you want: a
> > collection with record-like objects with field names and runtime types.
> >
> > Almost all of the 40 methods are transformations for structured data,
> such
> > as aggregation on a field, or filtering on a field. If all you have is
> the
> > old RDD style map/flatMap, then any transformation would lose the schema
> > information, making the extra schema information useless.
> >
> >
> >
> >
> > On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >
> >> so i understand the success or spark.sql. besides the fact that anything
> >> with the words SQL in its name will have thousands of developers running
> >> towards it because of the familiarity, there is also a genuine need for
> a
> >> generic RDD that holds record-like objects, with field names and runtime
> >> types. after all that is a successfull generic abstraction used in many
> >> structured data tools.
> >>
> >> but to me that abstraction is as simple as:
> >>
> >> trait SchemaRDD extends RDD[Row] {
> >>   def schema: StructType
> >> }
> >>
> >> and perhaps another abstraction to indicate it intends to be column
> >> oriented (with a few methods to efficiently extract a subset of
> columns).
> >> so that could be DataFrame.
> >>
> >> such simple contracts would allow many people to write loaders for this
> >> (say from csv) and whatnot.
> >>
> >> what i do not understand why it has to be much more complex than this.
> but
> >> if i look at DataFrame it has so much additional stuff, that has (in my
> >> eyes) nothing to do with generic structured data analysis.
> >>
> >> for example to implement DataFrame i need to implement about 40
> additional
> >> methods!? and for some the SQLness is obviously leaking into the
> >> abstraction. for example why would i care about:
> >>   def registerTempTable(tableName: String): Unit
> >>
> >>
> >> best, koert
> >>
> >> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <ve...@gmail.com>
> >> wrote:
> >>
> >> > It is true that you can persist SchemaRdds / DataFrames to disk via
> >> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> >> > columnar cached representation is completely different from the
> >> > Parquet file format, and I believe there has to be a translation into
> >> > a Row (because ultimately Spark SQL traverses Row's -- even the
> >> > InMemoryColumnarTableScan has to then convert the columns into Rows
> >> > for row-based processing).   On the other hand, traditional data
> >> > frames process in a columnar fashion.   Columnar storage is good, but
> >> > nowhere near as good as columnar processing.
> >> >
> >> > Another issue, which I don't know if it is solved yet, but it is
> >> > difficult for Tachyon to efficiently cache Parquet files without
> >> > understanding the file format itself.
> >> >
> >> > I gave a talk at last year's Spark Summit on this topic.
> >> >
> >> > I'm working on efforts to change this, however.  Shoot me an email at
> >> > velvia at gmail if you're interested in joining forces.
> >> >
> >> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com>
> >> wrote:
> >> > > Yes, when a DataFrame is cached in memory, it's stored in an
> efficient
> >> > > columnar format. And you can also easily persist it on disk using
> >> > Parquet,
> >> > > which is also columnar.
> >> > >
> >> > > Cheng
> >> > >
> >> > >
> >> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> >> > >>
> >> > >> to me the word DataFrame does come with certain expectations. one
> of
> >> > them
> >> > >> is that the data is stored columnar. in R data.frame internally
> uses
> >> a
> >> > >> list
> >> > >> of sequences i think, but since lists can have labels its more
> like a
> >> > >> SortedMap[String, Array[_]]. this makes certain operations very
> cheap
> >> > >> (such
> >> > >> as adding a column).
> >> > >>
> >> > >> in Spark the closest thing would be a data structure where per
> >> Partition
> >> > >> the data is also stored columnar. does spark SQL already use
> >> something
> >> > >> like
> >> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
> >> like
> >> > >> it. where can i find that?
> >> > >>
> >> > >> thanks
> >> > >>
> >> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <
> velvia.github@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >>> +1.... having proper NA support is much cleaner than using null,
> at
> >> > >>> least the Java null.
> >> > >>>
> >> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
> >> evan.sparks@gmail.com
> >> > >
> >> > >>> wrote:
> >> > >>>>
> >> > >>>> You've got to be a little bit careful here. "NA" in systems like
> R
> >> or
> >> > >>>
> >> > >>> pandas
> >> > >>>>
> >> > >>>> may have special meaning that is distinct from "null".
> >> > >>>>
> >> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <
> rxin@databricks.com>
> >> > >>>
> >> > >>> wrote:
> >> > >>>>>
> >> > >>>>> Isn't that just "null" in SQL?
> >> > >>>>>
> >> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
> >> velvia.github@gmail.com>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> I believe that most DataFrame implementations out there, like
> >> > Pandas,
> >> > >>>>>> supports the idea of missing values / NA, and some support the
> >> idea
> >> > of
> >> > >>>>>> Not Meaningful as well.
> >> > >>>>>>
> >> > >>>>>> Does Row support anything like that?  That is important for
> >> certain
> >> > >>>>>> applications.  I thought that Row worked by being a mutable
> >> object,
> >> > >>>>>> but haven't looked into the details in a while.
> >> > >>>>>>
> >> > >>>>>> -Evan
> >> > >>>>>>
> >> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <
> >> rxin@databricks.com>
> >> > >>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>> It shouldn't change the data source api at all because data
> >> sources
> >> > >>>>>>
> >> > >>>>>> create
> >> > >>>>>>>
> >> > >>>>>>> RDD[Row], and that gets converted into a DataFrame
> automatically
> >> > >>>>>>
> >> > >>>>>> (previously
> >> > >>>>>>>
> >> > >>>>>>> to SchemaRDD).
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>
> >> > >>>
> >> >
> >>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> > >>>>>>>
> >> > >>>>>>> One thing that will break the data source API in 1.3 is the
> >> > location
> >> > >>>>>>> of
> >> > >>>>>>> types. Types were previously defined in sql.catalyst.types,
> and
> >> now
> >> > >>>>>>
> >> > >>>>>> moved to
> >> > >>>>>>>
> >> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and
> all
> >> > >>>>>>> public
> >> > >>>>>>
> >> > >>>>>> APIs
> >> > >>>>>>>
> >> > >>>>>>> have first class classes/objects defined in sql directly.
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> >> > velvia.github@gmail.com
> >> > >>>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>> Hey guys,
> >> > >>>>>>>>
> >> > >>>>>>>> How does this impact the data sources API?  I was planning on
> >> > using
> >> > >>>>>>>> this for a project.
> >> > >>>>>>>>
> >> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> >> > >>>>>>>> desirable and useful.
> >> > >>>>>>>>
> >> > >>>>>>>> By the way, one thing that prevents the columnar compression
> >> stuff
> >> > >>>
> >> > >>> in
> >> > >>>>>>>>
> >> > >>>>>>>> Spark SQL from being more useful is, at least from previous
> >> talks
> >> > >>>>>>>> with
> >> > >>>>>>>> Reynold and Michael et al., that the format was not designed
> >> for
> >> > >>>>>>>> persistence.
> >> > >>>>>>>>
> >> > >>>>>>>> I have a new project that aims to change that.  It is a
> >> > >>>>>>>> zero-serialisation, high performance binary vector library,
> >> > >>>
> >> > >>> designed
> >> > >>>>>>>>
> >> > >>>>>>>> from the outset to be a persistent storage friendly.  May be
> >> one
> >> > >>>
> >> > >>> day
> >> > >>>>>>>>
> >> > >>>>>>>> it can replace the Spark SQL columnar compression.
> >> > >>>>>>>>
> >> > >>>>>>>> Michael told me this would be a lot of work, and recreates
> >> parts
> >> > of
> >> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
> >> > >>>
> >> > >>> details.
> >> > >>>>>>>>
> >> > >>>>>>>> -Evan
> >> > >>>>>>>>
> >> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
> >> rxin@databricks.com
> >> > >
> >> > >>>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>> Alright I have merged the patch (
> >> > >>>>>>>>> https://github.com/apache/spark/pull/4173
> >> > >>>>>>>>> ) since I don't see any strong opinions against it (as a
> >> matter
> >> > >>>
> >> > >>> of
> >> > >>>>>>
> >> > >>>>>> fact
> >> > >>>>>>>>>
> >> > >>>>>>>>> most were for it). We can still change it if somebody lays
> >> out a
> >> > >>>>>>
> >> > >>>>>> strong
> >> > >>>>>>>>>
> >> > >>>>>>>>> argument.
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > >>>>>>>>> <ma...@gmail.com>
> >> > >>>>>>>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> The type alias means your methods can specify either type
> and
> >> > >>>
> >> > >>> they
> >> > >>>>>>
> >> > >>>>>> will
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> work. It's just another name for the same type. But
> Scaladocs
> >> > >>>
> >> > >>> and
> >> > >>>>>>
> >> > >>>>>> such
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> will
> >> > >>>>>>>>>> show DataFrame as the type.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Matei
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Reynold,
> >> > >>>>>>>>>>> But with type alias we will have the same problem, right?
> >> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
> >> have
> >> > >>>>>>>>>>> to
> >> > >>>>>>>>>>> change
> >> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we
> have
> >> > >>>
> >> > >>> an
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> implicit
> >> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <
> rxin@databricks.com
> >> >:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> Dirceu,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> That is not possible because one cannot overload return
> >> > >>>
> >> > >>> types.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
> >> > >>>
> >> > >>> return
> >> > >>>>>>
> >> > >>>>>> some
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> type,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
> >> > >>>>>>>>>>>> SchemaRDD
> >> > >>>>>>>>>>>> to
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> not
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> break source compatibility for Scala.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> > >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and
> >> be
> >> > >>>>>>
> >> > >>>>>> removed
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> in
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> the
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
> >> added
> >> > >>>>>>>>>>>>> to
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> DataFrame?
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
> >> few
> >> > >>>>>>>>>>>>> releases.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> >> > >>>>>>>>>>>>> <ku...@gmail.com>:
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
> >> > >>>
> >> > >>> heavy
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> lifting
> >> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that
> even
> >> > >>>>>>>>>>>>>> after
> >> > >>>>>>
> >> > >>>>>> 30
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> years
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> of
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> research there is not a single good open source
> columnar
> >> > >>>>>>
> >> > >>>>>> database
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> like
> >> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but
> it
> >> is
> >> > >>>>>>>>>>>>>> not
> >> > >>>>>>>>>>>>>> nearly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> as
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
> >> > >>>
> >> > >>> need
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>> such
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> a
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change
> >> that.
> >> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> >> > >>>>>>>>>>>>>> <sa...@cloudera.com>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though
> I
> >> > >>>
> >> > >>> like
> >> > >>>>>>
> >> > >>>>>> the
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> former
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under
> the
> >> > >>>>>>
> >> > >>>>>> covers,
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> it
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> would
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at
> least
> >> > >>>>>>
> >> > >>>>>> choose a
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> package
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> name for it that omits "sql".
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
> >> Schema
> >> > >>>>>>
> >> > >>>>>> module
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> Spark
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large
> a
> >> > >>>>>>>>>>>>>>> change
> >> > >>>>>>
> >> > >>>>>> at
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> this
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> point?
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> -Sandy
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
> >> giving
> >> > >>>>>>>>>>>>>>>> it
> >> > >>>>>>>>>>>>>>>> another
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> name,
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
> >> since
> >> > >>>>>>>>>>>>>>>> that
> >> > >>>>>>>>>>>>>>>> was
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> most
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> obvious use case to many users.)
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> Matei
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
> >> Spark
> >> > >>>>>>>>>>>>>>>>> Core
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> long-term,
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
> >> > >>>
> >> > >>> quite a
> >> > >>>>>>
> >> > >>>>>> bit
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> of
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning
> and
> >> > >>>>>>
> >> > >>>>>> columnar
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> storage.
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a
> SQL
> >> > >>>>>>>>>>>>>>>> server
> >> > >>>>>>>>>>>>>>>> --
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> it's
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured
> data.
> >> > >>>>>>>>>>>>>>>> Since
> >> > >>>>>>>>>>>>>>>> this
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> is
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty
> >> natural
> >> > >>>
> >> > >>> to
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> organize it
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> Matei
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> >> > >>>>>>
> >> > >>>>>> koert@tresata.com>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
> >> > >>>
> >> > >>> data
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> format
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> used
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> >> > >>>
> >> > >>> used
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> various
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline
> API."
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in
> spark
> >> > >>>>>>>>>>>>>>>>>> core,
> >> > >>>>>>
> >> > >>>>>> not
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> sql
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> > >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it
> >> yet,
> >> > >>>>>>>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>> Jan.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> 13
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> Bay
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> Area
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of
> >> background
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> information
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> on
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> this
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> ________________________________
> >> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
> >> > >>>
> >> > >>> there
> >> > >>>>>>
> >> > >>>>>> will
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> be
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> a
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> 1:1
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
> >> > >>>>>>
> >> > >>>>>> DataFrame.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> rxin@databricks.com>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> Hi,
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD ->
> DataFrame
> >> in
> >> > >>>>>>>>>>>>>>>>>>>> 1.3,
> >> > >>>>>>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> wanted
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> to
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a
> common
> >> > >>>
> >> > >>> data
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> format
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> used
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems,
> and
> >> > >>>>>>>>>>>>>>>>>>>> used
> >> > >>>>>>
> >> > >>>>>> for
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> various
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline
> API.
> >> > >>>
> >> > >>> We
> >> > >>>>>>
> >> > >>>>>> also
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> expect
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> more
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
> >> > >>>
> >> > >>> SchemaRDD
> >> > >>>>>>
> >> > >>>>>> API
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> rather
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> than
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less
> commonly
> >> > >>>
> >> > >>> used
> >> > >>>>>>
> >> > >>>>>> DSL
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> originally
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
> >> > >>>>>>>>>>>>>>>>>>>> data-frame
> >> > >>>>>>>>>>>>>>>>>>>> like
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> API.
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> In
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
> >> > >>>
> >> > >>> usable
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>> end
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> users.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> >> > >>>
> >> > >>> than
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> SchemaRDD.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to
> be
> >> an
> >> > >>>>>>>>>>>>>>>>>>>> RDD
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> anymore
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> (even
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like
> >> map,
> >> > >>>>>>>>>>>>>>>>>>>> flatMap,
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> etc),
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> >> > >>>
> >> > >>> highly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> confusing.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> Instead.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for
> >> all
> >> > >>>>>>>>>>>>>>>>>>>> RDD
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> methods.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
> >> > >>>>>>>>>>>>>>>>>>>> directly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> against
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
> >> > >>>
> >> > >>> well
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> documented.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> However,
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can
> create a
> >> > >>>>>>>>>>>>>>>>>>>> type
> >> > >>>>>>>>>>>>>>>>>>>> alias
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> DataFrame
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
> >> > >>>>>>>>>>>>>>>>>>>> source
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> compatibility
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
> >> existing
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> materials to
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> use
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> > >>>
> >> > >>> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> > >>>
> >> > >>> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>
> >> > >>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>
> >> ---------------------------------------------------------------------
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>
> >> > >>>>
> >> > >>>
> >> ---------------------------------------------------------------------
> >> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> >> > >>>
> >> > >>>
> >> > >
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > > For additional commands, e-mail: dev-help@spark.apache.org
> >> > >
> >> >
> >>
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

thanks matei its good to know i can create them like that

reynold, yeah somehow the words sql gets me going :) sorry...
yeah agreed that you need new transformations to preserve the schema info.
i misunderstood and thought i had to implement the bunch but that is
clearly not necessary as matei indicated.

allright i am clearly being slow/dense here, but now it makes sense to
me....






On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <rx...@databricks.com> wrote:

> Koert,
>
> Don't get too hang up on the name SQL. This is exactly what you want: a
> collection with record-like objects with field names and runtime types.
>
> Almost all of the 40 methods are transformations for structured data, such
> as aggregation on a field, or filtering on a field. If all you have is the
> old RDD style map/flatMap, then any transformation would lose the schema
> information, making the extra schema information useless.
>
>
>
>
> On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> so i understand the success or spark.sql. besides the fact that anything
>> with the words SQL in its name will have thousands of developers running
>> towards it because of the familiarity, there is also a genuine need for a
>> generic RDD that holds record-like objects, with field names and runtime
>> types. after all that is a successfull generic abstraction used in many
>> structured data tools.
>>
>> but to me that abstraction is as simple as:
>>
>> trait SchemaRDD extends RDD[Row] {
>>   def schema: StructType
>> }
>>
>> and perhaps another abstraction to indicate it intends to be column
>> oriented (with a few methods to efficiently extract a subset of columns).
>> so that could be DataFrame.
>>
>> such simple contracts would allow many people to write loaders for this
>> (say from csv) and whatnot.
>>
>> what i do not understand why it has to be much more complex than this. but
>> if i look at DataFrame it has so much additional stuff, that has (in my
>> eyes) nothing to do with generic structured data analysis.
>>
>> for example to implement DataFrame i need to implement about 40 additional
>> methods!? and for some the SQLness is obviously leaking into the
>> abstraction. for example why would i care about:
>>   def registerTempTable(tableName: String): Unit
>>
>>
>> best, koert
>>
>> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <ve...@gmail.com>
>> wrote:
>>
>> > It is true that you can persist SchemaRdds / DataFrames to disk via
>> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
>> > columnar cached representation is completely different from the
>> > Parquet file format, and I believe there has to be a translation into
>> > a Row (because ultimately Spark SQL traverses Row's -- even the
>> > InMemoryColumnarTableScan has to then convert the columns into Rows
>> > for row-based processing).   On the other hand, traditional data
>> > frames process in a columnar fashion.   Columnar storage is good, but
>> > nowhere near as good as columnar processing.
>> >
>> > Another issue, which I don't know if it is solved yet, but it is
>> > difficult for Tachyon to efficiently cache Parquet files without
>> > understanding the file format itself.
>> >
>> > I gave a talk at last year's Spark Summit on this topic.
>> >
>> > I'm working on efforts to change this, however.  Shoot me an email at
>> > velvia at gmail if you're interested in joining forces.
>> >
>> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com>
>> wrote:
>> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient
>> > > columnar format. And you can also easily persist it on disk using
>> > Parquet,
>> > > which is also columnar.
>> > >
>> > > Cheng
>> > >
>> > >
>> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
>> > >>
>> > >> to me the word DataFrame does come with certain expectations. one of
>> > them
>> > >> is that the data is stored columnar. in R data.frame internally uses
>> a
>> > >> list
>> > >> of sequences i think, but since lists can have labels its more like a
>> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap
>> > >> (such
>> > >> as adding a column).
>> > >>
>> > >> in Spark the closest thing would be a data structure where per
>> Partition
>> > >> the data is also stored columnar. does spark SQL already use
>> something
>> > >> like
>> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
>> like
>> > >> it. where can i find that?
>> > >>
>> > >> thanks
>> > >>
>> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> +1.... having proper NA support is much cleaner than using null, at
>> > >>> least the Java null.
>> > >>>
>> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
>> evan.sparks@gmail.com
>> > >
>> > >>> wrote:
>> > >>>>
>> > >>>> You've got to be a little bit careful here. "NA" in systems like R
>> or
>> > >>>
>> > >>> pandas
>> > >>>>
>> > >>>> may have special meaning that is distinct from "null".
>> > >>>>
>> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
>> > >>>
>> > >>> wrote:
>> > >>>>>
>> > >>>>> Isn't that just "null" in SQL?
>> > >>>>>
>> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
>> velvia.github@gmail.com>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> I believe that most DataFrame implementations out there, like
>> > Pandas,
>> > >>>>>> supports the idea of missing values / NA, and some support the
>> idea
>> > of
>> > >>>>>> Not Meaningful as well.
>> > >>>>>>
>> > >>>>>> Does Row support anything like that?  That is important for
>> certain
>> > >>>>>> applications.  I thought that Row worked by being a mutable
>> object,
>> > >>>>>> but haven't looked into the details in a while.
>> > >>>>>>
>> > >>>>>> -Evan
>> > >>>>>>
>> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <
>> rxin@databricks.com>
>> > >>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> It shouldn't change the data source api at all because data
>> sources
>> > >>>>>>
>> > >>>>>> create
>> > >>>>>>>
>> > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>> > >>>>>>
>> > >>>>>> (previously
>> > >>>>>>>
>> > >>>>>>> to SchemaRDD).
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>
>> > >>>
>> >
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >>>>>>>
>> > >>>>>>> One thing that will break the data source API in 1.3 is the
>> > location
>> > >>>>>>> of
>> > >>>>>>> types. Types were previously defined in sql.catalyst.types, and
>> now
>> > >>>>>>
>> > >>>>>> moved to
>> > >>>>>>>
>> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > >>>>>>> public
>> > >>>>>>
>> > >>>>>> APIs
>> > >>>>>>>
>> > >>>>>>> have first class classes/objects defined in sql directly.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
>> > velvia.github@gmail.com
>> > >>>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hey guys,
>> > >>>>>>>>
>> > >>>>>>>> How does this impact the data sources API?  I was planning on
>> > using
>> > >>>>>>>> this for a project.
>> > >>>>>>>>
>> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>> > >>>>>>>> desirable and useful.
>> > >>>>>>>>
>> > >>>>>>>> By the way, one thing that prevents the columnar compression
>> stuff
>> > >>>
>> > >>> in
>> > >>>>>>>>
>> > >>>>>>>> Spark SQL from being more useful is, at least from previous
>> talks
>> > >>>>>>>> with
>> > >>>>>>>> Reynold and Michael et al., that the format was not designed
>> for
>> > >>>>>>>> persistence.
>> > >>>>>>>>
>> > >>>>>>>> I have a new project that aims to change that.  It is a
>> > >>>>>>>> zero-serialisation, high performance binary vector library,
>> > >>>
>> > >>> designed
>> > >>>>>>>>
>> > >>>>>>>> from the outset to be a persistent storage friendly.  May be
>> one
>> > >>>
>> > >>> day
>> > >>>>>>>>
>> > >>>>>>>> it can replace the Spark SQL columnar compression.
>> > >>>>>>>>
>> > >>>>>>>> Michael told me this would be a lot of work, and recreates
>> parts
>> > of
>> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>> > >>>
>> > >>> details.
>> > >>>>>>>>
>> > >>>>>>>> -Evan
>> > >>>>>>>>
>> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
>> rxin@databricks.com
>> > >
>> > >>>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>> Alright I have merged the patch (
>> > >>>>>>>>> https://github.com/apache/spark/pull/4173
>> > >>>>>>>>> ) since I don't see any strong opinions against it (as a
>> matter
>> > >>>
>> > >>> of
>> > >>>>>>
>> > >>>>>> fact
>> > >>>>>>>>>
>> > >>>>>>>>> most were for it). We can still change it if somebody lays
>> out a
>> > >>>>>>
>> > >>>>>> strong
>> > >>>>>>>>>
>> > >>>>>>>>> argument.
>> > >>>>>>>>>
>> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >>>>>>>>> <ma...@gmail.com>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> The type alias means your methods can specify either type and
>> > >>>
>> > >>> they
>> > >>>>>>
>> > >>>>>> will
>> > >>>>>>>>>>
>> > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>> > >>>
>> > >>> and
>> > >>>>>>
>> > >>>>>> such
>> > >>>>>>>>>>
>> > >>>>>>>>>> will
>> > >>>>>>>>>> show DataFrame as the type.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Matei
>> > >>>>>>>>>>
>> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >>>>>>>>>>
>> > >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Reynold,
>> > >>>>>>>>>>> But with type alias we will have the same problem, right?
>> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
>> have
>> > >>>>>>>>>>> to
>> > >>>>>>>>>>> change
>> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>> > >>>
>> > >>> an
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> implicit
>> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com
>> >:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Dirceu,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> That is not possible because one cannot overload return
>> > >>>
>> > >>> types.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>> > >>>
>> > >>> return
>> > >>>>>>
>> > >>>>>> some
>> > >>>>>>>>>>
>> > >>>>>>>>>> type,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>> > >>>>>>>>>>>> SchemaRDD
>> > >>>>>>>>>>>> to
>> > >>>>>>>>>>
>> > >>>>>>>>>> not
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> break source compatibility for Scala.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> > >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and
>> be
>> > >>>>>>
>> > >>>>>> removed
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> in
>> > >>>>>>>>>>
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
>> added
>> > >>>>>>>>>>>>> to
>> > >>>>>>>>>>
>> > >>>>>>>>>> DataFrame?
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
>> few
>> > >>>>>>>>>>>>> releases.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>> > >>>>>>>>>>>>> <ku...@gmail.com>:
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>> > >>>
>> > >>> heavy
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> lifting
>> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>> > >>>>>>>>>>>>>> after
>> > >>>>>>
>> > >>>>>> 30
>> > >>>>>>>>>>
>> > >>>>>>>>>> years
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> research there is not a single good open source columnar
>> > >>>>>>
>> > >>>>>> database
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> like
>> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it
>> is
>> > >>>>>>>>>>>>>> not
>> > >>>>>>>>>>>>>> nearly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> as
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>> > >>>
>> > >>> need
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>> such
>> > >>>>>>>>>>
>> > >>>>>>>>>> a
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change
>> that.
>> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>> > >>>>>>>>>>>>>> <sa...@cloudera.com>
>> > >>>>>>>>>>
>> > >>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>> > >>>
>> > >>> like
>> > >>>>>>
>> > >>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> former
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>> > >>>>>>
>> > >>>>>> covers,
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> it
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> would
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>> > >>>>>>
>> > >>>>>> choose a
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> package
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> name for it that omits "sql".
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
>> Schema
>> > >>>>>>
>> > >>>>>> module
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Spark
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>> > >>>>>>>>>>>>>>> change
>> > >>>>>>
>> > >>>>>> at
>> > >>>>>>>>>>
>> > >>>>>>>>>> this
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> point?
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> -Sandy
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
>> giving
>> > >>>>>>>>>>>>>>>> it
>> > >>>>>>>>>>>>>>>> another
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> name,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
>> since
>> > >>>>>>>>>>>>>>>> that
>> > >>>>>>>>>>>>>>>> was
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> most
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> obvious use case to many users.)
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Matei
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
>> Spark
>> > >>>>>>>>>>>>>>>>> Core
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> long-term,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
>> > >>>
>> > >>> quite a
>> > >>>>>>
>> > >>>>>> bit
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>> > >>>>>>
>> > >>>>>> columnar
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> storage.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>> > >>>>>>>>>>>>>>>> server
>> > >>>>>>>>>>>>>>>> --
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>> > >>>>>>>>>>>>>>>> Since
>> > >>>>>>>>>>>>>>>> this
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> is
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty
>> natural
>> > >>>
>> > >>> to
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> organize it
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Matei
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>> > >>>>>>
>> > >>>>>> koert@tresata.com>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>> > >>>
>> > >>> data
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> format
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> used
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>> > >>>
>> > >>> used
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> various
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>> > >>>>>>>>>>>>>>>>>> core,
>> > >>>>>>
>> > >>>>>> not
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> sql
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it
>> yet,
>> > >>>>>>>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>> Jan.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> 13
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Bay
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Area
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of
>> background
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> information
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> on
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> this
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> ________________________________
>> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>> > >>>
>> > >>> there
>> > >>>>>>
>> > >>>>>> will
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> be
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> a
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 1:1
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>> > >>>>>>
>> > >>>>>> DataFrame.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> rxin@databricks.com>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> Hi,
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame
>> in
>> > >>>>>>>>>>>>>>>>>>>> 1.3,
>> > >>>>>>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> wanted
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> to
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>> > >>>
>> > >>> data
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> format
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> used
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>> > >>>>>>>>>>>>>>>>>>>> used
>> > >>>>>>
>> > >>>>>> for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> various
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>> > >>>
>> > >>> We
>> > >>>>>>
>> > >>>>>> also
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> expect
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> more
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>> > >>>
>> > >>> SchemaRDD
>> > >>>>>>
>> > >>>>>> API
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> rather
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> than
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>> > >>>
>> > >>> used
>> > >>>>>>
>> > >>>>>> DSL
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> originally
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>> > >>>>>>>>>>>>>>>>>>>> data-frame
>> > >>>>>>>>>>>>>>>>>>>> like
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> API.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> In
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>> > >>>
>> > >>> usable
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>> end
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> users.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>> > >>>
>> > >>> than
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> SchemaRDD.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be
>> an
>> > >>>>>>>>>>>>>>>>>>>> RDD
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> anymore
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> (even
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like
>> map,
>> > >>>>>>>>>>>>>>>>>>>> flatMap,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> etc),
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>> > >>>
>> > >>> highly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> confusing.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> Instead.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for
>> all
>> > >>>>>>>>>>>>>>>>>>>> RDD
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> methods.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>> > >>>>>>>>>>>>>>>>>>>> directly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>> > >>>
>> > >>> well
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> documented.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> However,
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>> > >>>>>>>>>>>>>>>>>>>> type
>> > >>>>>>>>>>>>>>>>>>>> alias
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> DataFrame
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>> > >>>>>>>>>>>>>>>>>>>> source
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> compatibility
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
>> existing
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> materials to
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> use
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> > >>>
>> > >>> dev-unsubscribe@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> > >>>
>> > >>> dev-unsubscribe@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@spark.apache.org
>> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>
>> > >>> dev-help@spark.apache.org
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>
>> ---------------------------------------------------------------------
>> > >>>>>>>>>>
>> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>
>> > >>>>
>> > >>>
>> ---------------------------------------------------------------------
>> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >>> For additional commands, e-mail: dev-help@spark.apache.org
>> > >>>
>> > >>>
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > > For additional commands, e-mail: dev-help@spark.apache.org
>> > >
>> >
>>
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

Koert,

Don't get too hang up on the name SQL. This is exactly what you want: a
collection with record-like objects with field names and runtime types.

Almost all of the 40 methods are transformations for structured data, such
as aggregation on a field, or filtering on a field. If all you have is the
old RDD style map/flatMap, then any transformation would lose the schema
information, making the extra schema information useless.




On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:

> so i understand the success or spark.sql. besides the fact that anything
> with the words SQL in its name will have thousands of developers running
> towards it because of the familiarity, there is also a genuine need for a
> generic RDD that holds record-like objects, with field names and runtime
> types. after all that is a successfull generic abstraction used in many
> structured data tools.
>
> but to me that abstraction is as simple as:
>
> trait SchemaRDD extends RDD[Row] {
>   def schema: StructType
> }
>
> and perhaps another abstraction to indicate it intends to be column
> oriented (with a few methods to efficiently extract a subset of columns).
> so that could be DataFrame.
>
> such simple contracts would allow many people to write loaders for this
> (say from csv) and whatnot.
>
> what i do not understand why it has to be much more complex than this. but
> if i look at DataFrame it has so much additional stuff, that has (in my
> eyes) nothing to do with generic structured data analysis.
>
> for example to implement DataFrame i need to implement about 40 additional
> methods!? and for some the SQLness is obviously leaking into the
> abstraction. for example why would i care about:
>   def registerTempTable(tableName: String): Unit
>
>
> best, koert
>
> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <ve...@gmail.com> wrote:
>
> > It is true that you can persist SchemaRdds / DataFrames to disk via
> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> > columnar cached representation is completely different from the
> > Parquet file format, and I believe there has to be a translation into
> > a Row (because ultimately Spark SQL traverses Row's -- even the
> > InMemoryColumnarTableScan has to then convert the columns into Rows
> > for row-based processing).   On the other hand, traditional data
> > frames process in a columnar fashion.   Columnar storage is good, but
> > nowhere near as good as columnar processing.
> >
> > Another issue, which I don't know if it is solved yet, but it is
> > difficult for Tachyon to efficiently cache Parquet files without
> > understanding the file format itself.
> >
> > I gave a talk at last year's Spark Summit on this topic.
> >
> > I'm working on efforts to change this, however.  Shoot me an email at
> > velvia at gmail if you're interested in joining forces.
> >
> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com>
> wrote:
> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient
> > > columnar format. And you can also easily persist it on disk using
> > Parquet,
> > > which is also columnar.
> > >
> > > Cheng
> > >
> > >
> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> > >>
> > >> to me the word DataFrame does come with certain expectations. one of
> > them
> > >> is that the data is stored columnar. in R data.frame internally uses a
> > >> list
> > >> of sequences i think, but since lists can have labels its more like a
> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap
> > >> (such
> > >> as adding a column).
> > >>
> > >> in Spark the closest thing would be a data structure where per
> Partition
> > >> the data is also stored columnar. does spark SQL already use something
> > >> like
> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
> like
> > >> it. where can i find that?
> > >>
> > >> thanks
> > >>
> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com>
> > >> wrote:
> > >>
> > >>> +1.... having proper NA support is much cleaner than using null, at
> > >>> least the Java null.
> > >>>
> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
> evan.sparks@gmail.com
> > >
> > >>> wrote:
> > >>>>
> > >>>> You've got to be a little bit careful here. "NA" in systems like R
> or
> > >>>
> > >>> pandas
> > >>>>
> > >>>> may have special meaning that is distinct from "null".
> > >>>>
> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
> > >>>
> > >>> wrote:
> > >>>>>
> > >>>>> Isn't that just "null" in SQL?
> > >>>>>
> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
> velvia.github@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I believe that most DataFrame implementations out there, like
> > Pandas,
> > >>>>>> supports the idea of missing values / NA, and some support the
> idea
> > of
> > >>>>>> Not Meaningful as well.
> > >>>>>>
> > >>>>>> Does Row support anything like that?  That is important for
> certain
> > >>>>>> applications.  I thought that Row worked by being a mutable
> object,
> > >>>>>> but haven't looked into the details in a while.
> > >>>>>>
> > >>>>>> -Evan
> > >>>>>>
> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rxin@databricks.com
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> It shouldn't change the data source api at all because data
> sources
> > >>>>>>
> > >>>>>> create
> > >>>>>>>
> > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
> > >>>>>>
> > >>>>>> (previously
> > >>>>>>>
> > >>>>>>> to SchemaRDD).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>>
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> > >>>>>>>
> > >>>>>>> One thing that will break the data source API in 1.3 is the
> > location
> > >>>>>>> of
> > >>>>>>> types. Types were previously defined in sql.catalyst.types, and
> now
> > >>>>>>
> > >>>>>> moved to
> > >>>>>>>
> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
> > >>>>>>> public
> > >>>>>>
> > >>>>>> APIs
> > >>>>>>>
> > >>>>>>> have first class classes/objects defined in sql directly.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> > velvia.github@gmail.com
> > >>>>>>
> > >>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hey guys,
> > >>>>>>>>
> > >>>>>>>> How does this impact the data sources API?  I was planning on
> > using
> > >>>>>>>> this for a project.
> > >>>>>>>>
> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> > >>>>>>>> desirable and useful.
> > >>>>>>>>
> > >>>>>>>> By the way, one thing that prevents the columnar compression
> stuff
> > >>>
> > >>> in
> > >>>>>>>>
> > >>>>>>>> Spark SQL from being more useful is, at least from previous
> talks
> > >>>>>>>> with
> > >>>>>>>> Reynold and Michael et al., that the format was not designed for
> > >>>>>>>> persistence.
> > >>>>>>>>
> > >>>>>>>> I have a new project that aims to change that.  It is a
> > >>>>>>>> zero-serialisation, high performance binary vector library,
> > >>>
> > >>> designed
> > >>>>>>>>
> > >>>>>>>> from the outset to be a persistent storage friendly.  May be one
> > >>>
> > >>> day
> > >>>>>>>>
> > >>>>>>>> it can replace the Spark SQL columnar compression.
> > >>>>>>>>
> > >>>>>>>> Michael told me this would be a lot of work, and recreates parts
> > of
> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
> > >>>
> > >>> details.
> > >>>>>>>>
> > >>>>>>>> -Evan
> > >>>>>>>>
> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
> rxin@databricks.com
> > >
> > >>>>>>
> > >>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Alright I have merged the patch (
> > >>>>>>>>> https://github.com/apache/spark/pull/4173
> > >>>>>>>>> ) since I don't see any strong opinions against it (as a matter
> > >>>
> > >>> of
> > >>>>>>
> > >>>>>> fact
> > >>>>>>>>>
> > >>>>>>>>> most were for it). We can still change it if somebody lays out
> a
> > >>>>>>
> > >>>>>> strong
> > >>>>>>>>>
> > >>>>>>>>> argument.
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> > >>>>>>>>> <ma...@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> The type alias means your methods can specify either type and
> > >>>
> > >>> they
> > >>>>>>
> > >>>>>> will
> > >>>>>>>>>>
> > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs
> > >>>
> > >>> and
> > >>>>>>
> > >>>>>> such
> > >>>>>>>>>>
> > >>>>>>>>>> will
> > >>>>>>>>>> show DataFrame as the type.
> > >>>>>>>>>>
> > >>>>>>>>>> Matei
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> > >>>>>>>>>>
> > >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Reynold,
> > >>>>>>>>>>> But with type alias we will have the same problem, right?
> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
> have
> > >>>>>>>>>>> to
> > >>>>>>>>>>> change
> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
> > >>>
> > >>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>> implicit
> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com
> >:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Dirceu,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> That is not possible because one cannot overload return
> > >>>
> > >>> types.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
> > >>>
> > >>> return
> > >>>>>>
> > >>>>>> some
> > >>>>>>>>>>
> > >>>>>>>>>> type,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
> > >>>>>>>>>>>> SchemaRDD
> > >>>>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>>> not
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> break source compatibility for Scala.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> > >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
> > >>>>>>
> > >>>>>> removed
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
> added
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>>> DataFrame?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
> few
> > >>>>>>>>>>>>> releases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> > >>>>>>>>>>>>> <ku...@gmail.com>:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
> > >>>
> > >>> heavy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> lifting
> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
> > >>>>>>>>>>>>>> after
> > >>>>>>
> > >>>>>> 30
> > >>>>>>>>>>
> > >>>>>>>>>> years
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> research there is not a single good open source columnar
> > >>>>>>
> > >>>>>> database
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it
> is
> > >>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>> nearly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
> > >>>
> > >>> need
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>> such
> > >>>>>>>>>>
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> > >>>>>>>>>>>>>> <sa...@cloudera.com>
> > >>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
> > >>>
> > >>> like
> > >>>>>>
> > >>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> former
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> > >>>>>>
> > >>>>>> covers,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
> > >>>>>>
> > >>>>>> choose a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> package
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> name for it that omits "sql".
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
> Schema
> > >>>>>>
> > >>>>>> module
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Spark
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
> > >>>>>>>>>>>>>>> change
> > >>>>>>
> > >>>>>> at
> > >>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> point?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Sandy
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
> giving
> > >>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> name,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
> since
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>> was
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> obvious use case to many users.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Matei
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
> Spark
> > >>>>>>>>>>>>>>>>> Core
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> long-term,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
> > >>>
> > >>> quite a
> > >>>>>>
> > >>>>>> bit
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
> > >>>>>>
> > >>>>>> columnar
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> storage.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
> > >>>>>>>>>>>>>>>> server
> > >>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
> > >>>>>>>>>>>>>>>> Since
> > >>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
> > >>>
> > >>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> organize it
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Matei
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> > >>>>>>
> > >>>>>> koert@tresata.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
> > >>>
> > >>> data
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> > >>>
> > >>> used
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> various
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
> > >>>>>>>>>>>>>>>>>> core,
> > >>>>>>
> > >>>>>> not
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> sql
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> Jan.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 13
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Bay
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Area
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> information
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> ________________________________
> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
> > >>>
> > >>> there
> > >>>>>>
> > >>>>>> will
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1:1
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
> > >>>>>>
> > >>>>>> DataFrame.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> rxin@databricks.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame
> in
> > >>>>>>>>>>>>>>>>>>>> 1.3,
> > >>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wanted
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
> > >>>
> > >>> data
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> > >>>>>>>>>>>>>>>>>>>> used
> > >>>>>>
> > >>>>>> for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> various
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
> > >>>
> > >>> We
> > >>>>>>
> > >>>>>> also
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> expect
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
> > >>>
> > >>> SchemaRDD
> > >>>>>>
> > >>>>>> API
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
> > >>>
> > >>> used
> > >>>>>>
> > >>>>>> DSL
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> originally
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
> > >>>>>>>>>>>>>>>>>>>> data-frame
> > >>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> API.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
> > >>>
> > >>> usable
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> end
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> users.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> > >>>
> > >>> than
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> SchemaRDD.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be
> an
> > >>>>>>>>>>>>>>>>>>>> RDD
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> anymore
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> (even
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
> > >>>>>>>>>>>>>>>>>>>> flatMap,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> etc),
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> > >>>
> > >>> highly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> confusing.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Instead.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> > >>>>>>>>>>>>>>>>>>>> RDD
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> methods.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
> > >>>>>>>>>>>>>>>>>>>> directly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
> > >>>
> > >>> well
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> documented.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> However,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
> > >>>>>>>>>>>>>>>>>>>> type
> > >>>>>>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> DataFrame
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
> > >>>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
> existing
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> materials to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> > >>>
> > >>> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> > >>>
> > >>> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>
> > >>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>> ---------------------------------------------------------------------
> > >>>>>>>>>>
> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>
> > >>>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

so i understand the success or spark.sql. besides the fact that anything
with the words SQL in its name will have thousands of developers running
towards it because of the familiarity, there is also a genuine need for a
generic RDD that holds record-like objects, with field names and runtime
types. after all that is a successfull generic abstraction used in many
structured data tools.

but to me that abstraction is as simple as:

trait SchemaRDD extends RDD[Row] {
  def schema: StructType
}

and perhaps another abstraction to indicate it intends to be column
oriented (with a few methods to efficiently extract a subset of columns).
so that could be DataFrame.

such simple contracts would allow many people to write loaders for this
(say from csv) and whatnot.

what i do not understand why it has to be much more complex than this. but
if i look at DataFrame it has so much additional stuff, that has (in my
eyes) nothing to do with generic structured data analysis.

for example to implement DataFrame i need to implement about 40 additional
methods!? and for some the SQLness is obviously leaking into the
abstraction. for example why would i care about:
  def registerTempTable(tableName: String): Unit


best, koert

On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <ve...@gmail.com> wrote:

> It is true that you can persist SchemaRdds / DataFrames to disk via
> Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> columnar cached representation is completely different from the
> Parquet file format, and I believe there has to be a translation into
> a Row (because ultimately Spark SQL traverses Row's -- even the
> InMemoryColumnarTableScan has to then convert the columns into Rows
> for row-based processing).   On the other hand, traditional data
> frames process in a columnar fashion.   Columnar storage is good, but
> nowhere near as good as columnar processing.
>
> Another issue, which I don't know if it is solved yet, but it is
> difficult for Tachyon to efficiently cache Parquet files without
> understanding the file format itself.
>
> I gave a talk at last year's Spark Summit on this topic.
>
> I'm working on efforts to change this, however.  Shoot me an email at
> velvia at gmail if you're interested in joining forces.
>
> On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com> wrote:
> > Yes, when a DataFrame is cached in memory, it's stored in an efficient
> > columnar format. And you can also easily persist it on disk using
> Parquet,
> > which is also columnar.
> >
> > Cheng
> >
> >
> > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> >>
> >> to me the word DataFrame does come with certain expectations. one of
> them
> >> is that the data is stored columnar. in R data.frame internally uses a
> >> list
> >> of sequences i think, but since lists can have labels its more like a
> >> SortedMap[String, Array[_]]. this makes certain operations very cheap
> >> (such
> >> as adding a column).
> >>
> >> in Spark the closest thing would be a data structure where per Partition
> >> the data is also stored columnar. does spark SQL already use something
> >> like
> >> that? Evan mentioned "Spark SQL columnar compression", which sounds like
> >> it. where can i find that?
> >>
> >> thanks
> >>
> >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com>
> >> wrote:
> >>
> >>> +1.... having proper NA support is much cleaner than using null, at
> >>> least the Java null.
> >>>
> >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.sparks@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> You've got to be a little bit careful here. "NA" in systems like R or
> >>>
> >>> pandas
> >>>>
> >>>> may have special meaning that is distinct from "null".
> >>>>
> >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
> >>>
> >>> wrote:
> >>>>>
> >>>>> Isn't that just "null" in SQL?
> >>>>>
> >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I believe that most DataFrame implementations out there, like
> Pandas,
> >>>>>> supports the idea of missing values / NA, and some support the idea
> of
> >>>>>> Not Meaningful as well.
> >>>>>>
> >>>>>> Does Row support anything like that?  That is important for certain
> >>>>>> applications.  I thought that Row worked by being a mutable object,
> >>>>>> but haven't looked into the details in a while.
> >>>>>>
> >>>>>> -Evan
> >>>>>>
> >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> It shouldn't change the data source api at all because data sources
> >>>>>>
> >>>>>> create
> >>>>>>>
> >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
> >>>>>>
> >>>>>> (previously
> >>>>>>>
> >>>>>>> to SchemaRDD).
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >>>>>>>
> >>>>>>> One thing that will break the data source API in 1.3 is the
> location
> >>>>>>> of
> >>>>>>> types. Types were previously defined in sql.catalyst.types, and now
> >>>>>>
> >>>>>> moved to
> >>>>>>>
> >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
> >>>>>>> public
> >>>>>>
> >>>>>> APIs
> >>>>>>>
> >>>>>>> have first class classes/objects defined in sql directly.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> velvia.github@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey guys,
> >>>>>>>>
> >>>>>>>> How does this impact the data sources API?  I was planning on
> using
> >>>>>>>> this for a project.
> >>>>>>>>
> >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> >>>>>>>> desirable and useful.
> >>>>>>>>
> >>>>>>>> By the way, one thing that prevents the columnar compression stuff
> >>>
> >>> in
> >>>>>>>>
> >>>>>>>> Spark SQL from being more useful is, at least from previous talks
> >>>>>>>> with
> >>>>>>>> Reynold and Michael et al., that the format was not designed for
> >>>>>>>> persistence.
> >>>>>>>>
> >>>>>>>> I have a new project that aims to change that.  It is a
> >>>>>>>> zero-serialisation, high performance binary vector library,
> >>>
> >>> designed
> >>>>>>>>
> >>>>>>>> from the outset to be a persistent storage friendly.  May be one
> >>>
> >>> day
> >>>>>>>>
> >>>>>>>> it can replace the Spark SQL columnar compression.
> >>>>>>>>
> >>>>>>>> Michael told me this would be a lot of work, and recreates parts
> of
> >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
> >>>
> >>> details.
> >>>>>>>>
> >>>>>>>> -Evan
> >>>>>>>>
> >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rxin@databricks.com
> >
> >>>>>>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Alright I have merged the patch (
> >>>>>>>>> https://github.com/apache/spark/pull/4173
> >>>>>>>>> ) since I don't see any strong opinions against it (as a matter
> >>>
> >>> of
> >>>>>>
> >>>>>> fact
> >>>>>>>>>
> >>>>>>>>> most were for it). We can still change it if somebody lays out a
> >>>>>>
> >>>>>> strong
> >>>>>>>>>
> >>>>>>>>> argument.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >>>>>>>>> <ma...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The type alias means your methods can specify either type and
> >>>
> >>> they
> >>>>>>
> >>>>>> will
> >>>>>>>>>>
> >>>>>>>>>> work. It's just another name for the same type. But Scaladocs
> >>>
> >>> and
> >>>>>>
> >>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> will
> >>>>>>>>>> show DataFrame as the type.
> >>>>>>>>>>
> >>>>>>>>>> Matei
> >>>>>>>>>>
> >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >>>>>>>>>>
> >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Reynold,
> >>>>>>>>>>> But with type alias we will have the same problem, right?
> >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
> >>>>>>>>>>> to
> >>>>>>>>>>> change
> >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
> >>>
> >>> an
> >>>>>>>>>>>
> >>>>>>>>>>> implicit
> >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dirceu,
> >>>>>>>>>>>>
> >>>>>>>>>>>> That is not possible because one cannot overload return
> >>>
> >>> types.
> >>>>>>>>>>>>
> >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
> >>>
> >>> return
> >>>>>>
> >>>>>> some
> >>>>>>>>>>
> >>>>>>>>>> type,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
> >>>>>>>>>>>> SchemaRDD
> >>>>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> not
> >>>>>>>>>>>>
> >>>>>>>>>>>> break source compatibility for Scala.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
> >>>>>>
> >>>>>> removed
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
> >>>>>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> DataFrame?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With this, we don't impact in existing code for the next few
> >>>>>>>>>>>>> releases.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> >>>>>>>>>>>>> <ku...@gmail.com>:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
> >>>
> >>> heavy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> lifting
> >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
> >>>>>>>>>>>>>> after
> >>>>>>
> >>>>>> 30
> >>>>>>>>>>
> >>>>>>>>>> years
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> research there is not a single good open source columnar
> >>>>>>
> >>>>>> database
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> like
> >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>> nearly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
> >>>
> >>> need
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
> >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> >>>>>>>>>>>>>> <sa...@cloudera.com>
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
> >>>
> >>> like
> >>>>>>
> >>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> former
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> slightly better because it's more descriptive.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> >>>>>>
> >>>>>> covers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
> >>>>>>
> >>>>>> choose a
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> package
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> name for it that omits "sql".
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
> >>>>>>
> >>>>>> module
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Spark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
> >>>>>>>>>>>>>>> change
> >>>>>>
> >>>>>> at
> >>>>>>>>>>
> >>>>>>>>>> this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Sandy
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
> >>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> name,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> obvious use case to many users.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Matei
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
> >>>>>>>>>>>>>>>>> Core
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> long-term,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> supporting structured data efficiently does require
> >>>
> >>> quite a
> >>>>>>
> >>>>>> bit
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
> >>>>>>
> >>>>>> columnar
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> storage.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
> >>>>>>>>>>>>>>>> server
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
> >>>>>>>>>>>>>>>> Since
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
> >>>
> >>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> organize it
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Matei
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> >>>>>>
> >>>>>> koert@tresata.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
> >>>
> >>> data
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> >>>
> >>> used
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> various
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
> >>>>>>>>>>>>>>>>>> core,
> >>>>>>
> >>>>>> not
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> sql
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> Jan.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 13
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bay
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Area
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ________________________________
> >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
> >>>
> >>> there
> >>>>>>
> >>>>>> will
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1:1
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
> >>>>>>
> >>>>>> DataFrame.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> rxin@databricks.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
> >>>>>>>>>>>>>>>>>>>> 1.3,
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wanted
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> get the community's opinion.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
> >>>
> >>> data
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> >>>>>>>>>>>>>>>>>>>> used
> >>>>>>
> >>>>>> for
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> various
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
> >>>
> >>> We
> >>>>>>
> >>>>>> also
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> expect
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
> >>>
> >>> SchemaRDD
> >>>>>>
> >>>>>> API
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
> >>>
> >>> used
> >>>>>>
> >>>>>> DSL
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> originally
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
> >>>>>>>>>>>>>>>>>>>> data-frame
> >>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> API.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
> >>>
> >>> usable
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> >>>
> >>> than
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SchemaRDD.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
> >>>>>>>>>>>>>>>>>>>> RDD
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> anymore
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> (even
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
> >>>>>>>>>>>>>>>>>>>> flatMap,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> etc),
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> >>>
> >>> highly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Instead.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> >>>>>>>>>>>>>>>>>>>> RDD
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> methods.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
> >>>>>>>>>>>>>>>>>>>> directly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> against
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
> >>>
> >>> well
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> documented.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> However,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
> >>>>>>>>>>>>>>>>>>>> type
> >>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> DataFrame
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
> >>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> materials to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >>>
> >>> dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >>>
> >>> dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>> For additional commands, e-mail:
> >>>
> >>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Evan Chan <ve...@gmail.com>.

It is true that you can persist SchemaRdds / DataFrames to disk via
Parquet, but a lot of time and inefficiencies is lost.   The in-memory
columnar cached representation is completely different from the
Parquet file format, and I believe there has to be a translation into
a Row (because ultimately Spark SQL traverses Row's -- even the
InMemoryColumnarTableScan has to then convert the columns into Rows
for row-based processing).   On the other hand, traditional data
frames process in a columnar fashion.   Columnar storage is good, but
nowhere near as good as columnar processing.

Another issue, which I don't know if it is solved yet, but it is
difficult for Tachyon to efficiently cache Parquet files without
understanding the file format itself.

I gave a talk at last year's Spark Summit on this topic.

I'm working on efforts to change this, however.  Shoot me an email at
velvia at gmail if you're interested in joining forces.

On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <li...@gmail.com> wrote:
> Yes, when a DataFrame is cached in memory, it's stored in an efficient
> columnar format. And you can also easily persist it on disk using Parquet,
> which is also columnar.
>
> Cheng
>
>
> On 1/29/15 1:24 PM, Koert Kuipers wrote:
>>
>> to me the word DataFrame does come with certain expectations. one of them
>> is that the data is stored columnar. in R data.frame internally uses a
>> list
>> of sequences i think, but since lists can have labels its more like a
>> SortedMap[String, Array[_]]. this makes certain operations very cheap
>> (such
>> as adding a column).
>>
>> in Spark the closest thing would be a data structure where per Partition
>> the data is also stored columnar. does spark SQL already use something
>> like
>> that? Evan mentioned "Spark SQL columnar compression", which sounds like
>> it. where can i find that?
>>
>> thanks
>>
>> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com>
>> wrote:
>>
>>> +1.... having proper NA support is much cleaner than using null, at
>>> least the Java null.
>>>
>>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>>
>>>> You've got to be a little bit careful here. "NA" in systems like R or
>>>
>>> pandas
>>>>
>>>> may have special meaning that is distinct from "null".
>>>>
>>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>>
>>>>
>>>>
>>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
>>>
>>> wrote:
>>>>>
>>>>> Isn't that just "null" in SQL?
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I believe that most DataFrame implementations out there, like Pandas,
>>>>>> supports the idea of missing values / NA, and some support the idea of
>>>>>> Not Meaningful as well.
>>>>>>
>>>>>> Does Row support anything like that?  That is important for certain
>>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>>> but haven't looked into the details in a while.
>>>>>>
>>>>>> -Evan
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> It shouldn't change the data source api at all because data sources
>>>>>>
>>>>>> create
>>>>>>>
>>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>>>
>>>>>> (previously
>>>>>>>
>>>>>>> to SchemaRDD).
>>>>>>>
>>>>>>>
>>>>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>>>
>>>>>>> One thing that will break the data source API in 1.3 is the location
>>>>>>> of
>>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>>>
>>>>>> moved to
>>>>>>>
>>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>>> public
>>>>>>
>>>>>> APIs
>>>>>>>
>>>>>>> have first class classes/objects defined in sql directly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.github@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hey guys,
>>>>>>>>
>>>>>>>> How does this impact the data sources API?  I was planning on using
>>>>>>>> this for a project.
>>>>>>>>
>>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>>> desirable and useful.
>>>>>>>>
>>>>>>>> By the way, one thing that prevents the columnar compression stuff
>>>
>>> in
>>>>>>>>
>>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>>> with
>>>>>>>> Reynold and Michael et al., that the format was not designed for
>>>>>>>> persistence.
>>>>>>>>
>>>>>>>> I have a new project that aims to change that.  It is a
>>>>>>>> zero-serialisation, high performance binary vector library,
>>>
>>> designed
>>>>>>>>
>>>>>>>> from the outset to be a persistent storage friendly.  May be one
>>>
>>> day
>>>>>>>>
>>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>>
>>>>>>>> Michael told me this would be a lot of work, and recreates parts of
>>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>>>
>>> details.
>>>>>>>>
>>>>>>>> -Evan
>>>>>>>>
>>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
>>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Alright I have merged the patch (
>>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>>> ) since I don't see any strong opinions against it (as a matter
>>>
>>> of
>>>>>>
>>>>>> fact
>>>>>>>>>
>>>>>>>>> most were for it). We can still change it if somebody lays out a
>>>>>>
>>>>>> strong
>>>>>>>>>
>>>>>>>>> argument.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>>> <ma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The type alias means your methods can specify either type and
>>>
>>> they
>>>>>>
>>>>>> will
>>>>>>>>>>
>>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>>>
>>> and
>>>>>>
>>>>>> such
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>> show DataFrame as the type.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>>>>>>>>>>
>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Reynold,
>>>>>>>>>>> But with type alias we will have the same problem, right?
>>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
>>>>>>>>>>> to
>>>>>>>>>>> change
>>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>>>
>>> an
>>>>>>>>>>>
>>>>>>>>>>> implicit
>>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Dirceu,
>>>>>>>>>>>>
>>>>>>>>>>>> That is not possible because one cannot overload return
>>>
>>> types.
>>>>>>>>>>>>
>>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>>>
>>> return
>>>>>>
>>>>>> some
>>>>>>>>>>
>>>>>>>>>> type,
>>>>>>>>>>>>
>>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>>
>>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>>>>>>>>>>>> SchemaRDD
>>>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> not
>>>>>>>>>>>>
>>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
>>>>>>
>>>>>> removed
>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
>>>>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> DataFrame?
>>>>>>>>>>>>>
>>>>>>>>>>>>> With this, we don't impact in existing code for the next few
>>>>>>>>>>>>> releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>>> <ku...@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>>>
>>> heavy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> lifting
>>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>>>>>>>>>>>>>> after
>>>>>>
>>>>>> 30
>>>>>>>>>>
>>>>>>>>>> years
>>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> research there is not a single good open source columnar
>>>>>>
>>>>>> database
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> nearly
>>>>>>>>>>>>>
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>>>
>>> need
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> such
>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
>>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>>> <sa...@cloudera.com>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>>>
>>> like
>>>>>>
>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>> former
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> slightly better because it's more descriptive.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>>>>>>
>>>>>> covers,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>>>>>>
>>>>>> choose a
>>>>>>>>>>>>>
>>>>>>>>>>>>> package
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
>>>>>>
>>>>>> module
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>>>>>>>>>>>>>>> change
>>>>>>
>>>>>> at
>>>>>>>>>>
>>>>>>>>>> this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>>>>>>>>>>>>
>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>>>>>>>>>>>>
>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
>>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> supporting structured data efficiently does require
>>>
>>> quite a
>>>>>>
>>>>>> bit
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>>>>>>
>>>>>> columnar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
>>>
>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>>>>>>
>>>>>> koert@tresata.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>>>
>>> data
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>
>>> used
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>
>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>>>>>>>>>>>>>>>>>> core,
>>>>>>
>>>>>> not
>>>>>>>>>>>>>
>>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>>>>>>>>>>>>
>>>>>>>>>>>>> information
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>>>
>>> there
>>>>>>
>>>>>> will
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>>>>>>
>>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>>>>>>>>>>>>
>>>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> get the community's opinion.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>>>
>>> data
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>>>>>>>>>>>>>>>>> used
>>>>>>
>>>>>> for
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>>>
>>> We
>>>>>>
>>>>>> also
>>>>>>>>>>>>>
>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>>>
>>> SchemaRDD
>>>>>>
>>>>>> API
>>>>>>>>>>>>>
>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>>>
>>> used
>>>>>>
>>>>>> DSL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>>>
>>> usable
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>>>
>>> than
>>>>>>>>>>>>>
>>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>
>>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>>>
>>> highly
>>>>>>>>>>>>>
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>
>>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>
>>>>>>>>>>>>> against
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>>>
>>> well
>>>>>>>>>>>>>
>>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
>>>>>>>>>>>>>
>>>>>>>>>>>>> materials to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>
>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>
>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>
>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Cheng Lian <li...@gmail.com>.

Forgot to mention that you can find it here 
<https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala>.

On 1/29/15 1:59 PM, Cheng Lian wrote:

> Yes, when a DataFrame is cached in memory, it's stored in an efficient 
> columnar format. And you can also easily persist it on disk using 
> Parquet, which is also columnar.
>
> Cheng
>
> On 1/29/15 1:24 PM, Koert Kuipers wrote:
>> to me the word DataFrame does come with certain expectations. one of 
>> them
>> is that the data is stored columnar. in R data.frame internally uses 
>> a list
>> of sequences i think, but since lists can have labels its more like a
>> SortedMap[String, Array[_]]. this makes certain operations very cheap 
>> (such
>> as adding a column).
>>
>> in Spark the closest thing would be a data structure where per Partition
>> the data is also stored columnar. does spark SQL already use 
>> something like
>> that? Evan mentioned "Spark SQL columnar compression", which sounds like
>> it. where can i find that?
>>
>> thanks
>>
>> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com> 
>> wrote:
>>
>>> +1.... having proper NA support is much cleaner than using null, at
>>> least the Java null.
>>>
>>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>> You've got to be a little bit careful here. "NA" in systems like R or
>>> pandas
>>>> may have special meaning that is distinct from "null".
>>>>
>>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>>
>>>>
>>>>
>>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>>> Isn't that just "null" in SQL?
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I believe that most DataFrame implementations out there, like 
>>>>>> Pandas,
>>>>>> supports the idea of missing values / NA, and some support the 
>>>>>> idea of
>>>>>> Not Meaningful as well.
>>>>>>
>>>>>> Does Row support anything like that?  That is important for certain
>>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>>> but haven't looked into the details in a while.
>>>>>>
>>>>>> -Evan
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>> It shouldn't change the data source api at all because data sources
>>>>>> create
>>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>>> (previously
>>>>>>> to SchemaRDD).
>>>>>>>
>>>>>>>
>>>>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 
>>>
>>>>>>> One thing that will break the data source API in 1.3 is the 
>>>>>>> location
>>>>>>> of
>>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>>> moved to
>>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>>> public
>>>>>> APIs
>>>>>>> have first class classes/objects defined in sql directly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.github@gmail.com
>>>>>> wrote:
>>>>>>>> Hey guys,
>>>>>>>>
>>>>>>>> How does this impact the data sources API?  I was planning on 
>>>>>>>> using
>>>>>>>> this for a project.
>>>>>>>>
>>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>>> desirable and useful.
>>>>>>>>
>>>>>>>> By the way, one thing that prevents the columnar compression stuff
>>> in
>>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>>> with
>>>>>>>> Reynold and Michael et al., that the format was not designed for
>>>>>>>> persistence.
>>>>>>>>
>>>>>>>> I have a new project that aims to change that. It is a
>>>>>>>> zero-serialisation, high performance binary vector library,
>>> designed
>>>>>>>> from the outset to be a persistent storage friendly.  May be one
>>> day
>>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>>
>>>>>>>> Michael told me this would be a lot of work, and recreates 
>>>>>>>> parts of
>>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>>> details.
>>>>>>>> -Evan
>>>>>>>>
>>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>>>> Alright I have merged the patch (
>>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>>> ) since I don't see any strong opinions against it (as a matter
>>> of
>>>>>> fact
>>>>>>>>> most were for it). We can still change it if somebody lays out a
>>>>>> strong
>>>>>>>>> argument.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>>> <ma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The type alias means your methods can specify either type and
>>> they
>>>>>> will
>>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>>> and
>>>>>> such
>>>>>>>>>> will
>>>>>>>>>> show DataFrame as the type.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>> Reynold,
>>>>>>>>>>> But with type alias we will have the same problem, right?
>>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
>>>>>>>>>>> to
>>>>>>>>>>> change
>>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>>> an
>>>>>>>>>>> implicit
>>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Dirceu,
>>>>>>>>>>>>
>>>>>>>>>>>> That is not possible because one cannot overload return
>>> types.
>>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>>> return
>>>>>> some
>>>>>>>>>> type,
>>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>>
>>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>>>>>>>>>>>> SchemaRDD
>>>>>>>>>>>> to
>>>>>>>>>> not
>>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
>>>>>> removed
>>>>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
>>>>>>>>>>>>> to
>>>>>>>>>> DataFrame?
>>>>>>>>>>>>> With this, we don't impact in existing code for the next few
>>>>>>>>>>>>> releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>>> <ku...@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>>> heavy
>>>>>>>>>>>>>> lifting
>>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>>>>>>>>>>>>>> after
>>>>>> 30
>>>>>>>>>> years
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> research there is not a single good open source columnar
>>>>>> database
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> nearly
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>>> need
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> such
>>>>>>>>>> a
>>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
>>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>>> <sa...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>>> like
>>>>>> the
>>>>>>>>>>>>> former
>>>>>>>>>>>>>>> slightly better because it's more descriptive.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>>>>>> covers,
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>>>>>> choose a
>>>>>>>>>>>>> package
>>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
>>>>>> module
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>>>>>>>>>>>>>>> change
>>>>>> at
>>>>>>>>>> this
>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
>>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>>> supporting structured data efficiently does require
>>> quite a
>>>>>> bit
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>>>>>> columnar
>>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
>>> to
>>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>>>>>> koert@tresata.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>>> data
>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>> used
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>>>>>>>>>>>>>>>>>> core,
>>>>>> not
>>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>>> 13
>>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>>>>>>>>>>>> information
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>>> there
>>>>>> will
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>> a
>>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> get the community's opinion.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>>> data
>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>>>>>>>>>>>>>>>>> used
>>>>>> for
>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>>> We
>>>>>> also
>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>>> SchemaRDD
>>>>>> API
>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>>> used
>>>>>> DSL
>>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>>> usable
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>>> than
>>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>>> highly
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>> against
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>>> well
>>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
>>>>>>>>>>>>> materials to
>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Cheng Lian <li...@gmail.com>.

Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.

Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:
> to me the word DataFrame does come with certain expectations. one of them
> is that the data is stored columnar. in R data.frame internally uses a list
> of sequences i think, but since lists can have labels its more like a
> SortedMap[String, Array[_]]. this makes certain operations very cheap (such
> as adding a column).
>
> in Spark the closest thing would be a data structure where per Partition
> the data is also stored columnar. does spark SQL already use something like
> that? Evan mentioned "Spark SQL columnar compression", which sounds like
> it. where can i find that?
>
> thanks
>
> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com> wrote:
>
>> +1.... having proper NA support is much cleaner than using null, at
>> least the Java null.
>>
>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>> You've got to be a little bit careful here. "NA" in systems like R or
>> pandas
>>> may have special meaning that is distinct from "null".
>>>
>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>
>>>
>>>
>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>>> Isn't that just "null" in SQL?
>>>>
>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
>>>> wrote:
>>>>
>>>>> I believe that most DataFrame implementations out there, like Pandas,
>>>>> supports the idea of missing values / NA, and some support the idea of
>>>>> Not Meaningful as well.
>>>>>
>>>>> Does Row support anything like that?  That is important for certain
>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>> but haven't looked into the details in a while.
>>>>>
>>>>> -Evan
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>> It shouldn't change the data source api at all because data sources
>>>>> create
>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>> (previously
>>>>>> to SchemaRDD).
>>>>>>
>>>>>>
>>>>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>> One thing that will break the data source API in 1.3 is the location
>>>>>> of
>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>> moved to
>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>> public
>>>>> APIs
>>>>>> have first class classes/objects defined in sql directly.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.github@gmail.com
>>>>> wrote:
>>>>>>> Hey guys,
>>>>>>>
>>>>>>> How does this impact the data sources API?  I was planning on using
>>>>>>> this for a project.
>>>>>>>
>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>> desirable and useful.
>>>>>>>
>>>>>>> By the way, one thing that prevents the columnar compression stuff
>> in
>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>> with
>>>>>>> Reynold and Michael et al., that the format was not designed for
>>>>>>> persistence.
>>>>>>>
>>>>>>> I have a new project that aims to change that.  It is a
>>>>>>> zero-serialisation, high performance binary vector library,
>> designed
>>>>>>> from the outset to be a persistent storage friendly.  May be one
>> day
>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>
>>>>>>> Michael told me this would be a lot of work, and recreates parts of
>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>> details.
>>>>>>> -Evan
>>>>>>>
>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>>>> Alright I have merged the patch (
>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>> ) since I don't see any strong opinions against it (as a matter
>> of
>>>>> fact
>>>>>>>> most were for it). We can still change it if somebody lays out a
>>>>> strong
>>>>>>>> argument.
>>>>>>>>
>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>> <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The type alias means your methods can specify either type and
>> they
>>>>> will
>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>> and
>>>>> such
>>>>>>>>> will
>>>>>>>>> show DataFrame as the type.
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>> Reynold,
>>>>>>>>>> But with type alias we will have the same problem, right?
>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
>>>>>>>>>> to
>>>>>>>>>> change
>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>> an
>>>>>>>>>> implicit
>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>>>>>>>>>>
>>>>>>>>>>> Dirceu,
>>>>>>>>>>>
>>>>>>>>>>> That is not possible because one cannot overload return
>> types.
>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>> return
>>>>> some
>>>>>>>>> type,
>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>
>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>>>>>>>>>>> SchemaRDD
>>>>>>>>>>> to
>>>>>>>>> not
>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
>>>>> removed
>>>>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
>>>>>>>>>>>> to
>>>>>>>>> DataFrame?
>>>>>>>>>>>> With this, we don't impact in existing code for the next few
>>>>>>>>>>>> releases.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>> <ku...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>> heavy
>>>>>>>>>>>>> lifting
>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>>>>>>>>>>>>> after
>>>>> 30
>>>>>>>>> years
>>>>>>>>>>>> of
>>>>>>>>>>>>> research there is not a single good open source columnar
>>>>> database
>>>>>>>>>>>>> like
>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
>>>>>>>>>>>>> not
>>>>>>>>>>>>> nearly
>>>>>>>>>>>> as
>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>> need
>>>>>>>>>>>>> for
>>>>>>>>>>>>> such
>>>>>>>>> a
>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>> <sa...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>> like
>>>>> the
>>>>>>>>>>>> former
>>>>>>>>>>>>>> slightly better because it's more descriptive.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>>>>> covers,
>>>>>>>>>>>>>> it
>>>>>>>>>>>> would
>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>>>>> choose a
>>>>>>>>>>>> package
>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
>>>>> module
>>>>>>>>>>>>>> for
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>>>>>>>>>>>>>> change
>>>>> at
>>>>>>>>> this
>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> was
>>>>>>>>>>>> the
>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>> supporting structured data efficiently does require
>> quite a
>>>>> bit
>>>>>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>>>>> columnar
>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>> --
>>>>>>>>>>>> it's
>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>> this
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
>> to
>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>>>>> koert@tresata.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>> data
>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>> used
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>> used
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>>>>>>>>>>>>>>>>> core,
>>>>> not
>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>> 13
>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>>>>>>>>>>> information
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>>>>>>>>>>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>> there
>>>>> will
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> get the community's opinion.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>> data
>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>>>>>>>>>>>>>>>> used
>>>>> for
>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>> We
>>>>> also
>>>>>>>>>>>> expect
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>> SchemaRDD
>>>>> API
>>>>>>>>>>>> rather
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>> used
>>>>> DSL
>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>> usable
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>> than
>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>> highly
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>> against
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>> well
>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
>>>>>>>>>>>> materials to
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <ve...@gmail.com> wrote:

> +1.... having proper NA support is much cleaner than using null, at
> least the Java null.
>
> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <ev...@gmail.com>
> wrote:
> > You've got to be a little bit careful here. "NA" in systems like R or
> pandas
> > may have special meaning that is distinct from "null".
> >
> > See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >
> >
> >
> > On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>
> >> Isn't that just "null" in SQL?
> >>
> >> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
> >> wrote:
> >>
> >> > I believe that most DataFrame implementations out there, like Pandas,
> >> > supports the idea of missing values / NA, and some support the idea of
> >> > Not Meaningful as well.
> >> >
> >> > Does Row support anything like that?  That is important for certain
> >> > applications.  I thought that Row worked by being a mutable object,
> >> > but haven't looked into the details in a while.
> >> >
> >> > -Evan
> >> >
> >> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
> >> > wrote:
> >> > > It shouldn't change the data source api at all because data sources
> >> > create
> >> > > RDD[Row], and that gets converted into a DataFrame automatically
> >> > (previously
> >> > > to SchemaRDD).
> >> > >
> >> > >
> >> >
> >> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> > >
> >> > > One thing that will break the data source API in 1.3 is the location
> >> > > of
> >> > > types. Types were previously defined in sql.catalyst.types, and now
> >> > moved to
> >> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
> >> > > public
> >> > APIs
> >> > > have first class classes/objects defined in sql directly.
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.github@gmail.com
> >
> >> > wrote:
> >> > >>
> >> > >> Hey guys,
> >> > >>
> >> > >> How does this impact the data sources API?  I was planning on using
> >> > >> this for a project.
> >> > >>
> >> > >> +1 that many things from spark-sql / DataFrame is universally
> >> > >> desirable and useful.
> >> > >>
> >> > >> By the way, one thing that prevents the columnar compression stuff
> in
> >> > >> Spark SQL from being more useful is, at least from previous talks
> >> > >> with
> >> > >> Reynold and Michael et al., that the format was not designed for
> >> > >> persistence.
> >> > >>
> >> > >> I have a new project that aims to change that.  It is a
> >> > >> zero-serialisation, high performance binary vector library,
> designed
> >> > >> from the outset to be a persistent storage friendly.  May be one
> day
> >> > >> it can replace the Spark SQL columnar compression.
> >> > >>
> >> > >> Michael told me this would be a lot of work, and recreates parts of
> >> > >> Parquet, but I think it's worth it.  LMK if you'd like more
> details.
> >> > >>
> >> > >> -Evan
> >> > >>
> >> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
> >> > wrote:
> >> > >> > Alright I have merged the patch (
> >> > >> > https://github.com/apache/spark/pull/4173
> >> > >> > ) since I don't see any strong opinions against it (as a matter
> of
> >> > fact
> >> > >> > most were for it). We can still change it if somebody lays out a
> >> > strong
> >> > >> > argument.
> >> > >> >
> >> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > >> > <ma...@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> >> The type alias means your methods can specify either type and
> they
> >> > will
> >> > >> >> work. It's just another name for the same type. But Scaladocs
> and
> >> > such
> >> > >> >> will
> >> > >> >> show DataFrame as the type.
> >> > >> >>
> >> > >> >> Matei
> >> > >> >>
> >> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> > >> >> dirceu.semighini@gmail.com> wrote:
> >> > >> >> >
> >> > >> >> > Reynold,
> >> > >> >> > But with type alias we will have the same problem, right?
> >> > >> >> > If the methods doesn't receive schemardd anymore, we will have
> >> > >> >> > to
> >> > >> >> > change
> >> > >> >> > our code to migrade from schema to dataframe. Unless we have
> an
> >> > >> >> > implicit
> >> > >> >> > conversion between DataFrame and SchemaRDD
> >> > >> >> >
> >> > >> >> >
> >> > >> >> >
> >> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >> > >> >> >
> >> > >> >> >> Dirceu,
> >> > >> >> >>
> >> > >> >> >> That is not possible because one cannot overload return
> types.
> >> > >> >> >>
> >> > >> >> >> SQLContext.parquetFile (and many other methods) needs to
> return
> >> > some
> >> > >> >> type,
> >> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> > >> >> >>
> >> > >> >> >> In 1.3, we will create a type alias for DataFrame called
> >> > >> >> >> SchemaRDD
> >> > >> >> >> to
> >> > >> >> not
> >> > >> >> >> break source compatibility for Scala.
> >> > >> >> >>
> >> > >> >> >>
> >> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> > >> >> >> dirceu.semighini@gmail.com> wrote:
> >> > >> >> >>
> >> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> >> > removed
> >> > >> >> >>> in
> >> > >> >> the
> >> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
> >> > >> >> >>> to
> >> > >> >> DataFrame?
> >> > >> >> >>> With this, we don't impact in existing code for the next few
> >> > >> >> >>> releases.
> >> > >> >> >>>
> >> > >> >> >>>
> >> > >> >> >>>
> >> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> >> > >> >> >>> <ku...@gmail.com>:
> >> > >> >> >>>
> >> > >> >> >>>> I want to address the issue that Matei raised about the
> heavy
> >> > >> >> >>>> lifting
> >> > >> >> >>>> required for a full SQL support. It is amazing that even
> >> > >> >> >>>> after
> >> > 30
> >> > >> >> years
> >> > >> >> >>> of
> >> > >> >> >>>> research there is not a single good open source columnar
> >> > database
> >> > >> >> >>>> like
> >> > >> >> >>>> Vertica. There is a column store option in MySQL, but it is
> >> > >> >> >>>> not
> >> > >> >> >>>> nearly
> >> > >> >> >>> as
> >> > >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true
> need
> >> > >> >> >>>> for
> >> > >> >> >>>> such
> >> > >> >> a
> >> > >> >> >>>> system. I wonder why so and it's high time to change that.
> >> > >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> >> > >> >> >>>> <sa...@cloudera.com>
> >> > >> >> wrote:
> >> > >> >> >>>>
> >> > >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I
> like
> >> > the
> >> > >> >> >>> former
> >> > >> >> >>>>> slightly better because it's more descriptive.
> >> > >> >> >>>>>
> >> > >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> >> > covers,
> >> > >> >> >>>>> it
> >> > >> >> >>> would
> >> > >> >> >>>>> be more clear from a user-facing perspective to at least
> >> > choose a
> >> > >> >> >>> package
> >> > >> >> >>>>> name for it that omits "sql".
> >> > >> >> >>>>>
> >> > >> >> >>>>> I would also be in favor of adding a separate Spark Schema
> >> > module
> >> > >> >> >>>>> for
> >> > >> >> >>>> Spark
> >> > >> >> >>>>> SQL to rely on, but I imagine that might be too large a
> >> > >> >> >>>>> change
> >> > at
> >> > >> >> this
> >> > >> >> >>>>> point?
> >> > >> >> >>>>>
> >> > >> >> >>>>> -Sandy
> >> > >> >> >>>>>
> >> > >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> > >> >> >>> matei.zaharia@gmail.com>
> >> > >> >> >>>>> wrote:
> >> > >> >> >>>>>
> >> > >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving
> >> > >> >> >>>>>> it
> >> > >> >> >>>>>> another
> >> > >> >> >>>>> name,
> >> > >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
> >> > >> >> >>>>>> that
> >> > >> >> >>>>>> was
> >> > >> >> >>> the
> >> > >> >> >>>>> most
> >> > >> >> >>>>>> obvious use case to many users.)
> >> > >> >> >>>>>>
> >> > >> >> >>>>>> Matei
> >> > >> >> >>>>>>
> >> > >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> > >> >> >>> matei.zaharia@gmail.com>
> >> > >> >> >>>>>> wrote:
> >> > >> >> >>>>>>>
> >> > >> >> >>>>>>> While it might be possible to move this concept to Spark
> >> > >> >> >>>>>>> Core
> >> > >> >> >>>>> long-term,
> >> > >> >> >>>>>> supporting structured data efficiently does require
> quite a
> >> > bit
> >> > >> >> >>>>>> of
> >> > >> >> >>> the
> >> > >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
> >> > columnar
> >> > >> >> >>>> storage.
> >> > >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
> >> > >> >> >>>>>> server
> >> > >> >> >>>>>> --
> >> > >> >> >>> it's
> >> > >> >> >>>>>> meant to be a library for manipulating structured data.
> >> > >> >> >>>>>> Since
> >> > >> >> >>>>>> this
> >> > >> >> >>> is
> >> > >> >> >>>>>> possible to build over the core API, it's pretty natural
> to
> >> > >> >> >>> organize it
> >> > >> >> >>>>>> that way, same as Spark Streaming is a library.
> >> > >> >> >>>>>>>
> >> > >> >> >>>>>>> Matei
> >> > >> >> >>>>>>>
> >> > >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> >> > koert@tresata.com>
> >> > >> >> >>>> wrote:
> >> > >> >> >>>>>>>>
> >> > >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common
> data
> >> > >> >> >>>>>>>> format
> >> > >> >> >>> used
> >> > >> >> >>>>> for
> >> > >> >> >>>>>>>> bringing data into Spark from external systems, and
> used
> >> > >> >> >>>>>>>> for
> >> > >> >> >>> various
> >> > >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >> > >> >> >>>>>>>>
> >> > >> >> >>>>>>>> i agree. this to me also implies it belongs in spark
> >> > >> >> >>>>>>>> core,
> >> > not
> >> > >> >> >>> sql
> >> > >> >> >>>>>>>>
> >> > >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> > >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> > >> >> >>>>>>>>
> >> > >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> >> > >> >> >>>>>>>>> the
> >> > >> >> >>>>>>>>> Jan.
> >> > >> >> >>> 13
> >> > >> >> >>>> Bay
> >> > >> >> >>>>>> Area
> >> > >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >> > >> >> >>> information
> >> > >> >> >>>> on
> >> > >> >> >>>>>> this
> >> > >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>> ________________________________
> >> > >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >> > >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >> > >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> > >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> > >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>> One thing potentially not clear from this e-mail,
> there
> >> > will
> >> > >> >> >>>>>>>>> be
> >> > >> >> >>> a
> >> > >> >> >>>> 1:1
> >> > >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
> >> > DataFrame.
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> > >> >> >>> rxin@databricks.com>
> >> > >> >> >>>>>> wrote:
> >> > >> >> >>>>>>>>>> Hi,
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
> >> > >> >> >>>>>>>>>> 1.3,
> >> > >> >> >>>>>>>>>> and
> >> > >> >> >>>>> wanted
> >> > >> >> >>>>>> to
> >> > >> >> >>>>>>>>>> get the community's opinion.
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common
> data
> >> > >> >> >>>>>>>>>> format
> >> > >> >> >>>> used
> >> > >> >> >>>>>> for
> >> > >> >> >>>>>>>>>> bringing data into Spark from external systems, and
> >> > >> >> >>>>>>>>>> used
> >> > for
> >> > >> >> >>>> various
> >> > >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
> We
> >> > also
> >> > >> >> >>> expect
> >> > >> >> >>>>>> more
> >> > >> >> >>>>>>>>> and
> >> > >> >> >>>>>>>>>> more users to be programming directly against
> SchemaRDD
> >> > API
> >> > >> >> >>> rather
> >> > >> >> >>>>>> than
> >> > >> >> >>>>>>>>> the
> >> > >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
> used
> >> > DSL
> >> > >> >> >>>>> originally
> >> > >> >> >>>>>>>>>> designed for writing test cases, always has the
> >> > >> >> >>>>>>>>>> data-frame
> >> > >> >> >>>>>>>>>> like
> >> > >> >> >>>> API.
> >> > >> >> >>>>>> In
> >> > >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API
> usable
> >> > >> >> >>>>>>>>>> for
> >> > >> >> >>>>>>>>>> end
> >> > >> >> >>>>> users.
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> There are two motivations for the renaming:
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> than
> >> > >> >> >>> SchemaRDD.
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
> >> > >> >> >>>>>>>>>> RDD
> >> > >> >> >>> anymore
> >> > >> >> >>>>>> (even
> >> > >> >> >>>>>>>>>> though it would contain some RDD functions like map,
> >> > >> >> >>>>>>>>>> flatMap,
> >> > >> >> >>>> etc),
> >> > >> >> >>>>>> and
> >> > >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> highly
> >> > >> >> >>> confusing.
> >> > >> >> >>>>>>>>> Instead.
> >> > >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> >> > >> >> >>>>>>>>>> RDD
> >> > >> >> >>> methods.
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>>
> >> > >> >> >>>>>>>>>> My understanding is that very few users program
> >> > >> >> >>>>>>>>>> directly
> >> > >> >> >>> against
> >> > >> >> >>>> the
> >> > >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not
> well
> >> > >> >> >>> documented.
> >> > >> >> >>>>>>>>> However,
> >> > >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a
> >> > >> >> >>>>>>>>>> type
> >> > >> >> >>>>>>>>>> alias
> >> > >> >> >>>>>> DataFrame
> >> > >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain
> >> > >> >> >>>>>>>>>> source
> >> > >> >> >>>>> compatibility
> >> > >> >> >>>>>>>>> for
> >> > >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> >> > >> >> >>> materials to
> >> > >> >> >>>>> use
> >> > >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>
> >> > >> >> >>>>
> >> > ---------------------------------------------------------------------
> >> > >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> >> > >> >> >>>>>>>>> For additional commands, e-mail:
> >> > >> >> >>>>>>>>> dev-help@spark.apache.org
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>
> >> > >> >> >>>>
> >> > ---------------------------------------------------------------------
> >> > >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> >> > >> >> >>>>>>>>> For additional commands, e-mail:
> >> > >> >> >>>>>>>>> dev-help@spark.apache.org
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>>>
> >> > >> >> >>>>>>>
> >> > >> >> >>>>>>
> >> > >> >> >>>>>>
> >> > >> >> >>>>>>
> >> > >> >> >>>
> >> > >> >> >>>
> >> > ---------------------------------------------------------------------
> >> > >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >> >> >>>>>> For additional commands, e-mail:
> dev-help@spark.apache.org
> >> > >> >> >>>>>>
> >> > >> >> >>>>>>
> >> > >> >> >>>>>
> >> > >> >> >>>>
> >> > >> >> >>>
> >> > >> >> >>
> >> > >> >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> ---------------------------------------------------------------------
> >> > >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >> >> For additional commands, e-mail: dev-help@spark.apache.org
> >> > >> >>
> >> > >> >>
> >> > >
> >> > >
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Evan Chan <ve...@gmail.com>.

+1.... having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <ev...@gmail.com> wrote:
> You've got to be a little bit careful here. "NA" in systems like R or pandas
> may have special meaning that is distinct from "null".
>
> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>
>
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> Isn't that just "null" in SQL?
>>
>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
>> wrote:
>>
>> > I believe that most DataFrame implementations out there, like Pandas,
>> > supports the idea of missing values / NA, and some support the idea of
>> > Not Meaningful as well.
>> >
>> > Does Row support anything like that?  That is important for certain
>> > applications.  I thought that Row worked by being a mutable object,
>> > but haven't looked into the details in a while.
>> >
>> > -Evan
>> >
>> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
>> > wrote:
>> > > It shouldn't change the data source api at all because data sources
>> > create
>> > > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > > to SchemaRDD).
>> > >
>> > >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >
>> > > One thing that will break the data source API in 1.3 is the location
>> > > of
>> > > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > > public
>> > APIs
>> > > have first class classes/objects defined in sql directly.
>> > >
>> > >
>> > >
>> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com>
>> > wrote:
>> > >>
>> > >> Hey guys,
>> > >>
>> > >> How does this impact the data sources API?  I was planning on using
>> > >> this for a project.
>> > >>
>> > >> +1 that many things from spark-sql / DataFrame is universally
>> > >> desirable and useful.
>> > >>
>> > >> By the way, one thing that prevents the columnar compression stuff in
>> > >> Spark SQL from being more useful is, at least from previous talks
>> > >> with
>> > >> Reynold and Michael et al., that the format was not designed for
>> > >> persistence.
>> > >>
>> > >> I have a new project that aims to change that.  It is a
>> > >> zero-serialisation, high performance binary vector library, designed
>> > >> from the outset to be a persistent storage friendly.  May be one day
>> > >> it can replace the Spark SQL columnar compression.
>> > >>
>> > >> Michael told me this would be a lot of work, and recreates parts of
>> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> > >>
>> > >> -Evan
>> > >>
>> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
>> > wrote:
>> > >> > Alright I have merged the patch (
>> > >> > https://github.com/apache/spark/pull/4173
>> > >> > ) since I don't see any strong opinions against it (as a matter of
>> > fact
>> > >> > most were for it). We can still change it if somebody lays out a
>> > strong
>> > >> > argument.
>> > >> >
>> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >> > <ma...@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> >> The type alias means your methods can specify either type and they
>> > will
>> > >> >> work. It's just another name for the same type. But Scaladocs and
>> > such
>> > >> >> will
>> > >> >> show DataFrame as the type.
>> > >> >>
>> > >> >> Matei
>> > >> >>
>> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >> >> dirceu.semighini@gmail.com> wrote:
>> > >> >> >
>> > >> >> > Reynold,
>> > >> >> > But with type alias we will have the same problem, right?
>> > >> >> > If the methods doesn't receive schemardd anymore, we will have
>> > >> >> > to
>> > >> >> > change
>> > >> >> > our code to migrade from schema to dataframe. Unless we have an
>> > >> >> > implicit
>> > >> >> > conversion between DataFrame and SchemaRDD
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>> > >> >> >
>> > >> >> >> Dirceu,
>> > >> >> >>
>> > >> >> >> That is not possible because one cannot overload return types.
>> > >> >> >>
>> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
>> > some
>> > >> >> type,
>> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> > >> >> >>
>> > >> >> >> In 1.3, we will create a type alias for DataFrame called
>> > >> >> >> SchemaRDD
>> > >> >> >> to
>> > >> >> not
>> > >> >> >> break source compatibility for Scala.
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> > >> >> >> dirceu.semighini@gmail.com> wrote:
>> > >> >> >>
>> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
>> > removed
>> > >> >> >>> in
>> > >> >> the
>> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
>> > >> >> >>> to
>> > >> >> DataFrame?
>> > >> >> >>> With this, we don't impact in existing code for the next few
>> > >> >> >>> releases.
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>> > >> >> >>> <ku...@gmail.com>:
>> > >> >> >>>
>> > >> >> >>>> I want to address the issue that Matei raised about the heavy
>> > >> >> >>>> lifting
>> > >> >> >>>> required for a full SQL support. It is amazing that even
>> > >> >> >>>> after
>> > 30
>> > >> >> years
>> > >> >> >>> of
>> > >> >> >>>> research there is not a single good open source columnar
>> > database
>> > >> >> >>>> like
>> > >> >> >>>> Vertica. There is a column store option in MySQL, but it is
>> > >> >> >>>> not
>> > >> >> >>>> nearly
>> > >> >> >>> as
>> > >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
>> > >> >> >>>> for
>> > >> >> >>>> such
>> > >> >> a
>> > >> >> >>>> system. I wonder why so and it's high time to change that.
>> > >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>> > >> >> >>>> <sa...@cloudera.com>
>> > >> >> wrote:
>> > >> >> >>>>
>> > >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
>> > the
>> > >> >> >>> former
>> > >> >> >>>>> slightly better because it's more descriptive.
>> > >> >> >>>>>
>> > >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>> > covers,
>> > >> >> >>>>> it
>> > >> >> >>> would
>> > >> >> >>>>> be more clear from a user-facing perspective to at least
>> > choose a
>> > >> >> >>> package
>> > >> >> >>>>> name for it that omits "sql".
>> > >> >> >>>>>
>> > >> >> >>>>> I would also be in favor of adding a separate Spark Schema
>> > module
>> > >> >> >>>>> for
>> > >> >> >>>> Spark
>> > >> >> >>>>> SQL to rely on, but I imagine that might be too large a
>> > >> >> >>>>> change
>> > at
>> > >> >> this
>> > >> >> >>>>> point?
>> > >> >> >>>>>
>> > >> >> >>>>> -Sandy
>> > >> >> >>>>>
>> > >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> > >> >> >>> matei.zaharia@gmail.com>
>> > >> >> >>>>> wrote:
>> > >> >> >>>>>
>> > >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving
>> > >> >> >>>>>> it
>> > >> >> >>>>>> another
>> > >> >> >>>>> name,
>> > >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
>> > >> >> >>>>>> that
>> > >> >> >>>>>> was
>> > >> >> >>> the
>> > >> >> >>>>> most
>> > >> >> >>>>>> obvious use case to many users.)
>> > >> >> >>>>>>
>> > >> >> >>>>>> Matei
>> > >> >> >>>>>>
>> > >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> > >> >> >>> matei.zaharia@gmail.com>
>> > >> >> >>>>>> wrote:
>> > >> >> >>>>>>>
>> > >> >> >>>>>>> While it might be possible to move this concept to Spark
>> > >> >> >>>>>>> Core
>> > >> >> >>>>> long-term,
>> > >> >> >>>>>> supporting structured data efficiently does require quite a
>> > bit
>> > >> >> >>>>>> of
>> > >> >> >>> the
>> > >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
>> > columnar
>> > >> >> >>>> storage.
>> > >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
>> > >> >> >>>>>> server
>> > >> >> >>>>>> --
>> > >> >> >>> it's
>> > >> >> >>>>>> meant to be a library for manipulating structured data.
>> > >> >> >>>>>> Since
>> > >> >> >>>>>> this
>> > >> >> >>> is
>> > >> >> >>>>>> possible to build over the core API, it's pretty natural to
>> > >> >> >>> organize it
>> > >> >> >>>>>> that way, same as Spark Streaming is a library.
>> > >> >> >>>>>>>
>> > >> >> >>>>>>> Matei
>> > >> >> >>>>>>>
>> > >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>> > koert@tresata.com>
>> > >> >> >>>> wrote:
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
>> > >> >> >>>>>>>> format
>> > >> >> >>> used
>> > >> >> >>>>> for
>> > >> >> >>>>>>>> bringing data into Spark from external systems, and used
>> > >> >> >>>>>>>> for
>> > >> >> >>> various
>> > >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> i agree. this to me also implies it belongs in spark
>> > >> >> >>>>>>>> core,
>> > not
>> > >> >> >>> sql
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>> > >> >> >>>>>>>>> the
>> > >> >> >>>>>>>>> Jan.
>> > >> >> >>> 13
>> > >> >> >>>> Bay
>> > >> >> >>>>>> Area
>> > >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> > >> >> >>> information
>> > >> >> >>>> on
>> > >> >> >>>>>> this
>> > >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> ________________________________
>> > >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>> > >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>> > >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> > >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> > >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
>> > will
>> > >> >> >>>>>>>>> be
>> > >> >> >>> a
>> > >> >> >>>> 1:1
>> > >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
>> > DataFrame.
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> > >> >> >>> rxin@databricks.com>
>> > >> >> >>>>>> wrote:
>> > >> >> >>>>>>>>>> Hi,
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>> > >> >> >>>>>>>>>> 1.3,
>> > >> >> >>>>>>>>>> and
>> > >> >> >>>>> wanted
>> > >> >> >>>>>> to
>> > >> >> >>>>>>>>>> get the community's opinion.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
>> > >> >> >>>>>>>>>> format
>> > >> >> >>>> used
>> > >> >> >>>>>> for
>> > >> >> >>>>>>>>>> bringing data into Spark from external systems, and
>> > >> >> >>>>>>>>>> used
>> > for
>> > >> >> >>>> various
>> > >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
>> > also
>> > >> >> >>> expect
>> > >> >> >>>>>> more
>> > >> >> >>>>>>>>> and
>> > >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
>> > API
>> > >> >> >>> rather
>> > >> >> >>>>>> than
>> > >> >> >>>>>>>>> the
>> > >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
>> > DSL
>> > >> >> >>>>> originally
>> > >> >> >>>>>>>>>> designed for writing test cases, always has the
>> > >> >> >>>>>>>>>> data-frame
>> > >> >> >>>>>>>>>> like
>> > >> >> >>>> API.
>> > >> >> >>>>>> In
>> > >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
>> > >> >> >>>>>>>>>> for
>> > >> >> >>>>>>>>>> end
>> > >> >> >>>>> users.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> There are two motivations for the renaming:
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> > >> >> >>> SchemaRDD.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>> > >> >> >>>>>>>>>> RDD
>> > >> >> >>> anymore
>> > >> >> >>>>>> (even
>> > >> >> >>>>>>>>>> though it would contain some RDD functions like map,
>> > >> >> >>>>>>>>>> flatMap,
>> > >> >> >>>> etc),
>> > >> >> >>>>>> and
>> > >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> > >> >> >>> confusing.
>> > >> >> >>>>>>>>> Instead.
>> > >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>> > >> >> >>>>>>>>>> RDD
>> > >> >> >>> methods.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> My understanding is that very few users program
>> > >> >> >>>>>>>>>> directly
>> > >> >> >>> against
>> > >> >> >>>> the
>> > >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> > >> >> >>> documented.
>> > >> >> >>>>>>>>> However,
>> > >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a
>> > >> >> >>>>>>>>>> type
>> > >> >> >>>>>>>>>> alias
>> > >> >> >>>>>> DataFrame
>> > >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain
>> > >> >> >>>>>>>>>> source
>> > >> >> >>>>> compatibility
>> > >> >> >>>>>>>>> for
>> > >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
>> > >> >> >>> materials to
>> > >> >> >>>>> use
>> > >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >> >> >>>>>>>>> For additional commands, e-mail:
>> > >> >> >>>>>>>>> dev-help@spark.apache.org
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >> >> >>>>>>>>> For additional commands, e-mail:
>> > >> >> >>>>>>>>> dev-help@spark.apache.org
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>
>> > >> >> >>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>
>> > >> >> >>>>
>> > >> >> >>>
>> > >> >> >>
>> > >> >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > >> >> For additional commands, e-mail: dev-help@spark.apache.org
>> > >> >>
>> > >> >>
>> > >
>> > >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by "Evan R. Sparks" <ev...@gmail.com>.

You've got to be a little bit careful here. "NA" in systems like R or
pandas may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com> wrote:

> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
> wrote:
>
> > I believe that most DataFrame implementations out there, like Pandas,
> > supports the idea of missing values / NA, and some support the idea of
> > Not Meaningful as well.
> >
> > Does Row support anything like that?  That is important for certain
> > applications.  I thought that Row worked by being a mutable object,
> > but haven't looked into the details in a while.
> >
> > -Evan
> >
> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> > > It shouldn't change the data source api at all because data sources
> > create
> > > RDD[Row], and that gets converted into a DataFrame automatically
> > (previously
> > > to SchemaRDD).
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> > >
> > > One thing that will break the data source API in 1.3 is the location of
> > > types. Types were previously defined in sql.catalyst.types, and now
> > moved to
> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> > APIs
> > > have first class classes/objects defined in sql directly.
> > >
> > >
> > >
> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com>
> > wrote:
> > >>
> > >> Hey guys,
> > >>
> > >> How does this impact the data sources API?  I was planning on using
> > >> this for a project.
> > >>
> > >> +1 that many things from spark-sql / DataFrame is universally
> > >> desirable and useful.
> > >>
> > >> By the way, one thing that prevents the columnar compression stuff in
> > >> Spark SQL from being more useful is, at least from previous talks with
> > >> Reynold and Michael et al., that the format was not designed for
> > >> persistence.
> > >>
> > >> I have a new project that aims to change that.  It is a
> > >> zero-serialisation, high performance binary vector library, designed
> > >> from the outset to be a persistent storage friendly.  May be one day
> > >> it can replace the Spark SQL columnar compression.
> > >>
> > >> Michael told me this would be a lot of work, and recreates parts of
> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> > >>
> > >> -Evan
> > >>
> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
> > wrote:
> > >> > Alright I have merged the patch (
> > >> > https://github.com/apache/spark/pull/4173
> > >> > ) since I don't see any strong opinions against it (as a matter of
> > fact
> > >> > most were for it). We can still change it if somebody lays out a
> > strong
> > >> > argument.
> > >> >
> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> > >> > <ma...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> The type alias means your methods can specify either type and they
> > will
> > >> >> work. It's just another name for the same type. But Scaladocs and
> > such
> > >> >> will
> > >> >> show DataFrame as the type.
> > >> >>
> > >> >> Matei
> > >> >>
> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> > >> >> dirceu.semighini@gmail.com> wrote:
> > >> >> >
> > >> >> > Reynold,
> > >> >> > But with type alias we will have the same problem, right?
> > >> >> > If the methods doesn't receive schemardd anymore, we will have to
> > >> >> > change
> > >> >> > our code to migrade from schema to dataframe. Unless we have an
> > >> >> > implicit
> > >> >> > conversion between DataFrame and SchemaRDD
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> > >> >> >
> > >> >> >> Dirceu,
> > >> >> >>
> > >> >> >> That is not possible because one cannot overload return types.
> > >> >> >>
> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> > some
> > >> >> type,
> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> > >> >> >>
> > >> >> >> In 1.3, we will create a type alias for DataFrame called
> SchemaRDD
> > >> >> >> to
> > >> >> not
> > >> >> >> break source compatibility for Scala.
> > >> >> >>
> > >> >> >>
> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> > >> >> >> dirceu.semighini@gmail.com> wrote:
> > >> >> >>
> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> > removed
> > >> >> >>> in
> > >> >> the
> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> > >> >> DataFrame?
> > >> >> >>> With this, we don't impact in existing code for the next few
> > >> >> >>> releases.
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.datta@gmail.com
> >:
> > >> >> >>>
> > >> >> >>>> I want to address the issue that Matei raised about the heavy
> > >> >> >>>> lifting
> > >> >> >>>> required for a full SQL support. It is amazing that even after
> > 30
> > >> >> years
> > >> >> >>> of
> > >> >> >>>> research there is not a single good open source columnar
> > database
> > >> >> >>>> like
> > >> >> >>>> Vertica. There is a column store option in MySQL, but it is
> not
> > >> >> >>>> nearly
> > >> >> >>> as
> > >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
> for
> > >> >> >>>> such
> > >> >> a
> > >> >> >>>> system. I wonder why so and it's high time to change that.
> > >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <
> sandy.ryza@cloudera.com>
> > >> >> wrote:
> > >> >> >>>>
> > >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
> > the
> > >> >> >>> former
> > >> >> >>>>> slightly better because it's more descriptive.
> > >> >> >>>>>
> > >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> > covers,
> > >> >> >>>>> it
> > >> >> >>> would
> > >> >> >>>>> be more clear from a user-facing perspective to at least
> > choose a
> > >> >> >>> package
> > >> >> >>>>> name for it that omits "sql".
> > >> >> >>>>>
> > >> >> >>>>> I would also be in favor of adding a separate Spark Schema
> > module
> > >> >> >>>>> for
> > >> >> >>>> Spark
> > >> >> >>>>> SQL to rely on, but I imagine that might be too large a
> change
> > at
> > >> >> this
> > >> >> >>>>> point?
> > >> >> >>>>>
> > >> >> >>>>> -Sandy
> > >> >> >>>>>
> > >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> > >> >> >>> matei.zaharia@gmail.com>
> > >> >> >>>>> wrote:
> > >> >> >>>>>
> > >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
> > >> >> >>>>>> another
> > >> >> >>>>> name,
> > >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
> that
> > >> >> >>>>>> was
> > >> >> >>> the
> > >> >> >>>>> most
> > >> >> >>>>>> obvious use case to many users.)
> > >> >> >>>>>>
> > >> >> >>>>>> Matei
> > >> >> >>>>>>
> > >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> > >> >> >>> matei.zaharia@gmail.com>
> > >> >> >>>>>> wrote:
> > >> >> >>>>>>>
> > >> >> >>>>>>> While it might be possible to move this concept to Spark
> Core
> > >> >> >>>>> long-term,
> > >> >> >>>>>> supporting structured data efficiently does require quite a
> > bit
> > >> >> >>>>>> of
> > >> >> >>> the
> > >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
> > columnar
> > >> >> >>>> storage.
> > >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
> server
> > >> >> >>>>>> --
> > >> >> >>> it's
> > >> >> >>>>>> meant to be a library for manipulating structured data.
> Since
> > >> >> >>>>>> this
> > >> >> >>> is
> > >> >> >>>>>> possible to build over the core API, it's pretty natural to
> > >> >> >>> organize it
> > >> >> >>>>>> that way, same as Spark Streaming is a library.
> > >> >> >>>>>>>
> > >> >> >>>>>>> Matei
> > >> >> >>>>>>>
> > >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> > koert@tresata.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
> > >> >> >>>>>>>> format
> > >> >> >>> used
> > >> >> >>>>> for
> > >> >> >>>>>>>> bringing data into Spark from external systems, and used
> for
> > >> >> >>> various
> > >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> i agree. this to me also implies it belongs in spark core,
> > not
> > >> >> >>> sql
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the
> > >> >> >>>>>>>>> Jan.
> > >> >> >>> 13
> > >> >> >>>> Bay
> > >> >> >>>>>> Area
> > >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> > >> >> >>> information
> > >> >> >>>> on
> > >> >> >>>>>> this
> > >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> ________________________________
> > >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> > >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> > >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> > >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
> > will
> > >> >> >>>>>>>>> be
> > >> >> >>> a
> > >> >> >>>> 1:1
> > >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
> > DataFrame.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> > >> >> >>> rxin@databricks.com>
> > >> >> >>>>>> wrote:
> > >> >> >>>>>>>>>> Hi,
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
> 1.3,
> > >> >> >>>>>>>>>> and
> > >> >> >>>>> wanted
> > >> >> >>>>>> to
> > >> >> >>>>>>>>>> get the community's opinion.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
> > >> >> >>>>>>>>>> format
> > >> >> >>>> used
> > >> >> >>>>>> for
> > >> >> >>>>>>>>>> bringing data into Spark from external systems, and used
> > for
> > >> >> >>>> various
> > >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
> > also
> > >> >> >>> expect
> > >> >> >>>>>> more
> > >> >> >>>>>>>>> and
> > >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
> > API
> > >> >> >>> rather
> > >> >> >>>>>> than
> > >> >> >>>>>>>>> the
> > >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
> > DSL
> > >> >> >>>>> originally
> > >> >> >>>>>>>>>> designed for writing test cases, always has the
> data-frame
> > >> >> >>>>>>>>>> like
> > >> >> >>>> API.
> > >> >> >>>>>> In
> > >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
> for
> > >> >> >>>>>>>>>> end
> > >> >> >>>>> users.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> There are two motivations for the renaming:
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> > >> >> >>> SchemaRDD.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
> RDD
> > >> >> >>> anymore
> > >> >> >>>>>> (even
> > >> >> >>>>>>>>>> though it would contain some RDD functions like map,
> > >> >> >>>>>>>>>> flatMap,
> > >> >> >>>> etc),
> > >> >> >>>>>> and
> > >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> > >> >> >>> confusing.
> > >> >> >>>>>>>>> Instead.
> > >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
> > >> >> >>> methods.
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>>
> > >> >> >>>>>>>>>> My understanding is that very few users program directly
> > >> >> >>> against
> > >> >> >>>> the
> > >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> > >> >> >>> documented.
> > >> >> >>>>>>>>> However,
> > >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
> > >> >> >>>>>>>>>> alias
> > >> >> >>>>>> DataFrame
> > >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
> > >> >> >>>>> compatibility
> > >> >> >>>>>>>>> for
> > >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> > >> >> >>> materials to
> > >> >> >>>>> use
> > >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>
> > >> >> >>>>
> > >> >> >>>>
> > ---------------------------------------------------------------------
> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> >> >>>>>>>>> For additional commands, e-mail:
> dev-help@spark.apache.org
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>
> > >> >> >>>>
> > >> >> >>>>
> > ---------------------------------------------------------------------
> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> >> >>>>>>>>> For additional commands, e-mail:
> dev-help@spark.apache.org
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>>>
> > >> >> >>>>>>>
> > >> >> >>>>>>
> > >> >> >>>>>>
> > >> >> >>>>>>
> > >> >> >>>
> > >> >> >>>
> > ---------------------------------------------------------------------
> > >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> > >> >> >>>>>>
> > >> >> >>>>>>
> > >> >> >>>>>
> > >> >> >>>>
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> >> For additional commands, e-mail: dev-help@spark.apache.org
> > >> >>
> > >> >>
> > >
> > >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Michael Armbrust <mi...@databricks.com>.

In particular the performance tricks are in SpecificMutableRow.

On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan <ve...@gmail.com> wrote:

> Yeah, it's "null".   I was worried you couldn't represent it in Row
> because of primitive types like Int (unless you box the Int, which
> would be a performance hit).  Anyways, I'll take another look at the
> Row API again  :-p
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com> wrote:
> > Isn't that just "null" in SQL?
> >
> > On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com>
> wrote:
> >>
> >> I believe that most DataFrame implementations out there, like Pandas,
> >> supports the idea of missing values / NA, and some support the idea of
> >> Not Meaningful as well.
> >>
> >> Does Row support anything like that?  That is important for certain
> >> applications.  I thought that Row worked by being a mutable object,
> >> but haven't looked into the details in a while.
> >>
> >> -Evan
> >>
> >> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> > It shouldn't change the data source api at all because data sources
> >> > create
> >> > RDD[Row], and that gets converted into a DataFrame automatically
> >> > (previously
> >> > to SchemaRDD).
> >> >
> >> >
> >> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> >
> >> > One thing that will break the data source API in 1.3 is the location
> of
> >> > types. Types were previously defined in sql.catalyst.types, and now
> >> > moved to
> >> > sql.types. After 1.3, sql.catalyst is hidden from users, and all
> public
> >> > APIs
> >> > have first class classes/objects defined in sql directly.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hey guys,
> >> >>
> >> >> How does this impact the data sources API?  I was planning on using
> >> >> this for a project.
> >> >>
> >> >> +1 that many things from spark-sql / DataFrame is universally
> >> >> desirable and useful.
> >> >>
> >> >> By the way, one thing that prevents the columnar compression stuff in
> >> >> Spark SQL from being more useful is, at least from previous talks
> with
> >> >> Reynold and Michael et al., that the format was not designed for
> >> >> persistence.
> >> >>
> >> >> I have a new project that aims to change that.  It is a
> >> >> zero-serialisation, high performance binary vector library, designed
> >> >> from the outset to be a persistent storage friendly.  May be one day
> >> >> it can replace the Spark SQL columnar compression.
> >> >>
> >> >> Michael told me this would be a lot of work, and recreates parts of
> >> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >> >>
> >> >> -Evan
> >> >>
> >> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
> >> >> wrote:
> >> >> > Alright I have merged the patch (
> >> >> > https://github.com/apache/spark/pull/4173
> >> >> > ) since I don't see any strong opinions against it (as a matter of
> >> >> > fact
> >> >> > most were for it). We can still change it if somebody lays out a
> >> >> > strong
> >> >> > argument.
> >> >> >
> >> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> >> > <ma...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> The type alias means your methods can specify either type and they
> >> >> >> will
> >> >> >> work. It's just another name for the same type. But Scaladocs and
> >> >> >> such
> >> >> >> will
> >> >> >> show DataFrame as the type.
> >> >> >>
> >> >> >> Matei
> >> >> >>
> >> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> >> dirceu.semighini@gmail.com> wrote:
> >> >> >> >
> >> >> >> > Reynold,
> >> >> >> > But with type alias we will have the same problem, right?
> >> >> >> > If the methods doesn't receive schemardd anymore, we will have
> to
> >> >> >> > change
> >> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> >> > implicit
> >> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >> >> >> >
> >> >> >> >> Dirceu,
> >> >> >> >>
> >> >> >> >> That is not possible because one cannot overload return types.
> >> >> >> >>
> >> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> >> >> >> >> some
> >> >> >> type,
> >> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >> >>
> >> >> >> >> In 1.3, we will create a type alias for DataFrame called
> >> >> >> >> SchemaRDD
> >> >> >> >> to
> >> >> >> not
> >> >> >> >> break source compatibility for Scala.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> >> >> dirceu.semighini@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> >> >> >> >>> removed
> >> >> >> >>> in
> >> >> >> the
> >> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
> to
> >> >> >> DataFrame?
> >> >> >> >>> With this, we don't impact in existing code for the next few
> >> >> >> >>> releases.
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <
> kushal.datta@gmail.com>:
> >> >> >> >>>
> >> >> >> >>>> I want to address the issue that Matei raised about the heavy
> >> >> >> >>>> lifting
> >> >> >> >>>> required for a full SQL support. It is amazing that even
> after
> >> >> >> >>>> 30
> >> >> >> years
> >> >> >> >>> of
> >> >> >> >>>> research there is not a single good open source columnar
> >> >> >> >>>> database
> >> >> >> >>>> like
> >> >> >> >>>> Vertica. There is a column store option in MySQL, but it is
> not
> >> >> >> >>>> nearly
> >> >> >> >>> as
> >> >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
> >> >> >> >>>> for
> >> >> >> >>>> such
> >> >> >> a
> >> >> >> >>>> system. I wonder why so and it's high time to change that.
> >> >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <
> sandy.ryza@cloudera.com>
> >> >> >> wrote:
> >> >> >> >>>>
> >> >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
> >> >> >> >>>>> the
> >> >> >> >>> former
> >> >> >> >>>>> slightly better because it's more descriptive.
> >> >> >> >>>>>
> >> >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> >> >> >> >>>>> covers,
> >> >> >> >>>>> it
> >> >> >> >>> would
> >> >> >> >>>>> be more clear from a user-facing perspective to at least
> >> >> >> >>>>> choose a
> >> >> >> >>> package
> >> >> >> >>>>> name for it that omits "sql".
> >> >> >> >>>>>
> >> >> >> >>>>> I would also be in favor of adding a separate Spark Schema
> >> >> >> >>>>> module
> >> >> >> >>>>> for
> >> >> >> >>>> Spark
> >> >> >> >>>>> SQL to rely on, but I imagine that might be too large a
> change
> >> >> >> >>>>> at
> >> >> >> this
> >> >> >> >>>>> point?
> >> >> >> >>>>>
> >> >> >> >>>>> -Sandy
> >> >> >> >>>>>
> >> >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> >> >> >>> matei.zaharia@gmail.com>
> >> >> >> >>>>> wrote:
> >> >> >> >>>>>
> >> >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving
> it
> >> >> >> >>>>>> another
> >> >> >> >>>>> name,
> >> >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
> >> >> >> >>>>>> that
> >> >> >> >>>>>> was
> >> >> >> >>> the
> >> >> >> >>>>> most
> >> >> >> >>>>>> obvious use case to many users.)
> >> >> >> >>>>>>
> >> >> >> >>>>>> Matei
> >> >> >> >>>>>>
> >> >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> >> >> >>> matei.zaharia@gmail.com>
> >> >> >> >>>>>> wrote:
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> While it might be possible to move this concept to Spark
> >> >> >> >>>>>>> Core
> >> >> >> >>>>> long-term,
> >> >> >> >>>>>> supporting structured data efficiently does require quite a
> >> >> >> >>>>>> bit
> >> >> >> >>>>>> of
> >> >> >> >>> the
> >> >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
> >> >> >> >>>>>> columnar
> >> >> >> >>>> storage.
> >> >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
> >> >> >> >>>>>> server
> >> >> >> >>>>>> --
> >> >> >> >>> it's
> >> >> >> >>>>>> meant to be a library for manipulating structured data.
> Since
> >> >> >> >>>>>> this
> >> >> >> >>> is
> >> >> >> >>>>>> possible to build over the core API, it's pretty natural to
> >> >> >> >>> organize it
> >> >> >> >>>>>> that way, same as Spark Streaming is a library.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Matei
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers
> >> >> >> >>>>>>>> <ko...@tresata.com>
> >> >> >> >>>> wrote:
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
> >> >> >> >>>>>>>> format
> >> >> >> >>> used
> >> >> >> >>>>> for
> >> >> >> >>>>>>>> bringing data into Spark from external systems, and used
> >> >> >> >>>>>>>> for
> >> >> >> >>> various
> >> >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> i agree. this to me also implies it belongs in spark
> core,
> >> >> >> >>>>>>>> not
> >> >> >> >>> sql
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> the
> >> >> >> >>>>>>>>> Jan.
> >> >> >> >>> 13
> >> >> >> >>>> Bay
> >> >> >> >>>>>> Area
> >> >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >> >> >> >>> information
> >> >> >> >>>> on
> >> >> >> >>>>>> this
> >> >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> ________________________________
> >> >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >> >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >> >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
> >> >> >> >>>>>>>>> will
> >> >> >> >>>>>>>>> be
> >> >> >> >>> a
> >> >> >> >>>> 1:1
> >> >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
> >> >> >> >>>>>>>>> DataFrame.
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> >> >> >>> rxin@databricks.com>
> >> >> >> >>>>>> wrote:
> >> >> >> >>>>>>>>>> Hi,
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
> >> >> >> >>>>>>>>>> 1.3,
> >> >> >> >>>>>>>>>> and
> >> >> >> >>>>> wanted
> >> >> >> >>>>>> to
> >> >> >> >>>>>>>>>> get the community's opinion.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
> >> >> >> >>>>>>>>>> format
> >> >> >> >>>> used
> >> >> >> >>>>>> for
> >> >> >> >>>>>>>>>> bringing data into Spark from external systems, and
> used
> >> >> >> >>>>>>>>>> for
> >> >> >> >>>> various
> >> >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
> >> >> >> >>>>>>>>>> also
> >> >> >> >>> expect
> >> >> >> >>>>>> more
> >> >> >> >>>>>>>>> and
> >> >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
> >> >> >> >>>>>>>>>> API
> >> >> >> >>> rather
> >> >> >> >>>>>> than
> >> >> >> >>>>>>>>> the
> >> >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
> >> >> >> >>>>>>>>>> DSL
> >> >> >> >>>>> originally
> >> >> >> >>>>>>>>>> designed for writing test cases, always has the
> >> >> >> >>>>>>>>>> data-frame
> >> >> >> >>>>>>>>>> like
> >> >> >> >>>> API.
> >> >> >> >>>>>> In
> >> >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
> >> >> >> >>>>>>>>>> for
> >> >> >> >>>>>>>>>> end
> >> >> >> >>>>> users.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> There are two motivations for the renaming:
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >> >> >> >>> SchemaRDD.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
> RDD
> >> >> >> >>> anymore
> >> >> >> >>>>>> (even
> >> >> >> >>>>>>>>>> though it would contain some RDD functions like map,
> >> >> >> >>>>>>>>>> flatMap,
> >> >> >> >>>> etc),
> >> >> >> >>>>>> and
> >> >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >> >> >> >>> confusing.
> >> >> >> >>>>>>>>> Instead.
> >> >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> RDD
> >> >> >> >>> methods.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> My understanding is that very few users program
> directly
> >> >> >> >>> against
> >> >> >> >>>> the
> >> >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >> >> >> >>> documented.
> >> >> >> >>>>>>>>> However,
> >> >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a
> type
> >> >> >> >>>>>>>>>> alias
> >> >> >> >>>>>> DataFrame
> >> >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain
> source
> >> >> >> >>>>> compatibility
> >> >> >> >>>>>>>>> for
> >> >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> >> >> >> >>> materials to
> >> >> >> >>>>> use
> >> >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> >> >> >> >>>>>>>>> For additional commands, e-mail:
> dev-help@spark.apache.org
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> >> >> >> >>>>>>>>> For additional commands, e-mail:
> dev-help@spark.apache.org
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> >> For additional commands, e-mail: dev-help@spark.apache.org
> >> >> >>
> >> >> >>
> >> >
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Evan Chan <ve...@gmail.com>.

Yeah, it's "null".   I was worried you couldn't represent it in Row
because of primitive types like Int (unless you box the Int, which
would be a performance hit).  Anyways, I'll take another look at the
Row API again  :-p

On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rx...@databricks.com> wrote:
> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com> wrote:
>>
>> I believe that most DataFrame implementations out there, like Pandas,
>> supports the idea of missing values / NA, and some support the idea of
>> Not Meaningful as well.
>>
>> Does Row support anything like that?  That is important for certain
>> applications.  I thought that Row worked by being a mutable object,
>> but haven't looked into the details in a while.
>>
>> -Evan
>>
>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>> > It shouldn't change the data source api at all because data sources
>> > create
>> > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > to SchemaRDD).
>> >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> >
>> > One thing that will break the data source API in 1.3 is the location of
>> > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
>> > APIs
>> > have first class classes/objects defined in sql directly.
>> >
>> >
>> >
>> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com>
>> > wrote:
>> >>
>> >> Hey guys,
>> >>
>> >> How does this impact the data sources API?  I was planning on using
>> >> this for a project.
>> >>
>> >> +1 that many things from spark-sql / DataFrame is universally
>> >> desirable and useful.
>> >>
>> >> By the way, one thing that prevents the columnar compression stuff in
>> >> Spark SQL from being more useful is, at least from previous talks with
>> >> Reynold and Michael et al., that the format was not designed for
>> >> persistence.
>> >>
>> >> I have a new project that aims to change that.  It is a
>> >> zero-serialisation, high performance binary vector library, designed
>> >> from the outset to be a persistent storage friendly.  May be one day
>> >> it can replace the Spark SQL columnar compression.
>> >>
>> >> Michael told me this would be a lot of work, and recreates parts of
>> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> >>
>> >> -Evan
>> >>
>> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
>> >> wrote:
>> >> > Alright I have merged the patch (
>> >> > https://github.com/apache/spark/pull/4173
>> >> > ) since I don't see any strong opinions against it (as a matter of
>> >> > fact
>> >> > most were for it). We can still change it if somebody lays out a
>> >> > strong
>> >> > argument.
>> >> >
>> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> >> > <ma...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> The type alias means your methods can specify either type and they
>> >> >> will
>> >> >> work. It's just another name for the same type. But Scaladocs and
>> >> >> such
>> >> >> will
>> >> >> show DataFrame as the type.
>> >> >>
>> >> >> Matei
>> >> >>
>> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> >> dirceu.semighini@gmail.com> wrote:
>> >> >> >
>> >> >> > Reynold,
>> >> >> > But with type alias we will have the same problem, right?
>> >> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> >> > change
>> >> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> >> > implicit
>> >> >> > conversion between DataFrame and SchemaRDD
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>> >> >> >
>> >> >> >> Dirceu,
>> >> >> >>
>> >> >> >> That is not possible because one cannot overload return types.
>> >> >> >>
>> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
>> >> >> >> some
>> >> >> type,
>> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >> >>
>> >> >> >> In 1.3, we will create a type alias for DataFrame called
>> >> >> >> SchemaRDD
>> >> >> >> to
>> >> >> not
>> >> >> >> break source compatibility for Scala.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> >> dirceu.semighini@gmail.com> wrote:
>> >> >> >>
>> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
>> >> >> >>> removed
>> >> >> >>> in
>> >> >> the
>> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> >> DataFrame?
>> >> >> >>> With this, we don't impact in existing code for the next few
>> >> >> >>> releases.
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>> >> >> >>>
>> >> >> >>>> I want to address the issue that Matei raised about the heavy
>> >> >> >>>> lifting
>> >> >> >>>> required for a full SQL support. It is amazing that even after
>> >> >> >>>> 30
>> >> >> years
>> >> >> >>> of
>> >> >> >>>> research there is not a single good open source columnar
>> >> >> >>>> database
>> >> >> >>>> like
>> >> >> >>>> Vertica. There is a column store option in MySQL, but it is not
>> >> >> >>>> nearly
>> >> >> >>> as
>> >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
>> >> >> >>>> for
>> >> >> >>>> such
>> >> >> a
>> >> >> >>>> system. I wonder why so and it's high time to change that.
>> >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
>> >> >> wrote:
>> >> >> >>>>
>> >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
>> >> >> >>>>> the
>> >> >> >>> former
>> >> >> >>>>> slightly better because it's more descriptive.
>> >> >> >>>>>
>> >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>> >> >> >>>>> covers,
>> >> >> >>>>> it
>> >> >> >>> would
>> >> >> >>>>> be more clear from a user-facing perspective to at least
>> >> >> >>>>> choose a
>> >> >> >>> package
>> >> >> >>>>> name for it that omits "sql".
>> >> >> >>>>>
>> >> >> >>>>> I would also be in favor of adding a separate Spark Schema
>> >> >> >>>>> module
>> >> >> >>>>> for
>> >> >> >>>> Spark
>> >> >> >>>>> SQL to rely on, but I imagine that might be too large a change
>> >> >> >>>>> at
>> >> >> this
>> >> >> >>>>> point?
>> >> >> >>>>>
>> >> >> >>>>> -Sandy
>> >> >> >>>>>
>> >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >> >> >>> matei.zaharia@gmail.com>
>> >> >> >>>>> wrote:
>> >> >> >>>>>
>> >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
>> >> >> >>>>>> another
>> >> >> >>>>> name,
>> >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
>> >> >> >>>>>> that
>> >> >> >>>>>> was
>> >> >> >>> the
>> >> >> >>>>> most
>> >> >> >>>>>> obvious use case to many users.)
>> >> >> >>>>>>
>> >> >> >>>>>> Matei
>> >> >> >>>>>>
>> >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >> >> >>> matei.zaharia@gmail.com>
>> >> >> >>>>>> wrote:
>> >> >> >>>>>>>
>> >> >> >>>>>>> While it might be possible to move this concept to Spark
>> >> >> >>>>>>> Core
>> >> >> >>>>> long-term,
>> >> >> >>>>>> supporting structured data efficiently does require quite a
>> >> >> >>>>>> bit
>> >> >> >>>>>> of
>> >> >> >>> the
>> >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
>> >> >> >>>>>> columnar
>> >> >> >>>> storage.
>> >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
>> >> >> >>>>>> server
>> >> >> >>>>>> --
>> >> >> >>> it's
>> >> >> >>>>>> meant to be a library for manipulating structured data. Since
>> >> >> >>>>>> this
>> >> >> >>> is
>> >> >> >>>>>> possible to build over the core API, it's pretty natural to
>> >> >> >>> organize it
>> >> >> >>>>>> that way, same as Spark Streaming is a library.
>> >> >> >>>>>>>
>> >> >> >>>>>>> Matei
>> >> >> >>>>>>>
>> >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers
>> >> >> >>>>>>>> <ko...@tresata.com>
>> >> >> >>>> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
>> >> >> >>>>>>>> format
>> >> >> >>> used
>> >> >> >>>>> for
>> >> >> >>>>>>>> bringing data into Spark from external systems, and used
>> >> >> >>>>>>>> for
>> >> >> >>> various
>> >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> i agree. this to me also implies it belongs in spark core,
>> >> >> >>>>>>>> not
>> >> >> >>> sql
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the
>> >> >> >>>>>>>>> Jan.
>> >> >> >>> 13
>> >> >> >>>> Bay
>> >> >> >>>>>> Area
>> >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> >> >> >>> information
>> >> >> >>>> on
>> >> >> >>>>>> this
>> >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> ________________________________
>> >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>> >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>> >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
>> >> >> >>>>>>>>> will
>> >> >> >>>>>>>>> be
>> >> >> >>> a
>> >> >> >>>> 1:1
>> >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
>> >> >> >>>>>>>>> DataFrame.
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> >> >> >>> rxin@databricks.com>
>> >> >> >>>>>> wrote:
>> >> >> >>>>>>>>>> Hi,
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>> >> >> >>>>>>>>>> 1.3,
>> >> >> >>>>>>>>>> and
>> >> >> >>>>> wanted
>> >> >> >>>>>> to
>> >> >> >>>>>>>>>> get the community's opinion.
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
>> >> >> >>>>>>>>>> format
>> >> >> >>>> used
>> >> >> >>>>>> for
>> >> >> >>>>>>>>>> bringing data into Spark from external systems, and used
>> >> >> >>>>>>>>>> for
>> >> >> >>>> various
>> >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
>> >> >> >>>>>>>>>> also
>> >> >> >>> expect
>> >> >> >>>>>> more
>> >> >> >>>>>>>>> and
>> >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
>> >> >> >>>>>>>>>> API
>> >> >> >>> rather
>> >> >> >>>>>> than
>> >> >> >>>>>>>>> the
>> >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
>> >> >> >>>>>>>>>> DSL
>> >> >> >>>>> originally
>> >> >> >>>>>>>>>> designed for writing test cases, always has the
>> >> >> >>>>>>>>>> data-frame
>> >> >> >>>>>>>>>> like
>> >> >> >>>> API.
>> >> >> >>>>>> In
>> >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
>> >> >> >>>>>>>>>> for
>> >> >> >>>>>>>>>> end
>> >> >> >>>>> users.
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> There are two motivations for the renaming:
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> >> >> >>> SchemaRDD.
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> >> >> >>> anymore
>> >> >> >>>>>> (even
>> >> >> >>>>>>>>>> though it would contain some RDD functions like map,
>> >> >> >>>>>>>>>> flatMap,
>> >> >> >>>> etc),
>> >> >> >>>>>> and
>> >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> >> >> >>> confusing.
>> >> >> >>>>>>>>> Instead.
>> >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>> >> >> >>> methods.
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>>
>> >> >> >>>>>>>>>> My understanding is that very few users program directly
>> >> >> >>> against
>> >> >> >>>> the
>> >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> >> >> >>> documented.
>> >> >> >>>>>>>>> However,
>> >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
>> >> >> >>>>>>>>>> alias
>> >> >> >>>>>> DataFrame
>> >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
>> >> >> >>>>> compatibility
>> >> >> >>>>>>>>> for
>> >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
>> >> >> >>> materials to
>> >> >> >>>>> use
>> >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> ---------------------------------------------------------------------
>> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> ---------------------------------------------------------------------
>> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> ---------------------------------------------------------------------
>> >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>
>> >> >>
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <ve...@gmail.com> wrote:

> I believe that most DataFrame implementations out there, like Pandas,
> supports the idea of missing values / NA, and some support the idea of
> Not Meaningful as well.
>
> Does Row support anything like that?  That is important for certain
> applications.  I thought that Row worked by being a mutable object,
> but haven't looked into the details in a while.
>
> -Evan
>
> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com> wrote:
> > It shouldn't change the data source api at all because data sources
> create
> > RDD[Row], and that gets converted into a DataFrame automatically
> (previously
> > to SchemaRDD).
> >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >
> > One thing that will break the data source API in 1.3 is the location of
> > types. Types were previously defined in sql.catalyst.types, and now
> moved to
> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> APIs
> > have first class classes/objects defined in sql directly.
> >
> >
> >
> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com>
> wrote:
> >>
> >> Hey guys,
> >>
> >> How does this impact the data sources API?  I was planning on using
> >> this for a project.
> >>
> >> +1 that many things from spark-sql / DataFrame is universally
> >> desirable and useful.
> >>
> >> By the way, one thing that prevents the columnar compression stuff in
> >> Spark SQL from being more useful is, at least from previous talks with
> >> Reynold and Michael et al., that the format was not designed for
> >> persistence.
> >>
> >> I have a new project that aims to change that.  It is a
> >> zero-serialisation, high performance binary vector library, designed
> >> from the outset to be a persistent storage friendly.  May be one day
> >> it can replace the Spark SQL columnar compression.
> >>
> >> Michael told me this would be a lot of work, and recreates parts of
> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >>
> >> -Evan
> >>
> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> > Alright I have merged the patch (
> >> > https://github.com/apache/spark/pull/4173
> >> > ) since I don't see any strong opinions against it (as a matter of
> fact
> >> > most were for it). We can still change it if somebody lays out a
> strong
> >> > argument.
> >> >
> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > <ma...@gmail.com>
> >> > wrote:
> >> >
> >> >> The type alias means your methods can specify either type and they
> will
> >> >> work. It's just another name for the same type. But Scaladocs and
> such
> >> >> will
> >> >> show DataFrame as the type.
> >> >>
> >> >> Matei
> >> >>
> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> dirceu.semighini@gmail.com> wrote:
> >> >> >
> >> >> > Reynold,
> >> >> > But with type alias we will have the same problem, right?
> >> >> > If the methods doesn't receive schemardd anymore, we will have to
> >> >> > change
> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> > implicit
> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >
> >> >> >
> >> >> >
> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >> >> >
> >> >> >> Dirceu,
> >> >> >>
> >> >> >> That is not possible because one cannot overload return types.
> >> >> >>
> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> some
> >> >> type,
> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >>
> >> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
> >> >> >> to
> >> >> not
> >> >> >> break source compatibility for Scala.
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> >> dirceu.semighini@gmail.com> wrote:
> >> >> >>
> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> removed
> >> >> >>> in
> >> >> the
> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> >> >> DataFrame?
> >> >> >>> With this, we don't impact in existing code for the next few
> >> >> >>> releases.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
> >> >> >>>
> >> >> >>>> I want to address the issue that Matei raised about the heavy
> >> >> >>>> lifting
> >> >> >>>> required for a full SQL support. It is amazing that even after
> 30
> >> >> years
> >> >> >>> of
> >> >> >>>> research there is not a single good open source columnar
> database
> >> >> >>>> like
> >> >> >>>> Vertica. There is a column store option in MySQL, but it is not
> >> >> >>>> nearly
> >> >> >>> as
> >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
> >> >> >>>> such
> >> >> a
> >> >> >>>> system. I wonder why so and it's high time to change that.
> >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
> >> >> wrote:
> >> >> >>>>
> >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
> the
> >> >> >>> former
> >> >> >>>>> slightly better because it's more descriptive.
> >> >> >>>>>
> >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> covers,
> >> >> >>>>> it
> >> >> >>> would
> >> >> >>>>> be more clear from a user-facing perspective to at least
> choose a
> >> >> >>> package
> >> >> >>>>> name for it that omits "sql".
> >> >> >>>>>
> >> >> >>>>> I would also be in favor of adding a separate Spark Schema
> module
> >> >> >>>>> for
> >> >> >>>> Spark
> >> >> >>>>> SQL to rely on, but I imagine that might be too large a change
> at
> >> >> this
> >> >> >>>>> point?
> >> >> >>>>>
> >> >> >>>>> -Sandy
> >> >> >>>>>
> >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> >> >>> matei.zaharia@gmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>>
> >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
> >> >> >>>>>> another
> >> >> >>>>> name,
> >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since that
> >> >> >>>>>> was
> >> >> >>> the
> >> >> >>>>> most
> >> >> >>>>>> obvious use case to many users.)
> >> >> >>>>>>
> >> >> >>>>>> Matei
> >> >> >>>>>>
> >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> >> >>> matei.zaharia@gmail.com>
> >> >> >>>>>> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>> While it might be possible to move this concept to Spark Core
> >> >> >>>>> long-term,
> >> >> >>>>>> supporting structured data efficiently does require quite a
> bit
> >> >> >>>>>> of
> >> >> >>> the
> >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
> columnar
> >> >> >>>> storage.
> >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL server
> >> >> >>>>>> --
> >> >> >>> it's
> >> >> >>>>>> meant to be a library for manipulating structured data. Since
> >> >> >>>>>> this
> >> >> >>> is
> >> >> >>>>>> possible to build over the core API, it's pretty natural to
> >> >> >>> organize it
> >> >> >>>>>> that way, same as Spark Streaming is a library.
> >> >> >>>>>>>
> >> >> >>>>>>> Matei
> >> >> >>>>>>>
> >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> koert@tresata.com>
> >> >> >>>> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
> >> >> >>>>>>>> format
> >> >> >>> used
> >> >> >>>>> for
> >> >> >>>>>>>> bringing data into Spark from external systems, and used for
> >> >> >>> various
> >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >> >> >>>>>>>>
> >> >> >>>>>>>> i agree. this to me also implies it belongs in spark core,
> not
> >> >> >>> sql
> >> >> >>>>>>>>
> >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the
> >> >> >>>>>>>>> Jan.
> >> >> >>> 13
> >> >> >>>> Bay
> >> >> >>>>>> Area
> >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >> >> >>> information
> >> >> >>>> on
> >> >> >>>>>> this
> >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> ________________________________
> >> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
> will
> >> >> >>>>>>>>> be
> >> >> >>> a
> >> >> >>>> 1:1
> >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
> DataFrame.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> >> >>> rxin@databricks.com>
> >> >> >>>>>> wrote:
> >> >> >>>>>>>>>> Hi,
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3,
> >> >> >>>>>>>>>> and
> >> >> >>>>> wanted
> >> >> >>>>>> to
> >> >> >>>>>>>>>> get the community's opinion.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
> >> >> >>>>>>>>>> format
> >> >> >>>> used
> >> >> >>>>>> for
> >> >> >>>>>>>>>> bringing data into Spark from external systems, and used
> for
> >> >> >>>> various
> >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
> also
> >> >> >>> expect
> >> >> >>>>>> more
> >> >> >>>>>>>>> and
> >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
> API
> >> >> >>> rather
> >> >> >>>>>> than
> >> >> >>>>>>>>> the
> >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
> DSL
> >> >> >>>>> originally
> >> >> >>>>>>>>>> designed for writing test cases, always has the data-frame
> >> >> >>>>>>>>>> like
> >> >> >>>> API.
> >> >> >>>>>> In
> >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for
> >> >> >>>>>>>>>> end
> >> >> >>>>> users.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> There are two motivations for the renaming:
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >> >> >>> SchemaRDD.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> >> >> >>> anymore
> >> >> >>>>>> (even
> >> >> >>>>>>>>>> though it would contain some RDD functions like map,
> >> >> >>>>>>>>>> flatMap,
> >> >> >>>> etc),
> >> >> >>>>>> and
> >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >> >> >>> confusing.
> >> >> >>>>>>>>> Instead.
> >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
> >> >> >>> methods.
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>>
> >> >> >>>>>>>>>> My understanding is that very few users program directly
> >> >> >>> against
> >> >> >>>> the
> >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >> >> >>> documented.
> >> >> >>>>>>>>> However,
> >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
> >> >> >>>>>>>>>> alias
> >> >> >>>>>> DataFrame
> >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
> >> >> >>>>> compatibility
> >> >> >>>>>>>>> for
> >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> >> >> >>> materials to
> >> >> >>>>> use
> >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>
> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>
> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>
> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>
> >> >>
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Evan Chan <ve...@gmail.com>.

I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rx...@databricks.com> wrote:
> It shouldn't change the data source api at all because data sources create
> RDD[Row], and that gets converted into a DataFrame automatically (previously
> to SchemaRDD).
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>
> One thing that will break the data source API in 1.3 is the location of
> types. Types were previously defined in sql.catalyst.types, and now moved to
> sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs
> have first class classes/objects defined in sql directly.
>
>
>
> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com> wrote:
>>
>> Hey guys,
>>
>> How does this impact the data sources API?  I was planning on using
>> this for a project.
>>
>> +1 that many things from spark-sql / DataFrame is universally
>> desirable and useful.
>>
>> By the way, one thing that prevents the columnar compression stuff in
>> Spark SQL from being more useful is, at least from previous talks with
>> Reynold and Michael et al., that the format was not designed for
>> persistence.
>>
>> I have a new project that aims to change that.  It is a
>> zero-serialisation, high performance binary vector library, designed
>> from the outset to be a persistent storage friendly.  May be one day
>> it can replace the Spark SQL columnar compression.
>>
>> Michael told me this would be a lot of work, and recreates parts of
>> Parquet, but I think it's worth it.  LMK if you'd like more details.
>>
>> -Evan
>>
>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com> wrote:
>> > Alright I have merged the patch (
>> > https://github.com/apache/spark/pull/4173
>> > ) since I don't see any strong opinions against it (as a matter of fact
>> > most were for it). We can still change it if somebody lays out a strong
>> > argument.
>> >
>> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > <ma...@gmail.com>
>> > wrote:
>> >
>> >> The type alias means your methods can specify either type and they will
>> >> work. It's just another name for the same type. But Scaladocs and such
>> >> will
>> >> show DataFrame as the type.
>> >>
>> >> Matei
>> >>
>> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> dirceu.semighini@gmail.com> wrote:
>> >> >
>> >> > Reynold,
>> >> > But with type alias we will have the same problem, right?
>> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> > change
>> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> > implicit
>> >> > conversion between DataFrame and SchemaRDD
>> >> >
>> >> >
>> >> >
>> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>> >> >
>> >> >> Dirceu,
>> >> >>
>> >> >> That is not possible because one cannot overload return types.
>> >> >>
>> >> >> SQLContext.parquetFile (and many other methods) needs to return some
>> >> type,
>> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >>
>> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
>> >> >> to
>> >> not
>> >> >> break source compatibility for Scala.
>> >> >>
>> >> >>
>> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> dirceu.semighini@gmail.com> wrote:
>> >> >>
>> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
>> >> >>> in
>> >> the
>> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> DataFrame?
>> >> >>> With this, we don't impact in existing code for the next few
>> >> >>> releases.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>> >> >>>
>> >> >>>> I want to address the issue that Matei raised about the heavy
>> >> >>>> lifting
>> >> >>>> required for a full SQL support. It is amazing that even after 30
>> >> years
>> >> >>> of
>> >> >>>> research there is not a single good open source columnar database
>> >> >>>> like
>> >> >>>> Vertica. There is a column store option in MySQL, but it is not
>> >> >>>> nearly
>> >> >>> as
>> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
>> >> >>>> such
>> >> a
>> >> >>>> system. I wonder why so and it's high time to change that.
>> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
>> >> wrote:
>> >> >>>>
>> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >> >>> former
>> >> >>>>> slightly better because it's more descriptive.
>> >> >>>>>
>> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers,
>> >> >>>>> it
>> >> >>> would
>> >> >>>>> be more clear from a user-facing perspective to at least choose a
>> >> >>> package
>> >> >>>>> name for it that omits "sql".
>> >> >>>>>
>> >> >>>>> I would also be in favor of adding a separate Spark Schema module
>> >> >>>>> for
>> >> >>>> Spark
>> >> >>>>> SQL to rely on, but I imagine that might be too large a change at
>> >> this
>> >> >>>>> point?
>> >> >>>>>
>> >> >>>>> -Sandy
>> >> >>>>>
>> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >> >>> matei.zaharia@gmail.com>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
>> >> >>>>>> another
>> >> >>>>> name,
>> >> >>>>>> like Spark Schema, but we decided to stick with SQL since that
>> >> >>>>>> was
>> >> >>> the
>> >> >>>>> most
>> >> >>>>>> obvious use case to many users.)
>> >> >>>>>>
>> >> >>>>>> Matei
>> >> >>>>>>
>> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >> >>> matei.zaharia@gmail.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> While it might be possible to move this concept to Spark Core
>> >> >>>>> long-term,
>> >> >>>>>> supporting structured data efficiently does require quite a bit
>> >> >>>>>> of
>> >> >>> the
>> >> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
>> >> >>>> storage.
>> >> >>>>>> The intent of Spark SQL though is to be more than a SQL server
>> >> >>>>>> --
>> >> >>> it's
>> >> >>>>>> meant to be a library for manipulating structured data. Since
>> >> >>>>>> this
>> >> >>> is
>> >> >>>>>> possible to build over the core API, it's pretty natural to
>> >> >>> organize it
>> >> >>>>>> that way, same as Spark Streaming is a library.
>> >> >>>>>>>
>> >> >>>>>>> Matei
>> >> >>>>>>>
>> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
>> >> >>>> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
>> >> >>>>>>>> format
>> >> >>> used
>> >> >>>>> for
>> >> >>>>>>>> bringing data into Spark from external systems, and used for
>> >> >>> various
>> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> >> >>>>>>>>
>> >> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
>> >> >>> sql
>> >> >>>>>>>>
>> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the
>> >> >>>>>>>>> Jan.
>> >> >>> 13
>> >> >>>> Bay
>> >> >>>>>> Area
>> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> >> >>> information
>> >> >>>> on
>> >> >>>>>> this
>> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> >> >>>>>>>>>
>> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >> >>>>>>>>>
>> >> >>>>>>>>> ________________________________
>> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> One thing potentially not clear from this e-mail, there will
>> >> >>>>>>>>> be
>> >> >>> a
>> >> >>>> 1:1
>> >> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> >> >>> rxin@databricks.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>>>> Hi,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3,
>> >> >>>>>>>>>> and
>> >> >>>>> wanted
>> >> >>>>>> to
>> >> >>>>>>>>>> get the community's opinion.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
>> >> >>>>>>>>>> format
>> >> >>>> used
>> >> >>>>>> for
>> >> >>>>>>>>>> bringing data into Spark from external systems, and used for
>> >> >>>> various
>> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> >> >>> expect
>> >> >>>>>> more
>> >> >>>>>>>>> and
>> >> >>>>>>>>>> more users to be programming directly against SchemaRDD API
>> >> >>> rather
>> >> >>>>>> than
>> >> >>>>>>>>> the
>> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> >> >>>>> originally
>> >> >>>>>>>>>> designed for writing test cases, always has the data-frame
>> >> >>>>>>>>>> like
>> >> >>>> API.
>> >> >>>>>> In
>> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for
>> >> >>>>>>>>>> end
>> >> >>>>> users.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> There are two motivations for the renaming:
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> >> >>> SchemaRDD.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> >> >>> anymore
>> >> >>>>>> (even
>> >> >>>>>>>>>> though it would contain some RDD functions like map,
>> >> >>>>>>>>>> flatMap,
>> >> >>>> etc),
>> >> >>>>>> and
>> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> >> >>> confusing.
>> >> >>>>>>>>> Instead.
>> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>> >> >>> methods.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> My understanding is that very few users program directly
>> >> >>> against
>> >> >>>> the
>> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> >> >>> documented.
>> >> >>>>>>>>> However,
>> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
>> >> >>>>>>>>>> alias
>> >> >>>>>> DataFrame
>> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
>> >> >>>>> compatibility
>> >> >>>>>>>>> for
>> >> >>>>>>>>>> Scala. That said, we will have to update all existing
>> >> >>> materials to
>> >> >>>>> use
>> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>
>> >> >>>> ---------------------------------------------------------------------
>> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>
>> >> >>>> ---------------------------------------------------------------------
>> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >>
>> >>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

It shouldn't change the data source api at all because data sources create
RDD[Row], and that gets converted into a DataFrame automatically
(previously to SchemaRDD).

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location of
types. Types were previously defined in sql.catalyst.types, and now moved
to sql.types. After 1.3, sql.catalyst is hidden from users, and all public
APIs have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <ve...@gmail.com> wrote:

> Hey guys,
>
> How does this impact the data sources API?  I was planning on using
> this for a project.
>
> +1 that many things from spark-sql / DataFrame is universally
> desirable and useful.
>
> By the way, one thing that prevents the columnar compression stuff in
> Spark SQL from being more useful is, at least from previous talks with
> Reynold and Michael et al., that the format was not designed for
> persistence.
>
> I have a new project that aims to change that.  It is a
> zero-serialisation, high performance binary vector library, designed
> from the outset to be a persistent storage friendly.  May be one day
> it can replace the Spark SQL columnar compression.
>
> Michael told me this would be a lot of work, and recreates parts of
> Parquet, but I think it's worth it.  LMK if you'd like more details.
>
> -Evan
>
> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com> wrote:
> > Alright I have merged the patch (
> https://github.com/apache/spark/pull/4173
> > ) since I don't see any strong opinions against it (as a matter of fact
> > most were for it). We can still change it if somebody lays out a strong
> > argument.
> >
> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <matei.zaharia@gmail.com
> >
> > wrote:
> >
> >> The type alias means your methods can specify either type and they will
> >> work. It's just another name for the same type. But Scaladocs and such
> will
> >> show DataFrame as the type.
> >>
> >> Matei
> >>
> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> dirceu.semighini@gmail.com> wrote:
> >> >
> >> > Reynold,
> >> > But with type alias we will have the same problem, right?
> >> > If the methods doesn't receive schemardd anymore, we will have to
> change
> >> > our code to migrade from schema to dataframe. Unless we have an
> implicit
> >> > conversion between DataFrame and SchemaRDD
> >> >
> >> >
> >> >
> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >> >
> >> >> Dirceu,
> >> >>
> >> >> That is not possible because one cannot overload return types.
> >> >>
> >> >> SQLContext.parquetFile (and many other methods) needs to return some
> >> type,
> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >>
> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> >> not
> >> >> break source compatibility for Scala.
> >> >>
> >> >>
> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> dirceu.semighini@gmail.com> wrote:
> >> >>
> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
> in
> >> the
> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> >> DataFrame?
> >> >>> With this, we don't impact in existing code for the next few
> releases.
> >> >>>
> >> >>>
> >> >>>
> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
> >> >>>
> >> >>>> I want to address the issue that Matei raised about the heavy
> lifting
> >> >>>> required for a full SQL support. It is amazing that even after 30
> >> years
> >> >>> of
> >> >>>> research there is not a single good open source columnar database
> like
> >> >>>> Vertica. There is a column store option in MySQL, but it is not
> nearly
> >> >>> as
> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
> such
> >> a
> >> >>>> system. I wonder why so and it's high time to change that.
> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
> >> wrote:
> >> >>>>
> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
> >> >>> former
> >> >>>>> slightly better because it's more descriptive.
> >> >>>>>
> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers,
> it
> >> >>> would
> >> >>>>> be more clear from a user-facing perspective to at least choose a
> >> >>> package
> >> >>>>> name for it that omits "sql".
> >> >>>>>
> >> >>>>> I would also be in favor of adding a separate Spark Schema module
> for
> >> >>>> Spark
> >> >>>>> SQL to rely on, but I imagine that might be too large a change at
> >> this
> >> >>>>> point?
> >> >>>>>
> >> >>>>> -Sandy
> >> >>>>>
> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> >>> matei.zaharia@gmail.com>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
> another
> >> >>>>> name,
> >> >>>>>> like Spark Schema, but we decided to stick with SQL since that
> was
> >> >>> the
> >> >>>>> most
> >> >>>>>> obvious use case to many users.)
> >> >>>>>>
> >> >>>>>> Matei
> >> >>>>>>
> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> >>> matei.zaharia@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> While it might be possible to move this concept to Spark Core
> >> >>>>> long-term,
> >> >>>>>> supporting structured data efficiently does require quite a bit
> of
> >> >>> the
> >> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
> >> >>>> storage.
> >> >>>>>> The intent of Spark SQL though is to be more than a SQL server --
> >> >>> it's
> >> >>>>>> meant to be a library for manipulating structured data. Since
> this
> >> >>> is
> >> >>>>>> possible to build over the core API, it's pretty natural to
> >> >>> organize it
> >> >>>>>> that way, same as Spark Streaming is a library.
> >> >>>>>>>
> >> >>>>>>> Matei
> >> >>>>>>>
> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
> >> >>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data format
> >> >>> used
> >> >>>>> for
> >> >>>>>>>> bringing data into Spark from external systems, and used for
> >> >>> various
> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >> >>>>>>>>
> >> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
> >> >>> sql
> >> >>>>>>>>
> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
> >> >>> 13
> >> >>>> Bay
> >> >>>>>> Area
> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >> >>> information
> >> >>>> on
> >> >>>>>> this
> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> >>>>>>>>>
> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> >>>>>>>>>
> >> >>>>>>>>> ________________________________
> >> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> One thing potentially not clear from this e-mail, there will
> be
> >> >>> a
> >> >>>> 1:1
> >> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> >>> rxin@databricks.com>
> >> >>>>>> wrote:
> >> >>>>>>>>>> Hi,
> >> >>>>>>>>>>
> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3,
> and
> >> >>>>> wanted
> >> >>>>>> to
> >> >>>>>>>>>> get the community's opinion.
> >> >>>>>>>>>>
> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
> format
> >> >>>> used
> >> >>>>>> for
> >> >>>>>>>>>> bringing data into Spark from external systems, and used for
> >> >>>> various
> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
> >> >>> expect
> >> >>>>>> more
> >> >>>>>>>>> and
> >> >>>>>>>>>> more users to be programming directly against SchemaRDD API
> >> >>> rather
> >> >>>>>> than
> >> >>>>>>>>> the
> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
> >> >>>>> originally
> >> >>>>>>>>>> designed for writing test cases, always has the data-frame
> like
> >> >>>> API.
> >> >>>>>> In
> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for
> end
> >> >>>>> users.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> There are two motivations for the renaming:
> >> >>>>>>>>>>
> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >> >>> SchemaRDD.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> >> >>> anymore
> >> >>>>>> (even
> >> >>>>>>>>>> though it would contain some RDD functions like map, flatMap,
> >> >>>> etc),
> >> >>>>>> and
> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >> >>> confusing.
> >> >>>>>>>>> Instead.
> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
> >> >>> methods.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> My understanding is that very few users program directly
> >> >>> against
> >> >>>> the
> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >> >>> documented.
> >> >>>>>>>>> However,
> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
> alias
> >> >>>>>> DataFrame
> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
> >> >>>>> compatibility
> >> >>>>>>>>> for
> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> >> >>> materials to
> >> >>>>> use
> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>
> ---------------------------------------------------------------------
> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>
> ---------------------------------------------------------------------
> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Evan Chan <ve...@gmail.com>.

Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff in
Spark SQL from being more useful is, at least from previous talks with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library, designed
from the outset to be a persistent storage friendly.  May be one day
it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rx...@databricks.com> wrote:
> Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
> ) since I don't see any strong opinions against it (as a matter of fact
> most were for it). We can still change it if somebody lays out a strong
> argument.
>
> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> The type alias means your methods can specify either type and they will
>> work. It's just another name for the same type. But Scaladocs and such will
>> show DataFrame as the type.
>>
>> Matei
>>
>> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>> >
>> > Reynold,
>> > But with type alias we will have the same problem, right?
>> > If the methods doesn't receive schemardd anymore, we will have to change
>> > our code to migrade from schema to dataframe. Unless we have an implicit
>> > conversion between DataFrame and SchemaRDD
>> >
>> >
>> >
>> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
>> >
>> >> Dirceu,
>> >>
>> >> That is not possible because one cannot overload return types.
>> >>
>> >> SQLContext.parquetFile (and many other methods) needs to return some
>> type,
>> >> and that type cannot be both SchemaRDD and DataFrame.
>> >>
>> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
>> not
>> >> break source compatibility for Scala.
>> >>
>> >>
>> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> dirceu.semighini@gmail.com> wrote:
>> >>
>> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
>> the
>> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> DataFrame?
>> >>> With this, we don't impact in existing code for the next few releases.
>> >>>
>> >>>
>> >>>
>> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>> >>>
>> >>>> I want to address the issue that Matei raised about the heavy lifting
>> >>>> required for a full SQL support. It is amazing that even after 30
>> years
>> >>> of
>> >>>> research there is not a single good open source columnar database like
>> >>>> Vertica. There is a column store option in MySQL, but it is not nearly
>> >>> as
>> >>>> sophisticated as Vertica or MonetDB. But there's a true need for such
>> a
>> >>>> system. I wonder why so and it's high time to change that.
>> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
>> wrote:
>> >>>>
>> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >>> former
>> >>>>> slightly better because it's more descriptive.
>> >>>>>
>> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> >>> would
>> >>>>> be more clear from a user-facing perspective to at least choose a
>> >>> package
>> >>>>> name for it that omits "sql".
>> >>>>>
>> >>>>> I would also be in favor of adding a separate Spark Schema module for
>> >>>> Spark
>> >>>>> SQL to rely on, but I imagine that might be too large a change at
>> this
>> >>>>> point?
>> >>>>>
>> >>>>> -Sandy
>> >>>>>
>> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >>> matei.zaharia@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> (Actually when we designed Spark SQL we thought of giving it another
>> >>>>> name,
>> >>>>>> like Spark Schema, but we decided to stick with SQL since that was
>> >>> the
>> >>>>> most
>> >>>>>> obvious use case to many users.)
>> >>>>>>
>> >>>>>> Matei
>> >>>>>>
>> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >>> matei.zaharia@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> While it might be possible to move this concept to Spark Core
>> >>>>> long-term,
>> >>>>>> supporting structured data efficiently does require quite a bit of
>> >>> the
>> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
>> >>>> storage.
>> >>>>>> The intent of Spark SQL though is to be more than a SQL server --
>> >>> it's
>> >>>>>> meant to be a library for manipulating structured data. Since this
>> >>> is
>> >>>>>> possible to build over the core API, it's pretty natural to
>> >>> organize it
>> >>>>>> that way, same as Spark Streaming is a library.
>> >>>>>>>
>> >>>>>>> Matei
>> >>>>>>>
>> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
>> >>>> wrote:
>> >>>>>>>>
>> >>>>>>>> "The context is that SchemaRDD is becoming a common data format
>> >>> used
>> >>>>> for
>> >>>>>>>> bringing data into Spark from external systems, and used for
>> >>> various
>> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> >>>>>>>>
>> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
>> >>> sql
>> >>>>>>>>
>> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> >>>>>>>>
>> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
>> >>> 13
>> >>>> Bay
>> >>>>>> Area
>> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> >>> information
>> >>>> on
>> >>>>>> this
>> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> >>>>>>>>>
>> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >>>>>>>>>
>> >>>>>>>>> ________________________________
>> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> One thing potentially not clear from this e-mail, there will be
>> >>> a
>> >>>> 1:1
>> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> >>> rxin@databricks.com>
>> >>>>>> wrote:
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>> >>>>> wanted
>> >>>>>> to
>> >>>>>>>>>> get the community's opinion.
>> >>>>>>>>>>
>> >>>>>>>>>> The context is that SchemaRDD is becoming a common data format
>> >>>> used
>> >>>>>> for
>> >>>>>>>>>> bringing data into Spark from external systems, and used for
>> >>>> various
>> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> >>> expect
>> >>>>>> more
>> >>>>>>>>> and
>> >>>>>>>>>> more users to be programming directly against SchemaRDD API
>> >>> rather
>> >>>>>> than
>> >>>>>>>>> the
>> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> >>>>> originally
>> >>>>>>>>>> designed for writing test cases, always has the data-frame like
>> >>>> API.
>> >>>>>> In
>> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end
>> >>>>> users.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> There are two motivations for the renaming:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> >>> SchemaRDD.
>> >>>>>>>>>>
>> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> >>> anymore
>> >>>>>> (even
>> >>>>>>>>>> though it would contain some RDD functions like map, flatMap,
>> >>>> etc),
>> >>>>>> and
>> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> >>> confusing.
>> >>>>>>>>> Instead.
>> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>> >>> methods.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> My understanding is that very few users program directly
>> >>> against
>> >>>> the
>> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> >>> documented.
>> >>>>>>>>> However,
>> >>>>>>>>>> oo maintain backward compatibility, we can create a type alias
>> >>>>>> DataFrame
>> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
>> >>>>> compatibility
>> >>>>>>>>> for
>> >>>>>>>>>> Scala. That said, we will have to update all existing
>> >>> materials to
>> >>>>> use
>> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter of fact
most were for it). We can still change it if somebody lays out a strong
argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> The type alias means your methods can specify either type and they will
> work. It's just another name for the same type. But Scaladocs and such will
> show DataFrame as the type.
>
> Matei
>
> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
> >
> > Reynold,
> > But with type alias we will have the same problem, right?
> > If the methods doesn't receive schemardd anymore, we will have to change
> > our code to migrade from schema to dataframe. Unless we have an implicit
> > conversion between DataFrame and SchemaRDD
> >
> >
> >
> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> >
> >> Dirceu,
> >>
> >> That is not possible because one cannot overload return types.
> >>
> >> SQLContext.parquetFile (and many other methods) needs to return some
> type,
> >> and that type cannot be both SchemaRDD and DataFrame.
> >>
> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> not
> >> break source compatibility for Scala.
> >>
> >>
> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> dirceu.semighini@gmail.com> wrote:
> >>
> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
> the
> >>> release 1.5(+/- 1)  for example, and the new code been added to
> DataFrame?
> >>> With this, we don't impact in existing code for the next few releases.
> >>>
> >>>
> >>>
> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
> >>>
> >>>> I want to address the issue that Matei raised about the heavy lifting
> >>>> required for a full SQL support. It is amazing that even after 30
> years
> >>> of
> >>>> research there is not a single good open source columnar database like
> >>>> Vertica. There is a column store option in MySQL, but it is not nearly
> >>> as
> >>>> sophisticated as Vertica or MonetDB. But there's a true need for such
> a
> >>>> system. I wonder why so and it's high time to change that.
> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com>
> wrote:
> >>>>
> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
> >>> former
> >>>>> slightly better because it's more descriptive.
> >>>>>
> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> >>> would
> >>>>> be more clear from a user-facing perspective to at least choose a
> >>> package
> >>>>> name for it that omits "sql".
> >>>>>
> >>>>> I would also be in favor of adding a separate Spark Schema module for
> >>>> Spark
> >>>>> SQL to rely on, but I imagine that might be too large a change at
> this
> >>>>> point?
> >>>>>
> >>>>> -Sandy
> >>>>>
> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >>> matei.zaharia@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> (Actually when we designed Spark SQL we thought of giving it another
> >>>>> name,
> >>>>>> like Spark Schema, but we decided to stick with SQL since that was
> >>> the
> >>>>> most
> >>>>>> obvious use case to many users.)
> >>>>>>
> >>>>>> Matei
> >>>>>>
> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >>> matei.zaharia@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> While it might be possible to move this concept to Spark Core
> >>>>> long-term,
> >>>>>> supporting structured data efficiently does require quite a bit of
> >>> the
> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
> >>>> storage.
> >>>>>> The intent of Spark SQL though is to be more than a SQL server --
> >>> it's
> >>>>>> meant to be a library for manipulating structured data. Since this
> >>> is
> >>>>>> possible to build over the core API, it's pretty natural to
> >>> organize it
> >>>>>> that way, same as Spark Streaming is a library.
> >>>>>>>
> >>>>>>> Matei
> >>>>>>>
> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> "The context is that SchemaRDD is becoming a common data format
> >>> used
> >>>>> for
> >>>>>>>> bringing data into Spark from external systems, and used for
> >>> various
> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >>>>>>>>
> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
> >>> sql
> >>>>>>>>
> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
> >>> 13
> >>>> Bay
> >>>>>> Area
> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >>> information
> >>>> on
> >>>>>> this
> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >>>>>>>>>
> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>>>>>>>
> >>>>>>>>> ________________________________
> >>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
> >>>>>>>>> To: Reynold Xin <rx...@databricks.com>
> >>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> One thing potentially not clear from this e-mail, there will be
> >>> a
> >>>> 1:1
> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >>> rxin@databricks.com>
> >>>>>> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> >>>>> wanted
> >>>>>> to
> >>>>>>>>>> get the community's opinion.
> >>>>>>>>>>
> >>>>>>>>>> The context is that SchemaRDD is becoming a common data format
> >>>> used
> >>>>>> for
> >>>>>>>>>> bringing data into Spark from external systems, and used for
> >>>> various
> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
> >>> expect
> >>>>>> more
> >>>>>>>>> and
> >>>>>>>>>> more users to be programming directly against SchemaRDD API
> >>> rather
> >>>>>> than
> >>>>>>>>> the
> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
> >>>>> originally
> >>>>>>>>>> designed for writing test cases, always has the data-frame like
> >>>> API.
> >>>>>> In
> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end
> >>>>> users.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There are two motivations for the renaming:
> >>>>>>>>>>
> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >>> SchemaRDD.
> >>>>>>>>>>
> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> >>> anymore
> >>>>>> (even
> >>>>>>>>>> though it would contain some RDD functions like map, flatMap,
> >>>> etc),
> >>>>>> and
> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >>> confusing.
> >>>>>>>>> Instead.
> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
> >>> methods.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> My understanding is that very few users program directly
> >>> against
> >>>> the
> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >>> documented.
> >>>>>>>>> However,
> >>>>>>>>>> oo maintain backward compatibility, we can create a type alias
> >>>>>> DataFrame
> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
> >>>>> compatibility
> >>>>>>>>> for
> >>>>>>>>>> Scala. That said, we will have to update all existing
> >>> materials to
> >>>>> use
> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >>>>>>>>>
> >>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Matei Zaharia <ma...@gmail.com>.

The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type.

Matei

> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <di...@gmail.com> wrote:
> 
> Reynold,
> But with type alias we will have the same problem, right?
> If the methods doesn't receive schemardd anymore, we will have to change
> our code to migrade from schema to dataframe. Unless we have an implicit
> conversion between DataFrame and SchemaRDD
> 
> 
> 
> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:
> 
>> Dirceu,
>> 
>> That is not possible because one cannot overload return types.
>> 
>> SQLContext.parquetFile (and many other methods) needs to return some type,
>> and that type cannot be both SchemaRDD and DataFrame.
>> 
>> In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
>> break source compatibility for Scala.
>> 
>> 
>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>> 
>>> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
>>> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
>>> With this, we don't impact in existing code for the next few releases.
>>> 
>>> 
>>> 
>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>>> 
>>>> I want to address the issue that Matei raised about the heavy lifting
>>>> required for a full SQL support. It is amazing that even after 30 years
>>> of
>>>> research there is not a single good open source columnar database like
>>>> Vertica. There is a column store option in MySQL, but it is not nearly
>>> as
>>>> sophisticated as Vertica or MonetDB. But there's a true need for such a
>>>> system. I wonder why so and it's high time to change that.
>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com> wrote:
>>>> 
>>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
>>> former
>>>>> slightly better because it's more descriptive.
>>>>> 
>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>>> would
>>>>> be more clear from a user-facing perspective to at least choose a
>>> package
>>>>> name for it that omits "sql".
>>>>> 
>>>>> I would also be in favor of adding a separate Spark Schema module for
>>>> Spark
>>>>> SQL to rely on, but I imagine that might be too large a change at this
>>>>> point?
>>>>> 
>>>>> -Sandy
>>>>> 
>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>> matei.zaharia@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> (Actually when we designed Spark SQL we thought of giving it another
>>>>> name,
>>>>>> like Spark Schema, but we decided to stick with SQL since that was
>>> the
>>>>> most
>>>>>> obvious use case to many users.)
>>>>>> 
>>>>>> Matei
>>>>>> 
>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>> matei.zaharia@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> While it might be possible to move this concept to Spark Core
>>>>> long-term,
>>>>>> supporting structured data efficiently does require quite a bit of
>>> the
>>>>>> infrastructure in Spark SQL, such as query planning and columnar
>>>> storage.
>>>>>> The intent of Spark SQL though is to be more than a SQL server --
>>> it's
>>>>>> meant to be a library for manipulating structured data. Since this
>>> is
>>>>>> possible to build over the core API, it's pretty natural to
>>> organize it
>>>>>> that way, same as Spark Streaming is a library.
>>>>>>> 
>>>>>>> Matei
>>>>>>> 
>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>> "The context is that SchemaRDD is becoming a common data format
>>> used
>>>>> for
>>>>>>>> bringing data into Spark from external systems, and used for
>>> various
>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>> 
>>>>>>>> i agree. this to me also implies it belongs in spark core, not
>>> sql
>>>>>>>> 
>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>>>>>>>> 
>>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
>>> 13
>>>> Bay
>>>>>> Area
>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>> information
>>>> on
>>>>>> this
>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>> 
>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>> 
>>>>>>>>> ________________________________
>>>>>>>>> From: Patrick Wendell <pw...@gmail.com>
>>>>>>>>> To: Reynold Xin <rx...@databricks.com>
>>>>>>>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> One thing potentially not clear from this e-mail, there will be
>>> a
>>>> 1:1
>>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>> rxin@databricks.com>
>>>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>>>>> wanted
>>>>>> to
>>>>>>>>>> get the community's opinion.
>>>>>>>>>> 
>>>>>>>>>> The context is that SchemaRDD is becoming a common data format
>>>> used
>>>>>> for
>>>>>>>>>> bringing data into Spark from external systems, and used for
>>>> various
>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
>>> expect
>>>>>> more
>>>>>>>>> and
>>>>>>>>>> more users to be programming directly against SchemaRDD API
>>> rather
>>>>>> than
>>>>>>>>> the
>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
>>>>> originally
>>>>>>>>>> designed for writing test cases, always has the data-frame like
>>>> API.
>>>>>> In
>>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end
>>>>> users.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>> 
>>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>>> SchemaRDD.
>>>>>>>>>> 
>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>>> anymore
>>>>>> (even
>>>>>>>>>> though it would contain some RDD functions like map, flatMap,
>>>> etc),
>>>>>> and
>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>>> confusing.
>>>>>>>>> Instead.
>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>>> methods.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> My understanding is that very few users program directly
>>> against
>>>> the
>>>>>>>>>> SchemaRDD API at the moment, because they are not well
>>> documented.
>>>>>>>>> However,
>>>>>>>>>> oo maintain backward compatibility, we can create a type alias
>>>>>> DataFrame
>>>>>>>>>> that is still named SchemaRDD. This will maintain source
>>>>> compatibility
>>>>>>>>> for
>>>>>>>>>> Scala. That said, we will have to update all existing
>>> materials to
>>>>> use
>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>> 
>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have to change
our code to migrade from schema to dataframe. Unless we have an implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin <rx...@databricks.com>:

> Dirceu,
>
> That is not possible because one cannot overload return types.
>
> SQLContext.parquetFile (and many other methods) needs to return some type,
> and that type cannot be both SchemaRDD and DataFrame.
>
> In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
> break source compatibility for Scala.
>
>
> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
>> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
>> With this, we don't impact in existing code for the next few releases.
>>
>>
>>
>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>>
>> > I want to address the issue that Matei raised about the heavy lifting
>> > required for a full SQL support. It is amazing that even after 30 years
>> of
>> > research there is not a single good open source columnar database like
>> > Vertica. There is a column store option in MySQL, but it is not nearly
>> as
>> > sophisticated as Vertica or MonetDB. But there's a true need for such a
>> > system. I wonder why so and it's high time to change that.
>> > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com> wrote:
>> >
>> > > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> former
>> > > slightly better because it's more descriptive.
>> > >
>> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> would
>> > > be more clear from a user-facing perspective to at least choose a
>> package
>> > > name for it that omits "sql".
>> > >
>> > > I would also be in favor of adding a separate Spark Schema module for
>> > Spark
>> > > SQL to rely on, but I imagine that might be too large a change at this
>> > > point?
>> > >
>> > > -Sandy
>> > >
>> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> matei.zaharia@gmail.com>
>> > > wrote:
>> > >
>> > > > (Actually when we designed Spark SQL we thought of giving it another
>> > > name,
>> > > > like Spark Schema, but we decided to stick with SQL since that was
>> the
>> > > most
>> > > > obvious use case to many users.)
>> > > >
>> > > > Matei
>> > > >
>> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> matei.zaharia@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > While it might be possible to move this concept to Spark Core
>> > > long-term,
>> > > > supporting structured data efficiently does require quite a bit of
>> the
>> > > > infrastructure in Spark SQL, such as query planning and columnar
>> > storage.
>> > > > The intent of Spark SQL though is to be more than a SQL server --
>> it's
>> > > > meant to be a library for manipulating structured data. Since this
>> is
>> > > > possible to build over the core API, it's pretty natural to
>> organize it
>> > > > that way, same as Spark Streaming is a library.
>> > > > >
>> > > > > Matei
>> > > > >
>> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
>> > wrote:
>> > > > >>
>> > > > >> "The context is that SchemaRDD is becoming a common data format
>> used
>> > > for
>> > > > >> bringing data into Spark from external systems, and used for
>> various
>> > > > >> components of Spark, e.g. MLlib's new pipeline API."
>> > > > >>
>> > > > >> i agree. this to me also implies it belongs in spark core, not
>> sql
>> > > > >>
>> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > > > >> michaelmalak@yahoo.com.invalid> wrote:
>> > > > >>
>> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan.
>> 13
>> > Bay
>> > > > Area
>> > > > >>> Spark Meetup YouTube contained a wealth of background
>> information
>> > on
>> > > > this
>> > > > >>> idea (mostly from Patrick and Reynold :-).
>> > > > >>>
>> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > > > >>>
>> > > > >>> ________________________________
>> > > > >>> From: Patrick Wendell <pw...@gmail.com>
>> > > > >>> To: Reynold Xin <rx...@databricks.com>
>> > > > >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
>> > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > > > >>>
>> > > > >>>
>> > > > >>> One thing potentially not clear from this e-mail, there will be
>> a
>> > 1:1
>> > > > >>> correspondence where you can get an RDD to/from a DataFrame.
>> > > > >>>
>> > > > >>>
>> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> rxin@databricks.com>
>> > > > wrote:
>> > > > >>>> Hi,
>> > > > >>>>
>> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>> > > wanted
>> > > > to
>> > > > >>>> get the community's opinion.
>> > > > >>>>
>> > > > >>>> The context is that SchemaRDD is becoming a common data format
>> > used
>> > > > for
>> > > > >>>> bringing data into Spark from external systems, and used for
>> > various
>> > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> expect
>> > > > more
>> > > > >>> and
>> > > > >>>> more users to be programming directly against SchemaRDD API
>> rather
>> > > > than
>> > > > >>> the
>> > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> > > originally
>> > > > >>>> designed for writing test cases, always has the data-frame like
>> > API.
>> > > > In
>> > > > >>>> 1.3, we are redesigning the API to make the API usable for end
>> > > users.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> There are two motivations for the renaming:
>> > > > >>>>
>> > > > >>>> 1. DataFrame seems to be a more self-evident name than
>> SchemaRDD.
>> > > > >>>>
>> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> anymore
>> > > > (even
>> > > > >>>> though it would contain some RDD functions like map, flatMap,
>> > etc),
>> > > > and
>> > > > >>>> calling it Schema*RDD* while it is not an RDD is highly
>> confusing.
>> > > > >>> Instead.
>> > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD
>> methods.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> My understanding is that very few users program directly
>> against
>> > the
>> > > > >>>> SchemaRDD API at the moment, because they are not well
>> documented.
>> > > > >>> However,
>> > > > >>>> oo maintain backward compatibility, we can create a type alias
>> > > > DataFrame
>> > > > >>>> that is still named SchemaRDD. This will maintain source
>> > > compatibility
>> > > > >>> for
>> > > > >>>> Scala. That said, we will have to update all existing
>> materials to
>> > > use
>> > > > >>>> DataFrame rather than SchemaRDD.
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > > > >>> For additional commands, e-mail: dev-help@spark.apache.org
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > > > >>> For additional commands, e-mail: dev-help@spark.apache.org
>> > > > >>>
>> > > > >>>
>> > > > >
>> > > >
>> > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > > > For additional commands, e-mail: dev-help@spark.apache.org
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

Dirceu,

That is not possible because one cannot overload return types.

SQLContext.parquetFile (and many other methods) needs to return some type,
and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
> With this, we don't impact in existing code for the next few releases.
>
>
>
> 2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:
>
> > I want to address the issue that Matei raised about the heavy lifting
> > required for a full SQL support. It is amazing that even after 30 years
> of
> > research there is not a single good open source columnar database like
> > Vertica. There is a column store option in MySQL, but it is not nearly as
> > sophisticated as Vertica or MonetDB. But there's a true need for such a
> > system. I wonder why so and it's high time to change that.
> > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com> wrote:
> >
> > > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > > slightly better because it's more descriptive.
> > >
> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> would
> > > be more clear from a user-facing perspective to at least choose a
> package
> > > name for it that omits "sql".
> > >
> > > I would also be in favor of adding a separate Spark Schema module for
> > Spark
> > > SQL to rely on, but I imagine that might be too large a change at this
> > > point?
> > >
> > > -Sandy
> > >
> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> matei.zaharia@gmail.com>
> > > wrote:
> > >
> > > > (Actually when we designed Spark SQL we thought of giving it another
> > > name,
> > > > like Spark Schema, but we decided to stick with SQL since that was
> the
> > > most
> > > > obvious use case to many users.)
> > > >
> > > > Matei
> > > >
> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> matei.zaharia@gmail.com>
> > > > wrote:
> > > > >
> > > > > While it might be possible to move this concept to Spark Core
> > > long-term,
> > > > supporting structured data efficiently does require quite a bit of
> the
> > > > infrastructure in Spark SQL, such as query planning and columnar
> > storage.
> > > > The intent of Spark SQL though is to be more than a SQL server --
> it's
> > > > meant to be a library for manipulating structured data. Since this is
> > > > possible to build over the core API, it's pretty natural to organize
> it
> > > > that way, same as Spark Streaming is a library.
> > > > >
> > > > > Matei
> > > > >
> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
> > wrote:
> > > > >>
> > > > >> "The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > > >> bringing data into Spark from external systems, and used for
> various
> > > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > > >>
> > > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > > >>
> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > > >> michaelmalak@yahoo.com.invalid> wrote:
> > > > >>
> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13
> > Bay
> > > > Area
> > > > >>> Spark Meetup YouTube contained a wealth of background information
> > on
> > > > this
> > > > >>> idea (mostly from Patrick and Reynold :-).
> > > > >>>
> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > > >>>
> > > > >>> ________________________________
> > > > >>> From: Patrick Wendell <pw...@gmail.com>
> > > > >>> To: Reynold Xin <rx...@databricks.com>
> > > > >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
> > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > > > >>>
> > > > >>>
> > > > >>> One thing potentially not clear from this e-mail, there will be a
> > 1:1
> > > > >>> correspondence where you can get an RDD to/from a DataFrame.
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> rxin@databricks.com>
> > > > wrote:
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> > > wanted
> > > > to
> > > > >>>> get the community's opinion.
> > > > >>>>
> > > > >>>> The context is that SchemaRDD is becoming a common data format
> > used
> > > > for
> > > > >>>> bringing data into Spark from external systems, and used for
> > various
> > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
> expect
> > > > more
> > > > >>> and
> > > > >>>> more users to be programming directly against SchemaRDD API
> rather
> > > > than
> > > > >>> the
> > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> > > originally
> > > > >>>> designed for writing test cases, always has the data-frame like
> > API.
> > > > In
> > > > >>>> 1.3, we are redesigning the API to make the API usable for end
> > > users.
> > > > >>>>
> > > > >>>>
> > > > >>>> There are two motivations for the renaming:
> > > > >>>>
> > > > >>>> 1. DataFrame seems to be a more self-evident name than
> SchemaRDD.
> > > > >>>>
> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> anymore
> > > > (even
> > > > >>>> though it would contain some RDD functions like map, flatMap,
> > etc),
> > > > and
> > > > >>>> calling it Schema*RDD* while it is not an RDD is highly
> confusing.
> > > > >>> Instead.
> > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD
> methods.
> > > > >>>>
> > > > >>>>
> > > > >>>> My understanding is that very few users program directly against
> > the
> > > > >>>> SchemaRDD API at the moment, because they are not well
> documented.
> > > > >>> However,
> > > > >>>> oo maintain backward compatibility, we can create a type alias
> > > > DataFrame
> > > > >>>> that is still named SchemaRDD. This will maintain source
> > > compatibility
> > > > >>> for
> > > > >>>> Scala. That said, we will have to update all existing materials
> to
> > > use
> > > > >>>> DataFrame rather than SchemaRDD.
> > > > >>>
> > > > >>>
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > > >>>
> > > > >>>
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > > >>>
> > > > >>>
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > > For additional commands, e-mail: dev-help@spark.apache.org
> > > >
> > > >
> > >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Can't the SchemaRDD remain the same, but deprecated, and be removed in the
release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
With this, we don't impact in existing code for the next few releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta <ku...@gmail.com>:

> I want to address the issue that Matei raised about the heavy lifting
> required for a full SQL support. It is amazing that even after 30 years of
> research there is not a single good open source columnar database like
> Vertica. There is a column store option in MySQL, but it is not nearly as
> sophisticated as Vertica or MonetDB. But there's a true need for such a
> system. I wonder why so and it's high time to change that.
> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com> wrote:
>
> > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > slightly better because it's more descriptive.
> >
> > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> > be more clear from a user-facing perspective to at least choose a package
> > name for it that omits "sql".
> >
> > I would also be in favor of adding a separate Spark Schema module for
> Spark
> > SQL to rely on, but I imagine that might be too large a change at this
> > point?
> >
> > -Sandy
> >
> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <ma...@gmail.com>
> > wrote:
> >
> > > (Actually when we designed Spark SQL we thought of giving it another
> > name,
> > > like Spark Schema, but we decided to stick with SQL since that was the
> > most
> > > obvious use case to many users.)
> > >
> > > Matei
> > >
> > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com>
> > > wrote:
> > > >
> > > > While it might be possible to move this concept to Spark Core
> > long-term,
> > > supporting structured data efficiently does require quite a bit of the
> > > infrastructure in Spark SQL, such as query planning and columnar
> storage.
> > > The intent of Spark SQL though is to be more than a SQL server -- it's
> > > meant to be a library for manipulating structured data. Since this is
> > > possible to build over the core API, it's pretty natural to organize it
> > > that way, same as Spark Streaming is a library.
> > > >
> > > > Matei
> > > >
> > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> > > >>
> > > >> "The context is that SchemaRDD is becoming a common data format used
> > for
> > > >> bringing data into Spark from external systems, and used for various
> > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > >>
> > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > >>
> > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > >> michaelmalak@yahoo.com.invalid> wrote:
> > > >>
> > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13
> Bay
> > > Area
> > > >>> Spark Meetup YouTube contained a wealth of background information
> on
> > > this
> > > >>> idea (mostly from Patrick and Reynold :-).
> > > >>>
> > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > >>>
> > > >>> ________________________________
> > > >>> From: Patrick Wendell <pw...@gmail.com>
> > > >>> To: Reynold Xin <rx...@databricks.com>
> > > >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > > >>> Sent: Monday, January 26, 2015 4:01 PM
> > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > > >>>
> > > >>>
> > > >>> One thing potentially not clear from this e-mail, there will be a
> 1:1
> > > >>> correspondence where you can get an RDD to/from a DataFrame.
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> > > wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> > wanted
> > > to
> > > >>>> get the community's opinion.
> > > >>>>
> > > >>>> The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > >>>> bringing data into Spark from external systems, and used for
> various
> > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> > > more
> > > >>> and
> > > >>>> more users to be programming directly against SchemaRDD API rather
> > > than
> > > >>> the
> > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> > originally
> > > >>>> designed for writing test cases, always has the data-frame like
> API.
> > > In
> > > >>>> 1.3, we are redesigning the API to make the API usable for end
> > users.
> > > >>>>
> > > >>>>
> > > >>>> There are two motivations for the renaming:
> > > >>>>
> > > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > > >>>>
> > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > > (even
> > > >>>> though it would contain some RDD functions like map, flatMap,
> etc),
> > > and
> > > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > > >>> Instead.
> > > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > > >>>>
> > > >>>>
> > > >>>> My understanding is that very few users program directly against
> the
> > > >>>> SchemaRDD API at the moment, because they are not well documented.
> > > >>> However,
> > > >>>> oo maintain backward compatibility, we can create a type alias
> > > DataFrame
> > > >>>> that is still named SchemaRDD. This will maintain source
> > compatibility
> > > >>> for
> > > >>>> Scala. That said, we will have to update all existing materials to
> > use
> > > >>>> DataFrame rather than SchemaRDD.
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > >>>
> > > >>>
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> > >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Kushal Datta <ku...@gmail.com>.

I want to address the issue that Matei raised about the heavy lifting
required for a full SQL support. It is amazing that even after 30 years of
research there is not a single good open source columnar database like
Vertica. There is a column store option in MySQL, but it is not nearly as
sophisticated as Vertica or MonetDB. But there's a true need for such a
system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sa...@cloudera.com> wrote:

> Both SchemaRDD and DataFrame sound fine to me, though I like the former
> slightly better because it's more descriptive.
>
> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> be more clear from a user-facing perspective to at least choose a package
> name for it that omits "sql".
>
> I would also be in favor of adding a separate Spark Schema module for Spark
> SQL to rely on, but I imagine that might be too large a change at this
> point?
>
> -Sandy
>
> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
> > (Actually when we designed Spark SQL we thought of giving it another
> name,
> > like Spark Schema, but we decided to stick with SQL since that was the
> most
> > obvious use case to many users.)
> >
> > Matei
> >
> > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com>
> > wrote:
> > >
> > > While it might be possible to move this concept to Spark Core
> long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> > >
> > > Matei
> > >
> > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > >>
> > >> "The context is that SchemaRDD is becoming a common data format used
> for
> > >> bringing data into Spark from external systems, and used for various
> > >> components of Spark, e.g. MLlib's new pipeline API."
> > >>
> > >> i agree. this to me also implies it belongs in spark core, not sql
> > >>
> > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >> michaelmalak@yahoo.com.invalid> wrote:
> > >>
> > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >>> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >>> idea (mostly from Patrick and Reynold :-).
> > >>>
> > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>
> > >>> ________________________________
> > >>> From: Patrick Wendell <pw...@gmail.com>
> > >>> To: Reynold Xin <rx...@databricks.com>
> > >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > >>> Sent: Monday, January 26, 2015 4:01 PM
> > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>
> > >>>
> > >>> One thing potentially not clear from this e-mail, there will be a 1:1
> > >>> correspondence where you can get an RDD to/from a DataFrame.
> > >>>
> > >>>
> > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> > wrote:
> > >>>> Hi,
> > >>>>
> > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> wanted
> > to
> > >>>> get the community's opinion.
> > >>>>
> > >>>> The context is that SchemaRDD is becoming a common data format used
> > for
> > >>>> bringing data into Spark from external systems, and used for various
> > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> > more
> > >>> and
> > >>>> more users to be programming directly against SchemaRDD API rather
> > than
> > >>> the
> > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> > >>>> designed for writing test cases, always has the data-frame like API.
> > In
> > >>>> 1.3, we are redesigning the API to make the API usable for end
> users.
> > >>>>
> > >>>>
> > >>>> There are two motivations for the renaming:
> > >>>>
> > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > >>>>
> > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > (even
> > >>>> though it would contain some RDD functions like map, flatMap, etc),
> > and
> > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >>> Instead.
> > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > >>>>
> > >>>>
> > >>>> My understanding is that very few users program directly against the
> > >>>> SchemaRDD API at the moment, because they are not well documented.
> > >>> However,
> > >>>> oo maintain backward compatibility, we can create a type alias
> > DataFrame
> > >>>> that is still named SchemaRDD. This will maintain source
> compatibility
> > >>> for
> > >>>> Scala. That said, we will have to update all existing materials to
> use
> > >>>> DataFrame rather than SchemaRDD.
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>
> > >>>
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Sandy Ryza <sa...@cloudera.com>.

Both SchemaRDD and DataFrame sound fine to me, though I like the former
slightly better because it's more descriptive.

Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
be more clear from a user-facing perspective to at least choose a package
name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema module for Spark
SQL to rely on, but I imagine that might be too large a change at this
point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelmalak@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <pw...@gmail.com>
> >>> To: Reynold Xin <rx...@databricks.com>
> >>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Matei Zaharia <ma...@gmail.com>.

(Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.)

Matei

> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <ma...@gmail.com> wrote:
> 
> While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library.
> 
> Matei
> 
>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> 
>> "The context is that SchemaRDD is becoming a common data format used for
>> bringing data into Spark from external systems, and used for various
>> components of Spark, e.g. MLlib's new pipeline API."
>> 
>> i agree. this to me also implies it belongs in spark core, not sql
>> 
>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> michaelmalak@yahoo.com.invalid> wrote:
>> 
>>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
>>> Spark Meetup YouTube contained a wealth of background information on this
>>> idea (mostly from Patrick and Reynold :-).
>>> 
>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>> 
>>> ________________________________
>>> From: Patrick Wendell <pw...@gmail.com>
>>> To: Reynold Xin <rx...@databricks.com>
>>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>> Sent: Monday, January 26, 2015 4:01 PM
>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>> 
>>> 
>>> One thing potentially not clear from this e-mail, there will be a 1:1
>>> correspondence where you can get an RDD to/from a DataFrame.
>>> 
>>> 
>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>> Hi,
>>>> 
>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>>>> get the community's opinion.
>>>> 
>>>> The context is that SchemaRDD is becoming a common data format used for
>>>> bringing data into Spark from external systems, and used for various
>>>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
>>> and
>>>> more users to be programming directly against SchemaRDD API rather than
>>> the
>>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>>>> designed for writing test cases, always has the data-frame like API. In
>>>> 1.3, we are redesigning the API to make the API usable for end users.
>>>> 
>>>> 
>>>> There are two motivations for the renaming:
>>>> 
>>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>>> 
>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>>>> though it would contain some RDD functions like map, flatMap, etc), and
>>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>>> Instead.
>>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>>> 
>>>> 
>>>> My understanding is that very few users program directly against the
>>>> SchemaRDD API at the moment, because they are not well documented.
>>> However,
>>>> oo maintain backward compatibility, we can create a type alias DataFrame
>>>> that is still named SchemaRDD. This will maintain source compatibility
>>> for
>>>> Scala. That said, we will have to update all existing materials to use
>>>> DataFrame rather than SchemaRDD.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

thats great. guess i was looking at a somewhat stale master branch...

On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin <rx...@databricks.com> wrote:

> Koert,
>
> As Mark said, I have already refactored the API so that nothing is
> catalyst is exposed (and users won't need them anyway). Data types, Row
> interfaces are both outside catalyst package and in org.apache.spark.sql.
>
> On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> hey matei,
>> i think that stuff such as SchemaRDD, columar storage and perhaps also
>> query planning can be re-used by many systems that do analysis on
>> structured data. i can imagine panda-like systems, but also datalog or
>> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
>> some point). SchemaRDD should become the interface for all these. and
>> columnar storage abstractions should be re-used between all these.
>>
>> currently the sql tie in is way beyond just the (perhaps unfortunate)
>> naming convention. for example a core part of the SchemaRD abstraction is
>> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
>> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
>> (if i understand it correctly, i have not used catalyst, but it looks
>> neat). i should not need to include a SQL parser just to use structured
>> data in say a panda-like framework.
>>
>> best, koert
>>
>>
>> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> While it might be possible to move this concept to Spark Core long-term,
>>> supporting structured data efficiently does require quite a bit of the
>>> infrastructure in Spark SQL, such as query planning and columnar storage.
>>> The intent of Spark SQL though is to be more than a SQL server -- it's
>>> meant to be a library for manipulating structured data. Since this is
>>> possible to build over the core API, it's pretty natural to organize it
>>> that way, same as Spark Streaming is a library.
>>>
>>> Matei
>>>
>>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>> >
>>> > "The context is that SchemaRDD is becoming a common data format used
>>> for
>>> > bringing data into Spark from external systems, and used for various
>>> > components of Spark, e.g. MLlib's new pipeline API."
>>> >
>>> > i agree. this to me also implies it belongs in spark core, not sql
>>> >
>>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>> > michaelmalak@yahoo.com.invalid> wrote:
>>> >
>>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>>> Area
>>> >> Spark Meetup YouTube contained a wealth of background information on
>>> this
>>> >> idea (mostly from Patrick and Reynold :-).
>>> >>
>>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>>> >>
>>> >> ________________________________
>>> >> From: Patrick Wendell <pw...@gmail.com>
>>> >> To: Reynold Xin <rx...@databricks.com>
>>> >> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>>> >> Sent: Monday, January 26, 2015 4:01 PM
>>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>>> >>
>>> >>
>>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>>> >> correspondence where you can get an RDD to/from a DataFrame.
>>> >>
>>> >>
>>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>>> wanted to
>>> >>> get the community's opinion.
>>> >>>
>>> >>> The context is that SchemaRDD is becoming a common data format used
>>> for
>>> >>> bringing data into Spark from external systems, and used for various
>>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>>> more
>>> >> and
>>> >>> more users to be programming directly against SchemaRDD API rather
>>> than
>>> >> the
>>> >>> core RDD API. SchemaRDD, through its less commonly used DSL
>>> originally
>>> >>> designed for writing test cases, always has the data-frame like API.
>>> In
>>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>>> >>>
>>> >>>
>>> >>> There are two motivations for the renaming:
>>> >>>
>>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>> >>>
>>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>>> (even
>>> >>> though it would contain some RDD functions like map, flatMap, etc),
>>> and
>>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>>> >> Instead.
>>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>> >>>
>>> >>>
>>> >>> My understanding is that very few users program directly against the
>>> >>> SchemaRDD API at the moment, because they are not well documented.
>>> >> However,
>>> >>> oo maintain backward compatibility, we can create a type alias
>>> DataFrame
>>> >>> that is still named SchemaRDD. This will maintain source
>>> compatibility
>>> >> for
>>> >>> Scala. That said, we will have to update all existing materials to
>>> use
>>> >>> DataFrame rather than SchemaRDD.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: dev-help@spark.apache.org
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: dev-help@spark.apache.org
>>> >>
>>> >>
>>>
>>>
>>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Reynold Xin <rx...@databricks.com>.

Koert,

As Mark said, I have already refactored the API so that nothing is catalyst
is exposed (and users won't need them anyway). Data types, Row interfaces
are both outside catalyst package and in org.apache.spark.sql.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <ko...@tresata.com> wrote:

> hey matei,
> i think that stuff such as SchemaRDD, columar storage and perhaps also
> query planning can be re-used by many systems that do analysis on
> structured data. i can imagine panda-like systems, but also datalog or
> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
> some point). SchemaRDD should become the interface for all these. and
> columnar storage abstractions should be re-used between all these.
>
> currently the sql tie in is way beyond just the (perhaps unfortunate)
> naming convention. for example a core part of the SchemaRD abstraction is
> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
> (if i understand it correctly, i have not used catalyst, but it looks
> neat). i should not need to include a SQL parser just to use structured
> data in say a panda-like framework.
>
> best, koert
>
>
> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> While it might be possible to move this concept to Spark Core long-term,
>> supporting structured data efficiently does require quite a bit of the
>> infrastructure in Spark SQL, such as query planning and columnar storage.
>> The intent of Spark SQL though is to be more than a SQL server -- it's
>> meant to be a library for manipulating structured data. Since this is
>> possible to build over the core API, it's pretty natural to organize it
>> that way, same as Spark Streaming is a library.
>>
>> Matei
>>
>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> >
>> > "The context is that SchemaRDD is becoming a common data format used for
>> > bringing data into Spark from external systems, and used for various
>> > components of Spark, e.g. MLlib's new pipeline API."
>> >
>> > i agree. this to me also implies it belongs in spark core, not sql
>> >
>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > michaelmalak@yahoo.com.invalid> wrote:
>> >
>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>> Area
>> >> Spark Meetup YouTube contained a wealth of background information on
>> this
>> >> idea (mostly from Patrick and Reynold :-).
>> >>
>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >>
>> >> ________________________________
>> >> From: Patrick Wendell <pw...@gmail.com>
>> >> To: Reynold Xin <rx...@databricks.com>
>> >> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> >> Sent: Monday, January 26, 2015 4:01 PM
>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>> >>
>> >>
>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>> >> correspondence where you can get an RDD to/from a DataFrame.
>> >>
>> >>
>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>> Hi,
>> >>>
>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
>> to
>> >>> get the community's opinion.
>> >>>
>> >>> The context is that SchemaRDD is becoming a common data format used
>> for
>> >>> bringing data into Spark from external systems, and used for various
>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>> more
>> >> and
>> >>> more users to be programming directly against SchemaRDD API rather
>> than
>> >> the
>> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>> >>> designed for writing test cases, always has the data-frame like API.
>> In
>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>> >>>
>> >>>
>> >>> There are two motivations for the renaming:
>> >>>
>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>> >>>
>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>> (even
>> >>> though it would contain some RDD functions like map, flatMap, etc),
>> and
>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>> >> Instead.
>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>> >>>
>> >>>
>> >>> My understanding is that very few users program directly against the
>> >>> SchemaRDD API at the moment, because they are not well documented.
>> >> However,
>> >>> oo maintain backward compatibility, we can create a type alias
>> DataFrame
>> >>> that is still named SchemaRDD. This will maintain source compatibility
>> >> for
>> >>> Scala. That said, we will have to update all existing materials to use
>> >>> DataFrame rather than SchemaRDD.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >>
>> >>
>>
>>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Mark Hamstra <ma...@clearstorydata.com>.

In master, Reynold has already taken care of moving Row
into org.apache.spark.sql; so, even though the implementation of Row (and
GenericRow et al.) is in Catalyst (which is more optimizer than parser),
that needn't be of concern to users of the API in its most recent state.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <ko...@tresata.com> wrote:

> hey matei,
> i think that stuff such as SchemaRDD, columar storage and perhaps also
> query planning can be re-used by many systems that do analysis on
> structured data. i can imagine panda-like systems, but also datalog or
> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
> some point). SchemaRDD should become the interface for all these. and
> columnar storage abstractions should be re-used between all these.
>
> currently the sql tie in is way beyond just the (perhaps unfortunate)
> naming convention. for example a core part of the SchemaRD abstraction is
> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
> (if i understand it correctly, i have not used catalyst, but it looks
> neat). i should not need to include a SQL parser just to use structured
> data in say a panda-like framework.
>
> best, koert
>
>
> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
> > While it might be possible to move this concept to Spark Core long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> > > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > >
> > > "The context is that SchemaRDD is becoming a common data format used
> for
> > > bringing data into Spark from external systems, and used for various
> > > components of Spark, e.g. MLlib's new pipeline API."
> > >
> > > i agree. this to me also implies it belongs in spark core, not sql
> > >
> > > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > michaelmalak@yahoo.com.invalid> wrote:
> > >
> > >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >> idea (mostly from Patrick and Reynold :-).
> > >>
> > >> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>
> > >> ________________________________
> > >> From: Patrick Wendell <pw...@gmail.com>
> > >> To: Reynold Xin <rx...@databricks.com>
> > >> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> > >> Sent: Monday, January 26, 2015 4:01 PM
> > >> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>
> > >>
> > >> One thing potentially not clear from this e-mail, there will be a 1:1
> > >> correspondence where you can get an RDD to/from a DataFrame.
> > >>
> > >>
> > >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> > wrote:
> > >>> Hi,
> > >>>
> > >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> > to
> > >>> get the community's opinion.
> > >>>
> > >>> The context is that SchemaRDD is becoming a common data format used
> for
> > >>> bringing data into Spark from external systems, and used for various
> > >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> > >> and
> > >>> more users to be programming directly against SchemaRDD API rather
> than
> > >> the
> > >>> core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> > >>> designed for writing test cases, always has the data-frame like API.
> In
> > >>> 1.3, we are redesigning the API to make the API usable for end users.
> > >>>
> > >>>
> > >>> There are two motivations for the renaming:
> > >>>
> > >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > >>>
> > >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> > >>> though it would contain some RDD functions like map, flatMap, etc),
> and
> > >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >> Instead.
> > >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > >>>
> > >>>
> > >>> My understanding is that very few users program directly against the
> > >>> SchemaRDD API at the moment, because they are not well documented.
> > >> However,
> > >>> oo maintain backward compatibility, we can create a type alias
> > DataFrame
> > >>> that is still named SchemaRDD. This will maintain source
> compatibility
> > >> for
> > >>> Scala. That said, we will have to update all existing materials to
> use
> > >>> DataFrame rather than SchemaRDD.
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> For additional commands, e-mail: dev-help@spark.apache.org
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> For additional commands, e-mail: dev-help@spark.apache.org
> > >>
> > >>
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

hey matei,
i think that stuff such as SchemaRDD, columar storage and perhaps also
query planning can be re-used by many systems that do analysis on
structured data. i can imagine panda-like systems, but also datalog or
scalding-like (which we use at tresata and i might rebase on SchemaRDD at
some point). SchemaRDD should become the interface for all these. and
columnar storage abstractions should be re-used between all these.

currently the sql tie in is way beyond just the (perhaps unfortunate)
naming convention. for example a core part of the SchemaRD abstraction is
Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
(if i understand it correctly, i have not used catalyst, but it looks
neat). i should not need to include a SQL parser just to use structured
data in say a panda-like framework.

best, koert


On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
>
> Matei
>
> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >
> > "The context is that SchemaRDD is becoming a common data format used for
> > bringing data into Spark from external systems, and used for various
> > components of Spark, e.g. MLlib's new pipeline API."
> >
> > i agree. this to me also implies it belongs in spark core, not sql
> >
> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > michaelmalak@yahoo.com.invalid> wrote:
> >
> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >> Spark Meetup YouTube contained a wealth of background information on
> this
> >> idea (mostly from Patrick and Reynold :-).
> >>
> >> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>
> >> ________________________________
> >> From: Patrick Wendell <pw...@gmail.com>
> >> To: Reynold Xin <rx...@databricks.com>
> >> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> >> Sent: Monday, January 26, 2015 4:01 PM
> >> Subject: Re: renaming SchemaRDD -> DataFrame
> >>
> >>
> >> One thing potentially not clear from this e-mail, there will be a 1:1
> >> correspondence where you can get an RDD to/from a DataFrame.
> >>
> >>
> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>> Hi,
> >>>
> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>> get the community's opinion.
> >>>
> >>> The context is that SchemaRDD is becoming a common data format used for
> >>> bringing data into Spark from external systems, and used for various
> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
> >> and
> >>> more users to be programming directly against SchemaRDD API rather than
> >> the
> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>> designed for writing test cases, always has the data-frame like API. In
> >>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>
> >>>
> >>> There are two motivations for the renaming:
> >>>
> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>
> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> >>> though it would contain some RDD functions like map, flatMap, etc), and
> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >> Instead.
> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>
> >>>
> >>> My understanding is that very few users program directly against the
> >>> SchemaRDD API at the moment, because they are not well documented.
> >> However,
> >>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>> that is still named SchemaRDD. This will maintain source compatibility
> >> for
> >>> Scala. That said, we will have to update all existing materials to use
> >>> DataFrame rather than SchemaRDD.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >>
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Matei Zaharia <ma...@gmail.com>.

While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library.

Matei

> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> "The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API."
> 
> i agree. this to me also implies it belongs in spark core, not sql
> 
> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> michaelmalak@yahoo.com.invalid> wrote:
> 
>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
>> Spark Meetup YouTube contained a wealth of background information on this
>> idea (mostly from Patrick and Reynold :-).
>> 
>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> 
>> ________________________________
>> From: Patrick Wendell <pw...@gmail.com>
>> To: Reynold Xin <rx...@databricks.com>
>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> Sent: Monday, January 26, 2015 4:01 PM
>> Subject: Re: renaming SchemaRDD -> DataFrame
>> 
>> 
>> One thing potentially not clear from this e-mail, there will be a 1:1
>> correspondence where you can get an RDD to/from a DataFrame.
>> 
>> 
>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
>>> Hi,
>>> 
>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>>> get the community's opinion.
>>> 
>>> The context is that SchemaRDD is becoming a common data format used for
>>> bringing data into Spark from external systems, and used for various
>>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
>> and
>>> more users to be programming directly against SchemaRDD API rather than
>> the
>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>>> designed for writing test cases, always has the data-frame like API. In
>>> 1.3, we are redesigning the API to make the API usable for end users.
>>> 
>>> 
>>> There are two motivations for the renaming:
>>> 
>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>> 
>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>>> though it would contain some RDD functions like map, flatMap, etc), and
>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>> Instead.
>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>> 
>>> 
>>> My understanding is that very few users program directly against the
>>> SchemaRDD API at the moment, because they are not well documented.
>> However,
>>> oo maintain backward compatibility, we can create a type alias DataFrame
>>> that is still named SchemaRDD. This will maintain source compatibility
>> for
>>> Scala. That said, we will have to update all existing materials to use
>>> DataFrame rather than SchemaRDD.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

what i am trying to say is: structured data != sql

On Mon, Jan 26, 2015 at 7:26 PM, Koert Kuipers <ko...@tresata.com> wrote:

> "The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API."
>
> i agree. this to me also implies it belongs in spark core, not sql
>
> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> michaelmalak@yahoo.com.invalid> wrote:
>
>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>> Area Spark Meetup YouTube contained a wealth of background information on
>> this idea (mostly from Patrick and Reynold :-).
>>
>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>
>> ________________________________
>> From: Patrick Wendell <pw...@gmail.com>
>> To: Reynold Xin <rx...@databricks.com>
>> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
>> Sent: Monday, January 26, 2015 4:01 PM
>> Subject: Re: renaming SchemaRDD -> DataFrame
>>
>>
>> One thing potentially not clear from this e-mail, there will be a 1:1
>> correspondence where you can get an RDD to/from a DataFrame.
>>
>>
>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
>> > Hi,
>> >
>> > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>> > get the community's opinion.
>> >
>> > The context is that SchemaRDD is becoming a common data format used for
>> > bringing data into Spark from external systems, and used for various
>> > components of Spark, e.g. MLlib's new pipeline API. We also expect more
>> and
>> > more users to be programming directly against SchemaRDD API rather than
>> the
>> > core RDD API. SchemaRDD, through its less commonly used DSL originally
>> > designed for writing test cases, always has the data-frame like API. In
>> > 1.3, we are redesigning the API to make the API usable for end users.
>> >
>> >
>> > There are two motivations for the renaming:
>> >
>> > 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>> >
>> > 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>> > though it would contain some RDD functions like map, flatMap, etc), and
>> > calling it Schema*RDD* while it is not an RDD is highly confusing.
>> Instead.
>> > DataFrame.rdd will return the underlying RDD for all RDD methods.
>> >
>> >
>> > My understanding is that very few users program directly against the
>> > SchemaRDD API at the moment, because they are not well documented.
>> However,
>> > oo maintain backward compatibility, we can create a type alias DataFrame
>> > that is still named SchemaRDD. This will maintain source compatibility
>> for
>> > Scala. That said, we will have to update all existing materials to use
>> > DataFrame rather than SchemaRDD.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Koert Kuipers <ko...@tresata.com>.

"The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API."

i agree. this to me also implies it belongs in spark core, not sql

On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
michaelmalak@yahoo.com.invalid> wrote:

> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
> Spark Meetup YouTube contained a wealth of background information on this
> idea (mostly from Patrick and Reynold :-).
>
> https://www.youtube.com/watch?v=YWppYPWznSQ
>
> ________________________________
> From: Patrick Wendell <pw...@gmail.com>
> To: Reynold Xin <rx...@databricks.com>
> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> Sent: Monday, January 26, 2015 4:01 PM
> Subject: Re: renaming SchemaRDD -> DataFrame
>
>
> One thing potentially not clear from this e-mail, there will be a 1:1
> correspondence where you can get an RDD to/from a DataFrame.
>
>
> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> > Hi,
> >
> > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> > get the community's opinion.
> >
> > The context is that SchemaRDD is becoming a common data format used for
> > bringing data into Spark from external systems, and used for various
> > components of Spark, e.g. MLlib's new pipeline API. We also expect more
> and
> > more users to be programming directly against SchemaRDD API rather than
> the
> > core RDD API. SchemaRDD, through its less commonly used DSL originally
> > designed for writing test cases, always has the data-frame like API. In
> > 1.3, we are redesigning the API to make the API usable for end users.
> >
> >
> > There are two motivations for the renaming:
> >
> > 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >
> > 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> > though it would contain some RDD functions like map, flatMap, etc), and
> > calling it Schema*RDD* while it is not an RDD is highly confusing.
> Instead.
> > DataFrame.rdd will return the underlying RDD for all RDD methods.
> >
> >
> > My understanding is that very few users program directly against the
> > SchemaRDD API at the moment, because they are not well documented.
> However,
> > oo maintain backward compatibility, we can create a type alias DataFrame
> > that is still named SchemaRDD. This will maintain source compatibility
> for
> > Scala. That said, we will have to update all existing materials to use
> > DataFrame rather than SchemaRDD.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Posted by Michael Malak <mi...@yahoo.com.INVALID>.

And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-).

https://www.youtube.com/watch?v=YWppYPWznSQ

________________________________
From: Patrick Wendell <pw...@gmail.com>
To: Reynold Xin <rx...@databricks.com> 
Cc: "dev@spark.apache.org" <de...@spark.apache.org> 
Sent: Monday, January 26, 2015 4:01 PM
Subject: Re: renaming SchemaRDD -> DataFrame


One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.


On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Patrick Wendell <pw...@gmail.com>.

One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

It has been pretty evident for some time that's what it is, hasn't it?

Yes that's a better name IMO.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:

> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.
>