You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Xiangrui Meng <me...@gmail.com> on 2016/04/05 20:01:16 UTC

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Hi all,

More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API
has been developed under the spark.ml package, while the old RDD-based API
has been developed in parallel under the spark.mllib package. While it was
easier to implement and experiment with new APIs under a new package, it
became harder and harder to maintain as both packages grew bigger and
bigger. And new users are often confused by having two sets of APIs with
overlapped functions.

We started to recommend the DataFrame-based API over the RDD-based API in
Spark 1.5 for its versatility and flexibility, and we saw the development
and the usage gradually shifting to the DataFrame-based API. Just counting
the lines of Scala code, from 1.5 to the current master we added ~10000
lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
gather more resources on the development of the DataFrame-based API and to
help users migrate over sooner, I want to propose switching RDD-based MLlib
APIs to maintenance mode in Spark 2.0. What does it mean exactly?

* We do not accept new features in the RDD-based spark.mllib package,
unless they block implementing new features in the DataFrame-based spark.ml
package.
* We still accept bug fixes in the RDD-based API.
* We will add more features to the DataFrame-based API in the 2.x series to
reach feature parity with the RDD-based API.
* Once we reach feature parity (possibly in Spark 2.2), we will deprecate
the RDD-based API.
* We will remove the RDD-based API from the main Spark repo in Spark 3.0.

Though the RDD-based API is already in de facto maintenance mode, this
announcement will make it clear and hence important to both MLlib
developers and users. So we’d greatly appreciate your feedback!

(As a side note, people sometimes use “Spark ML” to refer to the
DataFrame-based API or even the entire MLlib component. This also causes
confusion. To be clear, “Spark ML” is not an official name and there are no
plans to rename MLlib to “Spark ML” at this time.)

Best,
Xiangrui

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Holden Karau <ho...@pigscanfly.ca>.

I'm very much in favor of this, the less porting work there is the better :)

On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~10000
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by DB Tsai <db...@dbtsai.com>.

+1 for renaming the jar file.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly <ch...@fregly.com> wrote:
> perhaps renaming to Spark ML would actually clear up code and documentation
> confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally move
>>> towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>>> <sh...@eecs.berkeley.edu> wrote:
>>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>>> > wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> >> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> >> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> >> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> >> was
>>>> >> easier to implement and experiment with new APIs under a new package,
>>>> >> it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> >> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> >> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> >> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> >> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> >> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>>> >> to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> >> and to
>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>> >> MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>> >> unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> >> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> >> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> >> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> >> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> >> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> >> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> >> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Nick Pentreath <ni...@gmail.com>.

+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly <ch...@fregly.com> wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>>> wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> was
>>>> >> easier to implement and experiment with new APIs under a new
>>>> package, it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API.
>>>> So, to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> and to
>>>> >> help users migrate over sooner, I want to propose switching
>>>> RDD-based MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib
>>>> package, unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>>>>
>>>
>>>
>>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Nick Pentreath <ni...@gmail.com>.

+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly <ch...@fregly.com> wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>>> wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> was
>>>> >> easier to implement and experiment with new APIs under a new
>>>> package, it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API.
>>>> So, to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> and to
>>>> >> help users migrate over sooner, I want to propose switching
>>>> RDD-based MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib
>>>> package, unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>>>>
>>>
>>>
>>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by DB Tsai <db...@dbtsai.com>.

+1 for renaming the jar file.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly <ch...@fregly.com> wrote:
> perhaps renaming to Spark ML would actually clear up code and documentation
> confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally move
>>> towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>>> <sh...@eecs.berkeley.edu> wrote:
>>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>>> > wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> >> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> >> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> >> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> >> was
>>>> >> easier to implement and experiment with new APIs under a new package,
>>>> >> it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> >> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> >> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> >> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> >> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> >> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>>> >> to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> >> and to
>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>> >> MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>> >> unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> >> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> >> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> >> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> >> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> >> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> >> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> >> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Chris Fregly <ch...@fregly.com>.

perhaps renaming to Spark ML would actually clear up code and documentation confusion?

+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com> wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com> wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move towards that.
>>> 
>>> Matei
>>> 
>>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>> 
>>>> Yes, DB (cc'ed) is working on porting the local linear algebra library over (SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order to reach feature parity. -Xiangrui
>>>> 
>>>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <sh...@eecs.berkeley.edu> wrote:
>>>>> Overall this sounds good to me. One question I have is that in
>>>>> addition to the ML algorithms we have a number of linear algebra
>>>>> (various distributed matrices) and statistical methods in the
>>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>>> namespace in the 2.x series ?
>>>>> 
>>>>> Thanks
>>>>> Shivaram
>>>>> 
>>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>>> > certainly better than two.
>>>>> >
>>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>>>>> >> been developed under the spark.ml package, while the old RDD-based API has
>>>>> >> been developed in parallel under the spark.mllib package. While it was
>>>>> >> easier to implement and experiment with new APIs under a new package, it
>>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>>> >> bigger. And new users are often confused by having two sets of APIs with
>>>>> >> overlapped functions.
>>>>> >>
>>>>> >> We started to recommend the DataFrame-based API over the RDD-based API in
>>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the development
>>>>> >> and the usage gradually shifting to the DataFrame-based API. Just counting
>>>>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>>>>> >> gather more resources on the development of the DataFrame-based API and to
>>>>> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
>>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>>> >>
>>>>> >> * We do not accept new features in the RDD-based spark.mllib package, unless
>>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>>> >> package.
>>>>> >> * We still accept bug fixes in the RDD-based API.
>>>>> >> * We will add more features to the DataFrame-based API in the 2.x series to
>>>>> >> reach feature parity with the RDD-based API.
>>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>>>>> >> the RDD-based API.
>>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>>>> >>
>>>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>>>> >> announcement will make it clear and hence important to both MLlib developers
>>>>> >> and users. So we’d greatly appreciate your feedback!
>>>>> >>
>>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>>> >> DataFrame-based API or even the entire MLlib component. This also causes
>>>>> >> confusion. To be clear, “Spark ML” is not an official name and there are no
>>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>>> >>
>>>>> >> Best,
>>>>> >> Xiangrui
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>>> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Chris Fregly <ch...@fregly.com>.

perhaps renaming to Spark ML would actually clear up code and documentation confusion?

+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com> wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com> wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move towards that.
>>> 
>>> Matei
>>> 
>>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>>> 
>>>> Yes, DB (cc'ed) is working on porting the local linear algebra library over (SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order to reach feature parity. -Xiangrui
>>>> 
>>>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <sh...@eecs.berkeley.edu> wrote:
>>>>> Overall this sounds good to me. One question I have is that in
>>>>> addition to the ML algorithms we have a number of linear algebra
>>>>> (various distributed matrices) and statistical methods in the
>>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>>> namespace in the 2.x series ?
>>>>> 
>>>>> Thanks
>>>>> Shivaram
>>>>> 
>>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>>> > certainly better than two.
>>>>> >
>>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>>>>> >> been developed under the spark.ml package, while the old RDD-based API has
>>>>> >> been developed in parallel under the spark.mllib package. While it was
>>>>> >> easier to implement and experiment with new APIs under a new package, it
>>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>>> >> bigger. And new users are often confused by having two sets of APIs with
>>>>> >> overlapped functions.
>>>>> >>
>>>>> >> We started to recommend the DataFrame-based API over the RDD-based API in
>>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the development
>>>>> >> and the usage gradually shifting to the DataFrame-based API. Just counting
>>>>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>>>>> >> gather more resources on the development of the DataFrame-based API and to
>>>>> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
>>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>>> >>
>>>>> >> * We do not accept new features in the RDD-based spark.mllib package, unless
>>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>>> >> package.
>>>>> >> * We still accept bug fixes in the RDD-based API.
>>>>> >> * We will add more features to the DataFrame-based API in the 2.x series to
>>>>> >> reach feature parity with the RDD-based API.
>>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>>>>> >> the RDD-based API.
>>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>>>> >>
>>>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>>>> >> announcement will make it clear and hence important to both MLlib developers
>>>>> >> and users. So we’d greatly appreciate your feedback!
>>>>> >>
>>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>>> >> DataFrame-based API or even the entire MLlib component. This also causes
>>>>> >> confusion. To be clear, “Spark ML” is not an official name and there are no
>>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>>> >>
>>>>> >> Best,
>>>>> >> Xiangrui
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>>> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

+1

This is a no brainer IMO.


On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~10000
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>
>>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Holden Karau <ho...@pigscanfly.ca>.

I'm very much in favor of this, the less porting work there is the better :)

On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~10000
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

+1

This is a no brainer IMO.


On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~10000
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>
>>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Joseph Bradley <jo...@databricks.com>.

+1  By the way, the JIRA for tracking (Scala) API parity is:
https://issues.apache.org/jira/browse/SPARK-4591

On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> This sounds good to me as well. The one thing we should pay attention to
> is how we update the docs so that people know to start with the spark.ml
> classes. Right now the docs list spark.mllib first and also seem more
> comprehensive in that area than in spark.ml, so maybe people naturally
> move towards that.
>
> Matei
>
> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>
> Yes, DB (cc'ed) is working on porting the local linear algebra library
> over (SPARK-13944). There are also frequent pattern mining algorithms we
> need to port over in order to reach feature parity. -Xiangrui
>
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>> Overall this sounds good to me. One question I have is that in
>> addition to the ML algorithms we have a number of linear algebra
>> (various distributed matrices) and statistical methods in the
>> spark.mllib package. Is the plan to port or move these to the spark.ml
>> namespace in the 2.x series ?
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>> > certainly better than two.
>> >
>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> >> Hi all,
>> >>
>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>> built
>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>> API has
>> >> been developed under the spark.ml package, while the old RDD-based
>> API has
>> >> been developed in parallel under the spark.mllib package. While it was
>> >> easier to implement and experiment with new APIs under a new package,
>> it
>> >> became harder and harder to maintain as both packages grew bigger and
>> >> bigger. And new users are often confused by having two sets of APIs
>> with
>> >> overlapped functions.
>> >>
>> >> We started to recommend the DataFrame-based API over the RDD-based API
>> in
>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>> development
>> >> and the usage gradually shifting to the DataFrame-based API. Just
>> counting
>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>> to
>> >> gather more resources on the development of the DataFrame-based API
>> and to
>> >> help users migrate over sooner, I want to propose switching RDD-based
>> MLlib
>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>> >>
>> >> * We do not accept new features in the RDD-based spark.mllib package,
>> unless
>> >> they block implementing new features in the DataFrame-based spark.ml
>> >> package.
>> >> * We still accept bug fixes in the RDD-based API.
>> >> * We will add more features to the DataFrame-based API in the 2.x
>> series to
>> >> reach feature parity with the RDD-based API.
>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>> deprecate
>> >> the RDD-based API.
>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>> 3.0.
>> >>
>> >> Though the RDD-based API is already in de facto maintenance mode, this
>> >> announcement will make it clear and hence important to both MLlib
>> developers
>> >> and users. So we’d greatly appreciate your feedback!
>> >>
>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>> >> DataFrame-based API or even the entire MLlib component. This also
>> causes
>> >> confusion. To be clear, “Spark ML” is not an official name and there
>> are no
>> >> plans to rename MLlib to “Spark ML” at this time.)
>> >>
>> >> Best,
>> >> Xiangrui
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Joseph Bradley <jo...@databricks.com>.

+1  By the way, the JIRA for tracking (Scala) API parity is:
https://issues.apache.org/jira/browse/SPARK-4591

On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> This sounds good to me as well. The one thing we should pay attention to
> is how we update the docs so that people know to start with the spark.ml
> classes. Right now the docs list spark.mllib first and also seem more
> comprehensive in that area than in spark.ml, so maybe people naturally
> move towards that.
>
> Matei
>
> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>
> Yes, DB (cc'ed) is working on porting the local linear algebra library
> over (SPARK-13944). There are also frequent pattern mining algorithms we
> need to port over in order to reach feature parity. -Xiangrui
>
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>> Overall this sounds good to me. One question I have is that in
>> addition to the ML algorithms we have a number of linear algebra
>> (various distributed matrices) and statistical methods in the
>> spark.mllib package. Is the plan to port or move these to the spark.ml
>> namespace in the 2.x series ?
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>> > certainly better than two.
>> >
>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> >> Hi all,
>> >>
>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>> built
>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>> API has
>> >> been developed under the spark.ml package, while the old RDD-based
>> API has
>> >> been developed in parallel under the spark.mllib package. While it was
>> >> easier to implement and experiment with new APIs under a new package,
>> it
>> >> became harder and harder to maintain as both packages grew bigger and
>> >> bigger. And new users are often confused by having two sets of APIs
>> with
>> >> overlapped functions.
>> >>
>> >> We started to recommend the DataFrame-based API over the RDD-based API
>> in
>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>> development
>> >> and the usage gradually shifting to the DataFrame-based API. Just
>> counting
>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>> to
>> >> gather more resources on the development of the DataFrame-based API
>> and to
>> >> help users migrate over sooner, I want to propose switching RDD-based
>> MLlib
>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>> >>
>> >> * We do not accept new features in the RDD-based spark.mllib package,
>> unless
>> >> they block implementing new features in the DataFrame-based spark.ml
>> >> package.
>> >> * We still accept bug fixes in the RDD-based API.
>> >> * We will add more features to the DataFrame-based API in the 2.x
>> series to
>> >> reach feature parity with the RDD-based API.
>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>> deprecate
>> >> the RDD-based API.
>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>> 3.0.
>> >>
>> >> Though the RDD-based API is already in de facto maintenance mode, this
>> >> announcement will make it clear and hence important to both MLlib
>> developers
>> >> and users. So we’d greatly appreciate your feedback!
>> >>
>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>> >> DataFrame-based API or even the entire MLlib component. This also
>> causes
>> >> confusion. To be clear, “Spark ML” is not an official name and there
>> are no
>> >> plans to rename MLlib to “Spark ML” at this time.)
>> >>
>> >> Best,
>> >> Xiangrui
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.

+1

From:  Matei Zaharia <ma...@gmail.com>
Date:  Tuesday, April 5, 2016 at 4:58 PM
To:  Xiangrui Meng <me...@databricks.com>
Cc:  Shivaram Venkataraman <sh...@eecs.berkeley.edu>, Sean Owen
<so...@cloudera.com>, Xiangrui Meng <me...@gmail.com>, dev
<de...@spark.apache.org>, "user @spark" <us...@spark.apache.org>, DB Tsai
<db...@dbtsai.com>
Subject:  Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

> This sounds good to me as well. The one thing we should pay attention to is
> how we update the docs so that people know to start with the spark.ml classes.
> Right now the docs list spark.mllib first and also seem more comprehensive in
> that area than in spark.ml, so maybe people naturally move towards that.
> 
> Matei
> 
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
>> 
>> Yes, DB (cc'ed) is working on porting the local linear algebra library over
>> (SPARK-13944). There are also frequent pattern mining algorithms we need to
>> port over in order to reach feature parity. -Xiangrui
>> 
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>> <sh...@eecs.berkeley.edu> wrote:
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> <http://spark.ml/>
>>> namespace in the 2.x series ?
>>> 
>>> Thanks
>>> Shivaram
>>> 
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>>> built
>>>>> >> on top of Spark SQL¹s DataFrames. Since then the new DataFrame-based
>>>>> API has
>>>>> >> been developed under the spark.ml <http://spark.ml/>  package, while
>>>>> the old RDD-based API has
>>>>> >> been developed in parallel under the spark.mllib package. While it was
>>>>> >> easier to implement and experiment with new APIs under a new package,
it
>>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>>> with
>>>>> >> overlapped functions.
>>>>> >>
>>>>> >> We started to recommend the DataFrame-based API over the RDD-based API
in
>>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>>> development
>>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>>> counting
>>>>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
to
>>>>> >> gather more resources on the development of the DataFrame-based API and
to
>>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>>> MLlib
>>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>>> >>
>>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>>> unless
>>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>>> <http://spark.ml/>
>>>>> >> package.
>>>>> >> * We still accept bug fixes in the RDD-based API.
>>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>>> series to
>>>>> >> reach feature parity with the RDD-based API.
>>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>>> deprecate
>>>>> >> the RDD-based API.
>>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>>> 3.0.
>>>>> >>
>>>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>>>> >> announcement will make it clear and hence important to both MLlib
>>>>> developers
>>>>> >> and users. So we¹d greatly appreciate your feedback!
>>>>> >>
>>>>> >> (As a side note, people sometimes use ³Spark ML² to refer to the
>>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>>> causes
>>>>> >> confusion. To be clear, ³Spark ML² is not an official name and there
>>>>> are no
>>>>> >> plans to rename MLlib to ³Spark ML² at this time.)
>>>>> >>
>>>>> >> Best,
>>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Matei Zaharia <ma...@gmail.com>.

This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over (SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <shivaram@eecs.berkeley.edu <ma...@eecs.berkeley.edu>> wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml <http://spark.ml/>
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <mengxr@gmail.com <ma...@gmail.com>> wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
> >> been developed under the spark.ml <http://spark.ml/> package, while the old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~10000
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, unless
> >> they block implementing new features in the DataFrame-based spark.ml <http://spark.ml/>
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> > For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Matei Zaharia <ma...@gmail.com>.

This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <me...@databricks.com> wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over (SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <shivaram@eecs.berkeley.edu <ma...@eecs.berkeley.edu>> wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml <http://spark.ml/>
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <mengxr@gmail.com <ma...@gmail.com>> wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
> >> been developed under the spark.ml <http://spark.ml/> package, while the old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~10000
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, unless
> >> they block implementing new features in the DataFrame-based spark.ml <http://spark.ml/>
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> > For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Xiangrui Meng <me...@databricks.com>.

Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to
port over in order to reach feature parity. -Xiangrui

On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
>
> Thanks
> Shivaram
>
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
> API has
> >> been developed under the spark.ml package, while the old RDD-based API
> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API
> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the
> development
> >> and the usage gradually shifting to the DataFrame-based API. Just
> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~10000
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and
> to
> >> help users migrate over sooner, I want to propose switching RDD-based
> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package,
> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x
> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will
> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark
> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib
> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there
> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Xiangrui Meng <me...@databricks.com>.

Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to
port over in order to reach feature parity. -Xiangrui

On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
>
> Thanks
> Shivaram
>
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
> API has
> >> been developed under the spark.ml package, while the old RDD-based API
> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API
> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the
> development
> >> and the usage gradually shifting to the DataFrame-based API. Just
> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~10000
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and
> to
> >> help users migrate over sooner, I want to propose switching RDD-based
> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package,
> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x
> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will
> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark
> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib
> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there
> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?

Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~10000
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?

Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~10000
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

FWIW, all of that sounds like a good plan to me. Developing one API is
certainly better than two.

On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
> Hi all,
>
> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
> been developed under the spark.ml package, while the old RDD-based API has
> been developed in parallel under the spark.mllib package. While it was
> easier to implement and experiment with new APIs under a new package, it
> became harder and harder to maintain as both packages grew bigger and
> bigger. And new users are often confused by having two sets of APIs with
> overlapped functions.
>
> We started to recommend the DataFrame-based API over the RDD-based API in
> Spark 1.5 for its versatility and flexibility, and we saw the development
> and the usage gradually shifting to the DataFrame-based API. Just counting
> the lines of Scala code, from 1.5 to the current master we added ~10000
> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> gather more resources on the development of the DataFrame-based API and to
> help users migrate over sooner, I want to propose switching RDD-based MLlib
> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>
> * We do not accept new features in the RDD-based spark.mllib package, unless
> they block implementing new features in the DataFrame-based spark.ml
> package.
> * We still accept bug fixes in the RDD-based API.
> * We will add more features to the DataFrame-based API in the 2.x series to
> reach feature parity with the RDD-based API.
> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> the RDD-based API.
> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>
> Though the RDD-based API is already in de facto maintenance mode, this
> announcement will make it clear and hence important to both MLlib developers
> and users. So we’d greatly appreciate your feedback!
>
> (As a side note, people sometimes use “Spark ML” to refer to the
> DataFrame-based API or even the entire MLlib component. This also causes
> confusion. To be clear, “Spark ML” is not an official name and there are no
> plans to rename MLlib to “Spark ML” at this time.)
>
> Best,
> Xiangrui

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

FWIW, all of that sounds like a good plan to me. Developing one API is
certainly better than two.

On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <me...@gmail.com> wrote:
> Hi all,
>
> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
> been developed under the spark.ml package, while the old RDD-based API has
> been developed in parallel under the spark.mllib package. While it was
> easier to implement and experiment with new APIs under a new package, it
> became harder and harder to maintain as both packages grew bigger and
> bigger. And new users are often confused by having two sets of APIs with
> overlapped functions.
>
> We started to recommend the DataFrame-based API over the RDD-based API in
> Spark 1.5 for its versatility and flexibility, and we saw the development
> and the usage gradually shifting to the DataFrame-based API. Just counting
> the lines of Scala code, from 1.5 to the current master we added ~10000
> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> gather more resources on the development of the DataFrame-based API and to
> help users migrate over sooner, I want to propose switching RDD-based MLlib
> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>
> * We do not accept new features in the RDD-based spark.mllib package, unless
> they block implementing new features in the DataFrame-based spark.ml
> package.
> * We still accept bug fixes in the RDD-based API.
> * We will add more features to the DataFrame-based API in the 2.x series to
> reach feature parity with the RDD-based API.
> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> the RDD-based API.
> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>
> Though the RDD-based API is already in de facto maintenance mode, this
> announcement will make it clear and hence important to both MLlib developers
> and users. So we’d greatly appreciate your feedback!
>
> (As a side note, people sometimes use “Spark ML” to refer to the
> DataFrame-based API or even the entire MLlib component. This also causes
> confusion. To be clear, “Spark ML” is not an official name and there are no
> plans to rename MLlib to “Spark ML” at this time.)
>
> Best,
> Xiangrui

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org