You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@predictionio.apache.org by Marcin Ziemiński <zi...@gmail.com> on 2016/08/15 21:42:39 UTC

Spark update

Hi,

The recent version of Spark - 2.0.0 comes with many changes and
improvements. The same is for the related Spark MLlib .
-> Spark 2.0.0 release notes
<https://spark.apache.org/releases/spark-release-2-0-0.html>
-> Spark 2.0.0 overview
<https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html>
-> Announcement: DataFrame-based API is primary API
<http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api>

The changes are quite serious as they move the usage more towards DataSets
instead of RDDs and offer many performance improvements.
The current version of Spark that PredictionIO is built with is quite old
1.4.0 and soon some people will want to make use of the latest features in
new templates.
What is more, we could come up with a new kind of workflow different from
DASE fitting better Spark ML pipelines, which would be a part of some next
future release and would set a different direction of the project.

I am aware that simply upgrading the version of Spark would break a lot of
existing projects and cause many inconveniences to many people. Besides it
requires bumping Scala version to 2.11. Having a separate branch for the
new version is rather not maintainable, therefore I think that we could
cross-build the project against two versions of Scala - 2.10 and 2.11,
where the 2.11 version would be specifically for the new version of Spark.
Any sources not compatible with two versions at the same time, could be
split up between separate version-specific directories: src/main/scala-2.10/,
src/main/scala-2.11/. Sbt as of 13.8 should not have problems with
that -> merged
PR <https://github.com/sbt/sbt/pull/1799>
Such setup would make it possible to work on fixes and features available
in both Scala versions, but more importantly it would let us add new
functionalities specific to the latest Spark releases without breaking the
old build.

I have already tried to update Spark in PredictionIO in my own fork and I
have managed to get a working version without modifying too much code -> diff
here
<https://github.com/apache/incubator-predictionio/compare/develop...Ziemin:upgrade#diff-fdc3abdfd754eeb24090dbd90aeec2ce>
.
Both unit-tests and integration tests were successful.

What do you think? Would it affect the current release cycle in a negative
way? Maybe someone has a better idea on how to perform this upgrade.
Sticking to Spark 1.x version forever is probably not an option and the
sooner we upgrade the better.

Regards,
Marcin

Re: Spark update

Posted by Donald Szeto <do...@apache.org>.

Hi Marcin,

This is really great work, and I agree that we should cross build against
different versions of Spark. If we have a cross building infrastructure set
up, we can be more flexible and adapt to different versions of external
dependencies as well, such as different versions of Hadoop and
Elasticsearch.

I cannot see any negative issues impacting the release cycle, besides
requiring more work to generate more versions of binaries. One thing that
we need to be careful of would be making sure that the core artifact, which
all engine templates depend on, be always compatible with different cross
build configurations of PredictionIO. I think with Spark 2 being the
starting point of Scala 2.11 adoption we should be pretty safe to assume
supporting only Spark < 2 for Scala 2.10 artifacts and Spark >= 2 for Scala
2.11 artifacts.

How do others think?

Regards,
Donald

On Mon, Aug 15, 2016 at 2:42 PM, Marcin Ziemiński <zi...@gmail.com> wrote:

> Hi,
>
> The recent version of Spark - 2.0.0 comes with many changes and
> improvements. The same is for the related Spark MLlib .
> -> Spark 2.0.0 release notes
> <https://spark.apache.org/releases/spark-release-2-0-0.html>
> -> Spark 2.0.0 overview
> <https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html>
> -> Announcement: DataFrame-based API is primary API
> <http://spark.apache.org/docs/latest/ml-guide.html#
> announcement-dataframe-based-api-is-primary-api>
>
> The changes are quite serious as they move the usage more towards DataSets
> instead of RDDs and offer many performance improvements.
> The current version of Spark that PredictionIO is built with is quite old
> 1.4.0 and soon some people will want to make use of the latest features in
> new templates.
> What is more, we could come up with a new kind of workflow different from
> DASE fitting better Spark ML pipelines, which would be a part of some next
> future release and would set a different direction of the project.
>
> I am aware that simply upgrading the version of Spark would break a lot of
> existing projects and cause many inconveniences to many people. Besides it
> requires bumping Scala version to 2.11. Having a separate branch for the
> new version is rather not maintainable, therefore I think that we could
> cross-build the project against two versions of Scala - 2.10 and 2.11,
> where the 2.11 version would be specifically for the new version of Spark.
> Any sources not compatible with two versions at the same time, could be
> split up between separate version-specific directories:
> src/main/scala-2.10/,
> src/main/scala-2.11/. Sbt as of 13.8 should not have problems with
> that -> merged
> PR <https://github.com/sbt/sbt/pull/1799>
> Such setup would make it possible to work on fixes and features available
> in both Scala versions, but more importantly it would let us add new
> functionalities specific to the latest Spark releases without breaking the
> old build.
>
> I have already tried to update Spark in PredictionIO in my own fork and I
> have managed to get a working version without modifying too much code ->
> diff
> here
> <https://github.com/apache/incubator-predictionio/
> compare/develop...Ziemin:upgrade#diff-fdc3abdfd754eeb24090dbd90aeec2ce>
> .
> Both unit-tests and integration tests were successful.
>
> What do you think? Would it affect the current release cycle in a negative
> way? Maybe someone has a better idea on how to perform this upgrade.
> Sticking to Spark 1.x version forever is probably not an option and the
> sooner we upgrade the better.
>
> Regards,
> Marcin
>