You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Julien Le Dem <ju...@dremio.com> on 2015/11/13 01:39:40 UTC

Re: [DISCUSS] Spark-Kernel Incubator Proposal

I'd be happy to help as a mentor if you need more.

On Thu, Nov 12, 2015 at 4:17 PM, <da...@fallside.com> wrote:

> Hello, we would like to start a discussion on accepting the Spark-Kernel,
> a mechanism for applications to interactively and remotely access Apache
> Spark, into the Apache Incubator.
>
> The proposal is available online at
> https://wiki.apache.org/incubator/SparkKernelProposal, and it is appended
> to this email.
>
> We are looking for additional mentors to help with this project, and we
> would much appreciate your guidance and advice.
>
> Thank-you in advance,
> David Fallside
>
>
>
> = Spark-Kernel Proposal =
>
> == Abstract ==
> Spark-Kernel provides applications with a mechanism to interactively and
> remotely access Apache Spark.
>
> == Proposal ==
> The Spark-Kernel enables interactive applications to access Apache Spark
> clusters. More specifically:
>  * Applications can send code-snippets and libraries for execution by Spark
>  * Applications can be deployed separately from Spark clusters and
> communicate with the Spark-Kernel using the provided Spark-Kernel client
>  * Execution results and streaming data can be sent back to calling
> applications
>  * Applications no longer have to be network connected to the workers on a
> Spark cluster because the Spark-Kernel acts as each application’s proxy
>  * Work has started on enabling Spark-Kernel to support languages in
> addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL
> (with SparkSQL)
>
> == Background & Rationale ==
> Apache Spark provides applications with a fast and general purpose
> distributed computing engine that supports static and streaming data,
> tabular and graph representations of data, and an extensive library of
> machine learning libraries. Consequently, a wide variety of applications
> will be written for Spark and there will be interactive applications that
> require relatively frequent function evaluations, and batch-oriented
> applications that require one-shot or only occasional evaluation.
>
> Apache Spark provides two mechanisms for applications to connect with
> Spark. The primary mechanism launches applications on Spark clusters using
> spark-submit
> (http://spark.apache.org/docs/latest/submitting-applications.html); this
> requires developers to bundle their application code plus any dependencies
> into JAR files, and then submit them to Spark. A second mechanism is an
> ODBC/JDBC API
> (
> http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
> )
> which enables applications to issue SQL queries against SparkSQL.
>
> Our experience when developing interactive applications, such as analytic
> applications and Jupyter Notebooks, to run against Spark was that the
> spark-submit mechanism was overly cumbersome and slow (requiring JAR
> creation and forking processes to run spark-submit), and the SQL interface
> was too limiting and did not offer easy access to components other than
> SparkSQL, such as streaming. The most promising mechanism provided by
> Apache Spark was the command-line shell
> (
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> )
> which enabled us to execute code snippets and dynamically control the
> tasks submitted to  a Spark cluster. Spark does not provide the
> command-line shell as a consumable service but it provided us with the
> starting point from which we developed the Spark-Kernel.
>
> == Current Status ==
> Spark-Kernel was first developed by a small team working on an
> internal-IBM Spark-related project in July 2014. In recognition of its
> likely general utility to Spark users and developers, in November 2014 the
> Spark-Kernel project was moved to GitHub and made available under the
> Apache License V2.
>
> == Meritocracy ==
> The current developers are familiar with the meritocratic open source
> development process at Apache. As the project has gathered interest at
> GitHub the developers have actively started a process to invite additional
> developers into the project, and we have at least one new developer who is
> ready to contribute code to the project.
>
> == Community ==
> We started building a community around the Spark-Kernel project when we
> moved it to GitHub about one year ago. Since then we have grown to about
> 70 people, and there are regular requests and suggestions from the
> community. We believe that providing Apache Spark application developers
> with a general-purpose and interactive API holds a lot of community
> potential, especially considering possible tie-in’s with the Jupyter and
> data science community.
>
> == Core Developers ==
> The core developers of the project are currently all from IBM, from the
> IBM Emerging Technology team and from IBM’s recently formed Spark
> Technology Center.
>
> == Alignment ==
> Apache, as the home of Apache Spark, is the most natural home for the
> Spark-Kernel project because it was designed to work with Apache Spark and
> to provide capabilities for interactive applications and data science
> tools not provided by Spark itself.
>
> The Spark-Kernel also has an affinity with Jupyter (jupyter.org) because
> it uses the Jupyter protocol for communications, and so Jupyter Notebooks
> can directly use the Spark-Kernel as a kernel for communicating with
> Apache Spark. However, we believe that the Spark-Kernel provides a
> general-purpose mechanism enabling a wider variety of applications than
> just Notebooks to access Spark, and so the Spark-Kernel’s greatest
> affinity is with Apache and Apache Spark.
>
> == Known Risks ==
> === Orphaned products ===
> We believe the Spark-Kernel project has a low-risk of abandonment due to
> interest in its continuing existence from several parties. More
> specifically, the Spark-Kernel provides a capability that is not provided
> by Apache Spark today but it enables a wider range of applications to
> leverage Spark. For example, IBM uses (and is considering) the
> Spark-Kernel in several offerings including its IBM Analytics for Apache
> Spark product in the Bluemix Cloud. There are also a couple of other
> commercial users who are using or considering its use in their offerings.
> Furthermore, Jupyter Notebooks are used by data scientists and Spark is
> gaining popularity as an analytic engine for them. Jupyter Notebooks are
> very easily enabled with the Spark-Kernel and so there is another
> constituency for it.
>
> === Inexperience with Open Source ===
> The Spark-Kernel project has been running as an open-source project
> (albeit with only IBM committers) for the past several months. The project
> has an active issue tracker and due to the interest indicated by the
> nature and volume of requests and comments, the team has publicly stated
> it is beginning to build a process so they can accept third-party
> contributions to the project.
>
> === Relationships with Other Apache Products ===
> The Spark-Kernel has a clear affinity with the Apache Spark project
> because it is designed to  provide capabilities for interactive
> applications and data science tools not provided by Spark itself. The
> Spark-Kernel can be a back-end for the Zeppelin project currently
> incubating at Apache. There is interest from the Spark-Kernel community to
> develop this capability and an experimental branch has been started.
>
> === Homogeneous Developers ===
> The current group of developers working on Spark-Kernel are all from IBM
> although the group is in the process of expanding its membership to
> include members of the GitHub community who are not from IBM and who have
> been active in the Spark-Kernel community in GutHub.
>
> === Reliance on Salaried Developers ===
> The initial committers are full-time employees at IBM although not all
> work on the project full-time.
>
> === Excessive Fascination with the Apache Brand ===
> We believe the Spark-Kernel benefits Apache Spark application developers,
> and we are interested in an Apache Spark-Kernel project to benefit these
> developers by engaging a larger community, facilitating closer ties with
> the existing Spark project, and yes, gaining more visibility for the
> Spark-Kernel as a solution.
>
> We have recently become aware that the project name “Spark-Kernel” may be
> interpreted as having an association with an Apache project. If the
> project is accepted by Apache, we suggest the project name remains the
> same, but otherwise we will change it to one that does not imply any
> Apache association.
>
> === Documentation ===
> Comprehensive documentation including “Getting Started”, API
> specifications and a Roadmap are available from the GitHub project, see
> https://github.com/ibm-et/spark-kernel/wiki.
>
> === Initial Source ===
> The source code resides at https://github.com/ibm-et/spark-kernel.
>
> === External Dependencies ===
> The Spark-Kernel depends upon a number of Apache projects:
>  * Spark
>  * Hadoop
>  * Ivy
>  * Commons
>
> The Spark-Kernel also depends upon a number of other open source projects:
>  * JeroMQ (LGPL with Static Linking Exception,
> http://zeromq.org/area:licensing)
>  * Akka (MIT)
>  * JOpt Simple (MIT)
>  * Spring Framework Core (Apache v2)
>  * Play (Apache v2)
>  * SLF4J (MIT)
>  * Scala
>  * Scalatest (Apache v2)
>  * Scalactic (Apache v2)
>  * Mockito (MIT)
>
> == Required Resources ==
> Developer and user mailing lists
>  * private@spark-kernel.incubator.apache.org (with moderated
> subscriptions)
>  * commits@spark-kernel.incubator.apache.org
>  * dev@spark-kernel.incubator.apache.org
>  * users@spark-kernel.incubator.apache.org
>
> A git repository:
> https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git
>
> A JIRA issue tracker: https://issues.apache.org/jira/browse/SPARK-KERNEL
>
> == Initial Committers ==
> The initial list of committers is:
>  * Leugim Bustelo (gino@bustelos.com)
>  * Jakob Odersky (jodersky@gmail.com)
>  * Luciano Resende (lresende@apache.org)
>  * Robert Senkbeil (chip.senkbeil@gmail.com)
>  * Corey Stubbs (cas5542@gmail.com)
>  * Miao Wang (wm624@hotmail.com)
>  * Sean Welleck (wellecks@gmail.com)
>
> === Affiliations ===
> All of the initial committers are employed by IBM.
>
> == Sponsors ==
> === Champion ===
>  * Sam Ruby (IBM)
>
> === Nominated Mentors ===
>  * Luciano Resende
>
> We wish to recruit additional mentors during incubation.
>
> === Sponsoring Entity ===
> The Apache Incubator.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>


-- 
Julien