You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2016/11/02 23:48:25 UTC

Blocked PySpark changes

Hi Spark Developers & Maintainers,

I know we've been talking a lot about what we want changes we want in
PySpark to help keep it interesting and usable (see
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html).
One of the underlying challenges that we haven't explicitly discussed is
that a reason behind the slow pace of a lot of the PySpark development is
the lack of dedicated Python reviewers.

For changes which are based around parity with an existing component,
Python contributors like myself can sometimes get reviewers from the
component (like ML) to take a look at our Python changes - but for core
changes it's even harder to get reviewers.

The general Python PR review dashboard
<https://spark-prs.appspot.com/#python> shows the a number of PRs
languishing - but to specifically call out a few:

   -

   pip installability - https://github.com/apache/spark/pull/15659
   -

   KMeans summary in Python - https://github.com/apache/spark/pull/13557
   -

   The various Anaconda/Virtualenv support PRs (none of them have had any
   luck with committer bandwidth)
   -

   PySpark ML models should have params finally starting to get committer
   review - but blocked for months (
   https://github.com/apache/spark/pull/14653 )
   -

   Python meta algorithms in Scala -
   https://github.com/apache/spark/pull/13794 (out of sync with master but
   waiting for months for a committer to say if they are interested in the
   feature or not)


For those following a lot of Python JIRAs you also probably noticed a lot
of Python related JIRAs being re-targeted for future versions that keep
getting bumped back.

The lack of core Python reviewers will make things like Arrow integration
difficult to achieve unless the situation changes.

This isn't meant to say that the current Python reviewers aren't good -
there just isn't enough Python committer bandwidth available to move these
things forward. The normal solution to this is adding more committers with
that focus area.

I'd love to hear y'alls thoughts on this.

Cheers,

Holden :)


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau