You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Hendrickson (JIRA)" <ji...@apache.org> on 2016/11/03 21:29:00 UTC
[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

    [ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634354#comment-15634354 ] 

Seth Hendrickson edited comment on SPARK-15581 at 11/3/16 9:28 PM:
-------------------------------------------------------------------

I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on.
- The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic.
- We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done.
- As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next release
2. JIRAs that will be done at some point before the next major release (e.g. 3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like.	


was (Author: sethah):
I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on.
- The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic.
- We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done.
-As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next relase
2. JIRAs that will be done at some point before the next major relase (e.g. 3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like.	

> MLlib 2.1 Roadmap
> -----------------
>
>                 Key: SPARK-15581
>                 URL: https://issues.apache.org/jira/browse/SPARK-15581
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Blocker
>              Labels: roadmap
>             Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next release. Please view this as a wish list rather than a definite plan, for we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of MLlib.  However, we will prioritize API parity, bug fixes, and improvements over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity with the Scala/Java API. You can find a [complete list here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. The tasks fall into two major categories:
> ** Python API for missing methods (SPARK-14813)
> ** Python API for new algorithms. Committers should create a JIRA for the Python API after merging a public feature in Scala/Java.
> h2. SparkR
> * Improve R formula support and implementation (SPARK-15540)
> * Various SparkR ML API and usability improvements
> ** Note: No linked JIRA yet, but need to create an umbrella once more issues are collected.
> * Wrap more MLlib algorithms (SPARK-16442)
> * Release SparkR on CRAN [SPARK-15799]
> h2. Pipeline API
> * Usability: Automatic feature preprocessing [SPARK-11106]
> * ML attribute API improvements (SPARK-8515)
> * test Kaggle datasets (SPARK-9941)
> * See (SPARK-5874) for a list of other possibilities
> h2. Algorithms and performance
> * Trees & ensembles scaling & speed (SPARK-14045), (SPARK-14046), (SPARK-14047)
> * Locality sensitive hashing (LSH) (SPARK-5992)
> * Similarity search / nearest neighbors (SPARK-2336)
> Additional (may be lower priority):
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * local linear algebra (SPARK-6442)
> * weighted instance support (SPARK-9610)
> ** random forest (SPARK-9478)
> ** GBT (SPARK-9612)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * distributed LU decomposition (SPARK-8514)
> h2. Other
> * Infra
> ** Testing for example code (SPARK-12347)
> ** Remove breeze from dependencies (SPARK-15575)
> * public dataset loader (SPARK-10388)
> * Documentation: improve organization of user guide (SPARK-8517)
> * Python Documentation: expose default values of params in some way (SPARK-15130)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org