You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2017/05/12 23:31:04 UTC
[jira] [Comment Edited] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

    [ https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008898#comment-16008898 ] 

Weichen Xu edited comment on SPARK-20504 at 5/12/17 11:30 PM:
--------------------------------------------------------------

I have already taken the following steps to check this QA issue, and I also attach some output logs in this email, I skipped the `mllib` package which are deprecated:


1)  use `jar -tf` to extract the classes in `ml` package, towards master version and 2.1.1 version, but I use `grep` to filter some nested classes (which class name contains “$”, but classname ends with an `$` maybe an `object` and should be reserved.)


2) extracts the classes both existed in master version and 2.1.1 version, and use `javap -protected -s` to get the signature information of them, and use `diff` to compare their difference, and I manually check each difference, check their corresponding scala-doc and java-doc for consistency and potential incompatible problems.


3) extracts the classes added after 2.1.1 version, these classes are:
-------------------
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVC$
org.apache.spark.ml.classification.LinearSVCAggregator
org.apache.spark.ml.classification.LinearSVCCostFun
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.LinearSVCModel$
org.apache.spark.ml.classification.LinearSVCParams
org.apache.spark.ml.clustering.ExpectationAggregator
org.apache.spark.ml.feature.Imputer
org.apache.spark.ml.feature.Imputer$
org.apache.spark.ml.feature.ImputerModel
org.apache.spark.ml.feature.ImputerModel$
org.apache.spark.ml.feature.ImputerParams
org.apache.spark.ml.fpm.AssociationRules
org.apache.spark.ml.fpm.AssociationRules$
org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.ml.fpm.FPGrowth$
org.apache.spark.ml.fpm.FPGrowthModel
org.apache.spark.ml.fpm.FPGrowthModel$
org.apache.spark.ml.fpm.FPGrowthParams
org.apache.spark.ml.r.BisectingKMeansWrapper
org.apache.spark.ml.r.BisectingKMeansWrapper$
org.apache.spark.ml.recommendation.TopByKeyAggregator
org.apache.spark.ml.r.FPGrowthWrapper
org.apache.spark.ml.r.FPGrowthWrapper$
org.apache.spark.ml.r.LinearSVCWrapper
org.apache.spark.ml.r.LinearSVCWrapper$
org.apache.spark.ml.source.libsvm.LibSVMOptions
org.apache.spark.ml.source.libsvm.LibSVMOptions$
org.apache.spark.ml.stat.ChiSquareTest
org.apache.spark.ml.stat.ChiSquareTest$
org.apache.spark.ml.stat.Correlation
org.apache.spark.ml.stat.Correlation$
------------------
To these classes, I use `javap -s` to get their signatures and also manually check their corresponding scala-doc and java-docs.


After I check the things listed above, I found no problem related to java compatibility.
Only a small problem is, the `private` class marked in scala code, when compiled into bytecode, the `private` modifier seems to be lost and `javap` regard them as `public` classes. and Java-docs will also include these classes, these classes contains `***Aggregator`, `***CostFun` and so on but I think it is the problem scala compiler need to resolve. 


I attach the processing script I wrote and some intermediate output files for your further check, including:
1) processing script
2) class and method signature diff result between 2.1.1 and master version, for `ml` classes existing both in the two version.
3) class and method signature of the `ml` classes added after version 2.1.1
4) classes existing both in master and 2.1.1 version
5) classes added after version 2.1.1


was (Author: weichenxu123):
I have already taken the following steps to check this QA issue, and I also attach some output logs in this email, I skipped the `mllib` package which are deprecated:


1)  use `jar -tf` to extract the classes in `ml` package, towards master version and 2.1.1 version, but I use `grep` to filter some nested classes (which class name contains “$”)


2) extracts the classes both existed in master version and 2.1.1 version, and use `javap -protected -s` to get the signature information of them, and use `diff` to compare their difference, and I manually check each difference, check their corresponding scala-doc and java-doc for consistency and potential incompatible problems.


3) extracts the classes added after 2.1.1 version, these classes are:
-------------------
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVC$
org.apache.spark.ml.classification.LinearSVCAggregator
org.apache.spark.ml.classification.LinearSVCCostFun
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.LinearSVCModel$
org.apache.spark.ml.classification.LinearSVCParams
org.apache.spark.ml.clustering.ExpectationAggregator
org.apache.spark.ml.feature.Imputer
org.apache.spark.ml.feature.Imputer$
org.apache.spark.ml.feature.ImputerModel
org.apache.spark.ml.feature.ImputerModel$
org.apache.spark.ml.feature.ImputerParams
org.apache.spark.ml.fpm.AssociationRules
org.apache.spark.ml.fpm.AssociationRules$
org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.ml.fpm.FPGrowth$
org.apache.spark.ml.fpm.FPGrowthModel
org.apache.spark.ml.fpm.FPGrowthModel$
org.apache.spark.ml.fpm.FPGrowthParams
org.apache.spark.ml.r.BisectingKMeansWrapper
org.apache.spark.ml.r.BisectingKMeansWrapper$
org.apache.spark.ml.recommendation.TopByKeyAggregator
org.apache.spark.ml.r.FPGrowthWrapper
org.apache.spark.ml.r.FPGrowthWrapper$
org.apache.spark.ml.r.LinearSVCWrapper
org.apache.spark.ml.r.LinearSVCWrapper$
org.apache.spark.ml.source.libsvm.LibSVMOptions
org.apache.spark.ml.source.libsvm.LibSVMOptions$
org.apache.spark.ml.stat.ChiSquareTest
org.apache.spark.ml.stat.ChiSquareTest$
org.apache.spark.ml.stat.Correlation
org.apache.spark.ml.stat.Correlation$
------------------
To these classes, I use `javap -s` to get their signatures and also manually check their corresponding scala-doc and java-docs.


After I check the things listed above, I found no problem related to java compatibility.
Only a small problem is, the `private` class marked in scala code, when compiled into bytecode, the `private` modifier seems to be lost and `javap` regard them as `public` classes. and Java-docs will also include these classes, these classes contains `***Aggregator`, `***CostFun` and so on but I think it is the problem scala compiler need to resolve. 


I attach the processing script I wrote and some intermediate output files for your further check, including:
1) processing script
2) class and method signature diff result between 2.1.1 and master version, for `ml` classes existing both in the two version.
3) class and method signature of the `ml` classes added after version 2.1.1
4) classes existing both in master and 2.1.1 version
5) classes added after version 2.1.1

> ML 2.2 QA: API: Java compatibility, docs
> ----------------------------------------
>
>                 Key: SPARK-20504
>                 URL: https://issues.apache.org/jira/browse/SPARK-20504
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Documentation, Java API, ML, MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Weichen Xu
>            Priority: Blocker
>             Fix For: 2.2.0
>
>         Attachments: 1_process_script.sh, 2_signature.diff, 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not be understood in Java, or they may be accessible only via the weirdly named Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find a problem, check if it was introduced in this Spark release (in which case we can fix it) or in a previous one (in which case we can create a java-friendly version of the API).
> * If needed for complex issues, create small Java unit tests which execute each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org