You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/03/16 22:54:06 UTC
[jira] [Updated] (SPARK-28958) pyspark.ml function parity
[ https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-28958:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> pyspark.ml function parity
> --------------------------
>
> Key: SPARK-28958
> URL: https://issues.apache.org/jira/browse/SPARK-28958
> Project: Spark
> Issue Type: Improvement
> Components: ML, PySpark
> Affects Versions: 3.1.0
> Reporter: zhengruifeng
> Priority: Major
> Attachments: ML_SYNC.pdf
>
>
> I looked into the hierarchy of both py and scala sides, and found that they are quite different, which damage the parity and make the codebase hard to maintain.
> The main inconvenience is that most models in pyspark do not support any param getters and setters.
> In the py side, I think we need to do:
> 1, remove setters generated by _shared_params_code_gen.py;
> 2, add common abstract classes like the side side, such as JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;
> 3, for each alg, add its param trait, such as LinearSVCParams;
> 4, since sharedParam do not have setters, we need to add them in right places;
> Unfortunately, I notice that if we do 1 (remove setters generated by _shared_params_code_gen.py), all algs (classification/regression/clustering/features/fpm/recommendation) need to be modified in one batch.
> The scala side also need some small improvements, but I think they can be leave alone at first
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org