You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/11/26 10:16:00 UTC
[jira] [Assigned] (SPARK-26172) Unify String Params'
case-insensitivity in ML
[ https://issues.apache.org/jira/browse/SPARK-26172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-26172:
------------------------------------
Assignee: (was: Apache Spark)
> Unify String Params' case-insensitivity in ML
> ---------------------------------------------
>
> Key: SPARK-26172
> URL: https://issues.apache.org/jira/browse/SPARK-26172
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Priority: Major
>
> For now, there are three ways to deal with case-insensitivity in ML:
> 1, support case-insensitivity, e.g. {{LogisticRegression}};
> 2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}};
> 3, do not support case-insensitivity, e.g. {{NaiveBayes}}
>
> This situation result in confusion in usage.
> I think we should choose the *first* way to support case-insensitivity of all non-columnName string params, including:
> * LogisticRegression: family
> * MultilayerPerceptronClassifier: {{solver}}
> * NaiveBayes: modelType
> * DecisionTreeClassifier: impurity
> * RandomForestClassifier: featureSubsetStrategy, impurity
> * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
> * {{}}
> * LinearRegression: solver, loss
> * GeneralizedLinearRegression: family, link, solver
> * DecisionTreeRegressor: impurity
> * RandomForestRegressor: featureSubsetStrategy, impurity
> * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
> * {{}}
> * {\{KMeans: }}initMode
> * LDA: optimizer
> * PowerIterationClustering\{{: }}initMode
> *
> * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
> *
> * Bucketizer: handleInvalid
> * ChiSqSelector: selectorType
> * Imputer: strategy
> * QuantileDiscretizer: handleInvalid
> * RFormula: handleInvalid, stringIndexerOrderType
> * StringIndexer: handleInvalid, stringOrderType
> * VectorAssembler: handleInvalid
> * VectorIndexer: handleInvalid
> * VectorSizeHint: handleInvalid
> * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the breaking change*)
> *
> * BinaryClassificationEvaluator: metricName
> * MulticlassClassificationEvaluator: metricName
> * RegressionEvaluator: metricName
> * ClusteringEvaluator: metricName, distanceMeasure
>
>
>
> To to this:
> * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in {{ParamValidators}} to check case-insensitivity;
> * methods {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} are created in trait {{Params}} to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change {{$(param)}} to {{$$\{param}}};
> * in *SharedParamsCodeGen*, *handleInvalid* and *{{distanceMeasure}}* are updated to use lowerCaseInArray
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org