You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2018/11/26 10:09:00 UTC
[jira] [Created] (SPARK-26172) Unify String Params'
case-insensitivity in ML
zhengruifeng created SPARK-26172:
------------------------------------
Summary: Unify String Params' case-insensitivity in ML
Key: SPARK-26172
URL: https://issues.apache.org/jira/browse/SPARK-26172
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng
For now, there are three ways to deal with case-insensitivity in ML:
1, support case-insensitivity, e.g. \{{LogisticRegression}};
2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. \{{ALS}},\{{DecisionTreeClassifier}};
3, do not support case-insensitivity, e.g. \{{NaiveBayes}}
This situation result in confusion in usage.
I think we should choose the *first* way to support case-insensitivity of all non-columnName string params, including:
* LogisticRegression: family
* MultilayerPerceptronClassifier: {{solver}}
* NaiveBayes: modelType
* DecisionTreeClassifier: impurity
* RandomForestClassifier: featureSubsetStrategy, impurity
* GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
* {{}}
* LinearRegression: solver, loss
* GeneralizedLinearRegression: family, link, solver
* DecisionTreeRegressor: impurity
* RandomForestRegressor: featureSubsetStrategy, impurity
* GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
* {{}}
* {{KMeans: }}initMode
* LDA: optimizer
* PowerIterationClustering{{: }}initMode
*
* ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
*
* Bucketizer: handleInvalid
* ChiSqSelector: selectorType
* Imputer: strategy
* QuantileDiscretizer: handleInvalid
* RFormula: handleInvalid, stringIndexerOrderType
* StringIndexer: handleInvalid, stringOrderType
* VectorAssembler: handleInvalid
* VectorIndexer: handleInvalid
* VectorSizeHint: handleInvalid
* OneHotEncoderEstimator: handleInvalid (*this will be let alone until the breaking change*)
*
* BinaryClassificationEvaluator: metricName
* MulticlassClassificationEvaluator: metricName
* RegressionEvaluator: metricName
* ClusteringEvaluator: metricName, distanceMeasure
To to this:
* methods \{{lowerCaseInArray}} and \{{upperCaseInArray}} are created in \{{ParamValidators}} to check case-insensitivity;
* methods {{{{$$(param: Param[String])}}}} and \{{%%(param: Param[String])}} are created in trait \{{Params}} to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change \{{$(param)}} to \{{$${param}}};
* {{in {{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and \{{distanceMeasure}} are updated to use \{{lowerCaseInArray}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org