You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2018/11/26 10:09:00 UTC

[jira] [Created] (SPARK-26172) Unify String Params' case-insensitivity in ML

zhengruifeng created SPARK-26172:
------------------------------------

             Summary: Unify String Params' case-insensitivity in ML
                 Key: SPARK-26172
                 URL: https://issues.apache.org/jira/browse/SPARK-26172
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng


For now, there are three ways to deal with case-insensitivity in ML:

1, support case-insensitivity, e.g. \{{LogisticRegression}};

2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. \{{ALS}},\{{DecisionTreeClassifier}};

3, do not support case-insensitivity, e.g. \{{NaiveBayes}}

 

This situation result in confusion in usage. 

I think we should choose the *first* way to support case-insensitivity of all non-columnName string params, including:
 * LogisticRegression: family
 * MultilayerPerceptronClassifier: {{solver}}
 * NaiveBayes: modelType
 * DecisionTreeClassifier: impurity
 * RandomForestClassifier: featureSubsetStrategy, impurity
 * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * LinearRegression: solver, loss
 * GeneralizedLinearRegression: family, link, solver
 * DecisionTreeRegressor: impurity
 * RandomForestRegressor: featureSubsetStrategy, impurity
 * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * {{KMeans: }}initMode
 * LDA: optimizer
 * PowerIterationClustering{{: }}initMode
 * 
 * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
 * 
 * Bucketizer: handleInvalid
 * ChiSqSelector: selectorType
 * Imputer: strategy
 * QuantileDiscretizer: handleInvalid
 * RFormula: handleInvalid, stringIndexerOrderType
 * StringIndexer: handleInvalid, stringOrderType
 * VectorAssembler: handleInvalid
 * VectorIndexer: handleInvalid
 * VectorSizeHint: handleInvalid
 * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the breaking change*)
 * 
 * BinaryClassificationEvaluator: metricName
 * MulticlassClassificationEvaluator: metricName
 * RegressionEvaluator: metricName
 * ClusteringEvaluator: metricName, distanceMeasure

 

 

 

To to this:
 * methods \{{lowerCaseInArray}} and \{{upperCaseInArray}} are created in \{{ParamValidators}} to check case-insensitivity;
 * methods  {{{{$$(param: Param[String])}}}} and \{{%%(param: Param[String])}} are created in trait \{{Params}} to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change \{{$(param)}} to \{{$${param}}};
 * {{in {{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and \{{distanceMeasure}} are updated to use \{{lowerCaseInArray}}

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org