You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/07 17:47:18 UTC

[GitHub] [spark] huaxingao commented on issue #27841: [SPARK-31077][ML] Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel

huaxingao commented on issue #27841: [SPARK-31077][ML] Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
URL: https://github.com/apache/spark/pull/27841#issuecomment-596120093
 
 
   @srowen Thank you very much for your review!
   
   I am trying to remove the dependency on mllib.ChiSqSelectorModel. After the change, ChiSqSelector.fit, ChiSqSelectorModel  constructor and implementation are almost the same as the FValueSelector implementation, then I will extract all the common code between ```ChiSqSelector``` and ```FValueSelector``` to put in an abstract ```Selector```. 
   
   So basically here is the final goal: 
   
   ```
   abstract class Selector[T <: SelectorModel[T]]
   it contains all the params, fit method + 2 abstract methods:
   
   // Compute the statistics information (pValue + chisq statistics/fvalue)
   protected[this] def getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult]
   
   // Create SelectorModel (could be ChiSqModel or FValueModel or ANOVAModel)
   protected[this] def createSelectorModel(
         uid: String,
         statistics: Array[Double]): T
   ```
   
   Then I will have three concrete Selector classes:
   ```
   1. 
   // categorical features and categorical labels.
   class ChiSqSelector extends Selector[ChiSqSelectorModel]
   implement getSelectionTestResult to return pValue and chisq stats
   implement createSelectorModel to create a ChiSqSelectorModel
   
   2.
   // continuous features and continuous labels
   class FValueSelector extends Selector[FValueSelectorModel]
   implement getSelectionTestResult to return pValue and f regression value
   implement FValueSelector to create a FValueSelectorModel
   
   3.
   // continuous features and categorical labels
   class ANOVASelector extends Selector[ANOVASelectorModel]
   implement getSelectionTestResult to return pValue and ANOVA fvalue
   implement ANOVASelector to create a ANOVASelectorModel
   ```
   
   The initial PR is https://github.com/apache/spark/pull/27527. I broke it to several small PRs so it will be easier to review:
   
   1. add stat.FValueRegressionTest (https://github.com/apache/spark/pull/27623)
   2. add feature.FValueSelector (https://github.com/apache/spark/pull/27679)
   3. this PR to remove ChiSqSelector dependency on mllib.ChiSqSelectorModel. 
   4. extract all the common code to abstract Selector. Make FValueSelector and ChiSqSelector implement abstract Selector.
   5. Add ANOVASelector (the implementation of ANOVA is not in https://github.com/apache/spark/pull/27527)
   
   
   Maybe I should combine step 3 and 4 in one PR?
   
   For the binary test files changes: I initially planned to exposed pValue and chisq statistics in ChisqSelectorModel so the constructor is like this:
   ```
   final class ChiSqSelectorModel private[ml] (
       @Since("1.6.0") override val uid: String,
       override val selectedFeatures: Array[Int],
       @Since("3.1.0") override val pValues: Array[Double],
       @Since("3.1.0")override val statistic: Array[Double])
   ```
   Then when I load the model, I will need to do this:
   ```
         val model = if (majorVersion < 3 || (majorVersion == 3 && minorVersion < 1)) {
           // model prior to 3.1.0
           val data = df.select("selectedFeatures").head()
           ...
         } else {
           val data = df.select("selectedFeatures", "pValue", "statistics").head()
          ...
         }
   ```
   So I have the binary test files to test the prior 3.1.0 model. But then I wasn't so sure if I need to expose pValue and chisq statistics, and I decided to rethink this later in the final step when I do the abstract Selector. I changed the constructor and reader code but forgot to remove the binary test files. I will clean up now. Sorry my mistake confused you. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org