You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Chunnan Yao <ya...@gmail.com> on 2015/03/12 08:13:55 UTC

Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

Hi everyone!
I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
MLlib.stat.test, in the file ChiSqTest.scala under
/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
by the usage of mapPartitions API in the function  
def chiSquaredFeatures(data: RDD[LabeledPoint],
      methodName: String = PEARSON.name): Array[ChiSqTestResult]

According to my statistical testing knowledge, Chi-Square test requires
large numbers (>5 for 80% entries) in its contingency matrix in order to
satisfy good approximation
(http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). Thus the number
of feature & label categories cannot be too large because if otherwise,
there would be too few items in each categories, which fails to meet  the
constraint in usage of Chi-square test. 

I do see in the function above, Spark will throw exceptions when
distinctLabels.size and distinctFeatures.size exceed maxCategories defined
as 10000, but the  two HashSets distinctLabels and distinctFeatures are
initialized inside mapPartition, which means Spark will only be sensitive to
the number of feature & label categories in one partition. This will make
the reduced result---contingency matrix still have exceeded number of
categories and thus small matrix entries which makes Chi-Square inaccurate.
I've made a unit test on this function, which proves the case. 

Maybe I am just being trapped by a misunderstanding. Could any one please
give me a hint on this issue?



-----
Feel the sparking Spark!
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-this-a-bug-in-MLlib-stat-test-About-the-mapPartitions-API-used-in-Chi-Squared-test-tp11015.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

Posted by Joseph Bradley <jo...@databricks.com>.
The checks against maxCategories are not for statistical purposes; they are
to make sure communication does not blow up.  There currently are not
checks to make sure that there are enough entries for statistically
significant results.  That is up to the user.

I do like the idea of adding a warning.  A reasonable fix for now might be
to print a logWarning message and add a note to the documentation.  On the
JIRA, we could also discuss whether the result should be set to some value
to indicate a meaningless test (e.g., a very bad fixed pValue).

I made a JIRA to track this issue: SPARK-6312

Joseph

On Thu, Mar 12, 2015 at 12:13 AM, Chunnan Yao <ya...@gmail.com> wrote:

> Hi everyone!
> I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
> MLlib.stat.test, in the file ChiSqTest.scala under
> /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
> by the usage of mapPartitions API in the function
> def chiSquaredFeatures(data: RDD[LabeledPoint],
>       methodName: String = PEARSON.name): Array[ChiSqTestResult]
>
> According to my statistical testing knowledge, Chi-Square test requires
> large numbers (>5 for 80% entries) in its contingency matrix in order to
> satisfy good approximation
> (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). Thus the
> number
> of feature & label categories cannot be too large because if otherwise,
> there would be too few items in each categories, which fails to meet  the
> constraint in usage of Chi-square test.
>
> I do see in the function above, Spark will throw exceptions when
> distinctLabels.size and distinctFeatures.size exceed maxCategories defined
> as 10000, but the  two HashSets distinctLabels and distinctFeatures are
> initialized inside mapPartition, which means Spark will only be sensitive
> to
> the number of feature & label categories in one partition. This will make
> the reduced result---contingency matrix still have exceeded number of
> categories and thus small matrix entries which makes Chi-Square inaccurate.
> I've made a unit test on this function, which proves the case.
>
> Maybe I am just being trapped by a misunderstanding. Could any one please
> give me a hint on this issue?
>
>
>
> -----
> Feel the sparking Spark!
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Is-this-a-bug-in-MLlib-stat-test-About-the-mapPartitions-API-used-in-Chi-Squared-test-tp11015.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>