You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/10/04 09:22:20 UTC

[jira] [Updated] (SPARK-17768) Small {Sum,Count,Mean}Evaluator problems and suboptimalities

     [ https://issues.apache.org/jira/browse/SPARK-17768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-17768:
------------------------------
    Description: 
This tracks a few related issues with org.apache.spark.partial.(Count,Mean,Sum)Evaluator and their "Grouped" counterparts:

- GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the StudentTCacher support class
- CountEvaluator can return a lower bound < 0, when counts can't be negative
- MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
- CountEvaluator uses a normal distribution, which may be an inappropriate approximation (leading to above)
- Test for SumEvaluator asserts incorrect expected sums -- e.g. after observing 10% of data has sum of 2, expectation should be 20, not 38
- CountEvaluator, MeanEvaluator have no unit tests to catch these
- Duplication of distribution code across CountEvaluator, GroupedCountEvaluator
- The stats in each could use a bit of documentation as I had to guess at them
- (Code could use a few cleanups and optimizations too)



  was:
This tracks a few related issues with org.apache.spark.partial.(Count,Mean,Sum)Evaluator and their "Grouped" counterparts:

- GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the StudentTCacher support class
- CountEvaluator can return a lower bound < 0, when counts can't be negative
- MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
- CountEvaluator uses a normal distribution, which may be an inappropriate approximation (leading to above)
- CountEvaluator, MeanEvaluator have no unit tests to catch these
- Duplication across CountEvaluator, GroupedCountEvaluator
- SumEvaluator might have an issue related to CountEvaluator (or could delegate to compute CountEvaluator times MeanEvaluator?)
- The stats in each could use a bit of documentation as I had to guess at them
- (Code could use a few cleanups and optimizations too)




> Small {Sum,Count,Mean}Evaluator problems and suboptimalities
> ------------------------------------------------------------
>
>                 Key: SPARK-17768
>                 URL: https://issues.apache.org/jira/browse/SPARK-17768
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.1
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>
> This tracks a few related issues with org.apache.spark.partial.(Count,Mean,Sum)Evaluator and their "Grouped" counterparts:
> - GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the StudentTCacher support class
> - CountEvaluator can return a lower bound < 0, when counts can't be negative
> - MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
> - CountEvaluator uses a normal distribution, which may be an inappropriate approximation (leading to above)
> - Test for SumEvaluator asserts incorrect expected sums -- e.g. after observing 10% of data has sum of 2, expectation should be 20, not 38
> - CountEvaluator, MeanEvaluator have no unit tests to catch these
> - Duplication of distribution code across CountEvaluator, GroupedCountEvaluator
> - The stats in each could use a bit of documentation as I had to guess at them
> - (Code could use a few cleanups and optimizations too)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org