You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/01/14 02:44:39 UTC

[jira] [Resolved] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

     [ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley resolved SPARK-12026.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.1
                   2.0.0

Issue resolved by pull request 10146
[https://github.com/apache/spark/pull/10146]

> ChiSqTest gets slower and slower over time when number of features is large
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-12026
>                 URL: https://issues.apache.org/jira/browse/SPARK-12026
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: Hunter Kelly
>            Assignee: yuhao yang
>              Labels: mllib, stats
>             Fix For: 2.0.0, 1.6.1
>
>         Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My understanding is that internally it creates jobs to run on batches of 1000 features at a time.
> I was under the impression that the features are treated as independant, but this does not appear to be the case.  When the number of features is large (160k in my case), each batch gets slower and slower.  As an example, running on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org