You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:34:48 UTC

[jira] [Resolved] (SPARK-14464) Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small

     [ https://issues.apache.org/jira/browse/SPARK-14464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-14464.
----------------------------------
    Resolution: Incomplete

> Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14464
>                 URL: https://issues.apache.org/jira/browse/SPARK-14464
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.0
>            Reporter: Daniel Siegmann
>            Priority: Major
>              Labels: bulk-closed
>
> When training (a.k.a. fitting) org.apache.spark.ml.classification.LogisticRegression, aggregation is done using arrays (which are dense structures). This is the case regardless of whether the features of each instance are stored in a sparse or a dense vector.
> When the feature vectors are very large, performance is poor because there's a lot of overhead in transmitting these large arrays across workers.
> However, just because the feature vectors are large doesn't mean all these features are actually being used. If the actual features are sparse, very large arrays are being allocated unnecessarily.
> To solve this case, there should be an option to aggregate using sparse vectors. It should be up to the sure to set this explicitly as a parameter on the estimator, since the user should have some idea whether it is necessary for their particular case.
> As an example, I have a use case where the features vector size is around 20 million. However, there are only 7 - 8 thousand non-zero features. The time to train a model with only 10 worker is 1.3 hours on my test cluster. With 100 workers this balloons to over 10 hours on the same cluster! Also, spark.driver.maxResultSize is set to 112 GB to accommodate the data being pulled back to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org