You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/06/17 03:59:05 UTC

[jira] [Commented] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data

    [ https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335359#comment-15335359 ] 

Apache Spark commented on SPARK-16008:
--------------------------------------

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/13729

> ML Logistic Regression aggregator serializes unnecessary data
> -------------------------------------------------------------
>
>                 Key: SPARK-16008
>                 URL: https://issues.apache.org/jira/browse/SPARK-16008
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Seth Hendrickson
>
> LogisticRegressionAggregator class is used to collect gradient updates in ML logistic regression algorithm. The class stores a reference to the coefficients array of length equal to the number of features. It also stores a reference to an array of standard deviations which is length numFeatures also. When a task is completed it serializes the class which also serializes a copy of the two arrays. These arrays don't need to be serialized (only the gradient updates are being aggregated). This causes issues performance issues when the number of features is large and can trigger excess garbage collection when the executor doesn't have much excess memory. 
> This results in serializing 2*numFeatures excess data. When multiclass logistic regression is implemented, the excess will be numFeatures + numClasses * numFeatures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org