You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saurabh Agrawal (JIRA)" <ji...@apache.org> on 2017/07/20 05:16:00 UTC

[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

    [ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094182#comment-16094182 ] 

Saurabh Agrawal edited comment on SPARK-21476 at 7/20/17 5:15 AM:
------------------------------------------------------------------

I'm saying that the trees in the model get serialized with each task which increases the task deserialization time if the forest is big. 
I see that there is a transformImpl in RandomForestClassificationModel which is broadcasting itself first and then calling predict on the broadcast value (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L207-L213). But transformImpl is not getting invoked by the transform method in ProbabilisticClassificationModel. Instead ProbabilisticClassificationModel uses the concrete class definition of predictRaw.

transorm is a distributed operation but the trees contained within the model do not get broadcast and instead are serialized with each task. Is this intended behavior? 




was (Author: sagraw):
I'm saying that the trees in the model get serialized with each task which increases the task deserialization time if the forest is big. 
I see that there is a transformImpl in RandomForestClassificationModel which is broadcasting itself first and then calling predict on the broadcast value (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L207-L213). But transformImpl is not getting invoked by the transform method in ProbabilisticClassificationModel. Instead ProbabilisticClassificationModel uses the concrete class definition of predictRaw.



> RandomForest classification model not using broadcast in transform
> ------------------------------------------------------------------
>
>                 Key: SPARK-21476
>                 URL: https://issues.apache.org/jira/browse/SPARK-21476
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction with pipelines using RandomForestClassificationModel. While digging into the source, found that the transform method in RandomForestClassificationModel binds to its parent ProbabilisticClassificationModel and the only concrete definition that RandomForestClassificationModel provides and which is actually used in transform is that of predictRaw. Broadcasting is not being used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org