You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:45:11 UTC
[jira] [Resolved] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

     [ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24106.
----------------------------------
    Resolution: Incomplete

> Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24106
>                 URL: https://issues.apache.org/jira/browse/SPARK-24106
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0, 2.2.1, 2.3.0
>         Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>            Reporter: Tamilselvan Veeramani
>            Priority: Major
>              Labels: bulk-closed, performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second for 45MB of rf model, although processing time is just 400 to 600 ms for mini batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will help many people who is looking to use RF model for “probability” in real time streaming context
> Since RandomForestClassificationModel class of transformImpl method implements only “prediction” in current version of spark. Which can be leveraged to implement “probability” also in RandomForestClassificationModel class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org