You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Tamilselvan Veeramani (JIRA)" <ji...@apache.org> on 2018/04/26 19:58:00 UTC

[jira] [Created] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

Tamilselvan Veeramani created SPARK-24106:
---------------------------------------------

             Summary: Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
                 Key: SPARK-24106
                 URL: https://issues.apache.org/jira/browse/SPARK-24106
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.3.0, 2.2.1, 2.2.0
         Environment: Spark yarn / Standalone cluster
2 master nodes - 32 cores - 124 GB
9 worker nodes - 32 cores - 124 GB
Kafka input and output topic with 6 partition
            Reporter: Tamilselvan Veeramani
             Fix For: 2.4.0, 2.3.0


RandomForestClassificationModel broadcasted to executors for every mini batch in spark streaming while try to find probability

RF model size 45MB
spark kafka streaming job jar size 8 MB (including kafka dependency’s)

following log show model broad cast to executors for every mini batch when we call rf_model.transform(dataset).select("probability").
due to which task deserialization time also increases comes to 6 to 7 second for 45MB of rf model, although processing time is just 400 to 600 ms for mini batch

18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)

After 2 to 3 weeks of struggle, I found a potentially solution which will help many people who is looking to use RF model for “probability” in real time streaming context
Since RandomForestClassificationModel class of transformImpl method implements only “prediction” in current version of spark. Which can be leveraged to implement “probability” also in RandomForestClassificationModel class of transformImpl method.

I have modified the code and implemented in our server and it’s working as fast as 400ms to 500ms for every mini batch

I see many people our there facing this issue and no solution provided in any of the forums, Can you please review and put this fix in next release ? thanks




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org