You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/01 15:13:05 UTC

[GitHub] koeninger commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

koeninger commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#issuecomment-468697607
 
 
   From a quick look at the test, it looks like it is not doing anything that
   you would expect to result in hitting the same topicpartition multiple
   times concurrently, correct?
   
   On Fri, Mar 1, 2019 at 9:04 AM Jungtaek Lim <no...@github.com>
   wrote:
   
   > @koeninger <https://github.com/koeninger>
   > I've just run some experiments on my local dev.
   >
   > The source topic has around 1,100,000 records in 10 partitions.
   >
   > Records are generated by below utils:
   >
   > https://github.com/HeartSaVioR/sam-trucking-data-utils/blob/fix-for-spark-structured-streaming/README.md
   > (Actually we will pull exactly same records per query so it would not be a
   > big deal.)
   >
   > Test code is below:
   > https://gist.github.com/HeartSaVioR/74c7e78e5901b1974ccc400502fb6af2
   >
   > The query fetched 5000 records per batch: each query ran 221 batches.
   >
   > Query status file is parsed via below command (requires jq and datamash):
   >
   > cat experiment-SPARK-25151-master-query-v1.log | grep "addBatch" | jq '. | {addBatch: .durationMs.addBatch}' | grep "addBatch" | awk -F " " '{print $2}' | datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1
   >
   >
   >    - master branch
   >
   > attempt # max min mean median percentile 90 percentile 95 percentile 99
   > 1 449 4 7.6981981981982 5 7 11 21.37
   > 2 490 4 7.8198198198198 5 8 10.95 19.37
   > 3 442 4 7.4324324324324 5 8 10 16.16
   >
   >    - this patch
   >
   > attempt # max min mean median percentile 90 percentile 95 percentile 99
   > 1 501 4 7.8513513513514 5 7.9 9.95 18.79
   > 2 411 4 7.4054054054054 5 8 9 16.37
   > 3 431 3 7.5630630630631 5 8 11 16
   >
   > I would not say this patch is faster than current based the output (it
   > shows better numbers for percentile 95 and 99 though), but I would say it
   > doesn't bring performance regression, while bringing bugfix and
   > improvements.
   >
   > You could see the number doesn't contribute much on overall latency per
   > batch - once we notice the cache logic works well, it's not a critical path
   > and the overhead for retrieving is pretty ignorable.
   >
   > Could the result persuade you to review the patch? Or do you want to tune
   > the parameter in test env?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/22138#issuecomment-468694114>, or mute
   > the thread
   > <https://github.com/notifications/unsubscribe-auth/AAGAB4RaYNygwiw1BDIogGVdLBIDvmamks5vSUF8gaJpZM4WCUJs>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org