You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/28 22:56:07 UTC

[GitHub] [spark] yijiacui-db edited a comment on pull request #31944: [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay.

yijiacui-db edited a comment on pull request #31944:
URL: https://github.com/apache/spark/pull/31944#issuecomment-828832285


   > > > I've tested it on real cluster and works fine.
   > > > Just a question. How this it intended to use for dynamic allocation?
   > > 
   > > 
   > > Users can implement this interface in their customized SparkDataStream and know how far falling behind through the progress listener. Maybe this can provide more useful information to guide/trigger the auto scaling.
   > 
   > This is a valid user-case. But my question is that current offsets in `SourceProgress` should already provide the information the use-case needs (consumed offset, available offset). The source progress should be also available on the customized SparkDataStream. Do you mean the metrics from the customized SparkDataStream is not offset related?
   
   Yes. Available offset is retrieved through reportLatestOffset, that's something Kafka already implemented, so that's duplicated because we can use the latest consumed offset and also the available offset to compute how far is falling behind.
   But, for other customized spark data stream, it's possible that reportLatestOffset isn't implemented, so from the source progress report, there's no way to know the latest available offset to do the computation.  Also, the customized metrics, for example, how far the application is falling behind from the latest, can be represented in other ways (not only in the number of offset), which all depends on the how the stream defines it.
   
   We want to introduce this metrics interface to let user implement for their data stream to obtain the metrics they want from the source progress report. Kafka Stream is just an example of how users can implement this and retrieve that information, but it happens to have the latest available offset to make it look a little bit duplicated and hard to reason about.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org