You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/30 01:40:52 UTC

[GitHub] [spark] wenxuanguan commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

wenxuanguan commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526421834
 
 
   > By reading the doc without super deep understanding I've found this in the caveats section:
   > 
   > ```
   > If job failed before ResumeTransaction more than 60 seconds, the default value
   > ofconfiguration transaction.timeout.ms, data send to Kafka cluster will be discarded
   > and lead todata loss.So We set transaction.timeout.ms to 900000, the default
   > value of max.transaction.timeout.msin Kafka cluster, to reduce the risk of data loss
   > if user not defined
   > ```
   > 
   > The `to reduce the risk of data loss` part disturbs me a bit, is it exactly-once then or not?
   > @HeartSaVioR AFAIK you've proposed exactly once SPIP before but there were concerns.
   
   @gaborgsomogyi @HeartSaVioR Thanks for your reply about the config `transaction.timeout.ms` and the data loss.
   The common scene occurred is that as producer failed to commit transaction for some reason, such as kafka broker down, spark job will fail down. After kafka broker recovered, restart the job and transaction will resume. So if the time between transaction commit failure fixed and job restart by job attempt or manually not exceed `transaction.timeout.ms`, no data will be lost.
   The default config `transaction.timeout.ms` in producer 60 seconds, so to make sure there is enough time for fix failure we reset it to 900000, the default value of kafka broker config, if user not defined. Because the request will fail if the producer config `transaction.timeout.ms' is larger than the kafka broker config.
   I think it is what we can do in code, and also notice user in document. There is also some solution to avoid this, such as increase config `transaction.timeout.ms`, and it is depend on user. So if user defined `transaction.timeout.ms`, we just check if it is larger enough.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org