You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2020/05/20 06:53:07 UTC

[GitHub] [samza] lakshmi-manasa-g commented on pull request #1362: SAMZA-2526: Azure blob system producer: do not commit blobs if avro DataFileWriter.close fails

lakshmi-manasa-g commented on pull request #1362:
URL: https://github.com/apache/samza/pull/1362#issuecomment-631274573


   @bkonold 
   
   Let me start with what flush and close mean for AzureBlobWriter interface. The interface expectation is that after flush all of the data sent so far is uploaded as a block to Azure and close means the blob is sealed (aka created or committed) with all the blocks so far. Thus flush could be a more frequently occurring operation whereas close is the final operation after which the writer can not opened to write any more messages.
   So the flow is something like this 
                               writer.write (several times) --> writer.flush (can go back to write) --> writer.close
                                              
   Now coming to AzureBlobAvroWriter impl - it is the same meaning as the interface expects. Only difference arises due to the use of Avro's DataFileWriter. Used as follows
               avroWriter --> DataFileWriter (avro piece) --> AzureBlobOutputStream (uploads/commits to Azure)
   
   AzureBlobAvroWriter.close has to call DataFileWriter.close which internally calls flush and that writes some bytes (indicating end of block kind of stuff) to the output stream which puts it into the last block of the blob.  
   
   Now finally coming to the current issue: earlier, the blob was committed by calling AzureBlobOutputSteam.close even when DataFileWriter.close fails. which meant blocks uploaded so far will be used to make a blob - but this might miss the last bytes from DFW leading to an invalid/unreadable blob. The better thing to do is to discard the blob so that the user of the system producer has a chance to re-try all the messages for the blob.
   
   I know this is a very long response. But hope it helps in clarifying


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org