You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2020/05/18 22:44:43 UTC

[GitHub] [samza] lakshmi-manasa-g opened a new pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

lakshmi-manasa-g opened a new pull request #1364:
URL: https://github.com/apache/samza/pull/1364


   Feature: Enable addition of number of records in a blob to the blob's metadata. 
   
   Changes: BlobMetadataContext passed as input to BlobMetadataGenerator interface now have an additional field giving the number of records in the blob. This can be used as needed by the generator.
   
   API changes: BlobMetadataContext passed as input to BlobMetadataGenerator interface now have an additional field giving the number of records in the blob. This can be used as needed by the generator.
   
   Upgrade Instructions: None
   
   Usage Instructions: None
   
   Tests: unit test updated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] lakshmi-manasa-g commented on a change in pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
lakshmi-manasa-g commented on a change in pull request #1364:
URL: https://github.com/apache/samza/pull/1364#discussion_r427783902



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -85,6 +85,7 @@
 
   private volatile boolean isClosed = false;
   private long totalUploadedBlockSize = 0;
+  private long totalNumberOfRecordsInBlob = 0;

Review comment:
       AtomicLong is a nice thing to do. However, the methods of this class are all synchronized as there are other aspects beyond this variable that require single threaded access. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] bkonold commented on a change in pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
bkonold commented on a change in pull request #1364:
URL: https://github.com/apache/samza/pull/1364#discussion_r427769960



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/utils/BlobMetadataContext.java
##########
@@ -25,10 +25,12 @@
 public class BlobMetadataContext {
   private final String streamName;
   private final long blobSize;
+  private final long numberOfMessagesInBlob;
 
-  public BlobMetadataContext(String streamName, long blobSize) {
+  public BlobMetadataContext(String streamName, long blobSize, long numberOfMessagesInBlob) {
     this.streamName = streamName;
     this.blobSize = blobSize;
+    this.numberOfMessagesInBlob = numberOfMessagesInBlob;

Review comment:
       What will this be used for or what do we anticipate it will be used for?

##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -85,6 +85,7 @@
 
   private volatile boolean isClosed = false;
   private long totalUploadedBlockSize = 0;
+  private long totalNumberOfRecordsInBlob = 0;

Review comment:
       Would it be better to use AtomicLong here to ensure that all access to the member are atomic?

##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -85,6 +85,7 @@
 
   private volatile boolean isClosed = false;
   private long totalUploadedBlockSize = 0;
+  private long totalNumberOfRecordsInBlob = 0;

Review comment:
       Was there any consideration given to using AtomicLong here instead? If threading is involved, it could more clearly express that any modifications will be atomic without the need to worry about method synchronization.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 commented on pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
cameronlee314 commented on pull request #1364:
URL: https://github.com/apache/samza/pull/1364#issuecomment-634962641


   @lakshmi-manasa-g Can you please create a JIRA ticket for this and update the PR title with it? I can merge it once that is done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 commented on a change in pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
cameronlee314 commented on a change in pull request #1364:
URL: https://github.com/apache/samza/pull/1364#discussion_r429424131



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -230,6 +231,10 @@ public synchronized void releaseBuffer() throws IOException {
     }
   }
 
+  public synchronized void incrementNumberOfRecordsInBlob() {

Review comment:
       I suggest adding some javadocs here to clarify how this needs to be used (i.e. only for passing metadata for the blob when closing the blob).
   Since `DataFileWriter` may still be buffering some data, this `totalNumberOfRecordsInBlob` may not reflect the exact number of records in the output stream until flush/close are actually called. Also, it would be good to clarify that this needs to be atomically incremented along with appending the record to the `DataFileWriter` (or else the final record count may not be correct when closing the blob). It looks like you are doing the atomic update already in `AzureBlobAvroWriter`, but the general management of writers and output streams is a bit complex, so it would be good to add some more details.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] lakshmi-manasa-g commented on pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
lakshmi-manasa-g commented on pull request #1364:
URL: https://github.com/apache/samza/pull/1364#issuecomment-631523597


   yes, generator may have additional metadata.
    it is for two reasons that it was designed this way.
   
   1. to let users have full control over what their k-v pairs should look like - we only give the value but the key is decided by the users (aka generator). For example the streamName maybe used to form a k-v pair: "table":"_source_streamName_table" to help with ingestion.
   
   2. to let users add more k-v pairs if they need to. for example, "dataFormat":"avro" for which users do not need an input from the system producer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] bkonold commented on a change in pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
bkonold commented on a change in pull request #1364:
URL: https://github.com/apache/samza/pull/1364#discussion_r427806699



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/utils/BlobMetadataContext.java
##########
@@ -25,10 +25,12 @@
 public class BlobMetadataContext {
   private final String streamName;
   private final long blobSize;
+  private final long numberOfMessagesInBlob;
 
-  public BlobMetadataContext(String streamName, long blobSize) {
+  public BlobMetadataContext(String streamName, long blobSize, long numberOfMessagesInBlob) {
     this.streamName = streamName;
     this.blobSize = blobSize;
+    this.numberOfMessagesInBlob = numberOfMessagesInBlob;

Review comment:
       I see. Thanks for explaining.
   
   I notice that the interface of the generator takes a BlobMetadataContext and returns a Map<String,String>; out of curiosity, is it intended that a generator may have additional metadata (beyond what is carried within the BlobMetadataContext) to attach to the blob? 

##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -85,6 +85,7 @@
 
   private volatile boolean isClosed = false;
   private long totalUploadedBlockSize = 0;
+  private long totalNumberOfRecordsInBlob = 0;

Review comment:
       Cool. That's fine then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] lakshmi-manasa-g commented on a change in pull request #1364: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
lakshmi-manasa-g commented on a change in pull request #1364:
URL: https://github.com/apache/samza/pull/1364#discussion_r427782229



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/utils/BlobMetadataContext.java
##########
@@ -25,10 +25,12 @@
 public class BlobMetadataContext {
   private final String streamName;
   private final long blobSize;
+  private final long numberOfMessagesInBlob;
 
-  public BlobMetadataContext(String streamName, long blobSize) {
+  public BlobMetadataContext(String streamName, long blobSize, long numberOfMessagesInBlob) {
     this.streamName = streamName;
     this.blobSize = blobSize;
+    this.numberOfMessagesInBlob = numberOfMessagesInBlob;

Review comment:
       This value will be given to the metadata generator Impl passed in by the user of the SystemProducer through a config (https://github.com/apache/samza/pull/1358).  
   
   So along with streamName, blobSize, this numberOfMessagesInBlob is anticipated to be used to build metadata properties of the blob created. Metadata proprties are kv pairs attached to the blob that can then be leveraged for kusto ingestion. So I expect the usage to be something like <stream:streamName, rawSize: blobSize, records: numberofMessages>




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 merged pull request #1364: SAMZA-2534: AzureBlobSystemProducer: Enable adding of number of records in blob as metadata of the blob

Posted by GitBox <gi...@apache.org>.
cameronlee314 merged pull request #1364:
URL: https://github.com/apache/samza/pull/1364


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org