You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2020/05/08 02:47:32 UTC

[GitHub] [samza] lakshmi-manasa-g opened a new pull request #1358: Add metadata generator

lakshmi-manasa-g opened a new pull request #1358:
URL: https://github.com/apache/samza/pull/1358


   **Feature:** This PR extends the existing AzureBlobSystemProducer by adding the ability to enhance an azure storage blob with metadata - basically key-value pairs to the blob's metadata (sits outside of the blob's actual content). This aids tremendously in ingesting these blobs into Kusto (Azure's data explorer). 
   
   **Changes:** New interfaces BlobMetadataGeneratorFactory, BlobMetadataGenerator and class BlobMetadataContext added where the user needs to implement BlobMetadataGeneratorFactory and BlobMetadataGenerator which takes in a BlobMetadataContext containing the stream name and the size of the current blob being committed. An instance of the generator will be created when the blob is about to be committed. Generator is invoked to get metadata and the returned metadata is attached to the blob.
   Any additional configs (key:value pairs) needed for this generator can be passed as with prefix systems.<system-name>.azureblob.metadataGeneratorConfig.\<key\> with value \<value\>. 
   
   A default impl is provided which does not add anything to the blob.
   
   **API changes:** New interfaces BlobMetadataGeneratorFactory, BlobMetadataGenerator and class BlobMetadataContext added. New configs with prefix systems.<system-name>.azureblob.metadataGeneratorConfig. The factory is wired in through systems.<system-name>.azureblob.metadataPropertiesGeneratorFactory.
   
   **Upgrade instructions:** Backwards compatible. If an impl is not provided the default no-op impl is used.
   
   **Usage instructions:** Wire in the generator factory through systems.<system-name>.azureblob.metadataPropertiesGeneratorFactory and pass in additional configs with prefix systems.<system-name>.azureblob.metadataGeneratorConfig 
   
   **Tests:** Existing tests updated to assert for metadata attached.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 commented on a change in pull request #1358: SAMZA-2522: Add metadata generator for azure storage blob

Posted by GitBox <gi...@apache.org>.
cameronlee314 commented on a change in pull request #1358:
URL: https://github.com/apache/samza/pull/1358#discussion_r425509989



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/AzureBlobConfig.java
##########
@@ -94,6 +95,18 @@
   public static final String SYSTEM_SUFFIX_RANDOM_STRING_TO_BLOB_NAME = SYSTEM_AZUREBLOB_PREFIX + "suffixRandomStringToBlobName";
   private static final boolean SYSTEM_SUFFIX_RANDOM_STRING_TO_BLOB_NAME_DEFAULT = true;
 
+  // full class name of an implementation of org.apache.samza.system.azureblob.utils.BlobMetadataGeneratorFactory
+  // this factory should return an implementation of org.apache.samza.system.azureblob.utils.BlobMetadataGenerator
+  // this generator will be invoked when a blob is committed to add metadata properties to it
+  public static final String SYSTEM_BLOB_METADATA_PROPERTIES_GENERATOR_FACTORY = SYSTEM_AZUREBLOB_PREFIX + "metadataPropertiesGeneratorFactory";
+  private static final String SYSTEM_BLOB_METADATA_PROPERTIES_GENERATOR_FACTORY_DEFAULT =
+      "org.apache.samza.system.azureblob.utils.NullBlobMetadataGeneratorFactory";
+
+  // Additional configs for the metadata generator should be prefixed with this string which is passed to the generator.
+  // for example, to pass a "key":"value" pair to the metadata generator, add config like
+  // systems.<system-name>.azureblob.metadataGeneratorConfig.<key> with value <value>
+  public static final String SYSTEM_BLOB_METADATA_GENERATOR_CONFIG_PREFIX = SYSTEM_AZUREBLOB_PREFIX + "metadataGeneratorConfig";

Review comment:
       Does this need to have an extra `.` at the end so it gets stripped off in `getSystemMetadataGeneratorConfigs`?

##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/AzureBlobConfig.java
##########
@@ -185,4 +198,13 @@ public long getMaxBlobSize(String systemName) {
   public long getMaxMessagesPerBlob(String systemName) {
     return getLong(String.format(SYSTEM_MAX_MESSAGES_PER_BLOB, systemName), SYSTEM_MAX_MESSAGES_PER_BLOB_DEFAULT);
   }
+
+  public String getSystemMetadataPropertiesGeneratorFactory(String systemName) {
+    return get(String.format(SYSTEM_BLOB_METADATA_PROPERTIES_GENERATOR_FACTORY, systemName),
+        SYSTEM_BLOB_METADATA_PROPERTIES_GENERATOR_FACTORY_DEFAULT);
+  }
+
+  public Config getSystemMetadataGeneratorConfigs(String systemName) {
+    return subset(String.format(SYSTEM_BLOB_METADATA_GENERATOR_CONFIG_PREFIX, systemName));

Review comment:
       Minor: For consistency, consider naming these `getSystemBlobMetadata...`

##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -172,8 +193,8 @@ public synchronized void close() {
       future.get((long) flushTimeoutMs, TimeUnit.MILLISECONDS);
       LOG.info("For blob: {} committing blockList size:{}", blobAsyncClient.getBlobUrl().toString(), blockList.size());
       metrics.updateAzureCommitMetrics();
-      Map<String, String> blobMetadata = Collections.singletonMap(BLOB_RAW_SIZE_BYTES_METADATA, Long.toString(totalUploadedBlockSize));

Review comment:
       If an existing case needs this BLOB_RAW_SIZE_BYTES_METADATA, then will they need to implement their own blob metadata generator?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] lakshmi-manasa-g commented on pull request #1358: SAMZA-2522: Add metadata generator for azure storage blob

Posted by GitBox <gi...@apache.org>.
lakshmi-manasa-g commented on pull request #1358:
URL: https://github.com/apache/samza/pull/1358#issuecomment-629381702


   the exception is happening in Samza-core_2.11:test whereas these changes are all in samza-azure module. Hence these changes can not be causing the build failure seen in https://travis-ci.org/github/apache/samza/builds/687272039. Also triggered another build with an empty commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] lakshmi-manasa-g commented on a change in pull request #1358: SAMZA-2522: Add metadata generator for azure storage blob

Posted by GitBox <gi...@apache.org>.
lakshmi-manasa-g commented on a change in pull request #1358:
URL: https://github.com/apache/samza/pull/1358#discussion_r425530536



##########
File path: samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobOutputStream.java
##########
@@ -172,8 +193,8 @@ public synchronized void close() {
       future.get((long) flushTimeoutMs, TimeUnit.MILLISECONDS);
       LOG.info("For blob: {} committing blockList size:{}", blobAsyncClient.getBlobUrl().toString(), blockList.size());
       metrics.updateAzureCommitMetrics();
-      Map<String, String> blobMetadata = Collections.singletonMap(BLOB_RAW_SIZE_BYTES_METADATA, Long.toString(totalUploadedBlockSize));

Review comment:
       ah, yes. they will need to implement their own generator. Updated the PR description to reflect backwards incompatible.  is that ok?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 merged pull request #1358: SAMZA-2522: Add metadata generator for azure storage blob

Posted by GitBox <gi...@apache.org>.
cameronlee314 merged pull request #1358:
URL: https://github.com/apache/samza/pull/1358


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org