You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/12/26 00:59:42 UTC

[GitHub] [pinot] xiangfu0 opened a new pull request, #10034: Create metadata only tarball for metadata push job

xiangfu0 opened a new pull request, #10034:
URL: https://github.com/apache/pinot/pull/10034

   Instructions:
   1. The PR has to be tagged with at least one of the following labels (*):
      1. `feature`
      2. `bugfix`
      3. `performance`
      4. `ui`
      5. `backward-incompat`
      6. `release-notes` (**)
   2. Remove these instructions before publishing the PR.
    
   (*) Other labels to consider:
   - `testing`
   - `dependencies`
   - `docker`
   - `kubernetes`
   - `observability`
   - `security`
   - `code-style`
   - `extension-point`
   - `refactor`
   - `cleanup`
   
   (**) Use `release-notes` label for scenarios like:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #10034:
URL: https://github.com/apache/pinot/pull/10034#issuecomment-1364866321

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/10034?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#10034](https://codecov.io/gh/apache/pinot/pull/10034?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5b2f56f) into [master](https://codecov.io/gh/apache/pinot/commit/3fb211845d75e9e5657fab79b5766a20ee232024?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3fb2118) will **decrease** coverage by `39.18%`.
   > The diff coverage is `0.00%`.
   
   > :exclamation: Current head 5b2f56f differs from pull request most recent head cbdfc39. Consider uploading reports for the commit cbdfc39 to get more accurate results
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #10034       +/-   ##
   =============================================
   - Coverage     64.28%   25.10%   -39.19%     
   + Complexity     5670       44     -5626     
   =============================================
     Files          1939     1982       +43     
     Lines        105042   107191     +2149     
     Branches      16045    16308      +263     
   =============================================
   - Hits          67531    26907    -40624     
   - Misses        32618    77533    +44915     
   + Partials       4893     2751     -2142     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `25.10% <0.00%> (?)` | |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/10034?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...estion/batch/common/SegmentGenerationJobUtils.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtcGx1Z2lucy9waW5vdC1iYXRjaC1pbmdlc3Rpb24vcGlub3QtYmF0Y2gtaW5nZXN0aW9uLWNvbW1vbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvcGx1Z2luL2luZ2VzdGlvbi9iYXRjaC9jb21tb24vU2VnbWVudEdlbmVyYXRpb25Kb2JVdGlscy5qYXZh) | `0.00% <0.00%> (-90.00%)` | :arrow_down: |
   | [...n/batch/standalone/SegmentGenerationJobRunner.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtcGx1Z2lucy9waW5vdC1iYXRjaC1pbmdlc3Rpb24vcGlub3QtYmF0Y2gtaW5nZXN0aW9uLXN0YW5kYWxvbmUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3BsdWdpbi9pbmdlc3Rpb24vYmF0Y2gvc3RhbmRhbG9uZS9TZWdtZW50R2VuZXJhdGlvbkpvYlJ1bm5lci5qYXZh) | `0.00% <0.00%> (-72.49%)` | :arrow_down: |
   | [...he/pinot/segment/local/utils/SegmentPushUtils.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9TZWdtZW50UHVzaFV0aWxzLmphdmE=) | `0.00% <0.00%> (-11.98%)` | :arrow_down: |
   | [...ingestion/batch/spec/SegmentGenerationJobSpec.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvaW5nZXN0aW9uL2JhdGNoL3NwZWMvU2VnbWVudEdlbmVyYXRpb25Kb2JTcGVjLmphdmE=) | `0.00% <0.00%> (-56.25%)` | :arrow_down: |
   | [...ngestion/batch/spec/SegmentGenerationTaskSpec.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvaW5nZXN0aW9uL2JhdGNoL3NwZWMvU2VnbWVudEdlbmVyYXRpb25UYXNrU3BlYy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ain/java/org/apache/pinot/spi/utils/LoopUtils.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdXRpbHMvTG9vcFV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...in/java/org/apache/pinot/spi/utils/BytesUtils.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdXRpbHMvQnl0ZXNVdGlscy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...java/org/apache/pinot/spi/trace/BaseRecording.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdHJhY2UvQmFzZVJlY29yZGluZy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...java/org/apache/pinot/spi/trace/NoOpRecording.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvdHJhY2UvTm9PcFJlY29yZGluZy5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...ava/org/apache/pinot/spi/config/table/FSTType.java](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvY29uZmlnL3RhYmxlL0ZTVFR5cGUuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | ... and [1606 more](https://codecov.io/gh/apache/pinot/pull/10034/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on PR #10034:
URL: https://github.com/apache/pinot/pull/10034#issuecomment-1366106890

   > Doesn't this introduce inconsistency between the two steps. Assume the following scenario -
   > 
   > * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it
   > * Replaced that segment tar in the output dir with some other tar file with the same name
   > * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data
   
   You are 100% correct. I can make the segment creation job to delete the metadata tarball as well if not exists.
   But if users intend to manually update the segment tarball, then it's pretty hard to detect that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] zhtaoxiang commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
zhtaoxiang commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059725445


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
 
+
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   Thanks for the udpate!
   
   I feel that there are 3 cases
   1. `_spec.isOverwriteOutput()` is true -> we need to delete metadata from output dir when it exists
   2. `_spec.isOverwriteOutput()` is false and `taskSpec.isCreateMetadataTarGz` is true, we should try to create metadata targz locally and try to copy it to output dir in case the metadata targz does not exist (due to user switches off then on the `isCreateMetadataTarGz` flag)
   3. `_spec.isOverwriteOutput()` is false and `taskSpec.isCreateMetadataTarGz` is false, we should try to delete metadata targz from output dir in case the metadata targz exists (due to user switches on then off the `isCreateMetadataTarGz` flag)
   
   I think we missed case 3, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] shwin commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
shwin commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1080812390


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/SegmentPushUtils.java:
##########
@@ -284,7 +284,22 @@ public static void sendSegmentUriAndMetadata(SegmentGenerationJobSpec spec, Pino
       String segmentName = fileName.endsWith(Constants.TAR_GZ_FILE_EXT)
           ? fileName.substring(0, fileName.length() - Constants.TAR_GZ_FILE_EXT.length()) : fileName;
       SegmentNameUtils.validatePartialOrFullSegmentName(segmentName);
-      File segmentMetadataFile = generateSegmentMetadataFile(fileSystem, URI.create(tarFilePath));
+      File segmentMetadataFile;
+      // Check if there is a segment metadata tar gz file named `segmentName.metadata.tar.gz`, already in the remote
+      // directory. This is to avoid generating a new segment metadata tar gz file every time we push a segment,
+      // which requires downloading the entire segment tar gz file.
+      URI metadataTarGzFilePath = URI.create(
+          new File(tarFilePath).getParentFile() + File.separator + segmentName + Constants.METADATA_TAR_GZ_FILE_EXT);
+      if (spec.getPushJobSpec().isPreferMetadataTarGz() && fileSystem.exists(metadataTarGzFilePath)) {

Review Comment:
   I think https://replit.com/join/kndnjhlrzw-ashwinraja2 shows what's going on



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on PR #10034:
URL: https://github.com/apache/pinot/pull/10034#issuecomment-1367689687

   > > > Doesn't this introduce inconsistency between the two steps. Assume the following scenario -
   > > > 
   > > > * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it
   > > > * Replaced that segment tar in the output dir with some other tar file with the same name
   > > > * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data
   > > 
   > > 
   > > You are 100% correct. I can make the segment creation job to delete the metadata tarball as well if not exists. But if users intend to manually update the segment tarball, then it's pretty hard to detect that.
   > 
   > Thinking out loud here, but may be we can introduce a crc check and keep segment tar crc in the metadata tar?
   
   The purpose of using metadata push is to avoid the segment download from controller. The CRC is in segment metadata, but we cannot validate it without download the segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059217816


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
 
+
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   > > > Doesn't this introduce inconsistency between the two steps. Assume the following scenario -
   > > > 
   > > > * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it
   > > > * Replaced that segment tar in the output dir with some other tar file with the same name
   > > > * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data
   > > 
   > > 
   > > You are 100% correct. I can make the segment creation job to delete the metadata tarball as well if not exists. But if users intend to manually update the segment tarball, then it's pretty hard to detect that.
   > 
   > Thinking out loud here, but may be we can introduce a crc check and keep segment tar crc in the metadata tar?
   
   The purpose of using metadata push is to avoid the segment download from controller. The CRC is in segment metadata, but we cannot access it without download the segment.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] zhtaoxiang commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
zhtaoxiang commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1058115858


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-2.4/src/main/java/org/apache/pinot/plugin/ingestion/batch/spark/SparkSegmentGenerationJobRunner.java:
##########
@@ -290,20 +291,26 @@ public void call(String pathAndIdx)
           long compressedSegmentSize = FileUtils.sizeOf(localSegmentTarFile);
           LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
               DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
-          //move segment to output PinotFS
-          URI outputSegmentTarURI =
-              SegmentGenerationUtils.getRelativeOutputPath(finalInputDirURI, inputFileURI, finalOutputDirURI)
-                  .resolve(segmentTarFileName);
-          LOGGER.info("Trying to move segment tar file from: [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-          if (!_spec.isOverwriteOutput() && PinotFSFactory.create(outputSegmentTarURI.getScheme())
-              .exists(outputSegmentTarURI)) {
-            LOGGER.warn("Not overwrite existing output segment tar file: {}",
-                finalOutputDirFS.exists(outputSegmentTarURI));
-          } else {
-            finalOutputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+          // Move segment to output PinotFS
+          URI relativeOutputPath =
+              SegmentGenerationUtils.getRelativeOutputPath(finalInputDirURI, inputFileURI, finalOutputDirURI);
+          URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+
+          // Create and upload segment metadata tar file
+          String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   the same here.



##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
 
+
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   When `_spec.isOverwriteOutput()` is false and `taskSpec.isCreateMetadataTarGz()` is true, we should not delete `outputMetadataTarURI`. Otherwise, segment tar file will NOT be replaced and metadata tar file will be replaced, which causes inconsistency.



##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/src/main/java/org/apache/pinot/plugin/ingestion/batch/spark3/SparkSegmentGenerationJobRunner.java:
##########
@@ -289,20 +289,26 @@ public void call(String pathAndIdx)
           long compressedSegmentSize = FileUtils.sizeOf(localSegmentTarFile);
           LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
               DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
-          //move segment to output PinotFS
-          URI outputSegmentTarURI =
-              SegmentGenerationUtils.getRelativeOutputPath(finalInputDirURI, inputFileURI, finalOutputDirURI)
-                  .resolve(segmentTarFileName);
-          LOGGER.info("Trying to move segment tar file from: [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-          if (!_spec.isOverwriteOutput() && PinotFSFactory.create(outputSegmentTarURI.getScheme())
-              .exists(outputSegmentTarURI)) {
-            LOGGER.warn("Not overwrite existing output segment tar file: {}",
-                finalOutputDirFS.exists(outputSegmentTarURI));
-          } else {
-            finalOutputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+          // Move segment to output PinotFS
+          URI relativeOutputPath =
+              SegmentGenerationUtils.getRelativeOutputPath(finalInputDirURI, inputFileURI, finalOutputDirURI);
+          URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+          SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+              _spec.isOverwriteOutput());
+
+          // Create and upload segment metadata tar file
+          String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   the same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059219609


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,24 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
 
+
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");

Review Comment:
   I think we will delete the metadata tar gz file once `_spec.isOverwriteOutput()` is true.
   Just in the case of user switch on then off the `isCreateMetadataTarGz` flag. Then this code will only generate segment tar gz and delete the metadata tar gz file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
KKcorps commented on PR #10034:
URL: https://github.com/apache/pinot/pull/10034#issuecomment-1366669679

   
   
   
   
   > > Doesn't this introduce inconsistency between the two steps. Assume the following scenario -
   > > 
   > > * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it
   > > * Replaced that segment tar in the output dir with some other tar file with the same name
   > > * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data
   > 
   > You are 100% correct. I can make the segment creation job to delete the metadata tarball as well if not exists. But if users intend to manually update the segment tarball, then it's pretty hard to detect that.
   
   Thinking out loud here, but may be we can introduce a crc check and keep segment tar crc in the metadata tar?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] shwin commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
shwin commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1080811688


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/SegmentPushUtils.java:
##########
@@ -284,7 +284,22 @@ public static void sendSegmentUriAndMetadata(SegmentGenerationJobSpec spec, Pino
       String segmentName = fileName.endsWith(Constants.TAR_GZ_FILE_EXT)
           ? fileName.substring(0, fileName.length() - Constants.TAR_GZ_FILE_EXT.length()) : fileName;
       SegmentNameUtils.validatePartialOrFullSegmentName(segmentName);
-      File segmentMetadataFile = generateSegmentMetadataFile(fileSystem, URI.create(tarFilePath));
+      File segmentMetadataFile;
+      // Check if there is a segment metadata tar gz file named `segmentName.metadata.tar.gz`, already in the remote
+      // directory. This is to avoid generating a new segment metadata tar gz file every time we push a segment,
+      // which requires downloading the entire segment tar gz file.
+      URI metadataTarGzFilePath = URI.create(
+          new File(tarFilePath).getParentFile() + File.separator + segmentName + Constants.METADATA_TAR_GZ_FILE_EXT);
+      if (spec.getPushJobSpec().isPreferMetadataTarGz() && fileSystem.exists(metadataTarGzFilePath)) {

Review Comment:
   Mentioned this in slack (https://apache-pinot.slack.com/archives/C011C9JHN7R/p1674101587879929?thread_ts=1674101384.733239&cid=C011C9JHN7R) but I think this line doesn't work with S3 paths:
   
   `File(tarFilePath).getParentFile()` will end up returning `s3:/bucket/path` NOT `s3://bucket/path` which ends up messing the subsequent existence code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] zhtaoxiang commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
zhtaoxiang commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1059725960


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,26 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
+
+      // Create and upload segment metadata tar file
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");
+      URI outputMetadataTarURI = relativeOutputPath.resolve(metadataTarFileName);
+      if (_spec.isOverwriteOutput() && outputDirFS.exists(outputMetadataTarURI)) {

Review Comment:
   Thanks for the udpate!
   
   I feel that there are 3 cases
   1. `_spec.isOverwriteOutput()` is true -> we need to delete metadata from output dir when it exists
   2. `_spec.isOverwriteOutput()` is false and `taskSpec.isCreateMetadataTarGz` is true, we should try to create metadata targz locally and try to copy it to output dir in case the metadata targz does not exist (due to user switches off then on the `isCreateMetadataTarGz` flag)
   3. `_spec.isOverwriteOutput()` is false and `taskSpec.isCreateMetadataTarGz` is false, we should try to delete metadata targz from output dir in case the metadata targz exists (due to user switches on then off the `isCreateMetadataTarGz` flag)
   
   I think we missed case 3, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] zhtaoxiang commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
zhtaoxiang commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1057840220


##########
pinot-spi/src/main/java/org/apache/pinot/spi/ingestion/batch/spec/SegmentGenerationJobSpec.java:
##########
@@ -139,6 +139,11 @@ public class SegmentGenerationJobSpec implements Serializable {
    */
   private String _authToken;
 
+  /**
+   * Create a separated metadata only tar gz file to reduce the data transfer of segment metadata push job.
+   */
+  private boolean _createMetadataTarGz = true;

Review Comment:
   I feel that we should disable it by default as for now to avoid potential problems. We can enable it by default after we have fully tested it in some envs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
KKcorps commented on PR #10034:
URL: https://github.com/apache/pinot/pull/10034#issuecomment-1365818299

   Doesn't this introduce inconsistency between the two steps. Assume the following scenario - 
   
   * Generate one segment tar using StandaloneJobRunner, we also generate the corresponding metadata tar for it
   * Replaced that segment tar in the output dir with some other tar file with the same name
   * We trigger metadata push which pushes the old metadata tar, however the segment tar file contains different data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1057854104


##########
pinot-spi/src/main/java/org/apache/pinot/spi/ingestion/batch/spec/SegmentGenerationJobSpec.java:
##########
@@ -139,6 +139,11 @@ public class SegmentGenerationJobSpec implements Serializable {
    */
   private String _authToken;
 
+  /**
+   * Create a separated metadata only tar gz file to reduce the data transfer of segment metadata push job.
+   */
+  private boolean _createMetadataTarGz = true;

Review Comment:
   will update the default 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #10034:
URL: https://github.com/apache/pinot/pull/10034#discussion_r1061980199


##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/src/main/java/org/apache/pinot/plugin/ingestion/batch/hadoop/HadoopSegmentCreationMapper.java:
##########
@@ -183,13 +184,26 @@ protected void map(LongWritable key, Text value, Context context) {
       LOGGER.info("Size for segment: {}, uncompressed: {}, compressed: {}", segmentName,
           DataSizeUtils.fromBytes(uncompressedSegmentSize), DataSizeUtils.fromBytes(compressedSegmentSize));
       //move segment to output PinotFS
-      URI outputSegmentTarURI = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI)
-          .resolve(segmentTarFileName);
-      LOGGER.info("Copying segment tar file from [{}] to [{}]", localSegmentTarFile, outputSegmentTarURI);
-      outputDirFS.copyFromLocalFile(localSegmentTarFile, outputSegmentTarURI);
+      URI relativeOutputPath = SegmentGenerationUtils.getRelativeOutputPath(inputDirURI, inputFileURI, outputDirURI);
+      URI outputSegmentTarURI = relativeOutputPath.resolve(segmentTarFileName);
+      SegmentGenerationJobUtils.moveLocalTarFileToRemote(localSegmentTarFile, outputSegmentTarURI,
+          _spec.isOverwriteOutput());
+
+      // Create and upload segment metadata tar file
+      String metadataTarFileName = URLEncoder.encode(segmentName + Constants.METADATA_TAR_GZ_FILE_EXT, "UTF-8");
+      URI outputMetadataTarURI = relativeOutputPath.resolve(metadataTarFileName);
+      if (_spec.isOverwriteOutput() && outputDirFS.exists(outputMetadataTarURI)) {

Review Comment:
   Updated the logic to delete the metadata file when `_spec.isOverwriteOutput() || !_spec.isCreateMetadataTarGz()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 merged pull request #10034: Create metadata only tarball for metadata push job

Posted by GitBox <gi...@apache.org>.
xiangfu0 merged PR #10034:
URL: https://github.com/apache/pinot/pull/10034


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org