You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/12/12 17:42:02 UTC

[GitHub] [incubator-pinot] plaisted opened a new issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

plaisted opened a new issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349


   When running an ingestion job using the 'standalone' execution framework, the files written to 'outputDirURI' persist after the job completes. A couple issues arise from this:
   
   - This causes subsequent ingestion runs to add the left over files from previous runs in addition to the files for the current run
   - If concurrent jobs are running with the same storage location they would attempt to load each others files
   
   I haven't dug into the code but it seems like the job should:
   - clean up after itself
   - only load segments from the outPutDirURI that it created in the job
   
   If there are reasons why it shouldn't / can't do this, additional documentation on the behavior / purpose of the outputDirURI with standalone jobs would be helpful to callout the cleanup / URI uniqueness requirements.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] amarnathkarthik commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
amarnathkarthik commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-749460499


   @fx19880617 Thanks for the clarification and suggestion.
   
   Shouldn't we just delete the files instead of deleting the output directory for both Tar and Uri push? Also, instead of merging the segment generation and push task, maybe add the cleanUp to remove the tar files from the output directory once push finishes similar to `SegmentMetadataPush`, this will eliminate removing incorrect segment file(s) generated by another job (s).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] fx19880617 commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
fx19880617 commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-749282776


   1. For the ingestion job, it's by design to keep the segments in output directory. The reason is that for URI and METADATA push job, the output dir is treated at the source of truth of the segment. E.g. users will use this job to generate segments and directly write into s3, then push metadata to Pinot for loading segments from the same s3 directory. 
   
   I think it's ok to add a config like `cleanUpOutputDir` to delete the output directory if the push mode is `TAR` and the default value should be false. 
   
   2. We usually expect the ingestion job output directory to be empty, but you are right, if there are segments already there or building in progress, then it will push them all. 
   
   To solve this I feel we can:
   - Merge segment generation and push into one task;
   - Let segment generation job return an array of generated tar file URIs
   - Push task will take the array and do the work.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] amarnathkarthik commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
amarnathkarthik commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-749209495


   Here is my analysis, Looked at `LaunchDataIngestionJobCommand.java` and it's all job runner (Segment Generator and Push job). Did not see implementation to clean up `outputDirURI` after the push but do see tempDir cleanup after copying the generated segments to outputDirURI.
   
   @kishoreg @plaisted We have 2 option, let me know which one would be appropriate:
   
   1. **Backward compatible** - Introduce new property in `SegmentGenerationJobSpec._cleanupOutputAfterPush` to remove segments from `outputDirURI` once after push call finishes successfully. CleanupOutputAfterPush can be configurable using yaml and default it to `false`
   2. **Cleanup by default** - Enhance the push job to remove the segments from the outputDirURI once after push call finishes successfully.
   
   Note: 
   Both Spark and Hadoop push jobs does not support cleanup today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] plaisted closed issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
plaisted closed issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] plaisted commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
plaisted commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-754346251


   Ty @amarnathkarthik will check the changes out


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] amarnathkarthik commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
amarnathkarthik commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-754333990


   @plaisted Could you please mark this issue closed? I do not see an option to close.
   
   CC: @fx19880617 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] amarnathkarthik commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
amarnathkarthik commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-748671563


   @kishoreg @plaisted Please assign this task to me, I would like to work on this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] fx19880617 commented on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
fx19880617 commented on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-755682763


   Thanks, @amarnathkarthik !!!
   
   This is awesome!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] amarnathkarthik edited a comment on issue #6349: Standalone ingestion jobs do not clean up output files after completing loading

Posted by GitBox <gi...@apache.org>.
amarnathkarthik edited a comment on issue #6349:
URL: https://github.com/apache/incubator-pinot/issues/6349#issuecomment-749460499


   @fx19880617 Thanks for the clarification and suggestion.
   
   Shouldn't we just delete the files instead of deleting the output directory for both Tar and Uri push? Also, instead of merging the segment generation and push task, maybe add the cleanUp to remove the tar files from the output directory once push finishes similar to `SegmentMetadataPush`, this will eliminate removing incorrect segment file(s) generated by another job(s).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org