You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/01/29 22:06:41 UTC

[GitHub] [incubator-pinot] kkrugler commented on issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

kkrugler commented on issue #6492:
URL: https://github.com/apache/incubator-pinot/issues/6492#issuecomment-770075892


   While digging into the code, I found a few more issues that I'm fixing in the same PR:
   
   - Mapper was creating temp plugin directory in pwd (probably only an issue during tests)
   - Mapper wasn't registering Pinot file systems before starting segment generation.
   - Mapper was incorrectly using the `overwriteOutput` flag when writing segments to staging dir.
   - Job runner wasn't clearing out staging directory at start of execution (partial dir could be left around after a failed run, which would cause the next attempt to fail due to the output sub-dir existing)
   - Job runner wasn't adding the scheme to input file paths before writing out to the (temp) Hadoop input files.
   - Job runner was setting the job jar class to itself, but this class is inside of the Hadoop batch ingest plugin, which meant the "pinot-all" jar wasn't being distributed to Hadoop slaves.
   - Job runner wasn't disabling speculative execution, which could cause the job to fail due to two mappers writing to the same output file in the staging directory.
   - Job runner was using (very slow) copy command when updating the real output directory with the generated segments from the staging directory, versus the move command. It also wasn't honoring the `overwriteOutput` flag during the copy.
   - Job runner was adding the plugin tarball as an archive, but that meant the Hadoop distributed cache system was trying to unpack it, while the mapper code was also trying to expand it. I changed it to just add the tarball as a file, and left the mapper code as-is, but might be cleaner to leverage Hadoop's code.
   
   I also added a test for standalone batch ingestion, as that's now sharing some code with the Hadoop batch ingestion, and it didn't seem to have any current test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org