You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/01/26 21:05:12 UTC

[GitHub] [incubator-pinot] kkrugler opened a new issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

kkrugler opened a new issue #6492:
URL: https://github.com/apache/incubator-pinot/issues/6492

Currently the code creates a tarball of the plugin directory inside of the Pinot distribution directory, and then calls `job.addCacheArchive(file://<path to tarball>`. This won't work, as all that this call does is store the path in the JobConf. On the slaves, this path is used as the source for copying files to slaves, but that `file://xxx` path doesn't exist.

The Hadoop [DistributedCache documentation](https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/filecache/DistributedCache.html) says:

> Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.

So the `HadoopSegmentGenerationJobRunner` needs to copy files to HDFS, and set the distributed cache path to that location. There are some options for the location of where to copy these files. If you use the standard Hadoop command line `-files xxx` parameter (as an example), then the standard Hadoop tool framework will copy the file(s) to a job-specific directory inside of the "staging" directory. So we could try to leverage that same location. But since Pinot already requires a staging directory be specified in the job spec file, and this has to be in HDFS for a distributed job, we could use an explicit sub-dir within that directory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [incubator-pinot] kkrugler commented on issue #6492: HadoopSegmentGenerationJobRunner isn't setting up Hadoop distributed cache correctly

Posted by GitBox <gi...@apache.org>.

kkrugler commented on issue #6492:
URL: https://github.com/apache/incubator-pinot/issues/6492#issuecomment-770075892


   While digging into the code, I found a few more issues that I'm fixing in the same PR:
   
   - Mapper was creating temp plugin directory in pwd (probably only an issue during tests)
   - Mapper wasn't registering Pinot file systems before starting segment generation.
   - Mapper was incorrectly using the `overwriteOutput` flag when writing segments to staging dir.
   - Job runner wasn't clearing out staging directory at start of execution (partial dir could be left around after a failed run, which would cause the next attempt to fail due to the output sub-dir existing)
   - Job runner wasn't adding the scheme to input file paths before writing out to the (temp) Hadoop input files.
   - Job runner was setting the job jar class to itself, but this class is inside of the Hadoop batch ingest plugin, which meant the "pinot-all" jar wasn't being distributed to Hadoop slaves.
   - Job runner wasn't disabling speculative execution, which could cause the job to fail due to two mappers writing to the same output file in the staging directory.
   - Job runner was using (very slow) copy command when updating the real output directory with the generated segments from the staging directory, versus the move command. It also wasn't honoring the `overwriteOutput` flag during the copy.
   - Job runner was adding the plugin tarball as an archive, but that meant the Hadoop distributed cache system was trying to unpack it, while the mapper code was also trying to expand it. I changed it to just add the tarball as a file, and left the mapper code as-is, but might be cleaner to leverage Hadoop's code.
   
   I also added a test for standalone batch ingestion, as that's now sharing some code with the Hadoop batch ingestion, and it didn't seem to have any current test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org