You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by dam6923 <da...@gmail.com> on 2018/07/27 18:49:35 UTC
MapReduce Output File Names
Hello,
When Hive MapReduce jobs create HDFS output files, they use the format:
000000_0.gz
000000_0.gz_copy_1
000000_0.gz_copy_2
000000_0.gz_copy_3
...
This seems like it could become a long running list over time. In
fact, the code says "leave the below loop for now until a better
approach is found."
https://github.com/apache/hive/blob/758ff449099065a84c46d63f9418201c8a6731b1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3710
Would it be problematic to simply prefix a random number, or
timestamp, on the front of the file name to make it unique? This
would save the code from having to loop to ask the FileSystem
(NameNode) "is copy 1 there?", "is copy 2 there?", "is copy 1 there?"
etc.
Thanks.
Re: MapReduce Output File Names
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Would it be problematic to simply prefix a random number, or
> timestamp, on the front of the file name to make it unique?
For bucketed tables - they rely on the prefix to determine which bucket it belongs to.
So if you have a bucketed table and insert into it twice, then this turns into
0000_0 + 0000_0_Copy_1
which is logically the 1st bucket (if this is a sorted table, then it is a sort-merge to read out, not a one-after-other).
There's a set of race conditions with that loop when it comes to something with weak consistency like S3, which is why hive managed tables have switched to a delta_<id>/0000_0 instead of _Copy_<n> starting in Hive 3.0.
And where "id" is actually stored in the table metadata (so that no two queries will use the same delta_<id> dir).
Cheers,
Gopal