You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by dam6923 <da...@gmail.com> on 2018/07/27 18:49:35 UTC

MapReduce Output File Names

Hello,

When Hive MapReduce jobs create HDFS output files, they use the format:

000000_0.gz
000000_0.gz_copy_1
000000_0.gz_copy_2
000000_0.gz_copy_3
...

This seems like it could become a long running list over time.  In
fact, the code says "leave the below loop for now until a better
approach is found."

https://github.com/apache/hive/blob/758ff449099065a84c46d63f9418201c8a6731b1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3710

Would it be problematic to simply prefix a random number, or
timestamp, on the front of the file name to make it unique?  This
would save the code from having to loop to ask the FileSystem
(NameNode) "is copy 1 there?", "is copy 2 there?", "is copy 1 there?"
etc.

Thanks.

Re: MapReduce Output File Names

Posted by Gopal Vijayaraghavan <go...@apache.org>.

>    Would it be problematic to simply prefix a random number, or
>    timestamp, on the front of the file name to make it unique?  

For bucketed tables  - they rely on the prefix to determine which bucket it belongs to.

So if you have a bucketed table and insert into it twice, then this turns into 

0000_0 + 0000_0_Copy_1

which is logically the 1st bucket (if this is a sorted table, then it is a sort-merge to read out, not a one-after-other).

There's a set of race conditions with that loop when it comes to something with weak consistency like S3, which is why hive managed tables have switched to a delta_<id>/0000_0 instead of _Copy_<n> starting in Hive 3.0.

And where "id" is actually stored in the table metadata (so that no two queries will use the same delta_<id> dir).

Cheers,
Gopal