You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Marton Bod (Jira)" <ji...@apache.org> on 2020/02/27 16:00:00 UTC

[jira] [Commented] (HIVE-22938) Investigate possibility of removing empty bucket file creation mechanism in Hive-on-MR

    [ https://issues.apache.org/jira/browse/HIVE-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046751#comment-17046751 ] 

Marton Bod commented on HIVE-22938:
-----------------------------------

[~ashutoshc], [~gopalv] - I'm investigating whether we can stop creating empty bucket files when using MR/Spark (seems like we're already not creating them with Tez). So far I have not seen a scenario which makes use of these empty files: in my local tests, I have manually deleted some of these empty files from the delta directories and did not see any anomalies afterwards when reading the data back, or running compaction. But I might be missing some other area - do you have any ideas where the empty bucket files might become important?

> Investigate possibility of removing empty bucket file creation mechanism in Hive-on-MR
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-22938
>                 URL: https://issues.apache.org/jira/browse/HIVE-22938
>             Project: Hive
>          Issue Type: Task
>            Reporter: Marton Bod
>            Priority: Major
>
> As a follow-up to HIVE-22918, this ticket is to investigate whether the empty bucket file creation mechanism can be removed safely when using MR as the engine. 
> For a bucketed table of N buckets, each insert will generate N bucket files in the delta directory, regardless of how many actual buckets are written to. As an example, if a table has 500 buckets, and we insert a single record, 499 empty bucket files are generated alongside the single bucket that contains the actual data. This makes the operation substantially slower in some cases. This behaviour only seems to happen when using MR as the execution engine.
> Some components/parts of the code might depend on this behaviour though, so it needs to be verified that removing this logic does not interfere with anything.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)