You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Thomas Poepping (JIRA)" <ji...@apache.org> on 2017/02/08 20:14:41 UTC

[jira] [Commented] (HIVE-15852) Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException

    [ https://issues.apache.org/jira/browse/HIVE-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858485#comment-15858485 ] 

Thomas Poepping commented on HIVE-15852:
----------------------------------------

[~ashutoshc] Ashutosh, sorry it took so long to open this Jira issue. Here's a summary of what I've found so far. While it's the easiest solution, I really don't want to revert HIVE-13040, I think the performance gains can be large, especially in the blobstore (s3a or azure) case, as empty file creation is far from free.

Happy to hear suggestions, and start a conversation.

> Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-15852
>                 URL: https://issues.apache.org/jira/browse/HIVE-15852
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 2.1.1
>            Reporter: Thomas Poepping
>
> Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which doesn't create empty files to represent empty buckets when Hive is on Tez, a couple things are broken.
> First of all, if there are empty buckets (which is possible with large datasets in the partitioned-bucketed case), tablesampling will not work if you're referencing a bucket number higher than the number of files.
> e.g. In some partition 'p', there are three rows. The table 't' is clustered into ten buckets. With maximal hashing, only three bucket files will be created. If we do select * from t tablesample (bucket x out of 10) where <selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be thrown because Hive assumes there are only three buckets.
> Second, other applications (such as Pig) may be making assumptions about the number of files equaling the number of buckets.
> Possible fixes:
> * Revert HIVE-13040
> * Change how tablesampling is implemented to accept possibility that number of files != number of buckets
> ** Would require coordination across projects to change assumptions
> Things to consider:
> * what performance gains are there from not creating empty files?
> * if the gains are large, are we willing to lose them? (by reverting HIVE-13040)
> * _how else can we avoid creating unnecessary files, while still maintaining invariants other applications expect?_



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)