You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Abdullah Yousufi (JIRA)" <ji...@apache.org> on 2016/08/15 23:48:20 UTC

[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation

    [ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421901#comment-15421901 ] 

Abdullah Yousufi commented on HIVE-14165:
-----------------------------------------

So I did try the listFiles() optimization locally and modified Hive to call the function on the root directory of a partitioned table. While this does give a speedup for a select * query on a partitioned table, this approach is not really extensible to queries that do partition elimination, since in those cases it makes sense to just pass in the relevant partitions, as Hive currently does.

I'm thinking it might make sense to remove the following list call on Hive in the case of S3 partitioned tables since the listing for the split computation is going to happen later anyway in Hadoop's FileInputFormat.java.

FetchOperator.java#getNextPath()
{code}
if (fs.exists(currPath)) {
  for (FileStatus fStat : listStatusUnderPath(fs, currPath)) {
    if (fStat.getLen() > 0) {
      return true;
    }
  }
}
{code}

My question is if it sounds good to remove this check. It seems that there may be errors that FileInputFormat.java#getSplits() may return if the partition directory does not have any files, but is there a better way to handle that?

> Enable faster S3 Split Computation
> ----------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)