You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "Richard Bross (JIRA)" <ji...@apache.org> on 2018/04/02 20:48:00 UTC

[jira] [Created] (TEZ-3908) Tez fails to create files for all Hive buckets specified in DDL

Richard Bross created TEZ-3908:
----------------------------------

             Summary: Tez fails to create files for all Hive buckets specified in DDL
                 Key: TEZ-3908
                 URL: https://issues.apache.org/jira/browse/TEZ-3908
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.4
            Reporter: Richard Bross


When the Hive DDL specifies a clustering statement, i.e.  

"CLUSTERING BY(x) INTO x BUCKETS",

Tez may not create all the bucket files if the data is spare, causing query failures with Presto.

When an INSERT OVERWRITE is done on a partition, the MapReduce engine would always create the proper (as defined in the metastore DDL) number of bucket files.

Tez only creates bucket files if there will be data in them.  When the data is too sparse to force all files to be created a mismatch will occur with the Hive metastore.

Dependent applications, such as Apache Presto will then fail to execute queries (please see [https://github.com/prestodb/presto/issues/10301).]

There should be a conf var that forces Tez to create all the bucket files, even those that will be 0 length, so that there is a metastore match as well as backwards compatibility, as MapReduce did.

Since Hive has deprecated MapReduce in favor of Tez, any existing users that have the following conditions will have query failures:
 * Have a table with CLUSTERING BY 
 * Have a partition with sparse data that don't have enough samples to force Tez to create all bucket files
 * Query with Presto

If a query includes *any* partition without the full complement of buckets, the Presto query will fail.

As a real world example, our inserts are done with Hive/Tez and our query UIs are all set up to use Presto/Tez.  We have run into these failures and currently the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)