You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "Richard Bross (JIRA)" <ji...@apache.org> on 2018/04/02 20:48:00 UTC
[jira] [Created] (TEZ-3908) Tez fails to create files for all Hive
buckets specified in DDL
Richard Bross created TEZ-3908:
----------------------------------
Summary: Tez fails to create files for all Hive buckets specified in DDL
Key: TEZ-3908
URL: https://issues.apache.org/jira/browse/TEZ-3908
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.8.4
Reporter: Richard Bross
When the Hive DDL specifies a clustering statement, i.e.
"CLUSTERING BY(x) INTO x BUCKETS",
Tez may not create all the bucket files if the data is spare, causing query failures with Presto.
When an INSERT OVERWRITE is done on a partition, the MapReduce engine would always create the proper (as defined in the metastore DDL) number of bucket files.
Tez only creates bucket files if there will be data in them. When the data is too sparse to force all files to be created a mismatch will occur with the Hive metastore.
Dependent applications, such as Apache Presto will then fail to execute queries (please see [https://github.com/prestodb/presto/issues/10301).]
There should be a conf var that forces Tez to create all the bucket files, even those that will be 0 length, so that there is a metastore match as well as backwards compatibility, as MapReduce did.
Since Hive has deprecated MapReduce in favor of Tez, any existing users that have the following conditions will have query failures:
* Have a table with CLUSTERING BY
* Have a partition with sparse data that don't have enough samples to force Tez to create all bucket files
* Query with Presto
If a query includes *any* partition without the full complement of buckets, the Presto query will fail.
As a real world example, our inserts are done with Hive/Tez and our query UIs are all set up to use Presto/Tez. We have run into these failures and currently the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)