You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "Richard A. Bross" <rb...@oaktreepeak.com> on 2018/03/31 15:17:53 UTC

Hive, Tez, clustering, buckets, and Presto

Hello there, 

I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our test environment I have 8 tables being loaded with Hive that works fine under certain circumstances. 

Most of our data is taken from JSON files in from S3, creates a temporary external table, and then performs an INSERT OVERWRITE to transform the data to ORC format in Hive managed tables. 

A simple example of the CREATE statement for the internal table looks like this (I've left out a number of irrelevant columns): 

CREATE TABLE IF NOT EXISTS measurements (
    id STRING,
    session STRING,
    tt_ts INT,
    r_crs_id INT,
    user_agent STRING
)
PARTITIONED BY(daydate STRING, epoch BIGINT)
CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION's3://warehouse/'
tblproperties ("orc.compress"="ZLIB");


If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length.  In this case I can query with Presto just fine.

However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count.

I've seen some other postings in StackExchange and via Google where others have run into this.  We really need to use Presto to query Hive, and we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values.

So a few questions (thanks for any help):

The easiest fix for us would be a way to force Tez to create all buckets like mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent.  Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine.

ALternately, I was thinking that maybe someone could help me with the ORC file configuration so that r_crs_id was at least row optimized and sorted within the ORC files.  I've been reading up on ORC and the Hive statements related to ORC files but I seem to have hit my limit of understanding.  It appears that I should at the least have the rows sorted by r_crs_id when written, but I'm having problems figuring out how to apply that with the INSERT statement.

It appears that the ORC table should be created with 'orc.create.index’=’true’' for a start.

Currently our INSERT looks like this (we never UPDATE or DELETE):

FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Something like this?

FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Thanks for any help from more experienced Hive developers (most of you).