You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Richard A. Bross" <rb...@oaktreepeak.com> on 2018/03/31 15:17:53 UTC

Hive, Tez, clustering, buckets, and Presto

Hello there, 

I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our test environment I have 8 tables being loaded with Hive that works fine under certain circumstances. 

Most of our data is taken from JSON files in from S3, creates a temporary external table, and then performs an INSERT OVERWRITE to transform the data to ORC format in Hive managed tables. 

A simple example of the CREATE statement for the internal table looks like this (I've left out a number of irrelevant columns): 

CREATE TABLE IF NOT EXISTS measurements (
    id STRING,
    session STRING,
    tt_ts INT,
    r_crs_id INT,
    user_agent STRING
)
PARTITIONED BY(daydate STRING, epoch BIGINT)
CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION's3://warehouse/'
tblproperties ("orc.compress"="ZLIB");


If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length.  In this case I can query with Presto just fine.

However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count.

I've seen some other postings in StackExchange and via Google where others have run into this.  We really need to use Presto to query Hive, and we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values.

So a few questions (thanks for any help):

The easiest fix for us would be a way to force Tez to create all buckets like mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent.  Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine.

ALternately, I was thinking that maybe someone could help me with the ORC file configuration so that r_crs_id was at least row optimized and sorted within the ORC files.  I've been reading up on ORC and the Hive statements related to ORC files but I seem to have hit my limit of understanding.  It appears that I should at the least have the rows sorted by r_crs_id when written, but I'm having problems figuring out how to apply that with the INSERT statement.

It appears that the ORC table should be created with 'orc.create.index’=’true’' for a start.

Currently our INSERT looks like this (we never UPDATE or DELETE):

FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Something like this?

FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Thanks for any help from more experienced Hive developers (most of you).



Re: Hive, Tez, clustering, buckets, and Presto

Posted by Edward Capriolo <ed...@gmail.com>.
True. The spec does not mandate the bucket files have to be there if they
are empty. (missing directories are 0 row tables).

Thanks,
Edward

On Tue, Apr 3, 2018 at 4:42 PM, Richard A. Bross <rb...@oaktreepeak.com> wrote:

> Gopal,
>
> The Presto devs say they are willing to make the changes to adhere to the
> Hive bucket spec.  I quoted
>
> "Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile)."
>
> so there asking "where is the Hive bucketing spec".  Is it just to read
> the code for that function?  They were looking for something more explicit,
> I think.
>
> Thanks
>
> ----- Original Message -----
> From: "Gopal Vijayaraghavan" <go...@apache.org>
> To: user@hive.apache.org
> Sent: Tuesday, April 3, 2018 3:15:46 AM
> Subject: Re: Hive, Tez, clustering, buckets, and Presto
>
> >    * I'm interested in your statement that CLUSTERED BY does not CLUSTER
> BY.  My understanding was that this was related to the number of buckets,
> but you are relating it to ORC stripes.  It is odd that no examples that
> I've seen include the SORTED BY statement other than in relation to ORC
> indexes (that I understand).  So the question is; regardless of whether
> efficient ORC stripes are created (wouldn't I have to also specify
> 'orc.create.index’=’true’ for this to have much of an effect)
>
> ORC + bucketing has been something I've spent a lot of time with - a lot
> of this has to do with secondary characteristics of data (i.e same device
> has natural progressions for metrics), which when combined with a columnar
> format & ordering within files produces better storage and runtimes
> together (which I guess is usually a trade-off).
>
> Without a SORTED BY, the organizing function for the data-shuffle does not
> order in any specific way - the partition key for the shuffle is the
> modulus, while the order key is 0 bytes long, so it sorts by (modulus,)
> which for a quick-sort also loses the input order into the shuffle & each
> bucket file is produced in random order within itself.
>
> An explicit sort with bucketing is what I recommend to most of the HDP
> customers who have performance problems with ORC.
>
> This turns the shuffle key into (modulus, key1, key2) producing more
> predictable order during shuffle.
>
> Then the key1 can be RLE encoded so that ORC vector impl will pass it on
> as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows
> for integers.
>
> https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
>
> was written as a warning to customers who use bucketing to try & solve
> performance problems, but have ended up bucketing as their main problem.
>
> Most of what I have written above was discussed a few years back and in
> general, bucketing on a high cardinality column + sorting on a low
> cardinality together has given good results to my customers.
>
> >    I hadn't thought of the even number issue, not having looked at the
> function; I had assumed that it was a hash, not a modulus; shame on me.
> Reading the docs I see that hash is only used on string columns
>
> Actually a hash is used in theory, but I entirely blame Java for it - the
> Java hash is an identity function for Integers.
>
> scala> 42.hashCode
> res1: Int = 42
>
> scala> 42L.hashCode
> res2: Int = 42
>
> > Finally, I'm not sure that I got a specific answer to my original
> question, which is can I force Tez to create all bucket files so Presto
> queries can succeed?  Anyway, I will be testing today and the solution will
> either be to forgo buckets completely or to simply rely on ORC indexes.
>
> There's no config to do that today & Presto is already incompatible with
> Hive 3.0 tables (Update/Delete support).
>
> Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile).
>
> The file-count is a very flaky way to check if the table is bucketed
> correctly - either you trust the user to have properly bucketed the table
> or you don't use it. Failing to work on valid tables does look pretty bad,
> instead of soft fallbacks.
>
> I wrote a few UDFs which was used to validate suspect tables and fix them
> for customers who had bad historical data, which was loaded with
> "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
>
> https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/
> master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
>
> LLAP has a bucket pruning implementation if Presto wants to copy from it
> (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
>
> Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
> Runtime: https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281
>
> That actually does things according to the Hive bucketing contract where
> uncovered buckets are assumed to have 0 rows without a file present & not
> error out instead.
>
> If you do have the ability to redeploy Hive, the change you are looking
> for is a 1-liner to enable.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248
>
> Cheers,
> Gopal
>
>

Re: Hive, Tez, clustering, buckets, and Presto

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.
This is great information, Gopal, thank you.

I wish I had the time to create a comparison for our use case between Hive buckets and ORC files on S3 and ORC files without bucket.  Unfortunately it's a chicken and egg issue, since I won't have enough data volume until we are in production, which should be shortly.

Thanks again.

----- Original Message -----
From: "Gopal Vijayaraghavan" <go...@apache.org>
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 7:31:31 PM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

> so there asking "where is the Hive bucketing spec".  Is it just to read the code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 implemented Streaming ingest via Storm, it used an explicit naming "bucket_<n>" for the filename.

Since until the compaction runs the actual base files don't exist, the ACID bucketing implementation has to handle missing buckets as 0 rows in base file + possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work similarly, for the ability to do bucket map-joins between ACID & non-ACID bucketed tables. Particularly about the modulus for -ve numbers, which was broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000000_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000000_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000001_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000001_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000002_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000002_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000003_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000003_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you can see that is only applicable for the 1st insert to table (& the regex will remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross" <rb...@oaktreepeak.com> wrote:

    Gopal,
    
    The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec.  I quoted 
    
    "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)."
    
    so there asking "where is the Hive bucketing spec".  Is it just to read the code for that function?  They were looking for something more explicit, I think.
    
    Thanks
    
    ----- Original Message -----
    From: "Gopal Vijayaraghavan" <go...@apache.org>
    To: user@hive.apache.org
    Sent: Tuesday, April 3, 2018 3:15:46 AM
    Subject: Re: Hive, Tez, clustering, buckets, and Presto
    
    >    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)
    
    ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off).
    
    Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself.
    
    An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC.
    
    This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle.
    
    Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers.
    
    https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
    
    was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem.
    
    Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers.
    
    >    I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns
    
    Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers.
    
    scala> 42.hashCode
    res1: Int = 42
    
    scala> 42L.hashCode
    res2: Int = 42
    
    > Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.
    
    There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support).
    
    Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).
    
    The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks.
    
    I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
    
    https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
    
    LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
    
    Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
    Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281
    
    That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead.
    
    If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable.
    
    https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248
    
    Cheers,
    Gopal
    
    



Re: Hive, Tez, clustering, buckets, and Presto

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> so there asking "where is the Hive bucketing spec".  Is it just to read the code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 implemented Streaming ingest via Storm, it used an explicit naming "bucket_<n>" for the filename.

Since until the compaction runs the actual base files don't exist, the ACID bucketing implementation has to handle missing buckets as 0 rows in base file + possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work similarly, for the ability to do bucket map-joins between ACID & non-ACID bucketed tables. Particularly about the modulus for -ve numbers, which was broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000000_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000000_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000001_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000001_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000002_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000002_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 /apps/hive/warehouse/bucketed/000003_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 /apps/hive/warehouse/bucketed/000003_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you can see that is only applicable for the 1st insert to table (& the regex will remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross" <rb...@oaktreepeak.com> wrote:

    Gopal,
    
    The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec.  I quoted 
    
    "Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)."
    
    so there asking "where is the Hive bucketing spec".  Is it just to read the code for that function?  They were looking for something more explicit, I think.
    
    Thanks
    
    ----- Original Message -----
    From: "Gopal Vijayaraghavan" <go...@apache.org>
    To: user@hive.apache.org
    Sent: Tuesday, April 3, 2018 3:15:46 AM
    Subject: Re: Hive, Tez, clustering, buckets, and Presto
    
    >    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)
    
    ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off).
    
    Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself.
    
    An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC.
    
    This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle.
    
    Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers.
    
    https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
    
    was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem.
    
    Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers.
    
    >    I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns
    
    Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers.
    
    scala> 42.hashCode
    res1: Int = 42
    
    scala> 42L.hashCode
    res2: Int = 42
    
    > Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.
    
    There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support).
    
    Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).
    
    The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks.
    
    I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
    
    https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
    
    LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
    
    Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
    Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281
    
    That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead.
    
    If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable.
    
    https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248
    
    Cheers,
    Gopal
    
    



Re: Hive, Tez, clustering, buckets, and Presto

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.
Gopal,

The Presto devs say they are willing to make the changes to adhere to the Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the code for that function?  They were looking for something more explicit, I think.

Thanks

----- Original Message -----
From: "Gopal Vijayaraghavan" <go...@apache.org>
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers.

>    I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead.

If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal


Re: Hive, Tez, clustering, buckets, and Presto

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.
Gopal,

Thanks for this.  Great information and something to look at more closely to better understand the internals.

Rick

----- Original Message -----
From: "Gopal Vijayaraghavan" <go...@apache.org>
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers.

>    I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead.

If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal



Re: Hive, Tez, clustering, buckets, and Presto

Posted by Gopal Vijayaraghavan <go...@apache.org>.
>    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers.

>    I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead.

If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal



Re: Hive, Tez, clustering, buckets, and Presto

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.
Gopal,

Thanks for taking the time to try and help.  A few things in relation to your response:

* Yes, the 'epoch' column is an hourly timestamp.  Clustering by a column with high cardinality would make little sense.
* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes.  It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand).  So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect)
* Thanks for the hint about using primes for specifying the buckets.  Given that r_crs_id is a wide range I think that it would have worked Ok, but any optimizations are appreciated and I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me.  Reading the docs I see that hash is only used on string columns

Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed?  Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes.

Thanks.

Rick

----- Original Message -----
From: "Gopal Vijayaraghavan" <go...@apache.org>
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent).

Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent.  Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily).

> configuration so that r_crs_id was at least row optimized and sorted within the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal    


Re: Hive, Tez, clustering, buckets, and Presto

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.
I'm really confused and could use help understanding.  The Hive documentation here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

says:

"Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided. Here's how to do it right."

"This should obviously be avoided".  Yet with the Tez engine it seems to be unavoidable when the data does not allow for all the buckets to be populated.

----- Original Message -----
From: "Gopal Vijayaraghavan" <go...@apache.org>
To: user@hive.apache.org
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent).

Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent.  Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily).

> configuration so that r_crs_id was at least row optimized and sorted within the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal    


Re: Hive, Tez, clustering, buckets, and Presto

Posted by Gopal Vijayaraghavan <go...@apache.org>.
There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when the partition is created there are always 64 bucket files, even if some are 0 length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are created, and Presto fails since the table properties in the metastore (64 buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent).

Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is always set to "true", but I can't seem to find a description of its intent.  Anyway, if it's supposed to force the creation of buckets it's not working with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily).

> configuration so that r_crs_id was at least row optimized and sorted within the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal