You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by 何宝宁 <ba...@ecreditpal.com> on 2018/07/25 04:17:05 UTC

Total length of orc clustered table is always 2^31 in TezSplitGrouper

Hi,

When I was tuning initial mapper number with Hive+Tez, found if orc table is clustered, total length return by estimator is always 2^31.

Hive: 2.3.3
Tez: 0.8.4 (TezSplitGrouper.java:197)

How to replicate:

create table test (f1 string, f2 string) clustered by (f1) into 1 buckets stored as orc tblproperties(’transactional’=’true’);
insert into test values(’s1’, ’s2’), (’s3’, ’s4’);
select count(*) from test;

Search ’Total length’ in log sys_dag_xxx, it is 2147483648.

Thanks for any suggestion.

Bob He
Thanks

Re: Total length of orc clustered table is always 2^31 in TezSplitGrouper

Posted by 何宝宁 <ba...@ecreditpal.com>.
Thank you Gopal for pointing the root cause. After running command alter table xxx compact ‘major’ to request a force compaction, total length is right !

Is there any way to do compact immediately after insert values.

Bob He
Thanks

On 25 Jul 2018, at 1:45 PM, Gopal Vijayaraghavan <go...@apache.org> wrote:

> Search ’Total length’ in log sys_dag_xxx, it is 2147483648.

This is the INT_MAX “placeholder” value for uncompacted ACID tables.

This is because with ACIDv1 there is no way to generate splits against uncompacted files, so this gets “an empty bucket + unknown number of inserts + updates” placeholder value.

Cheers,
Gopal


Re: Total length of orc clustered table is always 2^31 in TezSplitGrouper

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Search ’Total length’ in log sys_dag_xxx, it is 2147483648.


This is the INT_MAX “placeholder” value for uncompacted ACID tables.

This is because with ACIDv1 there is no way to generate splits against uncompacted files, so this gets “an empty bucket + unknown number of inserts + updates” placeholder value.


Cheers,

Gopal