You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by no jihun <je...@gmail.com> on 2016/04/08 16:01:13 UTC

ORC compaction not happen.

Hello.

Does anyone can give me some advice?

I am trying to make this scenario work.

A. create orc, bucketed table.

  create table_orc ( field1, field2 )
  clustered by (field1, field2) into 64 buckets
  stored as ORC


B. add rows to table_orc *HOURLY.*

  insert into table_orc
  select * from hourly_row_2016040821
  distribute by (field1, field2)


# after create table by query A
# then run query B (once)

there exists one file per bucket.
[image: 본문 이미지 1]


Now one hour later
I run query B again to import the next hour's data into same table

  insert into table_orc
  select * from hourly_row_2016040822
  distribute by (field1, field2)


I expected there may be some transaction files, delta files .
  like orc document says.(https://orc.apache.org/docs/acid.html)
  [image: 본문 이미지 2]


But there only found XXXX_copy_i files.
  [image: 본문 이미지 3]

and compaction never happens.

This is ACID settings on ambari.
  [image: 본문 이미지 4]


Is this expected result?

How can I run multiple insert into X select from Y
and keep one file per bucket by compaction?

No way by insert query?


Any advice will be appreciated.

Thank you.

Re: ORC compaction not happen.

Posted by Mich Talebzadeh <mi...@gmail.com>.

I am guessing here but you may need to define table as ORC transactional

hive> show create table sales3;
OK
CREATE TABLE `sales3`(
  `prod_id` bigint,
  `cust_id` bigint,
  `time_id` timestamp,
  `channel_id` bigint,
  `promo_id` bigint,
  `quantity_sold` decimal(10,0),
  `amount_sold` decimal(10,0))
CLUSTERED BY (
  prod_id,
  cust_id,
  time_id,
  channel_id,
  promo_id)
INTO 256 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales3'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='512',
  'numRows'='5000000',
  'orc.compress'='SNAPPY',
  'rawDataSize'='0',
  'totalSize'='86027477',
  'transactional'='true',
  'transient_lastDdlTime'='1457429932')

2016-04-08T15:31:46,139 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
oraclehadoop.sales3

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 April 2016 at 15:01, no jihun <je...@gmail.com> wrote:

> Hello.
>
> Does anyone can give me some advice?
>
> I am trying to make this scenario work.
>
> A. create orc, bucketed table.
>
>   create table_orc ( field1, field2 )
>   clustered by (field1, field2) into 64 buckets
>   stored as ORC
>
>
> B. add rows to table_orc *HOURLY.*
>
>   insert into table_orc
>   select * from hourly_row_2016040821
>   distribute by (field1, field2)
>
>
> # after create table by query A
> # then run query B (once)
>
> there exists one file per bucket.
> [image: 본문 이미지 1]
>
>
> Now one hour later
> I run query B again to import the next hour's data into same table
>
>   insert into table_orc
>   select * from hourly_row_2016040822
>   distribute by (field1, field2)
>
>
> I expected there may be some transaction files, delta files .
>   like orc document says.(https://orc.apache.org/docs/acid.html)
>   [image: 본문 이미지 2]
>
>
> But there only found XXXX_copy_i files.
>   [image: 본문 이미지 3]
>
> and compaction never happens.
>
> This is ACID settings on ambari.
>   [image: 본문 이미지 4]
>
>
> Is this expected result?
>
> How can I run multiple insert into X select from Y
> and keep one file per bucket by compaction?
>
> No way by insert query?
>
>
> Any advice will be appreciated.
>
> Thank you.
>