You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sachin Pasalkar <Sa...@symantec.com> on 2015/09/14 12:03:31 UTC

How to enable compaction for table with external data?

Hi,

We are writing direct orc file from storm topology instead of using hive streaming (Due to performance issue with our data). However, we want to compact the data. So we have added the "NO_AUTO_COMPACTION"=“false” option in table which we created to read data(1.6 GB scattered in multiple small files) in ORC file. Does “NO_AUTO_COMPACTION” means it will not compact data while hive streaming is used? If no, why it did not compact our data into 1 file?

We also tried manually calling compaction from java code using org.apache.hadoop.hive.metastore.txn.TxnHandler’s compact API which shows it has started compaction, when we execute command Show compactions. But still does not work.  I don’t want to execute the manual commands from command line.

Is there any way?

PS: We are writing all files in one directory only.

Thanks,
Sachin



Re: How to enable compaction for table with external data?

Posted by Alan Gates <al...@gmail.com>.
Sorry for the slow response, I missed the email in my inbox.

When you write the data directly using a storm topology, how are you 
communicating to Hive that the new data exists?  When streaming data in 
via Hive Streaming the txn commits tells the system that new data is 
arriving in that table or partition and thus it should watch for a need 
to compact that table or partition.  Are you doing txn commits via the 
metastore thrift interface?

Regardless of this, when you've written data in and you manually request 
a compaction, if there are delta files, the compaction should occur.  
Can you share the arguments you are passing to the compact call and the 
output of the SHOW COMPACTIONS you issued afterwards.

Alan.

> Sachin Pasalkar <ma...@symantec.com>
> September 15, 2015 at 22:35
> Yes below are the values set in hive. Initally I haven’t mentioned 
> NO_AUTO_COMPACTION in my table definition, which didn’t work so I have 
> put it with value as false.
>
> hive.compactor.initiator.on
> hive.compactor.worker.threads
> hive.compactor.worker.timeout
> hive.compactor.check.interval
> hive.compactor.delta.num.threshold
> hive.compactor.delta.pct.threshold
>
> Thanks,
> Sachin
>
> From: Alan Gates <alanfgates@gmail.com <ma...@gmail.com>>
> Reply-To: "dev@hive.apache.org <ma...@hive.apache.org>" 
> <dev@hive.apache.org <ma...@hive.apache.org>>
> Date: Tuesday, 15 September 2015 10:30 pm
> To: "dev@hive.apache.org <ma...@hive.apache.org>" 
> <dev@hive.apache.org <ma...@hive.apache.org>>
> Subject: Re: How to enable compaction for table with external data?
>
> If you want it to compact automatically you should not put 
> NO_AUTO_COMPACTION in the table properties.
>
> First question, did you turn on the compactor on your metastore thrift 
> server?  To do this you need to set a couple of values in the 
> metastore's hive-site.xml:
>
> hive.compactor.initiator.on=true
> hive.compactor.worker.threads=1 # or more
>
> Alan.
>
> Alan Gates <ma...@gmail.com>
> September 15, 2015 at 10:00
> If you want it to compact automatically you should not put 
> NO_AUTO_COMPACTION in the table properties.
>
> First question, did you turn on the compactor on your metastore thrift 
> server?  To do this you need to set a couple of values in the 
> metastore's hive-site.xml:
>
> hive.compactor.initiator.on=true
> hive.compactor.worker.threads=1 # or more
>
> Alan.
>
> Sachin Pasalkar <ma...@symantec.com>
> September 14, 2015 at 3:03
> Hi,
>
> We are writing direct orc file from storm topology instead of using 
> hive streaming (Due to performance issue with our data). However, we 
> want to compact the data. So we have added the 
> "NO_AUTO_COMPACTION"=“false” option in table which we created to read 
> data(1.6 GB scattered in multiple small files) in ORC file. Does 
> “NO_AUTO_COMPACTION” means it will not compact data while hive 
> streaming is used? If no, why it did not compact our data into 1 file?
>
> We also tried manually calling compaction from java code using 
> org.apache.hadoop.hive.metastore.txn.TxnHandler’s compact API which 
> shows it has started compaction, when we execute command Show 
> compactions. But still does not work. I don’t want to execute the 
> manual commands from command line.
>
> Is there any way?
>
> PS: We are writing all files in one directory only.
>
> Thanks,
> Sachin
>
>
>

Re: How to enable compaction for table with external data?

Posted by Sachin Pasalkar <Sa...@symantec.com>.
Yes below are the values set in hive. Initally I haven’t mentioned NO_AUTO_COMPACTION in my table definition, which didn’t work so I have put it with value as false.

hive.compactor.initiator.on
hive.compactor.worker.threads
hive.compactor.worker.timeout
hive.compactor.check.interval
hive.compactor.delta.num.threshold
hive.compactor.delta.pct.threshold

Thanks,
Sachin

From: Alan Gates <al...@gmail.com>>
Reply-To: "dev@hive.apache.org<ma...@hive.apache.org>" <de...@hive.apache.org>>
Date: Tuesday, 15 September 2015 10:30 pm
To: "dev@hive.apache.org<ma...@hive.apache.org>" <de...@hive.apache.org>>
Subject: Re: How to enable compaction for table with external data?

If you want it to compact automatically you should not put NO_AUTO_COMPACTION in the table properties.

First question, did you turn on the compactor on your metastore thrift server?  To do this you need to set a couple of values in the metastore's hive-site.xml:

hive.compactor.initiator.on=true
hive.compactor.worker.threads=1 # or more

Alan.

[cid:part1.07070900.07000506@gmail.com]
Sachin Pasalkar<ma...@symantec.com>
September 14, 2015 at 3:03
Hi,

We are writing direct orc file from storm topology instead of using hive streaming (Due to performance issue with our data). However, we want to compact the data. So we have added the "NO_AUTO_COMPACTION"=“false” option in table which we created to read data(1.6 GB scattered in multiple small files) in ORC file. Does “NO_AUTO_COMPACTION” means it will not compact data while hive streaming is used? If no, why it did not compact our data into 1 file?

We also tried manually calling compaction from java code using org.apache.hadoop.hive.metastore.txn.TxnHandler’s compact API which shows it has started compaction, when we execute command Show compactions. But still does not work. I don’t want to execute the manual commands from command line.

Is there any way?

PS: We are writing all files in one directory only.

Thanks,
Sachin




Re: How to enable compaction for table with external data?

Posted by Alan Gates <al...@gmail.com>.
If you want it to compact automatically you should not put 
NO_AUTO_COMPACTION in the table properties.

First question, did you turn on the compactor on your metastore thrift 
server?  To do this you need to set a couple of values in the 
metastore's hive-site.xml:

hive.compactor.initiator.on=true
hive.compactor.worker.threads=1 # or more

Alan.

> Sachin Pasalkar <ma...@symantec.com>
> September 14, 2015 at 3:03
> Hi,
>
> We are writing direct orc file from storm topology instead of using 
> hive streaming (Due to performance issue with our data). However, we 
> want to compact the data. So we have added the 
> "NO_AUTO_COMPACTION"=“false” option in table which we created to read 
> data(1.6 GB scattered in multiple small files) in ORC file. Does 
> “NO_AUTO_COMPACTION” means it will not compact data while hive 
> streaming is used? If no, why it did not compact our data into 1 file?
>
> We also tried manually calling compaction from java code using 
> org.apache.hadoop.hive.metastore.txn.TxnHandler’s compact API which 
> shows it has started compaction, when we execute command Show 
> compactions. But still does not work. I don’t want to execute the 
> manual commands from command line.
>
> Is there any way?
>
> PS: We are writing all files in one directory only.
>
> Thanks,
> Sachin
>
>
>