You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by patcharee <Pa...@uni.no> on 2015/04/20 17:29:38 UTC

merge small orc files

Hi,

How to set the configuration hive-site.xml to automatically merge small 
orc file (output from mapreduce job) in hive 0.14 ?

This is my current configuration>

     <property>
       <name>hive.merge.mapfiles</name>
       <value>true</value>
     </property>

     <property>
       <name>hive.merge.mapredfiles</name>
       <value>true</value>
     </property>

     <property>
       <name>hive.merge.orcfile.stripe.level</name>
       <value>true</value>
     </property>

However the output from a mapreduce job, which is stored into an orc 
file, was not merged. This is the output>

-rwxr-xr-x   1 root hdfs          0 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/_SUCCESS
-rwxr-xr-x   1 root hdfs      29072 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-00000
-rwxr-xr-x   1 root hdfs      29049 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-00001
-rwxr-xr-x   1 root hdfs      29075 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-00002

Any ideas?

BR,
Patcharee

Re: merge small orc files

Posted by Xuefu Zhang <xz...@cloudera.com>.
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize.

On Mon, Apr 20, 2015 at 8:29 AM, patcharee <Pa...@uni.no>
wrote:

> Hi,
>
> How to set the configuration hive-site.xml to automatically merge small
> orc file (output from mapreduce job) in hive 0.14 ?
>
> This is my current configuration>
>
>     <property>
>       <name>hive.merge.mapfiles</name>
>       <value>true</value>
>     </property>
>
>     <property>
>       <name>hive.merge.mapredfiles</name>
>       <value>true</value>
>     </property>
>
>     <property>
>       <name>hive.merge.orcfile.stripe.level</name>
>       <value>true</value>
>     </property>
>
> However the output from a mapreduce job, which is stored into an orc file,
> was not merged. This is the output>
>
> -rwxr-xr-x   1 root hdfs          0 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/_SUCCESS
> -rwxr-xr-x   1 root hdfs      29072 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-00000
> -rwxr-xr-x   1 root hdfs      29049 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-00001
> -rwxr-xr-x   1 root hdfs      29075 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-00002
>
> Any ideas?
>
> BR,
> Patcharee
>

Re: merge small orc files

Posted by patcharee <Pa...@uni.no>.
Hi Gopal,

The table created is not  a bucketed table, but a dynamic partitioned 
table. I took the script test from 
https://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/orc_merge7.q

- create table orc_merge5 (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) stored as orc;
- create table orc_merge5a (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) partitioned by (st double) 
stored as orc;

I sent you the desc formatted table and application log. I just found 
out that there are some TezException which could be the cause of the 
problem. Please let me know how to fix it.

BR,

Patcharee


On 21. april 2015 13:10, Gopal Vijayaraghavan wrote:
>
>> alter table <table> concatenate do not work? I have a dynamic
>> partitioned table (stored as orc). I tried to alter concatenate, but it
>> did not work. See my test result.
> ORC fast concatenate does work on partitioned tables, but it doesn¹t work
> on bucketed tables.
>
> Bucketed tables cannot merge files, since the file count is capped by the
> numBuckets parameter.
>
>> hive> dfs -ls
>> ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
>> Found 2 items
>> -rw-r--r--   3 patcharee hdfs        534 2015-04-21 12:33
>> /apps/hive/warehouse/orc_merge5a/st=0.8/000000_0
>> -rw-r--r--   3 patcharee hdfs        533 2015-04-21 12:33
>> /apps/hive/warehouse/orc_merge5a/st=0.8/000001_0
> Is this a bucketed table?
>
> When you look at the point of view of split generation & cluster
> parallelism, bucketing is an anti-pattern, since in most query schemas it
> significantly slows down the slowest task.
>
> Making the fastest task faster isn¹t often worth it, if the overall query
> time goes up.
>
> Also if you want to, you can send me the yarn logs -applicationId <app-id>
> and the desc formatted of the table, which will help me understand what¹s
> happening better.
>
> Cheers,
> Gopal
>
>


Re: merge small orc files

Posted by Gopal Vijayaraghavan <go...@apache.org>.

>alter table <table> concatenate do not work? I have a dynamic
>partitioned table (stored as orc). I tried to alter concatenate, but it
>did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.

>hive> dfs -ls 
>${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
>Found 2 items
>-rw-r--r--   3 patcharee hdfs        534 2015-04-21 12:33
>/apps/hive/warehouse/orc_merge5a/st=0.8/000000_0
>-rw-r--r--   3 patcharee hdfs        533 2015-04-21 12:33
>/apps/hive/warehouse/orc_merge5a/st=0.8/000001_0

Is this a bucketed table?

When you look at the point of view of split generation & cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId <app-id>
and the desc formatted of the table, which will help me understand what¹s
happening better.

Cheers,
Gopal



Re: merge small orc files

Posted by patcharee <Pa...@uni.no>.
Hi Gopal,

Thanks for your explanation.

What could be the case that SET hive.merge.orcfile.stripe.level=true && 
alter table <table> concatenate do not work? I have a dynamic 
partitioned table (stored as orc). I tried to alter concatenate, but it 
did not work. See my test result.

hive> SET hive.merge.orcfile.stripe.level=true;
hive> alter table orc_merge5a partition(st=0.8) concatenate;
Starting Job = job_1424363133313_0053, Tracking URL = 
http://service-test-1-2.testlocal:8088/proxy/application_1424363133313_0053/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job  -kill 
job_1424363133313_0053
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-04-21 12:32:56,165 null map = 0%,  reduce = 0%
2015-04-21 12:33:05,964 null map = 100%,  reduce = 0%
Ended Job = job_1424363133313_0053
Loading data to table default.orc_merge5a partition (st=0.8)
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/000000_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/000002_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, 
totalSize=1067, rawDataSize=0]
MapReduce Jobs Launched:
Stage-null:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 22.839 seconds
hive> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs        534 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/000000_0
-rw-r--r--   3 patcharee hdfs        533 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/000001_0

It seems nothing happened when I altered table concatenate. Any ideas?

BR,
Patcharee

On 21. april 2015 04:41, Gopal Vijayaraghavan wrote:
> Hi,
>
>> How to set the configuration hive-site.xml to automatically merge small
>> orc file (output from mapreduce job) in hive 0.14 ?
> Hive cannot add work-stages to a map-reduce job.
>
> Hive follows merge.mapfiles=true when Hive generates a plan, by adding
> more work to the plan as a conditional task.
>
>> -rwxr-xr-x   1 root hdfs      29072 2015-04-20 15:23
>> /apps/hive/warehouse/coordinate/zone=2/part-r-00000
> This looks like it was written by an MRv2 Reducer and not by the Hive
> FileSinkOperator & handled by the MR outputcommitter instead of the Hive
> MoveTask.
>
> But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
> that is true (like your setting), then do
>
> ³alter table <table> concatenate²
>
> which effectively concatenates ORC blocks (without decompressing them),
> while maintaining metadata linkage of start/end offsets in the footer.
>
> Cheers,
> Gopal
>
>


Re: merge small orc files

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

>How to set the configuration hive-site.xml to automatically merge small
>orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.

>-rwxr-xr-x   1 root hdfs      29072 2015-04-20 15:23
>/apps/hive/warehouse/coordinate/zone=2/part-r-00000

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator & handled by the MR outputcommitter instead of the Hive
MoveTask.

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table <table> concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.

Cheers,
Gopal