You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Shangzhong zhu <sh...@gmail.com> on 2014/07/31 22:18:19 UTC

Hive ORC file map type performance

Hi, We are upgrading our Hive file format from RCFile to ORC, and doing
some performance evaluation.

Out testing table has a map type column. One thing we noticed is that if
the query involves retrieving data from the map column, the performance of
ORC file is actually worse than RCFile.

In general, the number of mappers from ORC file format is much less than
RCFile. However, the total running time for ORC is still longer than RCFile.

Any insights on that? Does map type in ORC file has additional cost?

Here are some settings:
Hive: 0.12
Table partitioned by dt and service. (2 partition columns)
ORC File:
orc.compress            SNAPPY
orc.compress.size       100000
orc.row.index.stride    5000
hive.exec.orc.default.stripe.size=10000000

RCFile:
compressions: LZO
default row group size:

Sample query:
SELECT
t.id,
SUBSTR(t.service, 11) AS Platform,
SUBSTR(t.param['data'], 7) AS Speed
FROM
default.test_hourly_orc_tbl t
WHERE
(t.service='service1-2014' OR t.service='service2-2014' OR
t.service='service3-2014' )
AND t.dt >= '2014-05-20' AND t.dt <= '2014-05-20'
AND SUBSTR(t.param['data'], 0, 6) = 'Speed.'
ORDER BY
Platform;