You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Gopal Vijayaraghavan <go...@apache.org> on 2016/11/04 05:07:55 UTC

[discuss] Fixing DecimalColumnVector cache misses

Hi,

(x-posted for discussion)

Hive's storage-api + ORC vector readers have a cache miss built-into it for the case of Decimal readers. 

With LLAP, two distinct cache misses are basically dragging Decimal performance down.

DecimalColumnVector -> HiveDecimalWritable -> HiveDecimal(BigInteger) -> new BigDecimal()

The writable is entirely overhead and so is the BigInteger -> BigDecimal conversions, particularly since the HiveDecimal type is not boxed unlike a "long".

Modifying the writable involves a fresh allocation of a HiveDecimal, which makes the object reference a rather unsightly cache miss (this is TPC-H Q1).



Changing this in hive/storage-api will produce a chicken-egg scenario between hive/storage-api -> orc -> hive/ql/exec/vectorization, across projects.

I'm conflicted on how to change DecimalColumnVector one-shot without breaking things (if possible, remove BigInteger allocations in the read-path as a possible optimization). 

Suggestions/discuss?

Cheers,
Gopal