You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2009/03/17 17:42:50 UTC
[jira] Commented: (HIVE-352) Make Hive support column based storage

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682716#action_12682716 ] 

Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------

thanks for taking this on. this could be pretty awesome.

traditionally the arguments for columnar storage has been limited 'scan bandwidth' and compression. In practice - we see that scan bandwidth has two components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data

most columnar stores optimize for both (especially because in shared disk architectures - #1 is at premium). However - our limited experience says is that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it is possible that this observation applies because of high hadoop/java compute overheads - regardless - this seems to be reality).

Given this - i like the idea of a scheme where columns are stored as independent streams inside a block oriented file format (each file block contains a set of rows, however - the organization inside blocks is by column). This does not optimize for #1 - but does optimize for #2 (potentially in conjunction with Hive's interfaces to get one column at a time from IO Libraries). It also gives us nearly equivalent compression.

(The alternative scheme of having  different file(s) per column is also complicated by the fact that locality is almost impossible to ensure and there is no reasonable ways of asking hdfs to colocate different file segments in the near future).

--

i would love to understand how you are planning to approach this. will we still use sequencefiles as a container - or should we ditch it? (it wasn't a great fit for hive - given that we don't use the key field - but the best thing we could find). We have seen that having a number of open codecs can hurt in memory usage - that's one open question for me - can we actually afford to open N concurrent compressed streams (assuming each column is stored compressed separately).

It also seems that one could define a ColumnarInputFormat/OutputFormat as a generic api with different implementations and different pluggable containers underneath - and a scheme of either file per column or columnar in a block approach. in that sense we could build something more generic for hadoop (and then just make sure that hive's lazy serde uses the columnar api for data access - instead of the row based api exposed by current inputformat).

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.