You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "alex gemini (JIRA)" <ji...@apache.org> on 2012/11/06 14:26:12 UTC

[jira] [Commented] (FLUME-1669) Add support for columnar event serializer in HDFS

    [ https://issues.apache.org/jira/browse/FLUME-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491442#comment-13491442 ] 

alex gemini commented on FLUME-1669:
------------------------------------

usually the sequence file or avro file format block size is quite small,the columnar format will only get benefit when block size is quite large usually a few GB is minimum ,see the trenvi spec "Desing" Section line 2. It's not practical to hold that too much data in memory considering service crash or reload configuration .It's better write sequence or avro file format to a directory then after some point merge this directory to columnar format when flume rolling to the next directory .another thing should be noticed is currently the query engine (hive,pig and others) didn't support one directory contains two different file format, but hive support one table contain two partition with different file format .So I think maybe flume should monitor two dictionary,one for currently writing dictionary,it will write small avro or sequence format with multiple writer, when data stream rolling to next,flume will merge this avro or sequence file format to trenvi columnar format maybe using mr.
                
> Add support for columnar event serializer in HDFS
> -------------------------------------------------
>
>                 Key: FLUME-1669
>                 URL: https://issues.apache.org/jira/browse/FLUME-1669
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>            Reporter: Mubarak Seyed
>            Assignee: Mubarak Seyed
>              Labels: noob
>             Fix For: v1.4.0
>
>
> Motivation:
> Columnar storage is preferred for better performance and compression for low-latency analytical workloads. Avro 1.7.2 supports column-major file format [1]
> and we can implement {{AbstractTrevniAvroEventSerializer}} (as like {{AbstractAvroEventSerializer}}). {{HDFSSink}} can have serializer type to store events in Trevni column-major file format.
> [1]    http://avro.apache.org/docs/current/trevni/spec.html
>        https://github.com/cutting/trevni

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira