You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "E. Sammer (Created) (JIRA)" <ji...@apache.org> on 2011/10/06 21:09:30 UTC

[jira] [Created] (FLUME-776) Create generic APIs for input / output formats and serialization

Create generic APIs for input / output formats and serialization
----------------------------------------------------------------

                 Key: FLUME-776
                 URL: https://issues.apache.org/jira/browse/FLUME-776
             Project: Flume
          Issue Type: New Feature
    Affects Versions: NG alpha 1
            Reporter: E. Sammer
            Priority: Blocker
             Fix For: NG alpha 1


Flume should have a generic set of APIs to handle input and output formats as well as event serialization.

These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:

* HDFS sink, text file output, events serialized as JSON
* HDFS sink, text file output, events serialized as text, Snappy compressed
* HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
* HBase sink, event fields[1] serialized as Thrift

[1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.

The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "Mingjie Lai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122210#comment-13122210 ] 

Mingjie Lai commented on FLUME-776:
-----------------------------------

@esammer. Will pre-NG sources/sinks be compatible with NG? Sounds like no. 
                
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 1
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "Mingjie Lai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122304#comment-13122304 ] 

Mingjie Lai commented on FLUME-776:
-----------------------------------

Not really for existing source/sinks. But we're using customized hbase sink and udp source now, so I hope the API won't be changed too much. 
                
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 1
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "E. Sammer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122270#comment-13122270 ] 

E. Sammer commented on FLUME-776:
---------------------------------

Mingjie:

We're not aiming for backward compatibility, no. I think we'd like to make sure we capture what is really important to people. Is there anything specific you're thinking of?
                
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 1
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-776:
----------------------------

    Fix Version/s:     (was: NG alpha 1)
                   NG alpha 2
    
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "Joe Crobak (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151719#comment-13151719 ] 

Joe Crobak commented on FLUME-776:
----------------------------------

We are generating Avro Events at the client, encoding these as bytes, and storing them in the body of a FlumeEvent. When these Events get to HDFS, it would be great to write out an avro data file with the schema of events in the body of the FlumeEvent (or as a Record with a nested Record in the body). I was thinking we could give the sink a pointer to the avsc file with schema to use for writing the data file.

Perhaps it's a special case, but I thought I'd throw that out there as a use-case to consider.
                
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG beta 1
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-776) Create generic APIs for input / output formats and serialization

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-776:
----------------------------

    Fix Version/s:     (was: NG alpha 2)
                   NG beta 1

Moving to the next milestone. Unlikely to happen by the end of this week.
                
> Create generic APIs for input / output formats and serialization
> ----------------------------------------------------------------
>
>                 Key: FLUME-776
>                 URL: https://issues.apache.org/jira/browse/FLUME-776
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: NG alpha 1
>            Reporter: E. Sammer
>            Priority: Blocker
>             Fix For: NG beta 1
>
>
> Flume should have a generic set of APIs to handle input and output formats as well as event serialization.
> These APIs should offer the same level of abstraction as Hadoop's InputFormat, OutputFormat, RecordReader, RecordWriter, and serializer interfaces / classes. The only rationale for not using Hadoop's specific implementation of these APIs is because we want to avoid that dependency and everything that comes with it. Examples of API usage would be:
> * HDFS sink, text file output, events serialized as JSON
> * HDFS sink, text file output, events serialized as text, Snappy compressed
> * HDFS sink, Avro file output, events serialized as Avro records, GZIP compressed.
> * HBase sink, event fields[1] serialized as Thrift
> [1] The case of HBase is odd in that the event needs to be broken into individual fields (i.e. extracted to a complex type). This means some kind of custom mapping / extraction code or configuration needs to supplied by the user; we're not overly concerned with that for this issue.
> The implementations of the formats (text file, Avro), serializations (JSON, Avro, Thrift), and compression codecs (Snappy, GZIP) listed above are just examples. We'll open separate JIRAs for implementations. The scope of this JIRA is the framework / infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira