You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by "Roshan Naik (JIRA)" <ji...@apache.org> on 2012/11/20 21:08:58 UTC

[jira] [Created] (FLUME-1734) Create a HCatalog Sink

Roshan Naik created FLUME-1734:
----------------------------------

             Summary: Create a HCatalog Sink 
                 Key: FLUME-1734
                 URL: https://issues.apache.org/jira/browse/FLUME-1734
             Project: Flume
          Issue Type: New Feature
          Components: Sinks+Sources
    Affects Versions: v1.2.0
            Reporter: Roshan Naik
            Assignee: Roshan Naik


Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1734) Create a HCatalog Sink

Posted by "Mike Percy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502341#comment-13502341 ] 

Mike Percy commented on FLUME-1734:
-----------------------------------

Hi Roshan,
Cool! A couple aspects to consider as you are mulling this over:
* If a Flume {{Transaction}} is committed by the sink then the data must be persisted. We need to avoid getting into states where any committed {{Channel.take()}} could be lost somehow. One way to do that today (requires some setup though) is to write to an external Hive table and then periodically do a LOAD via Oozie or something, which could move the files out of the external table and into the desired partitions.
* If the HCat APIs don't work with secure meta stores or secure HDFS yet, it might be worth considering other APIs at the moment. However, if it can navigate the necessary Hive & Hadoop security features to partition and write the data, it sounds great to me! This is just my opinion, of course you are welcome to take it or leave it.
                
> Create a HCatalog Sink 
> -----------------------
>
>                 Key: FLUME-1734
>                 URL: https://issues.apache.org/jira/browse/FLUME-1734
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: features
>
> Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1734) Create a HCatalog Sink

Posted by "Mike Percy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501640#comment-13501640 ] 

Mike Percy commented on FLUME-1734:
-----------------------------------

Hey Roshan,
Sounds interesting. Please pardon my limited knowledge about HCatalog, but I have a few questions about the approach you are proposing.

1. Would all of the partitions be calculated on the client side? Or would all of that loading logic happen via map/reduce jobs? Or would it be a mix?
2. If client side, what are the HCatalog API calls that can be used to stream the data onto HDFS?
3. Would this be able to support a secure Metastore? What about Kerberized HDFS clusters?
4. How much overlap do you see with the HDFS sink?

The HCatalog docs that I've found only seem to talk about using HCatalog in the context of Hive, Pig, and other types of MapReduce jobs.
                
> Create a HCatalog Sink 
> -----------------------
>
>                 Key: FLUME-1734
>                 URL: https://issues.apache.org/jira/browse/FLUME-1734
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: features
>
> Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1734) Create a HCatalog Sink

Posted by "Roshan Naik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502241#comment-13502241 ] 

Roshan Naik commented on FLUME-1734:
------------------------------------

Mike,
 1. There will be no map reduce. This will all be client side (i.e flume agents) streaming data in parallel into HCatalog. Clients will compute the specific partition into which the data will be written. Periodically (configurable) they would 'commit' the currently open partition and roll-over to a new partition. Until the partition is committed the data will not be query-able. There is one restriction... once a partition is committed data its data cannot be modified it.

 2. org.apache.hcatalog.data.transfer.* 

 3. I have not verified the secure mode HCat operation, but it appears to be supported. Will get back to you.

 4. At the moment, I dont see much code overlap with HDFS sink for the core data movement functionality. There may be always room for sharing other smaller tidbits.



                
> Create a HCatalog Sink 
> -----------------------
>
>                 Key: FLUME-1734
>                 URL: https://issues.apache.org/jira/browse/FLUME-1734
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: features
>
> Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (FLUME-1734) Create a HCatalog Sink

Posted by "Roshan Naik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/FLUME-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502241#comment-13502241 ] 

Roshan Naik edited comment on FLUME-1734 at 11/21/12 7:41 PM:
--------------------------------------------------------------

Mike,
 1. There will be no map reduce. This will all be client side (i.e flume agents) streaming data in parallel into HCatalog. Clients will compute the specific partition into which the data will be written. Periodically (configurable) they would 'commit' the currently open partition and roll-over to a new partition. Until the partition is committed the data will not be query-able. There is one restriction... once a partition is committed its data cannot be modified it.

 2. org.apache.hcatalog.data.transfer.* 

 3. I have not verified the secure mode HCat operation, but it appears to be supported. Will get back to you.

 4. At the moment, I dont see much code overlap with HDFS sink for the core data movement functionality. There may be always room for sharing other smaller tidbits.



                
      was (Author: roshan_naik):
    Mike,
 1. There will be no map reduce. This will all be client side (i.e flume agents) streaming data in parallel into HCatalog. Clients will compute the specific partition into which the data will be written. Periodically (configurable) they would 'commit' the currently open partition and roll-over to a new partition. Until the partition is committed the data will not be query-able. There is one restriction... once a partition is committed data its data cannot be modified it.

 2. org.apache.hcatalog.data.transfer.* 

 3. I have not verified the secure mode HCat operation, but it appears to be supported. Will get back to you.

 4. At the moment, I dont see much code overlap with HDFS sink for the core data movement functionality. There may be always room for sharing other smaller tidbits.



                  
> Create a HCatalog Sink 
> -----------------------
>
>                 Key: FLUME-1734
>                 URL: https://issues.apache.org/jira/browse/FLUME-1734
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: features
>
> Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira