You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Bryan Bende (JIRA)" <ji...@apache.org> on 2015/10/06 16:56:26 UTC
[jira] [Commented] (NIFI-817) Create Processors to interact with HBase

    [ https://issues.apache.org/jira/browse/NIFI-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945138#comment-14945138 ] 

Bryan Bende commented on NIFI-817:
----------------------------------

All, going to pick up where Mark left off and try make progress on this ticket for 0.4.0...

On the extraction side of things, here is what I gathered from reading the above discussion...
* The GetHBase processor in the patch saves state on a single node, but needs to also save state across the cluster, most likely similar to what we do in ListHDFS with the distributed cache
* Would like a property/properties to specify columns and column families to return, and possibly filters as well
* Consider using Avro as an output mechanism to provide a schema for the results
* Consider using a replication end-point to stream WALs

I looked at the replication endpoint a little bit and it does seem like an interesting concept. My understanding is that you deploy a jar to the lib directory of every region server that contains the implementation of your endpoint, this endpoint is then responsible for sending to the other system, and there is also some code that has to be run to register/turn-on your endpoint. The best example I found was this:
https://github.com/risdenk/hbase-custom-replication-endpoint-example

We would have to figure out how this replication endpoint would be sending data to NiFi, the first thing that comes to mind is through the SiteToSiteClient, but haven't really thought through this. I'm wondering if we proceed for now on the GetHBase processor (with some improvements above) and track this replication idea as another ticket since it would likely have a much different feel than a regular processor, thoughts? 

The put side of things seems to be more straight forward... I refactored the processor in the current patch to pull in a configurable batch of FlowFiles on each call to onTrigger, then group them by table, and make one call to table.put(List<Put>) so in the best case if all FlowFiles are for the same table then it would be a single call, worst case they are all different tables and it would be no different than processing each FlowFile one at a time.


> Create Processors to interact with HBase
> ----------------------------------------
>
>                 Key: NIFI-817
>                 URL: https://issues.apache.org/jira/browse/NIFI-817
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Bryan Bende
>             Fix For: 0.4.0
>
>         Attachments: 0001-NIFI-817-Initial-implementation-of-HBase-processors.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)