You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Bjorn Olsen (JIRA)" <ji...@apache.org> on 2017/03/24 14:39:42 UTC

[jira] [Comment Edited] (NIFI-3644) Add DetectDuplicateUsingHBase processor

    [ https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940467#comment-15940467 ] 

Bjorn Olsen edited comment on NIFI-3644 at 3/24/17 2:39 PM:
------------------------------------------------------------

Hi Joe

Thanks for the suggestion, I hadn't considered writing an HBase version of DistributedMapCache. 

I've already written my own DetectDuplicateUsingHBase processor today, as I needed something that was quick to develop.

Working code here, much copy-pasta from DetectDuplicate:
https://github.com/baolsen/nifi/blob/DetectDuplicateUsingHBase/nifi-nar-bundles/nifi-hbase-bundle/nifi-hbase-processors/src/main/java/org/apache/nifi/hbase/DetectDuplicateUsingHBase.java

It seems that implementing an HBase-based DistributedMapCache is more complex, but more reusable. 
Do you have any suggestions for documentation for this sort of thing?

Lastly, do you think it is worth including DetectDuplicateUsingHBase or rather wait for a more reusable option?

I'm a bit tight on time, and Java and NiFi are both new to me.
Meanwhile I can keep DetectDuplicateUsingHBase for my own use, so no worries there.


was (Author: bjorn.olsen1@gmail.com):
Hi Joe

Thanks for the suggestion, I hadn't considered writing an HBase version of DistributedMapCache. 

I've already written my own DetectDuplicateUsingHBase processor today, as I needed something that was quick to develop.

Code here, much copy-pasta from DetectDuplicate:
https://github.com/baolsen/nifi/blob/DetectDuplicateUsingHBase/nifi-nar-bundles/nifi-hbase-bundle/nifi-hbase-processors/src/main/java/org/apache/nifi/hbase/DetectDuplicateUsingHBase.java

It seems that implementing an HBase-based DistributedMapCache is more complex, but more reusable. 
Do you have any suggestions for documentation for this sort of thing?

Lastly, do you think it is worth including DetectDuplicateUsingHBase or rather wait for a more reusable option?

I'm a bit tight on time, and Java and NiFi are both new to me.
Meanwhile I can keep DetectDuplicateUsingHBase for my own use, so no worries there.

> Add DetectDuplicateUsingHBase processor
> ---------------------------------------
>
>                 Key: NIFI-3644
>                 URL: https://issues.apache.org/jira/browse/NIFI-3644
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Bjorn Olsen
>            Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, which then allows for reliably storing a huge volume of file identifiers and auditing information. The downside of this approach is of course that HBase is required.
> Storing the unique file identifiers in a reliable, query-able manner along with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)