You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by "Adam Fisher (JIRA)" <ji...@apache.org> on 2019/03/30 16:22:00 UTC

[jira] [Commented] (NIFI-6166) Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor

    [ https://issues.apache.org/jira/browse/NIFI-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805872#comment-16805872 ] 

Adam Fisher commented on NIFI-6166:
-----------------------------------

The initial implementation of the *DetectDuplicateRecord* processor is underway right now and must be completed before this can be implemented.

> Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor
> ------------------------------------------------------------------------
>
>                 Key: NIFI-6166
>                 URL: https://issues.apache.org/jira/browse/NIFI-6166
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Priority: Minor
>              Labels: features
>
> Currently the *DetectDuplicateRecord* processor supports *HASH_SET_VALUE* and *BLOOM_FILTER_VALUE* but adding *DISTRIBUTED_HASH_SET_VALUE* as a third use case could be useful when you have large datasets you want to check for duplicates but not load all the cached entries into memory:
> {code:java}
>     static final AllowableValue DISTRIBUTED_HASH_SET_VALUE = new AllowableValue("distributed-hash-set", "Distributed HashSet",
> "Exactly matches records seen before with 100% accuracy at the expense of more storage usage. " +
> "Stores one entry per record in the distributed cache, and checks the cache directly rather than loading the filter into memory during duplicate detection. " +
> "This filter is preferred when processing large data sets and complete accuracy is preferred.");
> {code}
> When the user selects this filter type, the cache entry identifier should probably be considered a prefix so the keys of entries into the cache would look like this:
> {code:java}
> CacheKey = CacheEntryIdenifier + Hash(RecordPath1 + "~" + RecordPath2 + "~" + RecordPath3 + "~" + ...)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)