You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Adam Fisher (JIRA)" <ji...@apache.org> on 2019/03/30 16:22:00 UTC
[jira] [Commented] (NIFI-6166) Add `Distributed HashSet Filter`
Type to DetectDuplicateRecord Processor
[ https://issues.apache.org/jira/browse/NIFI-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805872#comment-16805872 ]
Adam Fisher commented on NIFI-6166:
-----------------------------------
The initial implementation of the *DetectDuplicateRecord* processor is underway right now and must be completed before this can be implemented.
> Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor
> ------------------------------------------------------------------------
>
> Key: NIFI-6166
> URL: https://issues.apache.org/jira/browse/NIFI-6166
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Adam Fisher
> Priority: Minor
> Labels: features
>
> Currently the *DetectDuplicateRecord* processor supports *HASH_SET_VALUE* and *BLOOM_FILTER_VALUE* but adding *DISTRIBUTED_HASH_SET_VALUE* as a third use case could be useful when you have large datasets you want to check for duplicates but not load all the cached entries into memory:
> {code:java}
> static final AllowableValue DISTRIBUTED_HASH_SET_VALUE = new AllowableValue("distributed-hash-set", "Distributed HashSet",
> "Exactly matches records seen before with 100% accuracy at the expense of more storage usage. " +
> "Stores one entry per record in the distributed cache, and checks the cache directly rather than loading the filter into memory during duplicate detection. " +
> "This filter is preferred when processing large data sets and complete accuracy is preferred.");
> {code}
> When the user selects this filter type, the cache entry identifier should probably be considered a prefix so the keys of entries into the cache would look like this:
> {code:java}
> CacheKey = CacheEntryIdenifier + Hash(RecordPath1 + "~" + RecordPath2 + "~" + RecordPath3 + "~" + ...)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)