You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Adam Fisher (JIRA)" <ji...@apache.org> on 2019/02/17 00:00:01 UTC

[jira] [Commented] (NIFI-6047) Add DetectDuplicateRecord Processor

    [ https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770255#comment-16770255 ] 

Adam Fisher commented on NIFI-6047:
-----------------------------------

I would open a PR for this but it doesn't look like I have assign privileges in JIRA :(

If someone else wants to take a look at this, add tests, and bring it across the finish line, it will really benefit a lot of cool user scenarios.

> Add DetectDuplicateRecord Processor
> -----------------------------------
>
>                 Key: NIFI-6047
>                 URL: https://issues.apache.org/jira/browse/NIFI-6047
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Priority: Major
>              Labels: features
>         Attachments: DetectDuplicateRecord.java
>
>
> Add a new standard NiFi processor to supplement the DetectDuplicate processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached record has already been seen. The name of user-defined properties determines the RecordPath values used to determine if a record is unique. If no user-defined properties are present, the entire record is used as the input to determine uniqueness. All duplicate records are routed to 'duplicate'. If the record is not determined to be a duplicate, the Processor routes the record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available depending on the level of precision and amount of records the user wishes to process:
>  * A *HashSet* filter type will guarantee 100% duplicate detection at the expense of storing one hash per record.
>  * A *BloomFilter* filter type will use efficient/constant space through probabilistic guarantees. This is useful when processing an extremely large number of records and some false positives are acceptable (i.e. some records may be marked as duplicate even though they have not been seen before).
> h4. *{color:#654982}I have started an initial implementation of the idea in the attached DetectDuplicateRecord.java file.{color}* 
> At this point, it should just need unit testing and sanity checks as I have not yet tested it. If I have time to come back to this later I will try to work on integrating it into a pull request unless someone else gets to it first. Comments are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)