You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/03/10 00:09:00 UTC

[jira] [Commented] (NIFI-6047) Add DetectDuplicateRecord Processor

    [ https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503913#comment-17503913 ] 

ASF subversion and git services commented on NIFI-6047:
-------------------------------------------------------

Commit df00cc6cb576c11ae3ef0f1c6f64454598298936 in nifi's branch refs/heads/main from Mike Thomsen
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=df00cc6 ]

NIFI-6047 Cleaned up code to allow tests to run against 1.13.0-snapshot
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
NIFI-6047 Finished updates
First round of code review cleanup
Latest
Removed EL from the dynamic properties.
Finished code review requested refactoring.
Checkstyle fix.
Removed a Java 11 API
NIFI-6047 Renamed processor to DeduplicateRecord

Signed-off-by: Matthew Burgess <ma...@apache.org>

This closes #4646


> Add DetectDuplicateRecord Processor
> -----------------------------------
>
>                 Key: NIFI-6047
>                 URL: https://issues.apache.org/jira/browse/NIFI-6047
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Assignee: Adam Fisher
>            Priority: Major
>              Labels: features
>          Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached record has already been seen. The name of user-defined properties determines the RecordPath values used to determine if a record is unique. If no user-defined properties are present, the entire record is used as the input to determine uniqueness. All duplicate records are routed to 'duplicate'. If the record is not determined to be a duplicate, the Processor routes the record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available depending on the level of precision and amount of records the user wishes to process:
>  * A *HashSet* filter type will guarantee 100% duplicate detection at the expense of storing one hash per record.
>  * A *BloomFilter* filter type will use efficient/constant space through probabilistic guarantees. This is useful when processing an extremely large number of records and some false positives are acceptable (i.e. some records may be marked as duplicate even though they have not been seen before).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)