You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2020/11/01 22:19:40 UTC

How to proceed on two PRs for deduplicating records

A first time contributor named Adam Fisher and I submitted PRs for a
"deduplicate record" processor roughly at the same time. His was
focused mainly around removing duplicates from within a record set
using the record set itself as the source of truth, whereas mine
relied on a DistributedMapCache and record path operations to focus on
data lake-wide deduplication.

Here's his PR for reference: https://github.com/apache/nifi/pull/3317

The Git history is fairly broken at this point (I tried a rebase and
found some really bad merge commits), but I was able to squash it and
cherry-pick it onto main.

I think they're two separate use cases and should probably be two
separate processors in order to keep things simple.

Before I put much effort into pushing both PRs along, I'd like to know
if anyone else has any preferences/ideas on this.

Thanks,

Mike