You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "Beutel, Maximilian " <ma...@credit-suisse.com> on 2019/02/20 06:08:02 UTC

Avoid duplicate rows when inserting into table

Hello!

Since I'm new to Nifi I'm still trying to wrap my head around certain best practices.

My use case is as follows: I need to ingest a list of projects via http API. The API returns a list in a format like:

[
{ id: "1", name: "Project A"},
{ id: "2",name: "Project B"}
]

Now, the problem with this API is, it always returns the _full_ list of projects, no delta. So if since the last run some more projects are added, it will included project id 1, 2 also, like so:

[
{ id: "1", name: "Project A"},
{ id: "2",name: "Project B"},
{ id: "3",name: "Project C"}
]


I need to insert this list in a database table, and I thought I use nifi. My initial flow looked like:

InvokeHTTP (fetch JSON HTTP response)
into JoltTransformJSON (make the returned response a bit nicer)
into ConvertJSONToSQL (insert the whole lot into db)

The issue I have is the insert statement in ConvertJSONToSQL will fail, since the database table has a unique key on the project name field and the payload I want to insert will always include all projects, including some which are already there.

My question is how to people usually handle such a use case in Nifi? I can either think of filtering the API response against the list of already existing project names (not sure how), but I would rather do an insert ignore or something that just ignores duplicate record errors  - which is not supported by ConvertJSONToSQL as far as I'm aware.

Maybe I'm also approaching this problem from the wrong side, so I would be grateful to receive feedback/recommendations.

Thanks!

Max

=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
=============================================================================== 

Re: Avoid duplicate rows when inserting into table

Posted by Mike Thomsen <mi...@gmail.com>.
I think the SQL processors other than PutDatabaseRecord also support
"upsert" functionality, so that might also help.

On Wed, Feb 20, 2019 at 8:33 AM Mike Thomsen <mi...@gmail.com> wrote:

> The easiest way to do this would be to create a UNIQUE constraint on the
> project name and just send one insert at a time. Then each individual
> failed insert will get routed to failure.
>
> For the sake of safety here, if you have multiple flows that feed into a
> common SQL ingest point, you might want to put this functional on a
> separate flow so that you can assume "failure = duplicate according to the
> UNIQUE constraint."
>
> On Wed, Feb 20, 2019 at 7:56 AM Adam Fisher <fi...@gmail.com> wrote:
>
>> Maybe you could use something like SplitRecord, DetectDuplicate,
>> MergeRecord to get the file how you want it. This would split it into
>> smaller FlowFiles, check if the record has been seen before and keep only
>> unique ones and then merge them back into one file.
>>
>> I'm actually collaborating with someone on a DetectDuplicateRecord
>> processor so eventually you won't need to break your input into separate
>> files. This is expected to come out in a later release of NiFi.
>>
>>
>> Adam
>>
>> On Wed, Feb 20, 2019, 1:08 AM Beutel, Maximilian <
>> maximilian.beutel@credit-suisse.com wrote:
>>
>>> Hello!
>>>
>>>
>>>
>>> Since I’m new to Nifi I’m still trying to wrap my head around certain
>>> best practices.
>>>
>>>
>>>
>>> My use case is as follows: I need to ingest a list of projects via http
>>> API. The API returns a list in a format like:
>>>
>>>
>>>
>>> [
>>>
>>> { id: "1", name: "Project A"},
>>>
>>> { id: "2",name: "Project B"}
>>>
>>> ]
>>>
>>>
>>>
>>> Now, the problem with this API is, it always returns the _*full*_ list
>>> of projects, no delta. So if since the last run some more projects are
>>> added, it will included project id 1, 2 also, like so:
>>>
>>>
>>>
>>> [
>>>
>>> { id: "1", name: "Project A"},
>>>
>>> { id: "2",name: "Project B"},
>>>
>>> { id: "3",name: "Project C"}
>>>
>>> ]
>>>
>>>
>>>
>>>
>>>
>>> I need to insert this list in a database table, and I thought I use
>>> nifi. My initial flow looked like:
>>>
>>>
>>>
>>> InvokeHTTP (fetch JSON HTTP response)
>>> into JoltTransformJSON (make the returned response a bit nicer)
>>>
>>> into ConvertJSONToSQL (insert the whole lot into db)
>>>
>>>
>>>
>>> The issue I have is the insert statement in ConvertJSONToSQL will fail,
>>> since the database table has a unique key on the project name field and the
>>> payload I want to insert will always include all projects, including some
>>> which are already there.
>>>
>>>
>>>
>>> My question is how to people usually handle such a use case in Nifi? I
>>> can either think of filtering the API response against the list of already
>>> existing project names (not sure how), but I would rather do an insert
>>> ignore or something that just ignores duplicate record errors  - which is
>>> not supported by ConvertJSONToSQL as far as I’m aware.
>>>
>>>
>>>
>>> Maybe I’m also approaching this problem from the wrong side, so I would
>>> be grateful to receive feedback/recommendations.
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Max
>>>
>>>
>>>
>>> ==============================================================================
>>> Please access the attached hyperlink for an important electronic
>>> communications disclaimer:
>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>
>>> ==============================================================================
>>>
>>

RE: Avoid duplicate rows when inserting into table

Posted by "Beutel, Maximilian " <ma...@credit-suisse.com>.
Thanks for the suggestion, for the record, this is exactly what I did and it worked well.

At first I had  a setup where I truncate the table before inserting it, but first it makes the table for a (short) while completely empty which is not desirable in all cases, also the flow is more clear when we rely on the UNIQUE key constraint.

Max

From: Mike Thomsen [mailto:mikerthomsen@gmail.com]
Sent: Wednesday, February 20, 2019 9:33 PM
To: users@nifi.apache.org
Subject: Re: Avoid duplicate rows when inserting into table

The easiest way to do this would be to create a UNIQUE constraint on the project name and just send one insert at a time. Then each individual failed insert will get routed to failure.

For the sake of safety here, if you have multiple flows that feed into a common SQL ingest point, you might want to put this functional on a separate flow so that you can assume "failure = duplicate according to the UNIQUE constraint."

On Wed, Feb 20, 2019 at 7:56 AM Adam Fisher <fi...@gmail.com>> wrote:
Maybe you could use something like SplitRecord, DetectDuplicate, MergeRecord to get the file how you want it. This would split it into smaller FlowFiles, check if the record has been seen before and keep only unique ones and then merge them back into one file.

I'm actually collaborating with someone on a DetectDuplicateRecord processor so eventually you won't need to break your input into separate files. This is expected to come out in a later release of NiFi.

Adam

On Wed, Feb 20, 2019, 1:08 AM Beutel, Maximilian <ma...@credit-suisse.com> wrote:
Hello!

Since I’m new to Nifi I’m still trying to wrap my head around certain best practices.

My use case is as follows: I need to ingest a list of projects via http API. The API returns a list in a format like:

[
{ id: "1", name: "Project A"},
{ id: "2",name: "Project B"}
]

Now, the problem with this API is, it always returns the _full_ list of projects, no delta. So if since the last run some more projects are added, it will included project id 1, 2 also, like so:

[
{ id: "1", name: "Project A"},
{ id: "2",name: "Project B"},
{ id: "3",name: "Project C"}
]


I need to insert this list in a database table, and I thought I use nifi. My initial flow looked like:

InvokeHTTP (fetch JSON HTTP response)
into JoltTransformJSON (make the returned response a bit nicer)
into ConvertJSONToSQL (insert the whole lot into db)

The issue I have is the insert statement in ConvertJSONToSQL will fail, since the database table has a unique key on the project name field and the payload I want to insert will always include all projects, including some which are already there.

My question is how to people usually handle such a use case in Nifi? I can either think of filtering the API response against the list of already existing project names (not sure how), but I would rather do an insert ignore or something that just ignores duplicate record errors  - which is not supported by ConvertJSONToSQL as far as I’m aware.

Maybe I’m also approaching this problem from the wrong side, so I would be grateful to receive feedback/recommendations.

Thanks!
Max

==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================


=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
=============================================================================== 

Re: Avoid duplicate rows when inserting into table

Posted by Mike Thomsen <mi...@gmail.com>.
The easiest way to do this would be to create a UNIQUE constraint on the
project name and just send one insert at a time. Then each individual
failed insert will get routed to failure.

For the sake of safety here, if you have multiple flows that feed into a
common SQL ingest point, you might want to put this functional on a
separate flow so that you can assume "failure = duplicate according to the
UNIQUE constraint."

On Wed, Feb 20, 2019 at 7:56 AM Adam Fisher <fi...@gmail.com> wrote:

> Maybe you could use something like SplitRecord, DetectDuplicate,
> MergeRecord to get the file how you want it. This would split it into
> smaller FlowFiles, check if the record has been seen before and keep only
> unique ones and then merge them back into one file.
>
> I'm actually collaborating with someone on a DetectDuplicateRecord
> processor so eventually you won't need to break your input into separate
> files. This is expected to come out in a later release of NiFi.
>
>
> Adam
>
> On Wed, Feb 20, 2019, 1:08 AM Beutel, Maximilian <
> maximilian.beutel@credit-suisse.com wrote:
>
>> Hello!
>>
>>
>>
>> Since I’m new to Nifi I’m still trying to wrap my head around certain
>> best practices.
>>
>>
>>
>> My use case is as follows: I need to ingest a list of projects via http
>> API. The API returns a list in a format like:
>>
>>
>>
>> [
>>
>> { id: "1", name: "Project A"},
>>
>> { id: "2",name: "Project B"}
>>
>> ]
>>
>>
>>
>> Now, the problem with this API is, it always returns the _*full*_ list
>> of projects, no delta. So if since the last run some more projects are
>> added, it will included project id 1, 2 also, like so:
>>
>>
>>
>> [
>>
>> { id: "1", name: "Project A"},
>>
>> { id: "2",name: "Project B"},
>>
>> { id: "3",name: "Project C"}
>>
>> ]
>>
>>
>>
>>
>>
>> I need to insert this list in a database table, and I thought I use nifi.
>> My initial flow looked like:
>>
>>
>>
>> InvokeHTTP (fetch JSON HTTP response)
>> into JoltTransformJSON (make the returned response a bit nicer)
>>
>> into ConvertJSONToSQL (insert the whole lot into db)
>>
>>
>>
>> The issue I have is the insert statement in ConvertJSONToSQL will fail,
>> since the database table has a unique key on the project name field and the
>> payload I want to insert will always include all projects, including some
>> which are already there.
>>
>>
>>
>> My question is how to people usually handle such a use case in Nifi? I
>> can either think of filtering the API response against the list of already
>> existing project names (not sure how), but I would rather do an insert
>> ignore or something that just ignores duplicate record errors  - which is
>> not supported by ConvertJSONToSQL as far as I’m aware.
>>
>>
>>
>> Maybe I’m also approaching this problem from the wrong side, so I would
>> be grateful to receive feedback/recommendations.
>>
>>
>>
>> Thanks!
>>
>> Max
>>
>>
>>
>> ==============================================================================
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>
>> ==============================================================================
>>
>

Re: Avoid duplicate rows when inserting into table

Posted by Adam Fisher <fi...@gmail.com>.
Maybe you could use something like SplitRecord, DetectDuplicate,
MergeRecord to get the file how you want it. This would split it into
smaller FlowFiles, check if the record has been seen before and keep only
unique ones and then merge them back into one file.

I'm actually collaborating with someone on a DetectDuplicateRecord
processor so eventually you won't need to break your input into separate
files. This is expected to come out in a later release of NiFi.


Adam

On Wed, Feb 20, 2019, 1:08 AM Beutel, Maximilian <
maximilian.beutel@credit-suisse.com wrote:

> Hello!
>
>
>
> Since I’m new to Nifi I’m still trying to wrap my head around certain best
> practices.
>
>
>
> My use case is as follows: I need to ingest a list of projects via http
> API. The API returns a list in a format like:
>
>
>
> [
>
> { id: "1", name: "Project A"},
>
> { id: "2",name: "Project B"}
>
> ]
>
>
>
> Now, the problem with this API is, it always returns the _*full*_ list of
> projects, no delta. So if since the last run some more projects are added,
> it will included project id 1, 2 also, like so:
>
>
>
> [
>
> { id: "1", name: "Project A"},
>
> { id: "2",name: "Project B"},
>
> { id: "3",name: "Project C"}
>
> ]
>
>
>
>
>
> I need to insert this list in a database table, and I thought I use nifi.
> My initial flow looked like:
>
>
>
> InvokeHTTP (fetch JSON HTTP response)
> into JoltTransformJSON (make the returned response a bit nicer)
>
> into ConvertJSONToSQL (insert the whole lot into db)
>
>
>
> The issue I have is the insert statement in ConvertJSONToSQL will fail,
> since the database table has a unique key on the project name field and the
> payload I want to insert will always include all projects, including some
> which are already there.
>
>
>
> My question is how to people usually handle such a use case in Nifi? I can
> either think of filtering the API response against the list of already
> existing project names (not sure how), but I would rather do an insert
> ignore or something that just ignores duplicate record errors  - which is
> not supported by ConvertJSONToSQL as far as I’m aware.
>
>
>
> Maybe I’m also approaching this problem from the wrong side, so I would be
> grateful to receive feedback/recommendations.
>
>
>
> Thanks!
>
> Max
>
>
>
> ==============================================================================
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==============================================================================
>