You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2021/03/17 11:04:20 UTC

Data duplication When NIFI is restarted

Hi all,

I notice that, if the NIFI instance gets terminated while a processor is
processing a flow file, that processor starts to process the flow file
again from the beginning when NIFI is restarted.
I'm using the PutKudu processor and the PutParquet processor to write data
into kudu and parquet format. Due to the above behaviour,

   1. PutKudu shows primary key violation errors in a restart. I'm using
   INSERT operation and I can't use INSERT_IGNORE or UPSERT operations since I
   need to be notified if incoming data has duplicates.
   2. Since I need to write data in a single flow file into multiple
   parquet files(by specifying the row group size) It is possible for
   PutParquet processor to to generate multiple parquet  files with the same
   content in a restart (data can be duplicated)

I would be grateful if you could suggest a way to overcome this problem.

Thanks & Regards

*Vibhath Ileperuma*

Re: Data duplication When NIFI is restarted

Posted by Bryan Bende <bb...@gmail.com>.
If a processor uses the session to take a flow file from the incoming
queues, and then nifi crashes before session.commit is called, then
that flow file will be back in the original queue when nifi starts
again since the session never updated the repositories.

So it is possible that a destination processor obtains a flow file,
starts sending data to the destination system, and then nifi crashes,
which means the session didn't get committed and the flow file will be
back in the incoming queue.

The ideal way to solve this is that the destination system offers some
type of transaction, such that no data would be actually made
available in the destination system until committing that transaction,
and then immediately committing the nifi session, which would make it
very unlikely for nifi to crash between those exact two lines of code.

PutParquet is really using HDFS client which doesn't really have a
transaction concept for multiple files, and Kudu seems like it has
some transaction ability but only in limited scenarios.

On Wed, Mar 17, 2021 at 8:55 AM <Jo...@swisscom.com> wrote:
>
> I’m just jumping in, we are seeing this issue as well when we are restarting the nifi process from time.
>
>
>
> We are aware of the nifi.properties “nifi.flowcontroller.graceful.shutdown.period=10 sec” parameter, but to be honest we didn’t try to raise it up yet. Maybe it takes more than 10s to fully execute the PutKudu, I really don’t know.
>
>
>
> Cheers Josef
>
>
>
>
>
>
>
>
>
> From: Vibhath Ileperuma <vi...@gmail.com>
> Reply to: "users@nifi.apache.org" <us...@nifi.apache.org>
> Date: Wednesday, 17 March 2021 at 13:49
> To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Subject: Re: Data duplication When NIFI is restarted
>
>
>
> Hi Pierre,
>
>
>
> The NIFI flow I'm implementing can be run for a long time continuously(maybe a couple of weeks/months). During this time period it can be terminated due to memory issue or some other system issue, can't it be? In such a case, I may need to restart NIFi manually and run the flow from where it stopped.
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
>
>
>
> On Wed, Mar 17, 2021 at 5:51 PM Pierre Villard <pi...@gmail.com> wrote:
>
> Hi Vibhath,
>
>
>
> How is NiFi terminated / restarted ?
>
>
>
> Thanks,
>
> Pierre
>
>
>
> Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <vi...@gmail.com> a écrit :
>
> Hi all,
>
>
>
> I notice that, if the NIFI instance gets terminated while a processor is processing a flow file, that processor starts to process the flow file again from the beginning when NIFI is restarted.
>
> I'm using the PutKudu processor and the PutParquet processor to write data into kudu and parquet format. Due to the above behaviour,
>
> PutKudu shows primary key violation errors in a restart. I'm using INSERT operation and I can't use INSERT_IGNORE or UPSERT operations since I need to be notified if incoming data has duplicates.
> Since I need to write data in a single flow file into multiple parquet files(by specifying the row group size) It is possible for PutParquet processor to to generate multiple parquet  files with the same content in a restart (data can be duplicated)
>
> I would be grateful if you could suggest a way to overcome this problem.
>
> Thanks & Regards
>
> Vibhath Ileperuma

Re: Data duplication When NIFI is restarted

Posted by Jo...@swisscom.com.
I’m just jumping in, we are seeing this issue as well when we are restarting the nifi process from time.

We are aware of the nifi.properties “nifi.flowcontroller.graceful.shutdown.period=10 sec” parameter, but to be honest we didn’t try to raise it up yet. Maybe it takes more than 10s to fully execute the PutKudu, I really don’t know.

Cheers Josef




From: Vibhath Ileperuma <vi...@gmail.com>
Reply to: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Wednesday, 17 March 2021 at 13:49
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: Re: Data duplication When NIFI is restarted

Hi Pierre,

The NIFI flow I'm implementing can be run for a long time continuously(maybe a couple of weeks/months). During this time period it can be terminated due to memory issue or some other system issue, can't it be? In such a case, I may need to restart NIFi manually and run the flow from where it stopped.

Thanks & Regards

Vibhath Ileperuma




On Wed, Mar 17, 2021 at 5:51 PM Pierre Villard <pi...@gmail.com>> wrote:
Hi Vibhath,

How is NiFi terminated / restarted ?

Thanks,
Pierre

Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <vi...@gmail.com>> a écrit :
Hi all,

I notice that, if the NIFI instance gets terminated while a processor is processing a flow file, that processor starts to process the flow file again from the beginning when NIFI is restarted.
I'm using the PutKudu processor and the PutParquet processor to write data into kudu and parquet format. Due to the above behaviour,

  1.  PutKudu shows primary key violation errors in a restart. I'm using INSERT operation and I can't use INSERT_IGNORE or UPSERT operations since I need to be notified if incoming data has duplicates.
  2.  Since I need to write data in a single flow file into multiple parquet files(by specifying the row group size) It is possible for PutParquet processor to to generate multiple parquet  files with the same content in a restart (data can be duplicated)
I would be grateful if you could suggest a way to overcome this problem.

Thanks & Regards

Vibhath Ileperuma

Re: Data duplication When NIFI is restarted

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Pierre,

The NIFI flow I'm implementing can be run for a long time
continuously(maybe a couple of weeks/months). During this time period it
can be terminated due to memory issue or some other system issue, can't it
be? In such a case, I may need to restart NIFi manually and run the flow
from where it stopped.

Thanks & Regards

*Vibhath Ileperuma*




On Wed, Mar 17, 2021 at 5:51 PM Pierre Villard <pi...@gmail.com>
wrote:

> Hi Vibhath,
>
> How is NiFi terminated / restarted ?
>
> Thanks,
> Pierre
>
> Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> a écrit :
>
>> Hi all,
>>
>> I notice that, if the NIFI instance gets terminated while a processor is
>> processing a flow file, that processor starts to process the flow file
>> again from the beginning when NIFI is restarted.
>> I'm using the PutKudu processor and the PutParquet processor to write
>> data into kudu and parquet format. Due to the above behaviour,
>>
>>    1. PutKudu shows primary key violation errors in a restart. I'm using
>>    INSERT operation and I can't use INSERT_IGNORE or UPSERT operations since I
>>    need to be notified if incoming data has duplicates.
>>    2. Since I need to write data in a single flow file into multiple
>>    parquet files(by specifying the row group size) It is possible for
>>    PutParquet processor to to generate multiple parquet  files with the same
>>    content in a restart (data can be duplicated)
>>
>> I would be grateful if you could suggest a way to overcome this problem.
>>
>> Thanks & Regards
>>
>> *Vibhath Ileperuma*
>>
>

Re: Data duplication When NIFI is restarted

Posted by Pierre Villard <pi...@gmail.com>.
Hi Vibhath,

How is NiFi terminated / restarted ?

Thanks,
Pierre

Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <vi...@gmail.com>
a écrit :

> Hi all,
>
> I notice that, if the NIFI instance gets terminated while a processor is
> processing a flow file, that processor starts to process the flow file
> again from the beginning when NIFI is restarted.
> I'm using the PutKudu processor and the PutParquet processor to write data
> into kudu and parquet format. Due to the above behaviour,
>
>    1. PutKudu shows primary key violation errors in a restart. I'm using
>    INSERT operation and I can't use INSERT_IGNORE or UPSERT operations since I
>    need to be notified if incoming data has duplicates.
>    2. Since I need to write data in a single flow file into multiple
>    parquet files(by specifying the row group size) It is possible for
>    PutParquet processor to to generate multiple parquet  files with the same
>    content in a restart (data can be duplicated)
>
> I would be grateful if you could suggest a way to overcome this problem.
>
> Thanks & Regards
>
> *Vibhath Ileperuma*
>