You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by dan young <da...@gmail.com> on 2017/10/25 12:01:38 UTC

ListGCSBucket and duplicates.

Hello everyone,

We're using the ListGCSBucket processor in a clustered environment and are
seeing duplicates during subsequent runs.  We have the processor set to run
on a cron schedule every 15min. and we're seeing a duplicate files being
listed.  I've looked at each of the flowfile properties, and they appear to
be the same, i.e. the same gcs.create.time, gcs.etag, gcs.generation,
etc...  Any idea on why the same file is being listed twice?

[image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
2017-10-25 at 5.51.00 AM.png]

[image: Screen Shot 2017-10-25 at 6.00.08 AM.png]


Regards,

Dano

Re: ListGCSBucket and duplicates.

Posted by dan young <da...@gmail.com>.
It is set to only run on the primary node.

On Fri, Oct 27, 2017, 1:30 PM Andrew Grande <ap...@gmail.com> wrote:

> Can you check the runtime scheduling strategy for the List processor? It
> must be primary node only, otherwise every node lists the same bucket and
> generates duplicates.
>
> Andrew
>
> On Fri, Oct 27, 2017, 8:09 AM dan young <da...@gmail.com> wrote:
>
>> Hey James,
>>
>> I changed the Age Off to 1 day yesterday, and I just checked, and we we
>> don't have any duplicates in Big Query. So that looks promising, although I
>> will continue to monitor. As far as the gaps, they are all over the place.
>> We see anywhere from 15min-2hrs.  The 2 hour gap one was surprising and led
>> me to extend the age off to a day.  Most of them I checked this morning
>> have a 15min gap.
>>
>> Regards,
>>
>> Dano
>>
>>
>>
>> On Thu, Oct 26, 2017 at 10:59 PM James Wing <jv...@gmail.com> wrote:
>>
>>> I was able to reproduce this issue.  I ran ListGCSBucket at 30-second
>>> intervals, and the duplicates I caught appeared within 30 seconds or 1
>>> minute of the first. That seems good in the sense that DetectDuplicate is
>>> probably effective in weeding duplicates out.
>>>
>>> You mentioned changing the Age Off Duration to 1 day, what has been your
>>> experience with time gap between the first file and the duplicate?
>>>
>>> https://issues.apache.org/jira/browse/NIFI-4533
>>>
>>>
>>> On Thu, Oct 26, 2017 at 10:49 AM, dan young <da...@gmail.com> wrote:
>>>
>>>> Hey James,
>>>>
>>>> It's happening more frequently that I would expect. I did adjust the
>>>> ListGCSBucket to run every 5 min. and added a DetectDuplicate processor
>>>> that's been pretty good at filter these out, Although I've had to adjust
>>>> the Age Off Duration a few times.  I currently have it set to a day, so
>>>> we'll see how it goes today.
>>>>
>>>> As far as a correlation, I have not been able to nail down the timing
>>>> yet...It is a bursty write to GCS though, we're writing all our dialect
>>>> testing data out of GCS via Jenkins.  So depending on how fast a given
>>>> dialect test runs, and how many we have running at a time, it can vary.
>>>>  Nifi picks up from there and runs it through the meat grinder on its way
>>>> into Google Big Query. When I check the GCS meta data on the file, nothing
>>>> leads me to believe that the files are being written twice, and/or are
>>>> modified in any way that the timestamps might be changing.
>>>>
>>>> As an example, when looking at
>>>> results_2017-10-26_athena_1beb7976-dc23-42c4-876e-5ddbe8cb91e1_1.0.json in
>>>> GCS, we can see the creation/updated times are the same, and when I look at
>>>> the data provenance for this file, we see a flowfile being created at
>>>> 02:05:03 and then again at 04:50:03....
>>>>
>>>> When I look at the timings of the Jenkins job start/complete, etc...and
>>>> the file time in GCS, they lineup.
>>>>
>>>> Nifi 1.3, 3 node cluster.
>>>>
>>>> See the attached screenshots.....
>>>>
>>>> [image: Screen Shot 2017-10-26 at 11.33.28 AM.png]
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 26, 2017 at 10:07 AM James Wing <jv...@gmail.com> wrote:
>>>>
>>>>> Dano,
>>>>>
>>>>> Do you have a feel for how often it is happening, and if it correlates
>>>>> with either heavy or light activity in GCS?  I do not know that there is a
>>>>> problem, but it certainly seems possible that you are not imagining things.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>> On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> We're using the ListGCSBucket processor in a clustered environment
>>>>>> and are seeing duplicates during subsequent runs.  We have the processor
>>>>>> set to run on a cron schedule every 15min. and we're seeing a duplicate
>>>>>> files being listed.  I've looked at each of the flowfile properties, and
>>>>>> they appear to be the same, i.e. the same gcs.create.time, gcs.etag,
>>>>>> gcs.generation, etc...  Any idea on why the same file is being listed
>>>>>> twice?
>>>>>>
>>>>>> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
>>>>>> 2017-10-25 at 5.51.00 AM.png]
>>>>>>
>>>>>> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Dano
>>>>>>
>>>>>
>>>>>
>>>

Re: ListGCSBucket and duplicates.

Posted by Andrew Grande <ap...@gmail.com>.
Can you check the runtime scheduling strategy for the List processor? It
must be primary node only, otherwise every node lists the same bucket and
generates duplicates.

Andrew

On Fri, Oct 27, 2017, 8:09 AM dan young <da...@gmail.com> wrote:

> Hey James,
>
> I changed the Age Off to 1 day yesterday, and I just checked, and we we
> don't have any duplicates in Big Query. So that looks promising, although I
> will continue to monitor. As far as the gaps, they are all over the place.
> We see anywhere from 15min-2hrs.  The 2 hour gap one was surprising and led
> me to extend the age off to a day.  Most of them I checked this morning
> have a 15min gap.
>
> Regards,
>
> Dano
>
>
>
> On Thu, Oct 26, 2017 at 10:59 PM James Wing <jv...@gmail.com> wrote:
>
>> I was able to reproduce this issue.  I ran ListGCSBucket at 30-second
>> intervals, and the duplicates I caught appeared within 30 seconds or 1
>> minute of the first. That seems good in the sense that DetectDuplicate is
>> probably effective in weeding duplicates out.
>>
>> You mentioned changing the Age Off Duration to 1 day, what has been your
>> experience with time gap between the first file and the duplicate?
>>
>> https://issues.apache.org/jira/browse/NIFI-4533
>>
>>
>> On Thu, Oct 26, 2017 at 10:49 AM, dan young <da...@gmail.com> wrote:
>>
>>> Hey James,
>>>
>>> It's happening more frequently that I would expect. I did adjust the
>>> ListGCSBucket to run every 5 min. and added a DetectDuplicate processor
>>> that's been pretty good at filter these out, Although I've had to adjust
>>> the Age Off Duration a few times.  I currently have it set to a day, so
>>> we'll see how it goes today.
>>>
>>> As far as a correlation, I have not been able to nail down the timing
>>> yet...It is a bursty write to GCS though, we're writing all our dialect
>>> testing data out of GCS via Jenkins.  So depending on how fast a given
>>> dialect test runs, and how many we have running at a time, it can vary.
>>>  Nifi picks up from there and runs it through the meat grinder on its way
>>> into Google Big Query. When I check the GCS meta data on the file, nothing
>>> leads me to believe that the files are being written twice, and/or are
>>> modified in any way that the timestamps might be changing.
>>>
>>> As an example, when looking at
>>> results_2017-10-26_athena_1beb7976-dc23-42c4-876e-5ddbe8cb91e1_1.0.json in
>>> GCS, we can see the creation/updated times are the same, and when I look at
>>> the data provenance for this file, we see a flowfile being created at
>>> 02:05:03 and then again at 04:50:03....
>>>
>>> When I look at the timings of the Jenkins job start/complete, etc...and
>>> the file time in GCS, they lineup.
>>>
>>> Nifi 1.3, 3 node cluster.
>>>
>>> See the attached screenshots.....
>>>
>>> [image: Screen Shot 2017-10-26 at 11.33.28 AM.png]
>>>
>>>
>>>
>>>
>>> On Thu, Oct 26, 2017 at 10:07 AM James Wing <jv...@gmail.com> wrote:
>>>
>>>> Dano,
>>>>
>>>> Do you have a feel for how often it is happening, and if it correlates
>>>> with either heavy or light activity in GCS?  I do not know that there is a
>>>> problem, but it certainly seems possible that you are not imagining things.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>> On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> We're using the ListGCSBucket processor in a clustered environment and
>>>>> are seeing duplicates during subsequent runs.  We have the processor set to
>>>>> run on a cron schedule every 15min. and we're seeing a duplicate files
>>>>> being listed.  I've looked at each of the flowfile properties, and they
>>>>> appear to be the same, i.e. the same gcs.create.time, gcs.etag,
>>>>> gcs.generation, etc...  Any idea on why the same file is being listed
>>>>> twice?
>>>>>
>>>>> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
>>>>> 2017-10-25 at 5.51.00 AM.png]
>>>>>
>>>>> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Dano
>>>>>
>>>>
>>>>
>>

Re: ListGCSBucket and duplicates.

Posted by dan young <da...@gmail.com>.
Hey James,

I changed the Age Off to 1 day yesterday, and I just checked, and we we
don't have any duplicates in Big Query. So that looks promising, although I
will continue to monitor. As far as the gaps, they are all over the place.
We see anywhere from 15min-2hrs.  The 2 hour gap one was surprising and led
me to extend the age off to a day.  Most of them I checked this morning
have a 15min gap.

Regards,

Dano



On Thu, Oct 26, 2017 at 10:59 PM James Wing <jv...@gmail.com> wrote:

> I was able to reproduce this issue.  I ran ListGCSBucket at 30-second
> intervals, and the duplicates I caught appeared within 30 seconds or 1
> minute of the first. That seems good in the sense that DetectDuplicate is
> probably effective in weeding duplicates out.
>
> You mentioned changing the Age Off Duration to 1 day, what has been your
> experience with time gap between the first file and the duplicate?
>
> https://issues.apache.org/jira/browse/NIFI-4533
>
>
> On Thu, Oct 26, 2017 at 10:49 AM, dan young <da...@gmail.com> wrote:
>
>> Hey James,
>>
>> It's happening more frequently that I would expect. I did adjust the
>> ListGCSBucket to run every 5 min. and added a DetectDuplicate processor
>> that's been pretty good at filter these out, Although I've had to adjust
>> the Age Off Duration a few times.  I currently have it set to a day, so
>> we'll see how it goes today.
>>
>> As far as a correlation, I have not been able to nail down the timing
>> yet...It is a bursty write to GCS though, we're writing all our dialect
>> testing data out of GCS via Jenkins.  So depending on how fast a given
>> dialect test runs, and how many we have running at a time, it can vary.
>>  Nifi picks up from there and runs it through the meat grinder on its way
>> into Google Big Query. When I check the GCS meta data on the file, nothing
>> leads me to believe that the files are being written twice, and/or are
>> modified in any way that the timestamps might be changing.
>>
>> As an example, when looking at
>> results_2017-10-26_athena_1beb7976-dc23-42c4-876e-5ddbe8cb91e1_1.0.json in
>> GCS, we can see the creation/updated times are the same, and when I look at
>> the data provenance for this file, we see a flowfile being created at
>> 02:05:03 and then again at 04:50:03....
>>
>> When I look at the timings of the Jenkins job start/complete, etc...and
>> the file time in GCS, they lineup.
>>
>> Nifi 1.3, 3 node cluster.
>>
>> See the attached screenshots.....
>>
>> [image: Screen Shot 2017-10-26 at 11.33.28 AM.png]
>>
>>
>>
>>
>> On Thu, Oct 26, 2017 at 10:07 AM James Wing <jv...@gmail.com> wrote:
>>
>>> Dano,
>>>
>>> Do you have a feel for how often it is happening, and if it correlates
>>> with either heavy or light activity in GCS?  I do not know that there is a
>>> problem, but it certainly seems possible that you are not imagining things.
>>>
>>> Thanks,
>>>
>>> James
>>>
>>> On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> We're using the ListGCSBucket processor in a clustered environment and
>>>> are seeing duplicates during subsequent runs.  We have the processor set to
>>>> run on a cron schedule every 15min. and we're seeing a duplicate files
>>>> being listed.  I've looked at each of the flowfile properties, and they
>>>> appear to be the same, i.e. the same gcs.create.time, gcs.etag,
>>>> gcs.generation, etc...  Any idea on why the same file is being listed
>>>> twice?
>>>>
>>>> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
>>>> 2017-10-25 at 5.51.00 AM.png]
>>>>
>>>> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Dano
>>>>
>>>
>>>
>

Re: ListGCSBucket and duplicates.

Posted by James Wing <jv...@gmail.com>.
I was able to reproduce this issue.  I ran ListGCSBucket at 30-second
intervals, and the duplicates I caught appeared within 30 seconds or 1
minute of the first. That seems good in the sense that DetectDuplicate is
probably effective in weeding duplicates out.

You mentioned changing the Age Off Duration to 1 day, what has been your
experience with time gap between the first file and the duplicate?

https://issues.apache.org/jira/browse/NIFI-4533


On Thu, Oct 26, 2017 at 10:49 AM, dan young <da...@gmail.com> wrote:

> Hey James,
>
> It's happening more frequently that I would expect. I did adjust the
> ListGCSBucket to run every 5 min. and added a DetectDuplicate processor
> that's been pretty good at filter these out, Although I've had to adjust
> the Age Off Duration a few times.  I currently have it set to a day, so
> we'll see how it goes today.
>
> As far as a correlation, I have not been able to nail down the timing
> yet...It is a bursty write to GCS though, we're writing all our dialect
> testing data out of GCS via Jenkins.  So depending on how fast a given
> dialect test runs, and how many we have running at a time, it can vary.
>  Nifi picks up from there and runs it through the meat grinder on its way
> into Google Big Query. When I check the GCS meta data on the file, nothing
> leads me to believe that the files are being written twice, and/or are
> modified in any way that the timestamps might be changing.
>
> As an example, when looking at results_2017-10-26_athena_
> 1beb7976-dc23-42c4-876e-5ddbe8cb91e1_1.0.json in GCS, we can see the
> creation/updated times are the same, and when I look at the data provenance
> for this file, we see a flowfile being created at 02:05:03 and then again
> at 04:50:03....
>
> When I look at the timings of the Jenkins job start/complete, etc...and
> the file time in GCS, they lineup.
>
> Nifi 1.3, 3 node cluster.
>
> See the attached screenshots.....
>
> [image: Screen Shot 2017-10-26 at 11.33.28 AM.png]
>
>
>
>
> On Thu, Oct 26, 2017 at 10:07 AM James Wing <jv...@gmail.com> wrote:
>
>> Dano,
>>
>> Do you have a feel for how often it is happening, and if it correlates
>> with either heavy or light activity in GCS?  I do not know that there is a
>> problem, but it certainly seems possible that you are not imagining things.
>>
>> Thanks,
>>
>> James
>>
>> On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> We're using the ListGCSBucket processor in a clustered environment and
>>> are seeing duplicates during subsequent runs.  We have the processor set to
>>> run on a cron schedule every 15min. and we're seeing a duplicate files
>>> being listed.  I've looked at each of the flowfile properties, and they
>>> appear to be the same, i.e. the same gcs.create.time, gcs.etag,
>>> gcs.generation, etc...  Any idea on why the same file is being listed
>>> twice?
>>>
>>> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
>>> 2017-10-25 at 5.51.00 AM.png]
>>>
>>> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>>>
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>
>>

Re: ListGCSBucket and duplicates.

Posted by dan young <da...@gmail.com>.
Hey James,

It's happening more frequently that I would expect. I did adjust the
ListGCSBucket to run every 5 min. and added a DetectDuplicate processor
that's been pretty good at filter these out, Although I've had to adjust
the Age Off Duration a few times.  I currently have it set to a day, so
we'll see how it goes today.

As far as a correlation, I have not been able to nail down the timing
yet...It is a bursty write to GCS though, we're writing all our dialect
testing data out of GCS via Jenkins.  So depending on how fast a given
dialect test runs, and how many we have running at a time, it can vary.
 Nifi picks up from there and runs it through the meat grinder on its way
into Google Big Query. When I check the GCS meta data on the file, nothing
leads me to believe that the files are being written twice, and/or are
modified in any way that the timestamps might be changing.

As an example, when looking at
results_2017-10-26_athena_1beb7976-dc23-42c4-876e-5ddbe8cb91e1_1.0.json in
GCS, we can see the creation/updated times are the same, and when I look at
the data provenance for this file, we see a flowfile being created at
02:05:03 and then again at 04:50:03....

When I look at the timings of the Jenkins job start/complete, etc...and the
file time in GCS, they lineup.

Nifi 1.3, 3 node cluster.

See the attached screenshots.....

[image: Screen Shot 2017-10-26 at 11.33.28 AM.png]




On Thu, Oct 26, 2017 at 10:07 AM James Wing <jv...@gmail.com> wrote:

> Dano,
>
> Do you have a feel for how often it is happening, and if it correlates
> with either heavy or light activity in GCS?  I do not know that there is a
> problem, but it certainly seems possible that you are not imagining things.
>
> Thanks,
>
> James
>
> On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> We're using the ListGCSBucket processor in a clustered environment and
>> are seeing duplicates during subsequent runs.  We have the processor set to
>> run on a cron schedule every 15min. and we're seeing a duplicate files
>> being listed.  I've looked at each of the flowfile properties, and they
>> appear to be the same, i.e. the same gcs.create.time, gcs.etag,
>> gcs.generation, etc...  Any idea on why the same file is being listed
>> twice?
>>
>> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
>> 2017-10-25 at 5.51.00 AM.png]
>>
>> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>>
>>
>> Regards,
>>
>> Dano
>>
>
>

Re: ListGCSBucket and duplicates.

Posted by James Wing <jv...@gmail.com>.
Dano,

Do you have a feel for how often it is happening, and if it correlates with
either heavy or light activity in GCS?  I do not know that there is a
problem, but it certainly seems possible that you are not imagining things.

Thanks,

James

On Wed, Oct 25, 2017 at 5:01 AM, dan young <da...@gmail.com> wrote:

> Hello everyone,
>
> We're using the ListGCSBucket processor in a clustered environment and are
> seeing duplicates during subsequent runs.  We have the processor set to run
> on a cron schedule every 15min. and we're seeing a duplicate files being
> listed.  I've looked at each of the flowfile properties, and they appear to
> be the same, i.e. the same gcs.create.time, gcs.etag, gcs.generation,
> etc...  Any idea on why the same file is being listed twice?
>
> [image: Screen Shot 2017-10-25 at 5.50.33 AM.png][image: Screen Shot
> 2017-10-25 at 5.51.00 AM.png]
>
> [image: Screen Shot 2017-10-25 at 6.00.08 AM.png]
>
>
> Regards,
>
> Dano
>