You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2016/09/20 14:22:42 UTC

Requesting Obscene FlowFile Batch Sizes

I'm using JSONToSQL, followed by PutSQL.  I'm using Teradata, which supports a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows of data per batch.

What I'm finding is that when PutSQL requests a new batch of FlowFiles from the queue, which has over 1 million rows in it, with a batch size of 1000000, it always returns a maximum of 10k.  How can I get my obscenely sized batch request to return all the FlowFile's I'm asking for?

Thanks,
  Peter

Re: Requesting Obscene FlowFile Batch Sizes

Posted by Andy LoPresto <al...@gmail.com>.

Hi Peter,

Thanks for letting us know you found a solution and for the additional context. Provenance performance is a key area of focus in the next couple releases, so hopefully we will have that fixed soon. 

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 20, 2016, at 19:39, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> 
> Andy/Bryan,
>  
> Thanks for all of the detail, it’s been helpful.
> I actually did an experiment this morning where I modified the processor to force it to keep calling `get` until it had all 1 million FlowFiles.  Since I was calling it sequentially it was able to move files out of swap and into active on each request. I was able to retrieve them and process them through, which was great until… NiFi tried to move them through provenance.  At that point NiFi ran out of memory and fell over (stopped responding).  Right before NiFi ran out of memory I received several bulletins related to Provenance being written to too quickly, and that it was being slowed down.
>  
> I found another solution to my mass insert and got it up and running. Using a Teradata JDBC proprietary flag called FastLoadCSV, and a new custom processor, I was able to pass in a CSV file to my JDBC driver and get the same result.  In this scenario there was just a single FlowFile and everything went smoothly.
>  
> Thanks again!
>  
> Peter Wicks
>  
>  
>  
> From: Bryan Bende [mailto:bbende@gmail.com] 
> Sent: Tuesday, September 20, 2016 3:38 PM
> To: users@nifi.apache.org
> Subject: Re: Requesting Obscene FlowFile Batch Sizes
>  
> Andy,
>  
> That was my thinking. An easy test might be to bump the threshold up to 100k (increase heap if needed) and see if it starts grabbing 100k every time. 
>  
> If it does then I would think it is swapping related, then need to figure out if you really want to get all 1 million in a single batch, and if theres enough heap to support that.
>  
> -Bryan
>  
> On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <al...@apache.org> wrote:
> Bryan,
>  
> That’s a good point. Would running with a larger Java heap and higher swap threshold allow Peter to get larger batches out?
>  
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>  
> On Sep 20, 2016, at 1:41 PM, Bryan Bende <bb...@gmail.com> wrote:
>  
> Peter,
>  
> Does 10k happen to be your swap threshold in nifi.properties by any chance (it defaults to 20k I believe)?
>  
> I suspect the behavior you are seeing could be due to the way swapping works, but Mark or others could probably confirm.
>  
> I found this thread where Mark explained how swapping works with a background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>  
> -Bryan
>  
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows of data per batch.
>  
> What I’m finding is that when PutSQL requests a new batch of FlowFiles from the queue, which has over 1 million rows in it, with a batch size of 1000000, it always returns a maximum of 10k.  How can I get my obscenely sized batch request to return all the FlowFile’s I’m asking for?
>  
> Thanks,
>   Peter
>  
>  
>

RE: Requesting Obscene FlowFile Batch Sizes

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.

Andy/Bryan,

Thanks for all of the detail, it’s been helpful.
I actually did an experiment this morning where I modified the processor to force it to keep calling `get` until it had all 1 million FlowFiles.  Since I was calling it sequentially it was able to move files out of swap and into active on each request. I was able to retrieve them and process them through, which was great until… NiFi tried to move them through provenance.  At that point NiFi ran out of memory and fell over (stopped responding).  Right before NiFi ran out of memory I received several bulletins related to Provenance being written to too quickly, and that it was being slowed down.

I found another solution to my mass insert and got it up and running. Using a Teradata JDBC proprietary flag called FastLoadCSV, and a new custom processor, I was able to pass in a CSV file to my JDBC driver and get the same result.  In this scenario there was just a single FlowFile and everything went smoothly.

Thanks again!

Peter Wicks



From: Bryan Bende [mailto:bbende@gmail.com]
Sent: Tuesday, September 20, 2016 3:38 PM
To: users@nifi.apache.org
Subject: Re: Requesting Obscene FlowFile Batch Sizes

Andy,

That was my thinking. An easy test might be to bump the threshold up to 100k (increase heap if needed) and see if it starts grabbing 100k every time.

If it does then I would think it is swapping related, then need to figure out if you really want to get all 1 million in a single batch, and if theres enough heap to support that.

-Bryan

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <al...@apache.org>> wrote:
Bryan,

That’s a good point. Would running with a larger Java heap and higher swap threshold allow Peter to get larger batches out?

Andy LoPresto
alopresto@apache.org<ma...@apache.org>
alopresto.apache@gmail.com<ma...@gmail.com>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Sep 20, 2016, at 1:41 PM, Bryan Bende <bb...@gmail.com>> wrote:

Peter,

Does 10k happen to be your swap threshold in nifi.properties by any chance (it defaults to 20k I believe)?

I suspect the behavior you are seeing could be due to the way swapping works, but Mark or others could probably confirm.

I found this thread where Mark explained how swapping works with a background thread, and I believe it still works this way:
http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html

-Bryan

On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pw...@micron.com>> wrote:
I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows of data per batch.

What I’m finding is that when PutSQL requests a new batch of FlowFiles from the queue, which has over 1 million rows in it, with a batch size of 1000000, it always returns a maximum of 10k.  How can I get my obscenely sized batch request to return all the FlowFile’s I’m asking for?

Thanks,
  Peter

Re: Requesting Obscene FlowFile Batch Sizes

Posted by Bryan Bende <bb...@gmail.com>.

Andy,

That was my thinking. An easy test might be to bump the threshold up to
100k (increase heap if needed) and see if it starts grabbing 100k every
time.

If it does then I would think it is swapping related, then need to figure
out if you really want to get all 1 million in a single batch, and if
theres enough heap to support that.

-Bryan

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <al...@apache.org> wrote:

> Bryan,
>
> That’s a good point. Would running with a larger Java heap and higher swap
> threshold allow Peter to get larger batches out?
>
> Andy LoPresto
> alopresto@apache.org
> *alopresto.apache@gmail.com <al...@gmail.com>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Sep 20, 2016, at 1:41 PM, Bryan Bende <bb...@gmail.com> wrote:
>
> Peter,
>
> Does 10k happen to be your swap threshold in nifi.properties by any chance
> (it defaults to 20k I believe)?
>
> I suspect the behavior you are seeing could be due to the way swapping
> works, but Mark or others could probably confirm.
>
> I found this thread where Mark explained how swapping works with a
> background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-
> receiver-performance-configuration-td524.html
>
> -Bryan
>
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pw...@micron.com>
> wrote:
>
>> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
>> supports a special JDBC mode called FastLoad, designed for a minimum of
>> 100,000 rows of data per batch.
>>
>>
>>
>> What I’m finding is that when PutSQL requests a new batch of FlowFiles
>> from the queue, which has over 1 million rows in it, with a batch size of
>> 1000000, it always returns a maximum of 10k.  How can I get my obscenely
>> sized batch request to return all the FlowFile’s I’m asking for?
>>
>>
>>
>> Thanks,
>>
>>   Peter
>>
>
>
>

Re: Requesting Obscene FlowFile Batch Sizes

Posted by Joe Witt <jo...@gmail.com>.

It would buy time but either way it becomes a magic value people have
to know about.  This is not unlike the SplitText scenario where we
recommend doing two-phase splits.  The problem is that for the
ProcessSession we hold information about the flowfiles (not their
content) in memory and the provenance events in memory.  When we're
talking hundreds of thousands or more events in a session that adds up
really quick.  Users should not need to know/worry about this sort of
thing.  We need to have a way to prestage these things to the
respective repositories (provenance/flowfile) so this can go back to
where it belongs as a framework concern.  Easier said that done but a
good goal for us to have.

Peter's use case is a good one to rally around as they way he wanted
it to work is reasonable and intuitive and we should try to make that
happen.

Thanks
Joe

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <al...@apache.org> wrote:
> Bryan,
>
> That’s a good point. Would running with a larger Java heap and higher swap
> threshold allow Peter to get larger batches out?
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Sep 20, 2016, at 1:41 PM, Bryan Bende <bb...@gmail.com> wrote:
>
> Peter,
>
> Does 10k happen to be your swap threshold in nifi.properties by any chance
> (it defaults to 20k I believe)?
>
> I suspect the behavior you are seeing could be due to the way swapping
> works, but Mark or others could probably confirm.
>
> I found this thread where Mark explained how swapping works with a
> background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>
> -Bryan
>
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pw...@micron.com>
> wrote:
>>
>> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
>> supports a special JDBC mode called FastLoad, designed for a minimum of
>> 100,000 rows of data per batch.
>>
>>
>>
>> What I’m finding is that when PutSQL requests a new batch of FlowFiles
>> from the queue, which has over 1 million rows in it, with a batch size of
>> 1000000, it always returns a maximum of 10k.  How can I get my obscenely
>> sized batch request to return all the FlowFile’s I’m asking for?
>>
>>
>>
>> Thanks,
>>
>>   Peter
>
>
>

Re: Requesting Obscene FlowFile Batch Sizes

Posted by Andy LoPresto <al...@apache.org>.

Bryan,

That’s a good point. Would running with a larger Java heap and higher swap threshold allow Peter to get larger batches out?

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 20, 2016, at 1:41 PM, Bryan Bende <bb...@gmail.com> wrote:
> 
> Peter,
> 
> Does 10k happen to be your swap threshold in nifi.properties by any chance (it defaults to 20k I believe)?
> 
> I suspect the behavior you are seeing could be due to the way swapping works, but Mark or others could probably confirm.
> 
> I found this thread where Mark explained how swapping works with a background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html <http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html>
> 
> -Bryan
> 
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pwicks@micron.com <ma...@micron.com>> wrote:
> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows of data per batch.
> 
> 
> 
> What I’m finding is that when PutSQL requests a new batch of FlowFiles from the queue, which has over 1 million rows in it, with a batch size of 1000000, it always returns a maximum of 10k.  How can I get my obscenely sized batch request to return all the FlowFile’s I’m asking for?
> 
> 
> 
> Thanks,
> 
>   Peter
> 
>

Re: Requesting Obscene FlowFile Batch Sizes

Posted by Bryan Bende <bb...@gmail.com>.

Peter,

Does 10k happen to be your swap threshold in nifi.properties by any chance
(it defaults to 20k I believe)?

I suspect the behavior you are seeing could be due to the way swapping
works, but Mark or others could probably confirm.

I found this thread where Mark explained how swapping works with a
background thread, and I believe it still works this way:
http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html

-Bryan

On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pw...@micron.com>
wrote:

> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
> supports a special JDBC mode called FastLoad, designed for a minimum of
> 100,000 rows of data per batch.
>
>
>
> What I’m finding is that when PutSQL requests a new batch of FlowFiles
> from the queue, which has over 1 million rows in it, with a batch size of
> 1000000, it always returns a maximum of 10k.  How can I get my obscenely
> sized batch request to return all the FlowFile’s I’m asking for?
>
>
>
> Thanks,
>
>   Peter
>