You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Richard Miskin <r....@gmail.com> on 2016/04/15 17:57:04 UTC

Use of List in StandardProvenanceEventRecord.Builder

Hi,

I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.

As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.

I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.

Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?

Regards,
Richard

Re: Use of List in StandardProvenanceEventRecord.Builder

Posted by Richard Miskin <r....@gmail.com>.
Mark,Joe,


I’d spotted the ‘correlation attribute’ and I’d set that to a uuid relating to the incoming request, that much is fine. The problem I had was how the set things up so that MergeContent knows when it has got all the files. The options seemed to be batch size or time related, neither of which seemed guaranteed to get all the FlowFiles together.

Looking again at MergeContent I could potentially use defragment rather then bin packing, which would allow me to set the expected number of FlowFiles to merge.

The processor that loads from the database would probably still need to read all of the data and place many FlowFiles on the queue in one commit to ensure only complete files can go out. 

Thanks,
Richard


> On 15 Apr 2016, at 19:22, Mark Payne <ma...@hotmail.com> wrote:
> 
> Just to build on what Joe said - the correlation attribute can be used to ensure that FlowFiles
> are bunched together appropriately. So you could, for instance, create an attribute named "batch.id"
> and set it as UUID. So if you perform a query against your database, you can generate a new UUID
> and then stamp that UUID on all FlowFiles that are generated from that query. Then, in merge content
> you would said "Correlation Attribute" to "batch.id" and you should be good. Since this attribute will be
> the same on all FlowFIles that get bundled together, it will also be retained in the new, merged FlowFile,
> so you can use that same Correlation Attribute in the follow-on MergeContent processor.
> 
>> On Apr 15, 2016, at 1:50 PM, Joe Witt <jo...@gmail.com> wrote:
>> 
>> Richard,
>> 
>> MergeContent supports a concept called 'correlation attribute' which
>> will merge things together based on a matching correlation attribute
>> value.  That might be useful for your case.
>> 
>> Regarding heap use you are observing i'd be happy to work through that
>> more with you.
>> 
>> Thanks
>> Joe
>> 
>> On Fri, Apr 15, 2016 at 1:45 PM, Richard Miskin <r....@gmail.com> wrote:
>>> Hi Mark,
>>> 
>>> Thanks for pointer, I’d not spotted I was losing my provenance information. I’d changed my code from transferring the temporary FlowFiles to a relationship that was auto-terminated to using session.remove() and had assumed that the provenance report was the same. I’ve just tested it and you’re quite right, using session.remove() discards the provenance information.
>>> 
>>> Heap usage has been an issue, but seems to be okay at present, admittedly with several GB heap allocated.
>>> 
>>> I did look at combining the files using one processor to load the data and then using MergeContent to combine them. But every record loaded due to a specific request must be combined into a single file and I couldn’t find a suitable way of guaranteeing that with MergeContent.
>>> 
>>> Thanks for your help,
>>> Richard
>>> 
>>>> On 15 Apr 2016, at 17:10, Mark Payne <ma...@hotmail.com> wrote:
>>>> 
>>>> Richard,
>>>> 
>>>> So the order of the children may be important for some people. It certainly is reasonable to care
>>>> about the order in which the children were created.
>>>> 
>>>> The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
>>>> in the amount of heap consumed is pretty remarkably different. Since this collection is sometimes
>>>> quite large, a Set would be potentially problematic.
>>>> 
>>>> That said, with the approach that you are taking, I don't think you're going to get the result that
>>>> you are looking for, because as you remove the FlowFiles, the events generated for them are
>>>> also removed. So you won't end up getting any Provenance events anyway.
>>>> 
>>>> One possible way to achieve what you are looking for is to instead emit each of those FlowFiles
>>>> individually and then use a MergeContent processor to merge the FlowFiles back together.
>>>> Using this approach, though, you will certainly run into heap concerns if you are trying to merge
>>>> 500,000 FlowFiles in a single iterations. Typically, the approach that we would follow is to merge
>>>> say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
>>>> together 50 of those 10,000-FlowFile-bundles.
>>>> 
>>>> Thanks
>>>> -Mark
>>>> 
>>>> 
>>>>> On Apr 15, 2016, at 11:57 AM, Richard Miskin <r....@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.
>>>>> 
>>>>> As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.
>>>>> 
>>>>> I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.
>>>>> 
>>>>> Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?
>>>>> 
>>>>> Regards,
>>>>> Richard
>>>> 
>>> 
> 


Re: Use of List in StandardProvenanceEventRecord.Builder

Posted by Mark Payne <ma...@hotmail.com>.
Just to build on what Joe said - the correlation attribute can be used to ensure that FlowFiles
are bunched together appropriately. So you could, for instance, create an attribute named "batch.id"
and set it as UUID. So if you perform a query against your database, you can generate a new UUID
and then stamp that UUID on all FlowFiles that are generated from that query. Then, in merge content
you would said "Correlation Attribute" to "batch.id" and you should be good. Since this attribute will be
the same on all FlowFIles that get bundled together, it will also be retained in the new, merged FlowFile,
so you can use that same Correlation Attribute in the follow-on MergeContent processor.

> On Apr 15, 2016, at 1:50 PM, Joe Witt <jo...@gmail.com> wrote:
> 
> Richard,
> 
> MergeContent supports a concept called 'correlation attribute' which
> will merge things together based on a matching correlation attribute
> value.  That might be useful for your case.
> 
> Regarding heap use you are observing i'd be happy to work through that
> more with you.
> 
> Thanks
> Joe
> 
> On Fri, Apr 15, 2016 at 1:45 PM, Richard Miskin <r....@gmail.com> wrote:
>> Hi Mark,
>> 
>> Thanks for pointer, I’d not spotted I was losing my provenance information. I’d changed my code from transferring the temporary FlowFiles to a relationship that was auto-terminated to using session.remove() and had assumed that the provenance report was the same. I’ve just tested it and you’re quite right, using session.remove() discards the provenance information.
>> 
>> Heap usage has been an issue, but seems to be okay at present, admittedly with several GB heap allocated.
>> 
>> I did look at combining the files using one processor to load the data and then using MergeContent to combine them. But every record loaded due to a specific request must be combined into a single file and I couldn’t find a suitable way of guaranteeing that with MergeContent.
>> 
>> Thanks for your help,
>> Richard
>> 
>>> On 15 Apr 2016, at 17:10, Mark Payne <ma...@hotmail.com> wrote:
>>> 
>>> Richard,
>>> 
>>> So the order of the children may be important for some people. It certainly is reasonable to care
>>> about the order in which the children were created.
>>> 
>>> The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
>>> in the amount of heap consumed is pretty remarkably different. Since this collection is sometimes
>>> quite large, a Set would be potentially problematic.
>>> 
>>> That said, with the approach that you are taking, I don't think you're going to get the result that
>>> you are looking for, because as you remove the FlowFiles, the events generated for them are
>>> also removed. So you won't end up getting any Provenance events anyway.
>>> 
>>> One possible way to achieve what you are looking for is to instead emit each of those FlowFiles
>>> individually and then use a MergeContent processor to merge the FlowFiles back together.
>>> Using this approach, though, you will certainly run into heap concerns if you are trying to merge
>>> 500,000 FlowFiles in a single iterations. Typically, the approach that we would follow is to merge
>>> say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
>>> together 50 of those 10,000-FlowFile-bundles.
>>> 
>>> Thanks
>>> -Mark
>>> 
>>> 
>>>> On Apr 15, 2016, at 11:57 AM, Richard Miskin <r....@gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.
>>>> 
>>>> As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.
>>>> 
>>>> I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.
>>>> 
>>>> Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?
>>>> 
>>>> Regards,
>>>> Richard
>>> 
>> 


Re: Use of List in StandardProvenanceEventRecord.Builder

Posted by Joe Witt <jo...@gmail.com>.
Richard,

MergeContent supports a concept called 'correlation attribute' which
will merge things together based on a matching correlation attribute
value.  That might be useful for your case.

Regarding heap use you are observing i'd be happy to work through that
more with you.

Thanks
Joe

On Fri, Apr 15, 2016 at 1:45 PM, Richard Miskin <r....@gmail.com> wrote:
> Hi Mark,
>
> Thanks for pointer, I’d not spotted I was losing my provenance information. I’d changed my code from transferring the temporary FlowFiles to a relationship that was auto-terminated to using session.remove() and had assumed that the provenance report was the same. I’ve just tested it and you’re quite right, using session.remove() discards the provenance information.
>
> Heap usage has been an issue, but seems to be okay at present, admittedly with several GB heap allocated.
>
> I did look at combining the files using one processor to load the data and then using MergeContent to combine them. But every record loaded due to a specific request must be combined into a single file and I couldn’t find a suitable way of guaranteeing that with MergeContent.
>
> Thanks for your help,
> Richard
>
>> On 15 Apr 2016, at 17:10, Mark Payne <ma...@hotmail.com> wrote:
>>
>> Richard,
>>
>> So the order of the children may be important for some people. It certainly is reasonable to care
>> about the order in which the children were created.
>>
>> The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
>> in the amount of heap consumed is pretty remarkably different. Since this collection is sometimes
>> quite large, a Set would be potentially problematic.
>>
>> That said, with the approach that you are taking, I don't think you're going to get the result that
>> you are looking for, because as you remove the FlowFiles, the events generated for them are
>> also removed. So you won't end up getting any Provenance events anyway.
>>
>> One possible way to achieve what you are looking for is to instead emit each of those FlowFiles
>> individually and then use a MergeContent processor to merge the FlowFiles back together.
>> Using this approach, though, you will certainly run into heap concerns if you are trying to merge
>> 500,000 FlowFiles in a single iterations. Typically, the approach that we would follow is to merge
>> say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
>> together 50 of those 10,000-FlowFile-bundles.
>>
>> Thanks
>> -Mark
>>
>>
>>> On Apr 15, 2016, at 11:57 AM, Richard Miskin <r....@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.
>>>
>>> As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.
>>>
>>> I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.
>>>
>>> Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?
>>>
>>> Regards,
>>> Richard
>>
>

Re: Use of List in StandardProvenanceEventRecord.Builder

Posted by Richard Miskin <r....@gmail.com>.
Hi Mark,

Thanks for pointer, I’d not spotted I was losing my provenance information. I’d changed my code from transferring the temporary FlowFiles to a relationship that was auto-terminated to using session.remove() and had assumed that the provenance report was the same. I’ve just tested it and you’re quite right, using session.remove() discards the provenance information.

Heap usage has been an issue, but seems to be okay at present, admittedly with several GB heap allocated.

I did look at combining the files using one processor to load the data and then using MergeContent to combine them. But every record loaded due to a specific request must be combined into a single file and I couldn’t find a suitable way of guaranteeing that with MergeContent.

Thanks for your help,
Richard

> On 15 Apr 2016, at 17:10, Mark Payne <ma...@hotmail.com> wrote:
> 
> Richard,
> 
> So the order of the children may be important for some people. It certainly is reasonable to care
> about the order in which the children were created.
> 
> The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
> in the amount of heap consumed is pretty remarkably different. Since this collection is sometimes
> quite large, a Set would be potentially problematic.
> 
> That said, with the approach that you are taking, I don't think you're going to get the result that
> you are looking for, because as you remove the FlowFiles, the events generated for them are
> also removed. So you won't end up getting any Provenance events anyway.
> 
> One possible way to achieve what you are looking for is to instead emit each of those FlowFiles
> individually and then use a MergeContent processor to merge the FlowFiles back together.
> Using this approach, though, you will certainly run into heap concerns if you are trying to merge
> 500,000 FlowFiles in a single iterations. Typically, the approach that we would follow is to merge
> say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
> together 50 of those 10,000-FlowFile-bundles.
> 
> Thanks
> -Mark
> 
> 
>> On Apr 15, 2016, at 11:57 AM, Richard Miskin <r....@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.
>> 
>> As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.
>> 
>> I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.
>> 
>> Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?
>> 
>> Regards,
>> Richard
> 


Re: Use of List in StandardProvenanceEventRecord.Builder

Posted by Mark Payne <ma...@hotmail.com>.
Richard,

So the order of the children may be important for some people. It certainly is reasonable to care
about the order in which the children were created.

The larger concern, though, would be that if we moved to a Set such as HashSet, the difference
in the amount of heap consumed is pretty remarkably different. Since this collection is sometimes
quite large, a Set would be potentially problematic.

That said, with the approach that you are taking, I don't think you're going to get the result that
you are looking for, because as you remove the FlowFiles, the events generated for them are
also removed. So you won't end up getting any Provenance events anyway.

One possible way to achieve what you are looking for is to instead emit each of those FlowFiles
individually and then use a MergeContent processor to merge the FlowFiles back together.
Using this approach, though, you will certainly run into heap concerns if you are trying to merge
500,000 FlowFiles in a single iterations. Typically, the approach that we would follow is to merge
say 10,000 FlowFiles at a time and then have a subsequent MergeContent that would merge
together 50 of those 10,000-FlowFile-bundles.

Thanks
-Mark


> On Apr 15, 2016, at 11:57 AM, Richard Miskin <r....@gmail.com> wrote:
> 
> Hi,
> 
> I’m trying to track down a performance problem that I’ve spotted with a custom NiFi processor that I’ve written. When triggered by an incoming FlowFile, the processor loads many (up to about 500,000) records from a database and produces an output file in a custom format. I’m trying to leverage NiFi provenance to track what has gone into the merged file, so the processor creates individual FlowFiles for each database record parented from the incoming FlowFile and with various attributes set. The output FlowFile is then created as a merge of all the database record FlowFiles.
> 
> As I don’t require the individual database record FlowFiles outside the processor I call session.remove(Collection<FlowFile>) rather than transferring them. This works fine for small numbers of records, but the call to remove gets very slow as the number of FlowFiles increases, taking over a minute for 100,000 records.
> 
> I need to do some further testing be sure of the cause, but looking through the code I see that StandardProvenanceEventRecord.Builder contains a List<String> to hold the child uuids. The call to session.remove() eventually calls down to List.remove(), which will get progressively slower as the List grows.
> 
> Given the entries in the List<String> are uuids, could this reasonably be changed to be a Set<String>? Presumably there should never be duplicates, but does the order of entries matter?
> 
> Regards,
> Richard