You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Richard Beare <ri...@gmail.com> on 2021/05/11 22:06:34 UTC

advice - avro record fields to attributes

Hi,
Warning about a likely newbie question.

I'm extracting records from an SQL DB that include a blob that I need
to feed through some custom groovy/java. I have the basic version
working using avro records throughout. However there is a complexity
in that a small proportion of blobs span multiple rows and require
concatenation before processing. Thus I need to ensure that all the
blobs that belong together get assembled into the same flowfile before
performing the concatenation (probably using a custom groovy script
because there are some suffixes to remove before concatenation).

I've added a "PARTS" field via my initial SQL query and there is also
a sequence number column and an ID column. My plan was to use
RecordPartition based on the ID, modify fragment ID and count and then
reassemble with mergerecord using the defragment strategy.

My problem is that I can't figure out how to get record fields into
attributes. I'm hoping there is a recordpath/expression combination
allowing this. Any suggestions?

My fallback is to separate the initial sql queries into two parts, one
for batches of single row blobs and a second that collects the
multi-row ones one at a time.

Re: advice - avro record fields to attributes

Posted by Richard Beare <ri...@gmail.com>.
A working solution:

PartitionRecord on blobID and blobSize, route on attribute for
blobSize = record.count. The non matching branch goes through
PartitionRecord again, this time on blobID, blobSize and blobIndex,
which functions as split, but attaching content as attributes. Then
through updateattributes to copy those attributes to the appropriate
"fragment" attribute, then to merge.

On Thu, May 13, 2021 at 5:36 AM Mark Payne <ma...@hotmail.com> wrote:
>
> Richard,
>
> Yes, this seems reasonable, but you’ll need to know how many ‘fragments’ are in each bundle. Do you have that information? If so, you can use PartitionRecord to pull that information out into attributes, and then use UpdateAttribute to make sure that the appropriate attributes are specified. Then I think you’d probably need to have a custom processor that extends BinFiles. BinFiles is an abstract class that MergeContent extends. It has a couple of different abstract methods but the important one is method:
>
> protected abstract BinProcessingResult processBin(Bin unmodifiableBin, ProcessContext context) throws ProcessException;
>
> The others are more setup/config/validation types of thing that are likely either empty implementations or simple one-liner types of things.
> MergeContent would be a good example to look at to fully understand how to handle these methods.
>
> Hope this helps!
> -Mark
>
>
> On May 12, 2021, at 1:35 AM, Richard Beare <ri...@gmail.com> wrote:
>
> I'm hoping to use partitionrecord as a first step, to create groups
> with common ID. However I cannot be certain that all the required rows
> will be in the same flowfile because of the way input will be chunked
> (e.g some row limit in the initial sql query to keep size manageable).
> PartitionRecord, as far as I can tell, only partitions within each
> flowfile - I tested by feeding the results of splitrecord into
> partition record. Hence I'm thinking partionrecord, update attributes
> using fields in each partition, mergerecord (fragmentation). I may be
> able to filter the complete partitions around the merge.
>
> On Wed, May 12, 2021 at 2:59 PM Chris Sampson <ch...@naimuri.com> wrote:
>
>
> PartitionRecord might be what you're after. This will allow you to analyse fields and separate records from flowfiles into chunks containing the same field values and those values will be added as flowfile attributes.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.PartitionRecord/index.html
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 11 May 2021, 23:06 Richard Beare, <ri...@gmail.com> wrote:
>
>
> Hi,
> Warning about a likely newbie question.
>
> I'm extracting records from an SQL DB that include a blob that I need
> to feed through some custom groovy/java. I have the basic version
> working using avro records throughout. However there is a complexity
> in that a small proportion of blobs span multiple rows and require
> concatenation before processing. Thus I need to ensure that all the
> blobs that belong together get assembled into the same flowfile before
> performing the concatenation (probably using a custom groovy script
> because there are some suffixes to remove before concatenation).
>
> I've added a "PARTS" field via my initial SQL query and there is also
> a sequence number column and an ID column. My plan was to use
> RecordPartition based on the ID, modify fragment ID and count and then
> reassemble with mergerecord using the defragment strategy.
>
> My problem is that I can't figure out how to get record fields into
> attributes. I'm hoping there is a recordpath/expression combination
> allowing this. Any suggestions?
>
> My fallback is to separate the initial sql queries into two parts, one
> for batches of single row blobs and a second that collects the
> multi-row ones one at a time.
>
>

Re: advice - avro record fields to attributes

Posted by Richard Beare <ri...@gmail.com>.
I have a field in the record saying how many parts the blob has been
broken into - how do I put that information into an attribute?

On Thu, May 13, 2021 at 5:36 AM Mark Payne <ma...@hotmail.com> wrote:
>
> Richard,
>
> Yes, this seems reasonable, but you’ll need to know how many ‘fragments’ are in each bundle. Do you have that information? If so, you can use PartitionRecord to pull that information out into attributes, and then use UpdateAttribute to make sure that the appropriate attributes are specified. Then I think you’d probably need to have a custom processor that extends BinFiles. BinFiles is an abstract class that MergeContent extends. It has a couple of different abstract methods but the important one is method:
>
> protected abstract BinProcessingResult processBin(Bin unmodifiableBin, ProcessContext context) throws ProcessException;
>
> The others are more setup/config/validation types of thing that are likely either empty implementations or simple one-liner types of things.
> MergeContent would be a good example to look at to fully understand how to handle these methods.
>
> Hope this helps!
> -Mark
>
>
> On May 12, 2021, at 1:35 AM, Richard Beare <ri...@gmail.com> wrote:
>
> I'm hoping to use partitionrecord as a first step, to create groups
> with common ID. However I cannot be certain that all the required rows
> will be in the same flowfile because of the way input will be chunked
> (e.g some row limit in the initial sql query to keep size manageable).
> PartitionRecord, as far as I can tell, only partitions within each
> flowfile - I tested by feeding the results of splitrecord into
> partition record. Hence I'm thinking partionrecord, update attributes
> using fields in each partition, mergerecord (fragmentation). I may be
> able to filter the complete partitions around the merge.
>
> On Wed, May 12, 2021 at 2:59 PM Chris Sampson <ch...@naimuri.com> wrote:
>
>
> PartitionRecord might be what you're after. This will allow you to analyse fields and separate records from flowfiles into chunks containing the same field values and those values will be added as flowfile attributes.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.PartitionRecord/index.html
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 11 May 2021, 23:06 Richard Beare, <ri...@gmail.com> wrote:
>
>
> Hi,
> Warning about a likely newbie question.
>
> I'm extracting records from an SQL DB that include a blob that I need
> to feed through some custom groovy/java. I have the basic version
> working using avro records throughout. However there is a complexity
> in that a small proportion of blobs span multiple rows and require
> concatenation before processing. Thus I need to ensure that all the
> blobs that belong together get assembled into the same flowfile before
> performing the concatenation (probably using a custom groovy script
> because there are some suffixes to remove before concatenation).
>
> I've added a "PARTS" field via my initial SQL query and there is also
> a sequence number column and an ID column. My plan was to use
> RecordPartition based on the ID, modify fragment ID and count and then
> reassemble with mergerecord using the defragment strategy.
>
> My problem is that I can't figure out how to get record fields into
> attributes. I'm hoping there is a recordpath/expression combination
> allowing this. Any suggestions?
>
> My fallback is to separate the initial sql queries into two parts, one
> for batches of single row blobs and a second that collects the
> multi-row ones one at a time.
>
>

Re: advice - avro record fields to attributes

Posted by Mark Payne <ma...@hotmail.com>.
Richard,

Yes, this seems reasonable, but you’ll need to know how many ‘fragments’ are in each bundle. Do you have that information? If so, you can use PartitionRecord to pull that information out into attributes, and then use UpdateAttribute to make sure that the appropriate attributes are specified. Then I think you’d probably need to have a custom processor that extends BinFiles. BinFiles is an abstract class that MergeContent extends. It has a couple of different abstract methods but the important one is method:

protected abstract BinProcessingResult processBin(Bin unmodifiableBin, ProcessContext context) throws ProcessException;

The others are more setup/config/validation types of thing that are likely either empty implementations or simple one-liner types of things.
MergeContent would be a good example to look at to fully understand how to handle these methods.

Hope this helps!
-Mark


On May 12, 2021, at 1:35 AM, Richard Beare <ri...@gmail.com>> wrote:

I'm hoping to use partitionrecord as a first step, to create groups
with common ID. However I cannot be certain that all the required rows
will be in the same flowfile because of the way input will be chunked
(e.g some row limit in the initial sql query to keep size manageable).
PartitionRecord, as far as I can tell, only partitions within each
flowfile - I tested by feeding the results of splitrecord into
partition record. Hence I'm thinking partionrecord, update attributes
using fields in each partition, mergerecord (fragmentation). I may be
able to filter the complete partitions around the merge.

On Wed, May 12, 2021 at 2:59 PM Chris Sampson <ch...@naimuri.com>> wrote:

PartitionRecord might be what you're after. This will allow you to analyse fields and separate records from flowfiles into chunks containing the same field values and those values will be added as flowfile attributes.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.PartitionRecord/index.html


Cheers,

Chris Sampson

On Tue, 11 May 2021, 23:06 Richard Beare, <ri...@gmail.com> wrote:

Hi,
Warning about a likely newbie question.

I'm extracting records from an SQL DB that include a blob that I need
to feed through some custom groovy/java. I have the basic version
working using avro records throughout. However there is a complexity
in that a small proportion of blobs span multiple rows and require
concatenation before processing. Thus I need to ensure that all the
blobs that belong together get assembled into the same flowfile before
performing the concatenation (probably using a custom groovy script
because there are some suffixes to remove before concatenation).

I've added a "PARTS" field via my initial SQL query and there is also
a sequence number column and an ID column. My plan was to use
RecordPartition based on the ID, modify fragment ID and count and then
reassemble with mergerecord using the defragment strategy.

My problem is that I can't figure out how to get record fields into
attributes. I'm hoping there is a recordpath/expression combination
allowing this. Any suggestions?

My fallback is to separate the initial sql queries into two parts, one
for batches of single row blobs and a second that collects the
multi-row ones one at a time.


Re: advice - avro record fields to attributes

Posted by Richard Beare <ri...@gmail.com>.
I'm hoping to use partitionrecord as a first step, to create groups
with common ID. However I cannot be certain that all the required rows
will be in the same flowfile because of the way input will be chunked
(e.g some row limit in the initial sql query to keep size manageable).
PartitionRecord, as far as I can tell, only partitions within each
flowfile - I tested by feeding the results of splitrecord into
partition record. Hence I'm thinking partionrecord, update attributes
using fields in each partition, mergerecord (fragmentation). I may be
able to filter the complete partitions around the merge.

On Wed, May 12, 2021 at 2:59 PM Chris Sampson <ch...@naimuri.com> wrote:
>
> PartitionRecord might be what you're after. This will allow you to analyse fields and separate records from flowfiles into chunks containing the same field values and those values will be added as flowfile attributes.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.PartitionRecord/index.html
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 11 May 2021, 23:06 Richard Beare, <ri...@gmail.com> wrote:
>>
>> Hi,
>> Warning about a likely newbie question.
>>
>> I'm extracting records from an SQL DB that include a blob that I need
>> to feed through some custom groovy/java. I have the basic version
>> working using avro records throughout. However there is a complexity
>> in that a small proportion of blobs span multiple rows and require
>> concatenation before processing. Thus I need to ensure that all the
>> blobs that belong together get assembled into the same flowfile before
>> performing the concatenation (probably using a custom groovy script
>> because there are some suffixes to remove before concatenation).
>>
>> I've added a "PARTS" field via my initial SQL query and there is also
>> a sequence number column and an ID column. My plan was to use
>> RecordPartition based on the ID, modify fragment ID and count and then
>> reassemble with mergerecord using the defragment strategy.
>>
>> My problem is that I can't figure out how to get record fields into
>> attributes. I'm hoping there is a recordpath/expression combination
>> allowing this. Any suggestions?
>>
>> My fallback is to separate the initial sql queries into two parts, one
>> for batches of single row blobs and a second that collects the
>> multi-row ones one at a time.

Re: advice - avro record fields to attributes

Posted by Chris Sampson <ch...@naimuri.com>.
PartitionRecord might be what you're after. This will allow you to analyse
fields and separate records from flowfiles into chunks containing the same
field values and those values will be added as flowfile attributes.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.PartitionRecord/index.html


Cheers,

Chris Sampson

On Tue, 11 May 2021, 23:06 Richard Beare, <ri...@gmail.com> wrote:

> Hi,
> Warning about a likely newbie question.
>
> I'm extracting records from an SQL DB that include a blob that I need
> to feed through some custom groovy/java. I have the basic version
> working using avro records throughout. However there is a complexity
> in that a small proportion of blobs span multiple rows and require
> concatenation before processing. Thus I need to ensure that all the
> blobs that belong together get assembled into the same flowfile before
> performing the concatenation (probably using a custom groovy script
> because there are some suffixes to remove before concatenation).
>
> I've added a "PARTS" field via my initial SQL query and there is also
> a sequence number column and an ID column. My plan was to use
> RecordPartition based on the ID, modify fragment ID and count and then
> reassemble with mergerecord using the defragment strategy.
>
> My problem is that I can't figure out how to get record fields into
> attributes. I'm hoping there is a recordpath/expression combination
> allowing this. Any suggestions?
>
> My fallback is to separate the initial sql queries into two parts, one
> for batches of single row blobs and a second that collects the
> multi-row ones one at a time.
>