You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mika Borner <ni...@my2ndhead.com> on 2017/06/12 19:12:15 UTC

Merging Records

Hi,

what is the best way to merge records? I'm using a GrokReader, that 
spits out single json records. For efficiency I would like to merge a 
few hundred records into one flowfile. It seems there's no MergeRecord 
processor yet...

Thanks!

Mika>


Re: Merging Records

Posted by Mika Borner <ni...@my2ndhead.com>.
Yes, it worked!

Thanks!

Mika>


On 06/12/2017 10:02 PM, Bryan Bende wrote:
> Mika,
>
> Are you receiving the log messages using the ListenTCP processor?
>
> If so, just wanted to mention that there is a property "Max Batch
> Size" that defaults to 1 and will control how many logical TCP
> messages can be written to a single flow file.
>
> If you increase that to say 1000, then you can send a flow file with
> 1000 log messages to the next record-based processor with the
> GrokReader.
>
> -Bryan
>
>
> On Mon, Jun 12, 2017 at 3:51 PM, Mark Payne <ma...@hotmail.com> wrote:
>> Mika,
>>
>> Understood. The JIRA for this is NIFI-4060 [1]. MergeContent is likely the
>> best option for the short-term,
>> merging with a demarcator of \n (you can press Shift + Enter/Return to
>> insert a new-line in the UI), if that
>> works for your format.
>>
>> Thanks
>> -Mark
>>
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-4060
>>
>>
>> On Jun 12, 2017, at 3:36 PM, Mika Borner <ni...@my2ndhead.com> wrote:
>>
>> Hi Mark
>>
>> Yes, this makes sense.
>>
>> In my case. I'm receiving single log events from a tcp input which I would
>> like to process further with record processors. This is  probably an edge
>> case where a record merger would make sense to make the post-processing more
>> efficient.
>>
>> Good to hear it's already on the radar :-)
>>
>> Mika>
>>
>>
>>
>> On 06/12/2017 09:23 PM, Mark Payne wrote:
>>
>> Hi Mika,
>>
>> You're correct that there is not yet a MergeRecord processor. It is on my
>> personal radar,
>> but I've not yet gotten to it. One of the main reasons that I've not
>> prioritized this yet is that
>> typically in this record-oriented paradigm, you'll see data coming in, in
>> groups and being
>> processed in groups. MergeContent largely has been useful in cases where we
>> split data
>> apart (using processors like SplitText, for example), and then merge it back
>> together later.
>> I don't see this as being quite as prominent when using record readers and
>> writers, as the
>> readers are designed to handle streams of data instead of individual records
>> as FlowFiles.
>>
>> That being said, there are certainly cases where MergeRecord still makes
>> sense. For example,
>> when you're ingesting small payloads or want to batch up to send to
>> something like HDFS, which
>> prefers larger files, etc. So I'll hopefully have a chance to start working
>> on that this week or next.
>>
>> In the mean time, the best path forward for you may be to use MergeContent
>> to concatenate a bunch
>> of data before the processor that is using the Grok Reader. Or, if you are
>> splitting the data up
>> into individual records yourself, I would recommend not splitting them up at
>> all.
>>
>> Does this make sense?
>>
>> Thanks
>> -Mark
>>
>>
>> On Jun 12, 2017, at 3:12 PM, Mika Borner <ni...@my2ndhead.com> wrote:
>>
>> Hi,
>>
>> what is the best way to merge records? I'm using a GrokReader, that spits
>> out single json records. For efficiency I would like to merge a few hundred
>> records into one flowfile. It seems there's no MergeRecord processor yet...
>>
>> Thanks!
>>
>> Mika>
>>
>>
>>


Re: Merging Records

Posted by Bryan Bende <bb...@gmail.com>.
Mika,

Are you receiving the log messages using the ListenTCP processor?

If so, just wanted to mention that there is a property "Max Batch
Size" that defaults to 1 and will control how many logical TCP
messages can be written to a single flow file.

If you increase that to say 1000, then you can send a flow file with
1000 log messages to the next record-based processor with the
GrokReader.

-Bryan


On Mon, Jun 12, 2017 at 3:51 PM, Mark Payne <ma...@hotmail.com> wrote:
> Mika,
>
> Understood. The JIRA for this is NIFI-4060 [1]. MergeContent is likely the
> best option for the short-term,
> merging with a demarcator of \n (you can press Shift + Enter/Return to
> insert a new-line in the UI), if that
> works for your format.
>
> Thanks
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-4060
>
>
> On Jun 12, 2017, at 3:36 PM, Mika Borner <ni...@my2ndhead.com> wrote:
>
> Hi Mark
>
> Yes, this makes sense.
>
> In my case. I'm receiving single log events from a tcp input which I would
> like to process further with record processors. This is  probably an edge
> case where a record merger would make sense to make the post-processing more
> efficient.
>
> Good to hear it's already on the radar :-)
>
> Mika>
>
>
>
> On 06/12/2017 09:23 PM, Mark Payne wrote:
>
> Hi Mika,
>
> You're correct that there is not yet a MergeRecord processor. It is on my
> personal radar,
> but I've not yet gotten to it. One of the main reasons that I've not
> prioritized this yet is that
> typically in this record-oriented paradigm, you'll see data coming in, in
> groups and being
> processed in groups. MergeContent largely has been useful in cases where we
> split data
> apart (using processors like SplitText, for example), and then merge it back
> together later.
> I don't see this as being quite as prominent when using record readers and
> writers, as the
> readers are designed to handle streams of data instead of individual records
> as FlowFiles.
>
> That being said, there are certainly cases where MergeRecord still makes
> sense. For example,
> when you're ingesting small payloads or want to batch up to send to
> something like HDFS, which
> prefers larger files, etc. So I'll hopefully have a chance to start working
> on that this week or next.
>
> In the mean time, the best path forward for you may be to use MergeContent
> to concatenate a bunch
> of data before the processor that is using the Grok Reader. Or, if you are
> splitting the data up
> into individual records yourself, I would recommend not splitting them up at
> all.
>
> Does this make sense?
>
> Thanks
> -Mark
>
>
> On Jun 12, 2017, at 3:12 PM, Mika Borner <ni...@my2ndhead.com> wrote:
>
> Hi,
>
> what is the best way to merge records? I'm using a GrokReader, that spits
> out single json records. For efficiency I would like to merge a few hundred
> records into one flowfile. It seems there's no MergeRecord processor yet...
>
> Thanks!
>
> Mika>
>
>
>

Re: Merging Records

Posted by Mark Payne <ma...@hotmail.com>.
Mika,

Understood. The JIRA for this is NIFI-4060 [1]. MergeContent is likely the best option for the short-term,
merging with a demarcator of \n (you can press Shift + Enter/Return to insert a new-line in the UI), if that
works for your format.

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-4060


On Jun 12, 2017, at 3:36 PM, Mika Borner <ni...@my2ndhead.com>> wrote:

Hi Mark

Yes, this makes sense.

In my case. I'm receiving single log events from a tcp input which I would like to process further with record processors. This is  probably an edge case where a record merger would make sense to make the post-processing more efficient.

Good to hear it's already on the radar :-)

Mika>



On 06/12/2017 09:23 PM, Mark Payne wrote:
Hi Mika,

You're correct that there is not yet a MergeRecord processor. It is on my personal radar,
but I've not yet gotten to it. One of the main reasons that I've not prioritized this yet is that
typically in this record-oriented paradigm, you'll see data coming in, in groups and being
processed in groups. MergeContent largely has been useful in cases where we split data
apart (using processors like SplitText, for example), and then merge it back together later.
I don't see this as being quite as prominent when using record readers and writers, as the
readers are designed to handle streams of data instead of individual records as FlowFiles.

That being said, there are certainly cases where MergeRecord still makes sense. For example,
when you're ingesting small payloads or want to batch up to send to something like HDFS, which
prefers larger files, etc. So I'll hopefully have a chance to start working on that this week or next.

In the mean time, the best path forward for you may be to use MergeContent to concatenate a bunch
of data before the processor that is using the Grok Reader. Or, if you are splitting the data up
into individual records yourself, I would recommend not splitting them up at all.

Does this make sense?

Thanks
-Mark


On Jun 12, 2017, at 3:12 PM, Mika Borner <ni...@my2ndhead.com>> wrote:

Hi,

what is the best way to merge records? I'm using a GrokReader, that spits out single json records. For efficiency I would like to merge a few hundred records into one flowfile. It seems there's no MergeRecord processor yet...

Thanks!

Mika>




Re: Merging Records

Posted by Mika Borner <ni...@my2ndhead.com>.
Hi Mark

Yes, this makes sense.

In my case. I'm receiving single log events from a tcp input which I 
would like to process further with record processors. This is  probably 
an edge case where a record merger would make sense to make the 
post-processing more efficient.

Good to hear it's already on the radar :-)

Mika>



On 06/12/2017 09:23 PM, Mark Payne wrote:
> Hi Mika,
>
> You're correct that there is not yet a MergeRecord processor. It is on my personal radar,
> but I've not yet gotten to it. One of the main reasons that I've not prioritized this yet is that
> typically in this record-oriented paradigm, you'll see data coming in, in groups and being
> processed in groups. MergeContent largely has been useful in cases where we split data
> apart (using processors like SplitText, for example), and then merge it back together later.
> I don't see this as being quite as prominent when using record readers and writers, as the
> readers are designed to handle streams of data instead of individual records as FlowFiles.
>
> That being said, there are certainly cases where MergeRecord still makes sense. For example,
> when you're ingesting small payloads or want to batch up to send to something like HDFS, which
> prefers larger files, etc. So I'll hopefully have a chance to start working on that this week or next.
>
> In the mean time, the best path forward for you may be to use MergeContent to concatenate a bunch
> of data before the processor that is using the Grok Reader. Or, if you are splitting the data up
> into individual records yourself, I would recommend not splitting them up at all.
>
> Does this make sense?
>
> Thanks
> -Mark
>
>
>> On Jun 12, 2017, at 3:12 PM, Mika Borner <ni...@my2ndhead.com> wrote:
>>
>> Hi,
>>
>> what is the best way to merge records? I'm using a GrokReader, that spits out single json records. For efficiency I would like to merge a few hundred records into one flowfile. It seems there's no MergeRecord processor yet...
>>
>> Thanks!
>>
>> Mika>
>>


Re: Merging Records

Posted by Mark Payne <ma...@hotmail.com>.
Hi Mika,

You're correct that there is not yet a MergeRecord processor. It is on my personal radar,
but I've not yet gotten to it. One of the main reasons that I've not prioritized this yet is that
typically in this record-oriented paradigm, you'll see data coming in, in groups and being
processed in groups. MergeContent largely has been useful in cases where we split data
apart (using processors like SplitText, for example), and then merge it back together later.
I don't see this as being quite as prominent when using record readers and writers, as the
readers are designed to handle streams of data instead of individual records as FlowFiles.

That being said, there are certainly cases where MergeRecord still makes sense. For example,
when you're ingesting small payloads or want to batch up to send to something like HDFS, which
prefers larger files, etc. So I'll hopefully have a chance to start working on that this week or next.

In the mean time, the best path forward for you may be to use MergeContent to concatenate a bunch
of data before the processor that is using the Grok Reader. Or, if you are splitting the data up
into individual records yourself, I would recommend not splitting them up at all.

Does this make sense?

Thanks
-Mark


> On Jun 12, 2017, at 3:12 PM, Mika Borner <ni...@my2ndhead.com> wrote:
> 
> Hi,
> 
> what is the best way to merge records? I'm using a GrokReader, that spits out single json records. For efficiency I would like to merge a few hundred records into one flowfile. It seems there's no MergeRecord processor yet...
> 
> Thanks!
> 
> Mika>
>