You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by James McMahon <js...@gmail.com> on 2018/08/17 17:47:56 UTC

Creating an attribute

I have flowfiles with data payloads that represent small strings of text
(messages consumed from AMQP queues). I want to create an attribute that
holds the entire payload for downstream use. How can I capture the entire
data payload of a flowfile in a new attribute on the flowfile? Thank you in
advance for your help. -Jim

Re: Creating an attribute

Posted by James McMahon <js...@gmail.com>.
This sounds like just what I need. Thank you very much Matt. I'll dig in
and give this a try. Thanks again to each of you guys who responded.
Cheers, Jim

On Fri, Aug 17, 2018 at 4:24 PM, Matt Burgess <ma...@apache.org> wrote:

> Jim,
>
> You can use UpdateRecord for this, your input schema would have "last"
> and "first" in it (and I think you can have an optional "myKey" field
> so you can use the same schema for the writer), and the output schema
> would have all three fields in it. Then you'd set the Replacement
> Value Strategy to "Literal Value" and add a user-defined property in
> UpdateRecord called "/myKey" set to "${myKey}". This will take the
> value from the attribute myKey and put it at the root of each record
> in a field called myKey.  Since this is JSON, you could do the same
> with JoltTransformJSON, with a Default spec setting "myKey":
> "${myKey}". Not sure which is faster in this case, since there appears
> to be a single record.
>
> This also works if there are multiple records in the flow file, as
> long as the myKey field is to have the same value for all records
> (since there is only one myKey attribute value for the whole flow
> file).  If there are multiple records and they each need, you have a
> "lookup" use case on your hands, where you'd want to match some value
> against some lookup service, and it would fill in that field from the
> value supplied by the lookup service (you'd use LookupService for
> this). Or if all else fails, there is the Split pattern if you truly
> do want/need to process one JSON object at a time.
>
> Regards,
> Matt
>
> On Fri, Aug 17, 2018 at 4:06 PM James McMahon <js...@gmail.com>
> wrote:
> >
> > I do appreciate your point, Tim and Lee. What if I do this instead:
> append select attributes to my data payload. Would that minimize the impact
> on RAM? Can I do that?
> >
> > More specifically, my data payload is a string representation of a JSON
> object, like so:
> > {"last":"manson","first":"marilyn"}
> > and I have an attribute named myKey that contains the value "123abc"
> >
> > Is there a processor that allows me to wind up with this string
> representation of JSON:
> > {"last":"manson","first":"marilyn", "myKey":"123abc"}
> >
> > If I could do that, I could avoid loading the entire data payload into
> an attribute, and manipulate them in a python script called by
> ExecuteScript. I know how to do that, I don't know how to do the above with
> native processors.
> > Thanks in advance for your help.
> >
> > On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:
> >>
> >> Jim,
> >> I think the ExtractText processor with a large enough MaxCaptureGroup
> length (default :1024) will do that.      Though, I share Tim’s concerns
> when you scale up
> >> Thanks,
> >> Lee
> >>
> >>
> >> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <tim.tschampel@
> vivacehealthsolutions.com> wrote:
> >> >
> >> >
> >> > This may not be applicable to your use case depending on message
> volume / # of attributes; but I would avoid putting payloads into
> attributes for scalability reasons (especially RAM usage).
> >> >
> >> >
> >> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com>
> wrote:
> >> >>
> >> >> I have flowfiles with data payloads that represent small strings of
> text (messages consumed from AMQP queues). I want to create an attribute
> that holds the entire payload for downstream use. How can I capture the
> entire data payload of a flowfile in a new attribute on the flowfile? Thank
> you in advance for your help. -Jim
> >> >
> >
> >
>

Re: Creating an attribute

Posted by James McMahon <js...@gmail.com>.
Thank you Matt. Understood. Thanks again for taking the time to reply to my
questions. -Jim

On Mon, Aug 20, 2018 at 9:13 AM, Matt Burgess <ma...@apache.org> wrote:

> Jim,
>
> If you know all the possible fields that can occur, you can create a
> schema that contains the three mandatory fields and include all the
> others as "optional", this is done by setting the type of the field to
> ["null", <data type>]. This can even be done for the lookup field so
> you can inherit the record schema in the record writer (so you don't
> have to add it by hand in the writer).
>
> If you won't know all the fields, UpdateRecord doesn't currently alter
> the schema to add the field(s), although there is a Jira to cover the
> improvement [1].
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5524
>
> On Sat, Aug 18, 2018 at 7:43 AM James McMahon <js...@gmail.com>
> wrote:
> >
> > I do have a follow-up question. In my example I have oversimplified the
> structure. In my production space I have two complicating factors: the
> number of fields can vary, and only three fields are mandatory and so must
> be there. And the fields order can vary: the messages posted to the queue
> that we consume from have no requirement to enforce the order of the
> fields. All I know is that I will have my three guaranteed fields. Can
> UpdateRecord still be used, referencing the three fields explicitly,
> telling it to put my new field(s) after one of those where ever it may be
> in the object, and indicating it should then include all other keys/values
> in the object?
> >
> > On Fri, Aug 17, 2018 at 4:24 PM, Matt Burgess <ma...@apache.org>
> wrote:
> >>
> >> Jim,
> >>
> >> You can use UpdateRecord for this, your input schema would have "last"
> >> and "first" in it (and I think you can have an optional "myKey" field
> >> so you can use the same schema for the writer), and the output schema
> >> would have all three fields in it. Then you'd set the Replacement
> >> Value Strategy to "Literal Value" and add a user-defined property in
> >> UpdateRecord called "/myKey" set to "${myKey}". This will take the
> >> value from the attribute myKey and put it at the root of each record
> >> in a field called myKey.  Since this is JSON, you could do the same
> >> with JoltTransformJSON, with a Default spec setting "myKey":
> >> "${myKey}". Not sure which is faster in this case, since there appears
> >> to be a single record.
> >>
> >> This also works if there are multiple records in the flow file, as
> >> long as the myKey field is to have the same value for all records
> >> (since there is only one myKey attribute value for the whole flow
> >> file).  If there are multiple records and they each need, you have a
> >> "lookup" use case on your hands, where you'd want to match some value
> >> against some lookup service, and it would fill in that field from the
> >> value supplied by the lookup service (you'd use LookupService for
> >> this). Or if all else fails, there is the Split pattern if you truly
> >> do want/need to process one JSON object at a time.
> >>
> >> Regards,
> >> Matt
> >>
> >> On Fri, Aug 17, 2018 at 4:06 PM James McMahon <js...@gmail.com>
> wrote:
> >> >
> >> > I do appreciate your point, Tim and Lee. What if I do this instead:
> append select attributes to my data payload. Would that minimize the impact
> on RAM? Can I do that?
> >> >
> >> > More specifically, my data payload is a string representation of a
> JSON object, like so:
> >> > {"last":"manson","first":"marilyn"}
> >> > and I have an attribute named myKey that contains the value "123abc"
> >> >
> >> > Is there a processor that allows me to wind up with this string
> representation of JSON:
> >> > {"last":"manson","first":"marilyn", "myKey":"123abc"}
> >> >
> >> > If I could do that, I could avoid loading the entire data payload
> into an attribute, and manipulate them in a python script called by
> ExecuteScript. I know how to do that, I don't know how to do the above with
> native processors.
> >> > Thanks in advance for your help.
> >> >
> >> > On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:
> >> >>
> >> >> Jim,
> >> >> I think the ExtractText processor with a large enough
> MaxCaptureGroup length (default :1024) will do that.      Though, I share
> Tim’s concerns when you scale up
> >> >> Thanks,
> >> >> Lee
> >> >>
> >> >>
> >> >> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <tim.tschampel@
> vivacehealthsolutions.com> wrote:
> >> >> >
> >> >> >
> >> >> > This may not be applicable to your use case depending on message
> volume / # of attributes; but I would avoid putting payloads into
> attributes for scalability reasons (especially RAM usage).
> >> >> >
> >> >> >
> >> >> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com>
> wrote:
> >> >> >>
> >> >> >> I have flowfiles with data payloads that represent small strings
> of text (messages consumed from AMQP queues). I want to create an attribute
> that holds the entire payload for downstream use. How can I capture the
> entire data payload of a flowfile in a new attribute on the flowfile? Thank
> you in advance for your help. -Jim
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Creating an attribute

Posted by Matt Burgess <ma...@apache.org>.
Jim,

If you know all the possible fields that can occur, you can create a
schema that contains the three mandatory fields and include all the
others as "optional", this is done by setting the type of the field to
["null", <data type>]. This can even be done for the lookup field so
you can inherit the record schema in the record writer (so you don't
have to add it by hand in the writer).

If you won't know all the fields, UpdateRecord doesn't currently alter
the schema to add the field(s), although there is a Jira to cover the
improvement [1].

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-5524

On Sat, Aug 18, 2018 at 7:43 AM James McMahon <js...@gmail.com> wrote:
>
> I do have a follow-up question. In my example I have oversimplified the structure. In my production space I have two complicating factors: the number of fields can vary, and only three fields are mandatory and so must be there. And the fields order can vary: the messages posted to the queue that we consume from have no requirement to enforce the order of the fields. All I know is that I will have my three guaranteed fields. Can UpdateRecord still be used, referencing the three fields explicitly, telling it to put my new field(s) after one of those where ever it may be in the object, and indicating it should then include all other keys/values in the object?
>
> On Fri, Aug 17, 2018 at 4:24 PM, Matt Burgess <ma...@apache.org> wrote:
>>
>> Jim,
>>
>> You can use UpdateRecord for this, your input schema would have "last"
>> and "first" in it (and I think you can have an optional "myKey" field
>> so you can use the same schema for the writer), and the output schema
>> would have all three fields in it. Then you'd set the Replacement
>> Value Strategy to "Literal Value" and add a user-defined property in
>> UpdateRecord called "/myKey" set to "${myKey}". This will take the
>> value from the attribute myKey and put it at the root of each record
>> in a field called myKey.  Since this is JSON, you could do the same
>> with JoltTransformJSON, with a Default spec setting "myKey":
>> "${myKey}". Not sure which is faster in this case, since there appears
>> to be a single record.
>>
>> This also works if there are multiple records in the flow file, as
>> long as the myKey field is to have the same value for all records
>> (since there is only one myKey attribute value for the whole flow
>> file).  If there are multiple records and they each need, you have a
>> "lookup" use case on your hands, where you'd want to match some value
>> against some lookup service, and it would fill in that field from the
>> value supplied by the lookup service (you'd use LookupService for
>> this). Or if all else fails, there is the Split pattern if you truly
>> do want/need to process one JSON object at a time.
>>
>> Regards,
>> Matt
>>
>> On Fri, Aug 17, 2018 at 4:06 PM James McMahon <js...@gmail.com> wrote:
>> >
>> > I do appreciate your point, Tim and Lee. What if I do this instead: append select attributes to my data payload. Would that minimize the impact on RAM? Can I do that?
>> >
>> > More specifically, my data payload is a string representation of a JSON object, like so:
>> > {"last":"manson","first":"marilyn"}
>> > and I have an attribute named myKey that contains the value "123abc"
>> >
>> > Is there a processor that allows me to wind up with this string representation of JSON:
>> > {"last":"manson","first":"marilyn", "myKey":"123abc"}
>> >
>> > If I could do that, I could avoid loading the entire data payload into an attribute, and manipulate them in a python script called by ExecuteScript. I know how to do that, I don't know how to do the above with native processors.
>> > Thanks in advance for your help.
>> >
>> > On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:
>> >>
>> >> Jim,
>> >> I think the ExtractText processor with a large enough MaxCaptureGroup length (default :1024) will do that.      Though, I share Tim’s concerns when you scale up
>> >> Thanks,
>> >> Lee
>> >>
>> >>
>> >> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <ti...@vivacehealthsolutions.com> wrote:
>> >> >
>> >> >
>> >> > This may not be applicable to your use case depending on message volume / # of attributes; but I would avoid putting payloads into attributes for scalability reasons (especially RAM usage).
>> >> >
>> >> >
>> >> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com> wrote:
>> >> >>
>> >> >> I have flowfiles with data payloads that represent small strings of text (messages consumed from AMQP queues). I want to create an attribute that holds the entire payload for downstream use. How can I capture the entire data payload of a flowfile in a new attribute on the flowfile? Thank you in advance for your help. -Jim
>> >> >
>> >
>> >
>
>

Re: Creating an attribute

Posted by James McMahon <js...@gmail.com>.
I do have a follow-up question. In my example I have oversimplified the
structure. In my production space I have two complicating factors: the
number of fields can vary, and only three fields are mandatory and so must
be there. And the fields order can vary: the messages posted to the queue
that we consume from have no requirement to enforce the order of the
fields. All I know is that I will have my three guaranteed fields. Can
UpdateRecord still be used, referencing the three fields explicitly,
telling it to put my new field(s) after one of those where ever it may be
in the object, and indicating it should then include all other keys/values
in the object?

On Fri, Aug 17, 2018 at 4:24 PM, Matt Burgess <ma...@apache.org> wrote:

> Jim,
>
> You can use UpdateRecord for this, your input schema would have "last"
> and "first" in it (and I think you can have an optional "myKey" field
> so you can use the same schema for the writer), and the output schema
> would have all three fields in it. Then you'd set the Replacement
> Value Strategy to "Literal Value" and add a user-defined property in
> UpdateRecord called "/myKey" set to "${myKey}". This will take the
> value from the attribute myKey and put it at the root of each record
> in a field called myKey.  Since this is JSON, you could do the same
> with JoltTransformJSON, with a Default spec setting "myKey":
> "${myKey}". Not sure which is faster in this case, since there appears
> to be a single record.
>
> This also works if there are multiple records in the flow file, as
> long as the myKey field is to have the same value for all records
> (since there is only one myKey attribute value for the whole flow
> file).  If there are multiple records and they each need, you have a
> "lookup" use case on your hands, where you'd want to match some value
> against some lookup service, and it would fill in that field from the
> value supplied by the lookup service (you'd use LookupService for
> this). Or if all else fails, there is the Split pattern if you truly
> do want/need to process one JSON object at a time.
>
> Regards,
> Matt
>
> On Fri, Aug 17, 2018 at 4:06 PM James McMahon <js...@gmail.com>
> wrote:
> >
> > I do appreciate your point, Tim and Lee. What if I do this instead:
> append select attributes to my data payload. Would that minimize the impact
> on RAM? Can I do that?
> >
> > More specifically, my data payload is a string representation of a JSON
> object, like so:
> > {"last":"manson","first":"marilyn"}
> > and I have an attribute named myKey that contains the value "123abc"
> >
> > Is there a processor that allows me to wind up with this string
> representation of JSON:
> > {"last":"manson","first":"marilyn", "myKey":"123abc"}
> >
> > If I could do that, I could avoid loading the entire data payload into
> an attribute, and manipulate them in a python script called by
> ExecuteScript. I know how to do that, I don't know how to do the above with
> native processors.
> > Thanks in advance for your help.
> >
> > On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:
> >>
> >> Jim,
> >> I think the ExtractText processor with a large enough MaxCaptureGroup
> length (default :1024) will do that.      Though, I share Tim’s concerns
> when you scale up
> >> Thanks,
> >> Lee
> >>
> >>
> >> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <tim.tschampel@
> vivacehealthsolutions.com> wrote:
> >> >
> >> >
> >> > This may not be applicable to your use case depending on message
> volume / # of attributes; but I would avoid putting payloads into
> attributes for scalability reasons (especially RAM usage).
> >> >
> >> >
> >> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com>
> wrote:
> >> >>
> >> >> I have flowfiles with data payloads that represent small strings of
> text (messages consumed from AMQP queues). I want to create an attribute
> that holds the entire payload for downstream use. How can I capture the
> entire data payload of a flowfile in a new attribute on the flowfile? Thank
> you in advance for your help. -Jim
> >> >
> >
> >
>

Re: Creating an attribute

Posted by Matt Burgess <ma...@apache.org>.
Jim,

You can use UpdateRecord for this, your input schema would have "last"
and "first" in it (and I think you can have an optional "myKey" field
so you can use the same schema for the writer), and the output schema
would have all three fields in it. Then you'd set the Replacement
Value Strategy to "Literal Value" and add a user-defined property in
UpdateRecord called "/myKey" set to "${myKey}". This will take the
value from the attribute myKey and put it at the root of each record
in a field called myKey.  Since this is JSON, you could do the same
with JoltTransformJSON, with a Default spec setting "myKey":
"${myKey}". Not sure which is faster in this case, since there appears
to be a single record.

This also works if there are multiple records in the flow file, as
long as the myKey field is to have the same value for all records
(since there is only one myKey attribute value for the whole flow
file).  If there are multiple records and they each need, you have a
"lookup" use case on your hands, where you'd want to match some value
against some lookup service, and it would fill in that field from the
value supplied by the lookup service (you'd use LookupService for
this). Or if all else fails, there is the Split pattern if you truly
do want/need to process one JSON object at a time.

Regards,
Matt

On Fri, Aug 17, 2018 at 4:06 PM James McMahon <js...@gmail.com> wrote:
>
> I do appreciate your point, Tim and Lee. What if I do this instead: append select attributes to my data payload. Would that minimize the impact on RAM? Can I do that?
>
> More specifically, my data payload is a string representation of a JSON object, like so:
> {"last":"manson","first":"marilyn"}
> and I have an attribute named myKey that contains the value "123abc"
>
> Is there a processor that allows me to wind up with this string representation of JSON:
> {"last":"manson","first":"marilyn", "myKey":"123abc"}
>
> If I could do that, I could avoid loading the entire data payload into an attribute, and manipulate them in a python script called by ExecuteScript. I know how to do that, I don't know how to do the above with native processors.
> Thanks in advance for your help.
>
> On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:
>>
>> Jim,
>> I think the ExtractText processor with a large enough MaxCaptureGroup length (default :1024) will do that.      Though, I share Tim’s concerns when you scale up
>> Thanks,
>> Lee
>>
>>
>> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <ti...@vivacehealthsolutions.com> wrote:
>> >
>> >
>> > This may not be applicable to your use case depending on message volume / # of attributes; but I would avoid putting payloads into attributes for scalability reasons (especially RAM usage).
>> >
>> >
>> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com> wrote:
>> >>
>> >> I have flowfiles with data payloads that represent small strings of text (messages consumed from AMQP queues). I want to create an attribute that holds the entire payload for downstream use. How can I capture the entire data payload of a flowfile in a new attribute on the flowfile? Thank you in advance for your help. -Jim
>> >
>
>

Re: Creating an attribute

Posted by James McMahon <js...@gmail.com>.
I do appreciate your point, Tim and Lee. What if I do this instead: append
select attributes to my data payload. Would that minimize the impact on
RAM? Can I do that?

More specifically, my data payload is a string representation of a JSON
object, like so:
{"last":"manson","first":"marilyn"}
and I have an attribute named myKey that contains the value "123abc"

Is there a processor that allows me to wind up with this string
representation of JSON:
{"last":"manson","first":"marilyn", "myKey":"123abc"}

If I could do that, I could avoid loading the entire data payload into an
attribute, and manipulate them in a python script called by ExecuteScript.
I know how to do that, I don't know how to do the above with native
processors.
Thanks in advance for your help.

On Fri, Aug 17, 2018 at 2:02 PM, Lee Laim <le...@gmail.com> wrote:

> Jim,
> I think the ExtractText processor with a large enough MaxCaptureGroup
> length (default :1024) will do that.      Though, I share Tim’s concerns
> when you scale up
> Thanks,
> Lee
>
>
> > On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <tim.tschampel@
> vivacehealthsolutions.com> wrote:
> >
> >
> > This may not be applicable to your use case depending on message volume
> / # of attributes; but I would avoid putting payloads into attributes for
> scalability reasons (especially RAM usage).
> >
> >
> >> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com>
> wrote:
> >>
> >> I have flowfiles with data payloads that represent small strings of
> text (messages consumed from AMQP queues). I want to create an attribute
> that holds the entire payload for downstream use. How can I capture the
> entire data payload of a flowfile in a new attribute on the flowfile? Thank
> you in advance for your help. -Jim
> >
>

Re: Creating an attribute

Posted by Lee Laim <le...@gmail.com>.
Jim,
I think the ExtractText processor with a large enough MaxCaptureGroup length (default :1024) will do that.      Though, I share Tim’s concerns when you scale up
Thanks,
Lee


> On Aug 17, 2018, at 11:52 AM, Timothy Tschampel <ti...@vivacehealthsolutions.com> wrote:
> 
> 
> This may not be applicable to your use case depending on message volume / # of attributes; but I would avoid putting payloads into attributes for scalability reasons (especially RAM usage).
> 
> 
>> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com> wrote:
>> 
>> I have flowfiles with data payloads that represent small strings of text (messages consumed from AMQP queues). I want to create an attribute that holds the entire payload for downstream use. How can I capture the entire data payload of a flowfile in a new attribute on the flowfile? Thank you in advance for your help. -Jim
> 

Re: Creating an attribute

Posted by Timothy Tschampel <ti...@vivacehealthsolutions.com>.
This may not be applicable to your use case depending on message volume / # of attributes; but I would avoid putting payloads into attributes for scalability reasons (especially RAM usage).


> On Aug 17, 2018, at 10:47 AM, James McMahon <js...@gmail.com> wrote:
> 
> I have flowfiles with data payloads that represent small strings of text (messages consumed from AMQP queues). I want to create an attribute that holds the entire payload for downstream use. How can I capture the entire data payload of a flowfile in a new attribute on the flowfile? Thank you in advance for your help. -Jim