You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Charlie Frasure <ch...@gmail.com> on 2017/09/19 15:29:31 UTC

AttributesToJSON

I have a data flow that takes delimited input using GetFile, extracts some
of that into attributes, converts the attributes to a JSON object,
reformats the JSON using the Jolt transformer, and then does additional
processing before using PutFile to move the original file based on the
dataflow result.  I have to work around NiFi to make the last step happen.

I am setting the AttributesToJSON to replace the flowfile content because
the Jolt transformer requires the JSON object to be in the flowfile
content.  There is no "original" relationship out of AttributesToJSON, so
this data would be lost.  I have the "Keep Source File" set to true on the
GetFile, and then use PutFile with the filename to grab it later.

This works for the most part, but under heavy data loads we have some
errors trying to process a file more than once.

I think we could resolve this by not keeping the source file, sending a
duplicate of the content down another path and merging later.  I want to
explore the possibility of either 1) having an "original" relationship
whenever the previous flowfile content is being modified or replaced, or 2)
maintaining an "original" flowfile content alongside the working content so
that it is easily available once the processing is complete.

Am I missing a more direct way to process this data?  Other thoughts?

Thanks,
Charlie

Re: AttributesToJSON

Posted by Joe Witt <jo...@gmail.com>.

Ha!  They are nearly as cool as nifi reading bedtime stories.  You
have a good point.

I was all happy we were about to make your flow far
better/faster/stronger.  Then you threw down with HL7.

We really need to make an HL7RecordReader then the rest of this would
be fast/fun.  Any volunteers?

Thanks

On Tue, Sep 19, 2017 at 2:09 PM, Charlie Frasure
<ch...@gmail.com> wrote:
> Thanks Joe,
>
> I'm using the HL7 processor to extract HL7v2 data to attributes, then
> mapping the attributes to expected JSON entries.  I am using the Record
> reader/writers elsewhere, definitely the best thing that has happened to
> NiFi since bedtime stories [1].
> So my current flow is:
>
> GetFile (leave original file) ->
> ExtractHL7Attributes ->
> UpdateAttribute (for light conversions) ->
> AttributesToJSON (as flowfile-content) ->
> JoltTransformJSON (This could probably be replaced by record readers /
> writers) ->
> InvokeHTTP (call webservice) ->
> FetchFile (using filename attribute)
>
> There are some additional exception paths, but this flow works as intended
> except when the web service can't keep up with new files.  I have a delay
> built in to GetFile to account for this, which mostly works, but sometimes
> we pull the same file more than once.  I suppose I could also move the file
> to an interim folder to prevent multiple reads.
>
> Thanks,
> Charlie
>
>
> [1]
> https://community.hortonworks.com/articles/28380/nifi-ocr-using-apache-nifi-to-read-childrens-books.html
>
>
> On Tue, Sep 19, 2017 at 11:35 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Charlie
>>
>> You'll absolutely want to look at the Record reader/writer
>> capabilities.  It will help you convert from the CSV (or similar) to
>> JSON without having to go through attributes at all.
>>
>> Take a look here
>>
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
>> and you could see the provenance example for configuration.  If you
>> want to share a sample line of the delimited data and a sample of the
>> output JSON I can share you back a template that would help you get
>> started.
>>
>> Thanks
>> Joe
>>
>> On Tue, Sep 19, 2017 at 11:29 AM, Charlie Frasure
>> <ch...@gmail.com> wrote:
>> > I have a data flow that takes delimited input using GetFile, extracts
>> > some
>> > of that into attributes, converts the attributes to a JSON object,
>> > reformats
>> > the JSON using the Jolt transformer, and then does additional processing
>> > before using PutFile to move the original file based on the dataflow
>> > result.
>> > I have to work around NiFi to make the last step happen.
>> >
>> > I am setting the AttributesToJSON to replace the flowfile content
>> > because
>> > the Jolt transformer requires the JSON object to be in the flowfile
>> > content.
>> > There is no "original" relationship out of AttributesToJSON, so this
>> > data
>> > would be lost.  I have the "Keep Source File" set to true on the
>> > GetFile,
>> > and then use PutFile with the filename to grab it later.
>> >
>> > This works for the most part, but under heavy data loads we have some
>> > errors
>> > trying to process a file more than once.
>> >
>> > I think we could resolve this by not keeping the source file, sending a
>> > duplicate of the content down another path and merging later.  I want to
>> > explore the possibility of either 1) having an "original" relationship
>> > whenever the previous flowfile content is being modified or replaced, or
>> > 2)
>> > maintaining an "original" flowfile content alongside the working content
>> > so
>> > that it is easily available once the processing is complete.
>> >
>> > Am I missing a more direct way to process this data?  Other thoughts?
>> >
>> > Thanks,
>> > Charlie
>> >
>> >
>> >
>> >
>
>

Re: AttributesToJSON

Posted by Charlie Frasure <ch...@gmail.com>.

Thanks Joe,

I'm using the HL7 processor to extract HL7v2 data to attributes, then
mapping the attributes to expected JSON entries.  I am using the Record
reader/writers elsewhere, definitely the best thing that has happened to
NiFi since bedtime stories [1].
So my current flow is:

GetFile (leave original file) ->
ExtractHL7Attributes ->
UpdateAttribute (for light conversions) ->
AttributesToJSON (as flowfile-content) ->
JoltTransformJSON (This could probably be replaced by record readers /
writers) ->
InvokeHTTP (call webservice) ->
FetchFile (using filename attribute)

There are some additional exception paths, but this flow works as intended
except when the web service can't keep up with new files.  I have a delay
built in to GetFile to account for this, which mostly works, but sometimes
we pull the same file more than once.  I suppose I could also move the file
to an interim folder to prevent multiple reads.

Thanks,
Charlie


[1]
https://community.hortonworks.com/articles/28380/nifi-ocr-using-apache-nifi-to-read-childrens-books.html


On Tue, Sep 19, 2017 at 11:35 AM, Joe Witt <jo...@gmail.com> wrote:

> Charlie
>
> You'll absolutely want to look at the Record reader/writer
> capabilities.  It will help you convert from the CSV (or similar) to
> JSON without having to go through attributes at all.
>
> Take a look here
> https://cwiki.apache.org/confluence/display/NIFI/
> Example+Dataflow+Templates
> and you could see the provenance example for configuration.  If you
> want to share a sample line of the delimited data and a sample of the
> output JSON I can share you back a template that would help you get
> started.
>
> Thanks
> Joe
>
> On Tue, Sep 19, 2017 at 11:29 AM, Charlie Frasure
> <ch...@gmail.com> wrote:
> > I have a data flow that takes delimited input using GetFile, extracts
> some
> > of that into attributes, converts the attributes to a JSON object,
> reformats
> > the JSON using the Jolt transformer, and then does additional processing
> > before using PutFile to move the original file based on the dataflow
> result.
> > I have to work around NiFi to make the last step happen.
> >
> > I am setting the AttributesToJSON to replace the flowfile content because
> > the Jolt transformer requires the JSON object to be in the flowfile
> content.
> > There is no "original" relationship out of AttributesToJSON, so this data
> > would be lost.  I have the "Keep Source File" set to true on the GetFile,
> > and then use PutFile with the filename to grab it later.
> >
> > This works for the most part, but under heavy data loads we have some
> errors
> > trying to process a file more than once.
> >
> > I think we could resolve this by not keeping the source file, sending a
> > duplicate of the content down another path and merging later.  I want to
> > explore the possibility of either 1) having an "original" relationship
> > whenever the previous flowfile content is being modified or replaced, or
> 2)
> > maintaining an "original" flowfile content alongside the working content
> so
> > that it is easily available once the processing is complete.
> >
> > Am I missing a more direct way to process this data?  Other thoughts?
> >
> > Thanks,
> > Charlie
> >
> >
> >
> >
>

Re: AttributesToJSON

Posted by Joe Witt <jo...@gmail.com>.

Charlie

You'll absolutely want to look at the Record reader/writer
capabilities.  It will help you convert from the CSV (or similar) to
JSON without having to go through attributes at all.

Take a look here
https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
and you could see the provenance example for configuration.  If you
want to share a sample line of the delimited data and a sample of the
output JSON I can share you back a template that would help you get
started.

Thanks
Joe

On Tue, Sep 19, 2017 at 11:29 AM, Charlie Frasure
<ch...@gmail.com> wrote:
> I have a data flow that takes delimited input using GetFile, extracts some
> of that into attributes, converts the attributes to a JSON object, reformats
> the JSON using the Jolt transformer, and then does additional processing
> before using PutFile to move the original file based on the dataflow result.
> I have to work around NiFi to make the last step happen.
>
> I am setting the AttributesToJSON to replace the flowfile content because
> the Jolt transformer requires the JSON object to be in the flowfile content.
> There is no "original" relationship out of AttributesToJSON, so this data
> would be lost.  I have the "Keep Source File" set to true on the GetFile,
> and then use PutFile with the filename to grab it later.
>
> This works for the most part, but under heavy data loads we have some errors
> trying to process a file more than once.
>
> I think we could resolve this by not keeping the source file, sending a
> duplicate of the content down another path and merging later.  I want to
> explore the possibility of either 1) having an "original" relationship
> whenever the previous flowfile content is being modified or replaced, or 2)
> maintaining an "original" flowfile content alongside the working content so
> that it is easily available once the processing is complete.
>
> Am I missing a more direct way to process this data?  Other thoughts?
>
> Thanks,
> Charlie
>
>
>
>