You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Charlie Frasure <ch...@gmail.com> on 2015/10/26 00:13:08 UTC

ConvertCharacterSet

I'm looking to process many files into common formats.  The source files
are coming in various character sets, mime types, and new line terminators.

My thinking for a data flow was along these lines:

GetFile (from many sub directories) ->
ExecuteStreamCommand (file -i) ->
ConvertCharacterSet (from previous command to utf8) ->
ReplaceText (to change any \r\n into \n) ->
PutFile (into a directory structure based on values found in the original
file path and filename)

Additional steps would be added for archiving a copy of the original,
converting xml files, etc.

Attempting to process these with Nifi leaves me confused as to how to
process within the tool.  If I want to ConvertCharacterSet, I have to know
the input type.  I setup a ExecuteStreamCommand to file -i
${absolute.path:append(${filename})} which returned the expected values.  I
don't see a way to turn these results into input for the processor, which
doesn't accept expression language for that field.

I also considered ConvertCSVToAvro as an interim step but notice the same
issue.  Any suggestions what this dataflow should look like?

Charlie

Re: ConvertCharacterSet

Posted by Joe Percivall <jo...@yahoo.com>.

Hey Charlie, 
I just created a ticket for that: https://issues.apache.org/jira/browse/NIFI-1081. I should have it knocked out in the next day or so.
Joe
- - - - - - Joseph Percivalllinkedin.com/in/Percivalle: joepercivall@yahoo.com

     On Wednesday, October 28, 2015 11:46 AM, Charlie Frasure <ch...@gmail.com> wrote:

 I saw the patch you added for NIFI-1077.  Thanks!  Do you plan to add an issue for the ExecuteStreamCommand output, or should I be looking into NIFI-190 that Bryan mentioned?

On Tue, Oct 27, 2015 at 5:30 PM, Joe Percivall <jo...@yahoo.com> wrote:

No one responded with concerns regarding allowing expression language for the input/output character set so I created a jira [1]. This use-case is something that should be easy for NiFi and the flow for this use-case is definitely more of a hack job than it should be.

Does anyone have objections for adding a configuration option to put the output of
ExecuteStreamCommand to an attribute instead of the FlowFile contents?

[1] https://issues.apache.org/jira/browse/NIFI-1077

Joe
- - - - - -
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com

On Tuesday, October 27, 2015 5:15 PM, Charlie Frasure <ch...@gmail.com> wrote:

Thank you both for the replies.  I built a flow that adds the "fragment" attributes early on, and splits the feed after the ExecuteStream that identifies the character set.  The character set payload goes through ExtractText to move it into an attribute and ReplaceText to delete the contents of the file.  The two streams are then funneled to a MergeContent using Defragment, which results in the original data with an extra blank line and the character set attribute attached.

I suppose at this point I could route based on attributes for each character set or call another ExecuteStream to iconv.  This works, but seems a bit of a hack job.  Any suggestions for improvement?  Is this an expected use case for the tool?

On Tue, Oct 27, 2015 at 10:45 AM, Bryan Bende <bb...@gmail.com> wrote:

One problem with the above flow is that ExecuteStreamCommand will replace the contents of the FlowFile with the results of the command, so the FlowFIle will have the encoding value and no longer have the original content.
>
>
>This could potentially be solved in the future with the "hold file" processor [1] where the original file is held on one path, while the same file goes to ExecuteStreamCommand, after getting the encoding it could be extracted to an attribute and then trigger the original file for release, copying over the encoding attribute.
>
>
>[1] https://issues.apache.org/jira/browse/NIFI-190
>
>
>
>
>
>
>On Tue, Oct 27, 2015 at 10:24 AM, Joe Percivall <jo...@yahoo.com> wrote:
>
>Hey Charlie,
>>
>>Sorry no one has followed up with you yet. One way I see around ConvertCharacterSet not supporting expression language is to route on attribute (assuming the character set is extracted to be an attribute) to different ConvertCharacterSet processors depending on the input character set.
>>
>>That being said, I don't see a reason why the ConvertCharacterSet shouldn't support expression language. If anyone doesn't have objections I'll put in a ticket later today and knock it out real quick.
>>
>>
>>Joe
>>- - - - - -
>>Joseph Percivall
>>linkedin.com/in/Percivall
>>e: joepercivall@yahoo.com
>>
>>
>>
>>
>>
>>On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <ch...@gmail.com> wrote:
>>
>>
>>
>>I'm looking to process many files into common formats.  The source files are coming in various character sets, mime types, and new line terminators.
>>
>>My thinking for a data flow was along these lines:
>>
>>GetFile (from many sub directories) ->
>>ExecuteStreamCommand (file -i) ->
>>ConvertCharacterSet (from previous command to utf8) ->
>>ReplaceText (to change any \r\n into \n) ->
>>PutFile (into a directory structure based on values found in the original file path and filename)
>>
>>Additional steps would be added for archiving a copy of the original, converting xml files, etc.
>>
>>Attempting to process these with Nifi leaves me confused as to how to process within the tool.  If I want to ConvertCharacterSet, I have to know the input type.  I setup a ExecuteStreamCommand to file -i ${absolute.path:append(${filename})} which returned the expected values.  I don't see a way to turn these results into input for the processor, which doesn't accept expression language for that field.
>>
>>I also considered ConvertCSVToAvro as an interim step but notice the same issue.  Any suggestions what this dataflow should look like?
>>
>>
>>Charlie
>>
>

Re: ConvertCharacterSet

Posted by Charlie Frasure <ch...@gmail.com>.

I saw the patch you added for NIFI-1077.  Thanks!  Do you plan to add an
issue for the ExecuteStreamCommand output, or should I be looking into
NIFI-190 that Bryan mentioned?

On Tue, Oct 27, 2015 at 5:30 PM, Joe Percivall <jo...@yahoo.com>
wrote:

> No one responded with concerns regarding allowing expression language for
> the input/output character set so I created a jira [1]. This use-case is
> something that should be easy for NiFi and the flow for this use-case is
> definitely more of a hack job than it should be.
>
> Does anyone have objections for adding a configuration option to put the
> output of
> ExecuteStreamCommand to an attribute instead of the FlowFile contents?
>
> [1] https://issues.apache.org/jira/browse/NIFI-1077
>
>
> Joe
> - - - - - -
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
>
>
>
>
> On Tuesday, October 27, 2015 5:15 PM, Charlie Frasure <
> charliefrasure@gmail.com> wrote:
>
>
>
> Thank you both for the replies.  I built a flow that adds the "fragment"
> attributes early on, and splits the feed after the ExecuteStream that
> identifies the character set.  The character set payload goes through
> ExtractText to move it into an attribute and ReplaceText to delete the
> contents of the file.  The two streams are then funneled to a MergeContent
> using Defragment, which results in the original data with an extra blank
> line and the character set attribute attached.
>
> I suppose at this point I could route based on attributes for each
> character set or call another ExecuteStream to iconv.  This works, but
> seems a bit of a hack job.  Any suggestions for improvement?  Is this an
> expected use case for the tool?
>
>
> On Tue, Oct 27, 2015 at 10:45 AM, Bryan Bende <bb...@gmail.com> wrote:
>
> One problem with the above flow is that ExecuteStreamCommand will replace
> the contents of the FlowFile with the results of the command, so the
> FlowFIle will have the encoding value and no longer have the original
> content.
> >
> >
> >This could potentially be solved in the future with the "hold file"
> processor [1] where the original file is held on one path, while the same
> file goes to ExecuteStreamCommand, after getting the encoding it could be
> extracted to an attribute and then trigger the original file for release,
> copying over the encoding attribute.
> >
> >
> >[1] https://issues.apache.org/jira/browse/NIFI-190
> >
> >
> >
> >
> >
> >
> >On Tue, Oct 27, 2015 at 10:24 AM, Joe Percivall <jo...@yahoo.com>
> wrote:
> >
> >Hey Charlie,
> >>
> >>Sorry no one has followed up with you yet. One way I see around
> ConvertCharacterSet not supporting expression language is to route on
> attribute (assuming the character set is extracted to be an attribute) to
> different ConvertCharacterSet processors depending on the input character
> set.
> >>
> >>That being said, I don't see a reason why the ConvertCharacterSet
> shouldn't support expression language. If anyone doesn't have objections
> I'll put in a ticket later today and knock it out real quick.
> >>
> >>
> >>Joe
> >>- - - - - -
> >>Joseph Percivall
> >>linkedin.com/in/Percivall
> >>e: joepercivall@yahoo.com
> >>
> >>
> >>
> >>
> >>
> >>On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <
> charliefrasure@gmail.com> wrote:
> >>
> >>
> >>
> >>I'm looking to process many files into common formats.  The source files
> are coming in various character sets, mime types, and new line terminators.
> >>
> >>My thinking for a data flow was along these lines:
> >>
> >>GetFile (from many sub directories) ->
> >>ExecuteStreamCommand (file -i) ->
> >>ConvertCharacterSet (from previous command to utf8) ->
> >>ReplaceText (to change any \r\n into \n) ->
> >>PutFile (into a directory structure based on values found in the
> original file path and filename)
> >>
> >>Additional steps would be added for archiving a copy of the original,
> converting xml files, etc.
> >>
> >>Attempting to process these with Nifi leaves me confused as to how to
> process within the tool.  If I want to ConvertCharacterSet, I have to know
> the input type.  I setup a ExecuteStreamCommand to file -i
> ${absolute.path:append(${filename})} which returned the expected values.  I
> don't see a way to turn these results into input for the processor, which
> doesn't accept expression language for that field.
> >>
> >>I also considered ConvertCSVToAvro as an interim step but notice the
> same issue.  Any suggestions what this dataflow should look like?
> >>
> >>
> >>Charlie
> >>
> >
>

Re: ConvertCharacterSet

Posted by Joe Percivall <jo...@yahoo.com>.

No one responded with concerns regarding allowing expression language for the input/output character set so I created a jira [1]. This use-case is something that should be easy for NiFi and the flow for this use-case is definitely more of a hack job than it should be.

Does anyone have objections for adding a configuration option to put the output of 
ExecuteStreamCommand to an attribute instead of the FlowFile contents?

[1] https://issues.apache.org/jira/browse/NIFI-1077

Joe
- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com

On Tuesday, October 27, 2015 5:15 PM, Charlie Frasure <ch...@gmail.com> wrote:

Thank you both for the replies.  I built a flow that adds the "fragment" attributes early on, and splits the feed after the ExecuteStream that identifies the character set.  The character set payload goes through ExtractText to move it into an attribute and ReplaceText to delete the contents of the file.  The two streams are then funneled to a MergeContent using Defragment, which results in the original data with an extra blank line and the character set attribute attached.

I suppose at this point I could route based on attributes for each character set or call another ExecuteStream to iconv.  This works, but seems a bit of a hack job.  Any suggestions for improvement?  Is this an expected use case for the tool?

On Tue, Oct 27, 2015 at 10:45 AM, Bryan Bende <bb...@gmail.com> wrote:

One problem with the above flow is that ExecuteStreamCommand will replace the contents of the FlowFile with the results of the command, so the FlowFIle will have the encoding value and no longer have the original content.
>
>
>This could potentially be solved in the future with the "hold file" processor [1] where the original file is held on one path, while the same file goes to ExecuteStreamCommand, after getting the encoding it could be extracted to an attribute and then trigger the original file for release, copying over the encoding attribute.
>
>
>[1] https://issues.apache.org/jira/browse/NIFI-190
>
>
>
>
>
>
>On Tue, Oct 27, 2015 at 10:24 AM, Joe Percivall <jo...@yahoo.com> wrote:
>
>Hey Charlie,
>>
>>Sorry no one has followed up with you yet. One way I see around ConvertCharacterSet not supporting expression language is to route on attribute (assuming the character set is extracted to be an attribute) to different ConvertCharacterSet processors depending on the input character set.
>>
>>That being said, I don't see a reason why the ConvertCharacterSet shouldn't support expression language. If anyone doesn't have objections I'll put in a ticket later today and knock it out real quick.
>>
>>
>>Joe
>>- - - - - -
>>Joseph Percivall
>>linkedin.com/in/Percivall
>>e: joepercivall@yahoo.com
>>
>>
>>
>>
>>
>>On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <ch...@gmail.com> wrote:
>>
>>
>>
>>I'm looking to process many files into common formats.  The source files are coming in various character sets, mime types, and new line terminators.
>>
>>My thinking for a data flow was along these lines:
>>
>>GetFile (from many sub directories) ->
>>ExecuteStreamCommand (file -i) ->
>>ConvertCharacterSet (from previous command to utf8) ->
>>ReplaceText (to change any \r\n into \n) ->
>>PutFile (into a directory structure based on values found in the original file path and filename)
>>
>>Additional steps would be added for archiving a copy of the original, converting xml files, etc.
>>
>>Attempting to process these with Nifi leaves me confused as to how to process within the tool.  If I want to ConvertCharacterSet, I have to know the input type.  I setup a ExecuteStreamCommand to file -i ${absolute.path:append(${filename})} which returned the expected values.  I don't see a way to turn these results into input for the processor, which doesn't accept expression language for that field.
>>
>>I also considered ConvertCSVToAvro as an interim step but notice the same issue.  Any suggestions what this dataflow should look like?
>>
>>
>>Charlie
>>
>

Re: ConvertCharacterSet

Posted by Charlie Frasure <ch...@gmail.com>.

Thank you both for the replies.  I built a flow that adds the "fragment"
attributes early on, and splits the feed after the ExecuteStream that
identifies the character set.  The character set payload goes through
ExtractText to move it into an attribute and ReplaceText to delete the
contents of the file.  The two streams are then funneled to a MergeContent
using Defragment, which results in the original data with an extra blank
line and the character set attribute attached.

I suppose at this point I could route based on attributes for each
character set or call another ExecuteStream to iconv.  This works, but
seems a bit of a hack job.  Any suggestions for improvement?  Is this an
expected use case for the tool?

On Tue, Oct 27, 2015 at 10:45 AM, Bryan Bende <bb...@gmail.com> wrote:

> One problem with the above flow is that ExecuteStreamCommand will replace
> the contents of the FlowFile with the results of the command, so the
> FlowFIle will have the encoding value and no longer have the original
> content.
>
> This could potentially be solved in the future with the "hold file"
> processor [1] where the original file is held on one path, while the same
> file goes to ExecuteStreamCommand, after getting the encoding it could be
> extracted to an attribute and then trigger the original file for release,
> copying over the encoding attribute.
>
> [1] https://issues.apache.org/jira/browse/NIFI-190
>
>
>
> On Tue, Oct 27, 2015 at 10:24 AM, Joe Percivall <jo...@yahoo.com>
> wrote:
>
>> Hey Charlie,
>>
>> Sorry no one has followed up with you yet. One way I see around
>> ConvertCharacterSet not supporting expression language is to route on
>> attribute (assuming the character set is extracted to be an attribute) to
>> different ConvertCharacterSet processors depending on the input character
>> set.
>>
>> That being said, I don't see a reason why the ConvertCharacterSet
>> shouldn't support expression language. If anyone doesn't have objections
>> I'll put in a ticket later today and knock it out real quick.
>>
>>
>> Joe
>> - - - - - -
>> Joseph Percivall
>> linkedin.com/in/Percivall
>> e: joepercivall@yahoo.com
>>
>>
>>
>>
>> On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <
>> charliefrasure@gmail.com> wrote:
>>
>>
>>
>> I'm looking to process many files into common formats.  The source files
>> are coming in various character sets, mime types, and new line terminators.
>>
>> My thinking for a data flow was along these lines:
>>
>> GetFile (from many sub directories) ->
>> ExecuteStreamCommand (file -i) ->
>> ConvertCharacterSet (from previous command to utf8) ->
>> ReplaceText (to change any \r\n into \n) ->
>> PutFile (into a directory structure based on values found in the original
>> file path and filename)
>>
>> Additional steps would be added for archiving a copy of the original,
>> converting xml files, etc.
>>
>> Attempting to process these with Nifi leaves me confused as to how to
>> process within the tool.  If I want to ConvertCharacterSet, I have to know
>> the input type.  I setup a ExecuteStreamCommand to file -i
>> ${absolute.path:append(${filename})} which returned the expected values.  I
>> don't see a way to turn these results into input for the processor, which
>> doesn't accept expression language for that field.
>>
>> I also considered ConvertCSVToAvro as an interim step but notice the same
>> issue.  Any suggestions what this dataflow should look like?
>>
>>
>> Charlie
>>
>
>

Re: ConvertCharacterSet

Posted by Bryan Bende <bb...@gmail.com>.

One problem with the above flow is that ExecuteStreamCommand will replace
the contents of the FlowFile with the results of the command, so the
FlowFIle will have the encoding value and no longer have the original
content.

This could potentially be solved in the future with the "hold file"
processor [1] where the original file is held on one path, while the same
file goes to ExecuteStreamCommand, after getting the encoding it could be
extracted to an attribute and then trigger the original file for release,
copying over the encoding attribute.

[1] https://issues.apache.org/jira/browse/NIFI-190



On Tue, Oct 27, 2015 at 10:24 AM, Joe Percivall <jo...@yahoo.com>
wrote:

> Hey Charlie,
>
> Sorry no one has followed up with you yet. One way I see around
> ConvertCharacterSet not supporting expression language is to route on
> attribute (assuming the character set is extracted to be an attribute) to
> different ConvertCharacterSet processors depending on the input character
> set.
>
> That being said, I don't see a reason why the ConvertCharacterSet
> shouldn't support expression language. If anyone doesn't have objections
> I'll put in a ticket later today and knock it out real quick.
>
>
> Joe
> - - - - - -
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
>
>
>
>
> On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <
> charliefrasure@gmail.com> wrote:
>
>
>
> I'm looking to process many files into common formats.  The source files
> are coming in various character sets, mime types, and new line terminators.
>
> My thinking for a data flow was along these lines:
>
> GetFile (from many sub directories) ->
> ExecuteStreamCommand (file -i) ->
> ConvertCharacterSet (from previous command to utf8) ->
> ReplaceText (to change any \r\n into \n) ->
> PutFile (into a directory structure based on values found in the original
> file path and filename)
>
> Additional steps would be added for archiving a copy of the original,
> converting xml files, etc.
>
> Attempting to process these with Nifi leaves me confused as to how to
> process within the tool.  If I want to ConvertCharacterSet, I have to know
> the input type.  I setup a ExecuteStreamCommand to file -i
> ${absolute.path:append(${filename})} which returned the expected values.  I
> don't see a way to turn these results into input for the processor, which
> doesn't accept expression language for that field.
>
> I also considered ConvertCSVToAvro as an interim step but notice the same
> issue.  Any suggestions what this dataflow should look like?
>
>
> Charlie
>

Re: ConvertCharacterSet

Posted by Joe Percivall <jo...@yahoo.com>.

Hey Charlie,

Sorry no one has followed up with you yet. One way I see around ConvertCharacterSet not supporting expression language is to route on attribute (assuming the character set is extracted to be an attribute) to different ConvertCharacterSet processors depending on the input character set.

That being said, I don't see a reason why the ConvertCharacterSet shouldn't support expression language. If anyone doesn't have objections I'll put in a ticket later today and knock it out real quick.

Joe
- - - - - -
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com

On Sunday, October 25, 2015 7:13 PM, Charlie Frasure <ch...@gmail.com> wrote:

I'm looking to process many files into common formats. The source files are coming in various character sets, mime types, and new line terminators.

My thinking for a data flow was along these lines:

GetFile (from many sub directories) ->
ExecuteStreamCommand (file -i) ->
ConvertCharacterSet (from previous command to utf8) ->
ReplaceText (to change any \r\n into \n) ->
PutFile (into a directory structure based on values found in the original file path and filename)

Additional steps would be added for archiving a copy of the original, converting xml files, etc.

Attempting to process these with Nifi leaves me confused as to how to process within the tool. If I want to ConvertCharacterSet, I have to know the input type. I setup a ExecuteStreamCommand to file -i ${absolute.path:append(${filename})} which returned the expected values. I don't see a way to turn these results into input for the processor, which doesn't accept expression language for that field.

I also considered ConvertCSVToAvro as an interim step but notice the same issue. Any suggestions what this dataflow should look like?

Charlie