You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by James McMahon <js...@gmail.com> on 2017/02/02 22:38:04 UTC

Writing back through a python stream callback when the flowfile content is a mix of character and binary

I have a flowfile that has tagged character information I need to get at
throughout the first few sections of the file. I need to use regex in
python to select some of those values and to transform others. I am using
an ExecuteScript processor to execute my python code. Here is my approach:



= = = = =

class PyStreamCallback(StreamCallback) :

   def __init__ (self) :

   def process(self, inputSteam, outputStream) :

      stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  # what
happens to my binary and extreme chars when they get passed through this
step?

     .

     . (transform and pick out select content)

     .

     outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I using
the wrong functions to put my text chars and my binary and my extreme chars
back on the stream as a byte stream? What should I be doing to handle the
variety of data?



flowFile = session.get()

if (flowFile!= None)

   incoming = flowFile.getAttribute(‘filename’)

   logging.info(‘about to process file: %s’, incoming)

   flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in
my code

   session.transfer(flowFile, REL_SUCCESS)

   session.commit()



= = = = =



When my incoming flowfile is all character content - such as tagged xml -
my code works fine. All the flowfiles that also contain some binary data
and/or characters at the extremes such as foreign language characters don’t
work. They error out. I suspect it has to do with the way I am writing back
to the flowfile stream.



Here is the error I am getting:

Org.apache.nifi.processor.exception.ProcessException:
javax.script.ScriptException: TypeError: write(): 1st arg can’t be coerced
to int, byte[] in <script> at line number 155



How should I handle the write back to the flowfile in cases where I have a
mix of character and binary?


Note: I must do this programmatically. I tried using a combination of
SplitContent and MergeContent, but I have no consistent reliable
demarcation between the regular text characters and the other more
challenging characters that I can split on.

All the examples I've found handle more pure circumstances than mine seems
to be. For example, all text. Or all JSON. I've not yet been able to find
an example that shows me how to write back to the stream for mixed data
situations. Can you help?

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by Matt Burgess <ma...@apache.org>.
Can you share with us a little more information about the
schema/format of your incoming data?  Is there always a tag before a
data item, for example?

Thanks,
Matt

On Thu, Feb 2, 2017 at 8:26 PM, James McMahon <js...@gmail.com> wrote:
> Thank you very much Matt. I would be most interested in any insights you
> gain if you are able to recreate the problem.
>
> If you have a moment, can you offer up a line of code showing how one might
> wrap a call around the byte stream to treat the bytes as a string that can
> be matched against using, for instance, a compiled re pattern? I will
> definitely look more closely at the Oracle docs link you provided. An
> example would help me when I tackle this.  -Jim
>
> On Thu, Feb 2, 2017 at 6:56 PM, Matt Burgess <ma...@apache.org> wrote:
>>
>> James,
>>
>> If you'd rather work with the inputStream as bytes, you don't need the
>> IOUtils.toString() call, and I'm not sure what a UTF-8 charset would
>> do to your mixed data.  You can wrap any of the *InputStream
>> decorators around the inputStream object, such as DataInputStream [1]
>> to read various data types from the underlying bytes in the stream.
>> Alternatively you may want to read all the bytes into an array you can
>> work with directly via Jython methods instead of using Java I/O.
>>
>> What's weird about the TypeError is that it looks like it is calling a
>> different write() method than I would've expected, I wonder if the
>> translation of Jython to Java objects is somehow making the processor
>> not be able to match up a method signature.  If the error is not
>> occurring in the redacted code block above, I will give this script a
>> try, to see if I can reproduce and/or fix the error.
>>
>> Regards,
>> Matt
>>
>> [1] https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html
>>
>>
>> On Thu, Feb 2, 2017 at 6:19 PM, James McMahon <js...@gmail.com>
>> wrote:
>> > This is very helpful Russell, but in my case each file is a mix of data
>> > types. So even if i determine that the flowfile is a mix, I'd still have
>> > to
>> > be poised to tackle it it my ExecuteScript script. Good suggestion,
>> > though,
>> > and one I can use in other ways in my workflows.
>> >
>> > I do hope someone can tell me what I can do in my callback write back to
>> > handle all. I'd like to better understand this error I'm getting, too.
>> > -Jim
>> >
>> > On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <ru...@windofkeltia.com>
>> > wrote:
>> >>
>> >> Could you use RouteOnContent to determine what sort of content you're
>> >> dealing with, then branch to different ExecuteScript processors rigged
>> >> to
>> >> different Python scripts?
>> >>
>> >> Hope this comment is helpful.
>> >>
>> >>
>> >> On 02/02/2017 03:38 PM, James McMahon wrote:
>> >>
>> >> I have a flowfile that has tagged character information I need to get
>> >> at
>> >> throughout the first few sections of the file. I need to use regex in
>> >> python
>> >> to select some of those values and to transform others. I am using an
>> >> ExecuteScript processor to execute my python code. Here is my approach:
>> >>
>> >>
>> >>
>> >> = = = = =
>> >>
>> >> class PyStreamCallback(StreamCallback) :
>> >>
>> >>    def __init__ (self) :
>> >>
>> >>    def process(self, inputSteam, outputStream) :
>> >>
>> >>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
>> >> what happens to my binary and extreme chars when they get passed
>> >> through
>> >> this step?
>> >>
>> >>      .
>> >>
>> >>      . (transform and pick out select content)
>> >>
>> >>      .
>> >>
>> >>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
>> >> using the wrong functions to put my text chars and my binary and my
>> >> extreme
>> >> chars back on the stream as a byte stream? What should I be doing to
>> >> handle
>> >> the variety of data?
>> >>
>> >>
>> >>
>> >> flowFile = session.get()
>> >>
>> >> if (flowFile!= None)
>> >>
>> >>    incoming = flowFile.getAttribute(‘filename’)
>> >>
>> >>    logging.info(‘about to process file: %s’, incoming)
>> >>
>> >>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155
>> >> in
>> >> my code
>> >>
>> >>    session.transfer(flowFile, REL_SUCCESS)
>> >>
>> >>    session.commit()
>> >>
>> >>
>> >>
>> >> = = = = =
>> >>
>> >>
>> >>
>> >> When my incoming flowfile is all character content - such as tagged xml
>> >> -
>> >> my code works fine. All the flowfiles that also contain some binary
>> >> data
>> >> and/or characters at the extremes such as foreign language characters
>> >> don’t
>> >> work. They error out. I suspect it has to do with the way I am writing
>> >> back
>> >> to the flowfile stream.
>> >>
>> >>
>> >>
>> >> Here is the error I am getting:
>> >>
>> >> Org.apache.nifi.processor.exception.ProcessException:
>> >> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
>> >> coerced
>> >> to int, byte[] in <script> at line number 155
>> >>
>> >>
>> >>
>> >> How should I handle the write back to the flowfile in cases where I
>> >> have a
>> >> mix of character and binary?
>> >>
>> >>
>> >>
>> >> Note: I must do this programmatically. I tried using a combination of
>> >> SplitContent and MergeContent, but I have no consistent reliable
>> >> demarcation
>> >> between the regular text characters and the other more challenging
>> >> characters that I can split on.
>> >>
>> >> All the examples I've found handle more pure circumstances than mine
>> >> seems
>> >> to be. For example, all text. Or all JSON. I've not yet been able to
>> >> find an
>> >> example that shows me how to write back to the stream for mixed data
>> >> situations. Can you help?
>> >>
>> >>
>> >
>
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by James McMahon <js...@gmail.com>.
Thank you very much Matt. I would be most interested in any insights you
gain if you are able to recreate the problem.

If you have a moment, can you offer up a line of code showing how one might
wrap a call around the byte stream to treat the bytes as a string that can
be matched against using, for instance, a compiled re pattern? I will
definitely look more closely at the Oracle docs link you provided. An
example would help me when I tackle this.  -Jim

On Thu, Feb 2, 2017 at 6:56 PM, Matt Burgess <ma...@apache.org> wrote:

> James,
>
> If you'd rather work with the inputStream as bytes, you don't need the
> IOUtils.toString() call, and I'm not sure what a UTF-8 charset would
> do to your mixed data.  You can wrap any of the *InputStream
> decorators around the inputStream object, such as DataInputStream [1]
> to read various data types from the underlying bytes in the stream.
> Alternatively you may want to read all the bytes into an array you can
> work with directly via Jython methods instead of using Java I/O.
>
> What's weird about the TypeError is that it looks like it is calling a
> different write() method than I would've expected, I wonder if the
> translation of Jython to Java objects is somehow making the processor
> not be able to match up a method signature.  If the error is not
> occurring in the redacted code block above, I will give this script a
> try, to see if I can reproduce and/or fix the error.
>
> Regards,
> Matt
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html
>
>
> On Thu, Feb 2, 2017 at 6:19 PM, James McMahon <js...@gmail.com>
> wrote:
> > This is very helpful Russell, but in my case each file is a mix of data
> > types. So even if i determine that the flowfile is a mix, I'd still have
> to
> > be poised to tackle it it my ExecuteScript script. Good suggestion,
> though,
> > and one I can use in other ways in my workflows.
> >
> > I do hope someone can tell me what I can do in my callback write back to
> > handle all. I'd like to better understand this error I'm getting, too.
> -Jim
> >
> > On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <ru...@windofkeltia.com>
> > wrote:
> >>
> >> Could you use RouteOnContent to determine what sort of content you're
> >> dealing with, then branch to different ExecuteScript processors rigged
> to
> >> different Python scripts?
> >>
> >> Hope this comment is helpful.
> >>
> >>
> >> On 02/02/2017 03:38 PM, James McMahon wrote:
> >>
> >> I have a flowfile that has tagged character information I need to get at
> >> throughout the first few sections of the file. I need to use regex in
> python
> >> to select some of those values and to transform others. I am using an
> >> ExecuteScript processor to execute my python code. Here is my approach:
> >>
> >>
> >>
> >> = = = = =
> >>
> >> class PyStreamCallback(StreamCallback) :
> >>
> >>    def __init__ (self) :
> >>
> >>    def process(self, inputSteam, outputStream) :
> >>
> >>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
> >> what happens to my binary and extreme chars when they get passed through
> >> this step?
> >>
> >>      .
> >>
> >>      . (transform and pick out select content)
> >>
> >>      .
> >>
> >>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
> >> using the wrong functions to put my text chars and my binary and my
> extreme
> >> chars back on the stream as a byte stream? What should I be doing to
> handle
> >> the variety of data?
> >>
> >>
> >>
> >> flowFile = session.get()
> >>
> >> if (flowFile!= None)
> >>
> >>    incoming = flowFile.getAttribute(‘filename’)
> >>
> >>    logging.info(‘about to process file: %s’, incoming)
> >>
> >>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155
> in
> >> my code
> >>
> >>    session.transfer(flowFile, REL_SUCCESS)
> >>
> >>    session.commit()
> >>
> >>
> >>
> >> = = = = =
> >>
> >>
> >>
> >> When my incoming flowfile is all character content - such as tagged xml
> -
> >> my code works fine. All the flowfiles that also contain some binary data
> >> and/or characters at the extremes such as foreign language characters
> don’t
> >> work. They error out. I suspect it has to do with the way I am writing
> back
> >> to the flowfile stream.
> >>
> >>
> >>
> >> Here is the error I am getting:
> >>
> >> Org.apache.nifi.processor.exception.ProcessException:
> >> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
> coerced
> >> to int, byte[] in <script> at line number 155
> >>
> >>
> >>
> >> How should I handle the write back to the flowfile in cases where I
> have a
> >> mix of character and binary?
> >>
> >>
> >>
> >> Note: I must do this programmatically. I tried using a combination of
> >> SplitContent and MergeContent, but I have no consistent reliable
> demarcation
> >> between the regular text characters and the other more challenging
> >> characters that I can split on.
> >>
> >> All the examples I've found handle more pure circumstances than mine
> seems
> >> to be. For example, all text. Or all JSON. I've not yet been able to
> find an
> >> example that shows me how to write back to the stream for mixed data
> >> situations. Can you help?
> >>
> >>
> >
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by Matt Burgess <ma...@apache.org>.
James,

If you'd rather work with the inputStream as bytes, you don't need the
IOUtils.toString() call, and I'm not sure what a UTF-8 charset would
do to your mixed data.  You can wrap any of the *InputStream
decorators around the inputStream object, such as DataInputStream [1]
to read various data types from the underlying bytes in the stream.
Alternatively you may want to read all the bytes into an array you can
work with directly via Jython methods instead of using Java I/O.

What's weird about the TypeError is that it looks like it is calling a
different write() method than I would've expected, I wonder if the
translation of Jython to Java objects is somehow making the processor
not be able to match up a method signature.  If the error is not
occurring in the redacted code block above, I will give this script a
try, to see if I can reproduce and/or fix the error.

Regards,
Matt

[1] https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html


On Thu, Feb 2, 2017 at 6:19 PM, James McMahon <js...@gmail.com> wrote:
> This is very helpful Russell, but in my case each file is a mix of data
> types. So even if i determine that the flowfile is a mix, I'd still have to
> be poised to tackle it it my ExecuteScript script. Good suggestion, though,
> and one I can use in other ways in my workflows.
>
> I do hope someone can tell me what I can do in my callback write back to
> handle all. I'd like to better understand this error I'm getting, too.  -Jim
>
> On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <ru...@windofkeltia.com>
> wrote:
>>
>> Could you use RouteOnContent to determine what sort of content you're
>> dealing with, then branch to different ExecuteScript processors rigged to
>> different Python scripts?
>>
>> Hope this comment is helpful.
>>
>>
>> On 02/02/2017 03:38 PM, James McMahon wrote:
>>
>> I have a flowfile that has tagged character information I need to get at
>> throughout the first few sections of the file. I need to use regex in python
>> to select some of those values and to transform others. I am using an
>> ExecuteScript processor to execute my python code. Here is my approach:
>>
>>
>>
>> = = = = =
>>
>> class PyStreamCallback(StreamCallback) :
>>
>>    def __init__ (self) :
>>
>>    def process(self, inputSteam, outputStream) :
>>
>>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
>> what happens to my binary and extreme chars when they get passed through
>> this step?
>>
>>      .
>>
>>      . (transform and pick out select content)
>>
>>      .
>>
>>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
>> using the wrong functions to put my text chars and my binary and my extreme
>> chars back on the stream as a byte stream? What should I be doing to handle
>> the variety of data?
>>
>>
>>
>> flowFile = session.get()
>>
>> if (flowFile!= None)
>>
>>    incoming = flowFile.getAttribute(‘filename’)
>>
>>    logging.info(‘about to process file: %s’, incoming)
>>
>>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in
>> my code
>>
>>    session.transfer(flowFile, REL_SUCCESS)
>>
>>    session.commit()
>>
>>
>>
>> = = = = =
>>
>>
>>
>> When my incoming flowfile is all character content - such as tagged xml -
>> my code works fine. All the flowfiles that also contain some binary data
>> and/or characters at the extremes such as foreign language characters don’t
>> work. They error out. I suspect it has to do with the way I am writing back
>> to the flowfile stream.
>>
>>
>>
>> Here is the error I am getting:
>>
>> Org.apache.nifi.processor.exception.ProcessException:
>> javax.script.ScriptException: TypeError: write(): 1st arg can’t be coerced
>> to int, byte[] in <script> at line number 155
>>
>>
>>
>> How should I handle the write back to the flowfile in cases where I have a
>> mix of character and binary?
>>
>>
>>
>> Note: I must do this programmatically. I tried using a combination of
>> SplitContent and MergeContent, but I have no consistent reliable demarcation
>> between the regular text characters and the other more challenging
>> characters that I can split on.
>>
>> All the examples I've found handle more pure circumstances than mine seems
>> to be. For example, all text. Or all JSON. I've not yet been able to find an
>> example that shows me how to write back to the stream for mixed data
>> situations. Can you help?
>>
>>
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by James McMahon <js...@gmail.com>.
Another excellent suggestion. I did try this, but the problem I encountered
is that the various data types are intermingled throughout the files, and
in no fixed locations. Splitting turned out to be problematic. This is very
good means to conquer my problem if the files had some reliable demarcation
point. Thank you again, Russell.

On Thu, Feb 2, 2017 at 7:14 PM, Russell Bateman <ru...@windofkeltia.com>
wrote:

> There is also a *SplitContent* processor. Assuming you can recognize the
> boundaries of the different data types, you can split them up into separate
> flowfiles. Then you *MergeContent* them back together later.
>
>
> On 02/02/2017 04:19 PM, James McMahon wrote:
>
> This is very helpful Russell, but in my case each file is a mix of data
> types. So even if i determine that the flowfile is a mix, I'd still have to
> be poised to tackle it it my ExecuteScript script. Good suggestion, though,
> and one I can use in other ways in my workflows.
>
> I do hope someone can tell me what I can do in my callback write back to
> handle all. I'd like to better understand this error I'm getting, too.
>  -Jim
>
> On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <ru...@windofkeltia.com>
> wrote:
>
>> Could you use *RouteOnContent* to determine what sort of content you're
>> dealing with, then branch to different *ExecuteScript* processors rigged
>> to different Python scripts?
>>
>> Hope this comment is helpful.
>>
>>
>> On 02/02/2017 03:38 PM, James McMahon wrote:
>>
>> I have a flowfile that has tagged character information I need to get at
>> throughout the first few sections of the file. I need to use regex in
>> python to select some of those values and to transform others. I am using
>> an ExecuteScript processor to execute my python code. Here is my approach:
>>
>>
>>
>> = = = = =
>>
>> class PyStreamCallback(StreamCallback) :
>>
>>    def __init__ (self) :
>>
>>    def process(self, inputSteam, outputStream) :
>>
>>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
>> what happens to my binary and extreme chars when they get passed through
>> this step?
>>
>>      .
>>
>>      . (transform and pick out select content)
>>
>>      .
>>
>>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
>> using the wrong functions to put my text chars and my binary and my extreme
>> chars back on the stream as a byte stream? What should I be doing to handle
>> the variety of data?
>>
>>
>>
>> flowFile = session.get()
>>
>> if (flowFile!= None)
>>
>>    incoming = flowFile.getAttribute(‘filename’)
>>
>>    logging.info(‘about to process file: %s’, incoming)
>>
>>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in
>> my code
>>
>>    session.transfer(flowFile, REL_SUCCESS)
>>
>>    session.commit()
>>
>>
>>
>> = = = = =
>>
>>
>>
>> When my incoming flowfile is all character content - such as tagged xml -
>> my code works fine. All the flowfiles that also contain some binary data
>> and/or characters at the extremes such as foreign language characters don’t
>> work. They error out. I suspect it has to do with the way I am writing back
>> to the flowfile stream.
>>
>>
>>
>> Here is the error I am getting:
>>
>> Org.apache.nifi.processor.exception.ProcessException:
>> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
>> coerced to int, byte[] in <script> at line number 155
>>
>>
>>
>> How should I handle the write back to the flowfile in cases where I have
>> a mix of character and binary?
>>
>>
>> Note: I must do this programmatically. I tried using a combination of
>> SplitContent and MergeContent, but I have no consistent reliable
>> demarcation between the regular text characters and the other more
>> challenging characters that I can split on.
>>
>> All the examples I've found handle more pure circumstances than mine
>> seems to be. For example, all text. Or all JSON. I've not yet been able to
>> find an example that shows me how to write back to the stream for mixed
>> data situations. Can you help?
>>
>>
>>
>
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by Russell Bateman <ru...@windofkeltia.com>.
There is also a /SplitContent/ processor. Assuming you can recognize the 
boundaries of the different data types, you can split them up into 
separate flowfiles. Then you /MergeContent/ them back together later.


On 02/02/2017 04:19 PM, James McMahon wrote:
> This is very helpful Russell, but in my case each file is a mix of 
> data types. So even if i determine that the flowfile is a mix, I'd 
> still have to be poised to tackle it it my ExecuteScript script. Good 
> suggestion, though, and one I can use in other ways in my workflows.
>
> I do hope someone can tell me what I can do in my callback write back 
> to handle all. I'd like to better understand this error I'm getting, 
> too.  -Jim
>
> On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <russ@windofkeltia.com 
> <ma...@windofkeltia.com>> wrote:
>
>     Could you use /RouteOnContent/ to determine what sort of content
>     you're dealing with, then branch to different /ExecuteScript/
>     processors rigged to different Python scripts?
>
>     Hope this comment is helpful.
>
>
>     On 02/02/2017 03:38 PM, James McMahon wrote:
>>
>>     I have a flowfile that has tagged character information I need to
>>     get at throughout the first few sections of the file. I need to
>>     use regex in python to select some of those values and to
>>     transform others. I am using an ExecuteScript processor to
>>     execute my python code. Here is my approach:
>>
>>     = = = = =
>>
>>     class PyStreamCallback(StreamCallback) :
>>
>>        def __init__ (self) :
>>
>>        def process(self, inputSteam, outputStream) :
>>
>>           stuff = IOUtils.toString(inputStream,
>>     StandardCharsets.UTF_8)  # what happens to my binary and extreme
>>     chars when they get passed through this step?
>>
>>          .
>>
>>          . (transform and pick out select content)
>>
>>          .
>>
>>     outputStream.write(bytearray(stuff.encode(\u2018utf-8\u2019)))) # am I
>>     using the wrong functions to put my text chars and my binary and
>>     my extreme chars back on the stream as a byte stream? What should
>>     I be doing to handle the variety of data?
>>
>>     flowFile = session.get()
>>
>>     if (flowFile!= None)
>>
>>        incoming = flowFile.getAttribute(\u2018filename\u2019)
>>
>>     logging.info <http://logging.info>(\u2018about to process file: %s\u2019,
>>     incoming)
>>
>>        flowFile = session.write(flowFile, PyStreamCallback())   #
>>     line 155 in my code
>>
>>     session.transfer(flowFile, REL_SUCCESS)
>>
>>        session.commit()
>>
>>     = = = = =
>>
>>     When my incoming flowfile is all character content - such as
>>     tagged xml - my code works fine. All the flowfiles that also
>>     contain some binary data and/or characters at the extremes such
>>     as foreign language characters don\u2019t work. They error out. I
>>     suspect it has to do with the way I am writing back to the
>>     flowfile stream.
>>
>>     Here is the error I am getting:
>>
>>     Org.apache.nifi.processor.exception.ProcessException:
>>     javax.script.ScriptException: TypeError: write(): 1^st arg can\u2019t
>>     be coerced to int, byte[] in <script> at line number 155
>>
>>     How should I handle the write back to the flowfile in cases where
>>     I have a mix of character and binary?
>>
>>     Note: I must do this programmatically. I tried using a
>>     combination of SplitContent and MergeContent, but I have no
>>     consistent reliable demarcation between the regular text
>>     characters and the other more challenging characters that I can
>>     split on.
>>
>>     All the examples I've found handle more pure circumstances than
>>     mine seems to be. For example, all text. Or all JSON. I've not
>>     yet been able to find an example that shows me how to write back
>>     to the stream for mixed data situations. Can you help?
>
>


Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by James McMahon <js...@gmail.com>.
This is very helpful Russell, but in my case each file is a mix of data
types. So even if i determine that the flowfile is a mix, I'd still have to
be poised to tackle it it my ExecuteScript script. Good suggestion, though,
and one I can use in other ways in my workflows.

I do hope someone can tell me what I can do in my callback write back to
handle all. I'd like to better understand this error I'm getting, too.
 -Jim

On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <ru...@windofkeltia.com>
wrote:

> Could you use *RouteOnContent* to determine what sort of content you're
> dealing with, then branch to different *ExecuteScript* processors rigged
> to different Python scripts?
>
> Hope this comment is helpful.
>
>
> On 02/02/2017 03:38 PM, James McMahon wrote:
>
> I have a flowfile that has tagged character information I need to get at
> throughout the first few sections of the file. I need to use regex in
> python to select some of those values and to transform others. I am using
> an ExecuteScript processor to execute my python code. Here is my approach:
>
>
>
> = = = = =
>
> class PyStreamCallback(StreamCallback) :
>
>    def __init__ (self) :
>
>    def process(self, inputSteam, outputStream) :
>
>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
> what happens to my binary and extreme chars when they get passed through
> this step?
>
>      .
>
>      . (transform and pick out select content)
>
>      .
>
>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
> using the wrong functions to put my text chars and my binary and my extreme
> chars back on the stream as a byte stream? What should I be doing to handle
> the variety of data?
>
>
>
> flowFile = session.get()
>
> if (flowFile!= None)
>
>    incoming = flowFile.getAttribute(‘filename’)
>
>    logging.info(‘about to process file: %s’, incoming)
>
>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in
> my code
>
>    session.transfer(flowFile, REL_SUCCESS)
>
>    session.commit()
>
>
>
> = = = = =
>
>
>
> When my incoming flowfile is all character content - such as tagged xml -
> my code works fine. All the flowfiles that also contain some binary data
> and/or characters at the extremes such as foreign language characters don’t
> work. They error out. I suspect it has to do with the way I am writing back
> to the flowfile stream.
>
>
>
> Here is the error I am getting:
>
> Org.apache.nifi.processor.exception.ProcessException:
> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
> coerced to int, byte[] in <script> at line number 155
>
>
>
> How should I handle the write back to the flowfile in cases where I have a
> mix of character and binary?
>
>
> Note: I must do this programmatically. I tried using a combination of
> SplitContent and MergeContent, but I have no consistent reliable
> demarcation between the regular text characters and the other more
> challenging characters that I can split on.
>
> All the examples I've found handle more pure circumstances than mine seems
> to be. For example, all text. Or all JSON. I've not yet been able to find
> an example that shows me how to write back to the stream for mixed data
> situations. Can you help?
>
>
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by Russell Bateman <ru...@windofkeltia.com>.
Could you use /RouteOnContent/ to determine what sort of content you're 
dealing with, then branch to different /ExecuteScript/ processors rigged 
to different Python scripts?

Hope this comment is helpful.


On 02/02/2017 03:38 PM, James McMahon wrote:
>
> I have a flowfile that has tagged character information I need to get 
> at throughout the first few sections of the file. I need to use regex 
> in python to select some of those values and to transform others. I am 
> using an ExecuteScript processor to execute my python code. Here is my 
> approach:
>
> = = = = =
>
> class PyStreamCallback(StreamCallback) :
>
> def __init__ (self) :
>
> def process(self, inputSteam, outputStream) :
>
> stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8) # what 
> happens to my binary and extreme chars when they get passed through 
> this step?
>
> .
>
> . (transform and pick out select content)
>
> .
>
> outputStream.write(bytearray(stuff.encode(\u2018utf-8\u2019))))     # am I using 
> the wrong functions to put my text chars and my binary and my extreme 
> chars back on the stream as a byte stream? What should I be doing to 
> handle the variety of data?
>
> flowFile = session.get()
>
> if (flowFile!= None)
>
> incoming = flowFile.getAttribute(\u2018filename\u2019)
>
> logging.info <http://logging.info>(\u2018about to process file: %s\u2019, incoming)
>
> flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in 
> my code
>
> session.transfer(flowFile, REL_SUCCESS)
>
> session.commit()
>
> = = = = =
>
> When my incoming flowfile is all character content - such as tagged 
> xml - my code works fine. All the flowfiles that also contain some 
> binary data and/or characters at the extremes such as foreign language 
> characters don\u2019t work. They error out. I suspect it has to do with the 
> way I am writing back to the flowfile stream.
>
> Here is the error I am getting:
>
> Org.apache.nifi.processor.exception.ProcessException: 
> javax.script.ScriptException: TypeError: write(): 1^st arg can\u2019t be 
> coerced to int, byte[] in <script> at line number 155
>
> How should I handle the write back to the flowfile in cases where I 
> have a mix of character and binary?
>
> Note: I must do this programmatically. I tried using a combination of 
> SplitContent and MergeContent, but I have no consistent reliable 
> demarcation between the regular text characters and the other more 
> challenging characters that I can split on.
>
> All the examples I've found handle more pure circumstances than mine 
> seems to be. For example, all text. Or all JSON. I've not yet been 
> able to find an example that shows me how to write back to the stream 
> for mixed data situations. Can you help?


Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by James McMahon <js...@gmail.com>.
There are still issues without using bytearray[], Matt. I tried using a
json function in its place to format my data as json, and it still occurs.

I still have this problem but have implemented a temp workaround. I don't
think it is a very good one, as it turns out. As one of our previous
collaborators suggested, I SplitContent and then later after operating on
just the text data in the header I MergeContent to bring the pieces back
together. Problem with this is that there are limits to the number of
"split" flowfiles you can try to bring back at any one time. And to make it
15-20K, you need to up a parameter in nifi.properties. If I bump it up to
20,000 , let's say, then as soon as the 20001 fragment appears it rolls the
oldest one off to Failure. I can't have this happen. My flow volume is much
too high to throttle it down to this level. While I got it to work for the
time being by restricting my ListFile prior to my FetchFile, my approach
will nto scale to my customer's needs.

I hope this makes some modest sense. I am typing all this in her from home
without my NiFi flow and etails in front of me. Cheers and thanks again fro
any future insights you may have. -Jim Mc.

On Fri, Feb 3, 2017 at 10:39 PM, Matt Burgess <ma...@apache.org> wrote:

> James,
>
> I haven't had a chance to dig into this yet, but one thing I noticed
> about your script was an issue identified by Bryan Rosander (NiFi
> committer and all-around good guy :) as the probable cause of the
> TypeError, namely the calling of bytearray() after encode() (the
> latter of which already returns a byte array) [1]. Does removing the
> call to bytearray() fix your script, or are there still issues with
> decoding the input stream?
>
> Regards,
> Matt
>
> [1] https://community.hortonworks.com/questions/81291/nifi-
> executescript-processor-error-using-string-in.html
>
>
> On Thu, Feb 2, 2017 at 5:38 PM, James McMahon <js...@gmail.com>
> wrote:
> > I have a flowfile that has tagged character information I need to get at
> > throughout the first few sections of the file. I need to use regex in
> python
> > to select some of those values and to transform others. I am using an
> > ExecuteScript processor to execute my python code. Here is my approach:
> >
> >
> >
> > = = = = =
> >
> > class PyStreamCallback(StreamCallback) :
> >
> >    def __init__ (self) :
> >
> >    def process(self, inputSteam, outputStream) :
> >
> >       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
> what
> > happens to my binary and extreme chars when they get passed through this
> > step?
> >
> >      .
> >
> >      . (transform and pick out select content)
> >
> >      .
> >
> >      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
> using
> > the wrong functions to put my text chars and my binary and my extreme
> chars
> > back on the stream as a byte stream? What should I be doing to handle the
> > variety of data?
> >
> >
> >
> > flowFile = session.get()
> >
> > if (flowFile!= None)
> >
> >    incoming = flowFile.getAttribute(‘filename’)
> >
> >    logging.info(‘about to process file: %s’, incoming)
> >
> >    flowFile = session.write(flowFile, PyStreamCallback())   # line 155
> in my
> > code
> >
> >    session.transfer(flowFile, REL_SUCCESS)
> >
> >    session.commit()
> >
> >
> >
> > = = = = =
> >
> >
> >
> > When my incoming flowfile is all character content - such as tagged xml
> - my
> > code works fine. All the flowfiles that also contain some binary data
> and/or
> > characters at the extremes such as foreign language characters don’t
> work.
> > They error out. I suspect it has to do with the way I am writing back to
> the
> > flowfile stream.
> >
> >
> >
> > Here is the error I am getting:
> >
> > Org.apache.nifi.processor.exception.ProcessException:
> > javax.script.ScriptException: TypeError: write(): 1st arg can’t be
> coerced
> > to int, byte[] in <script> at line number 155
> >
> >
> >
> > How should I handle the write back to the flowfile in cases where I have
> a
> > mix of character and binary?
> >
> >
> >
> > Note: I must do this programmatically. I tried using a combination of
> > SplitContent and MergeContent, but I have no consistent reliable
> demarcation
> > between the regular text characters and the other more challenging
> > characters that I can split on.
> >
> > All the examples I've found handle more pure circumstances than mine
> seems
> > to be. For example, all text. Or all JSON. I've not yet been able to
> find an
> > example that shows me how to write back to the stream for mixed data
> > situations. Can you help?
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Posted by Matt Burgess <ma...@apache.org>.
James,

I haven't had a chance to dig into this yet, but one thing I noticed
about your script was an issue identified by Bryan Rosander (NiFi
committer and all-around good guy :) as the probable cause of the
TypeError, namely the calling of bytearray() after encode() (the
latter of which already returns a byte array) [1]. Does removing the
call to bytearray() fix your script, or are there still issues with
decoding the input stream?

Regards,
Matt

[1] https://community.hortonworks.com/questions/81291/nifi-executescript-processor-error-using-string-in.html


On Thu, Feb 2, 2017 at 5:38 PM, James McMahon <js...@gmail.com> wrote:
> I have a flowfile that has tagged character information I need to get at
> throughout the first few sections of the file. I need to use regex in python
> to select some of those values and to transform others. I am using an
> ExecuteScript processor to execute my python code. Here is my approach:
>
>
>
> = = = = =
>
> class PyStreamCallback(StreamCallback) :
>
>    def __init__ (self) :
>
>    def process(self, inputSteam, outputStream) :
>
>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  # what
> happens to my binary and extreme chars when they get passed through this
> step?
>
>      .
>
>      . (transform and pick out select content)
>
>      .
>
>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I using
> the wrong functions to put my text chars and my binary and my extreme chars
> back on the stream as a byte stream? What should I be doing to handle the
> variety of data?
>
>
>
> flowFile = session.get()
>
> if (flowFile!= None)
>
>    incoming = flowFile.getAttribute(‘filename’)
>
>    logging.info(‘about to process file: %s’, incoming)
>
>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155 in my
> code
>
>    session.transfer(flowFile, REL_SUCCESS)
>
>    session.commit()
>
>
>
> = = = = =
>
>
>
> When my incoming flowfile is all character content - such as tagged xml - my
> code works fine. All the flowfiles that also contain some binary data and/or
> characters at the extremes such as foreign language characters don’t work.
> They error out. I suspect it has to do with the way I am writing back to the
> flowfile stream.
>
>
>
> Here is the error I am getting:
>
> Org.apache.nifi.processor.exception.ProcessException:
> javax.script.ScriptException: TypeError: write(): 1st arg can’t be coerced
> to int, byte[] in <script> at line number 155
>
>
>
> How should I handle the write back to the flowfile in cases where I have a
> mix of character and binary?
>
>
>
> Note: I must do this programmatically. I tried using a combination of
> SplitContent and MergeContent, but I have no consistent reliable demarcation
> between the regular text characters and the other more challenging
> characters that I can split on.
>
> All the examples I've found handle more pure circumstances than mine seems
> to be. For example, all text. Or all JSON. I've not yet been able to find an
> example that shows me how to write back to the stream for mixed data
> situations. Can you help?