You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by churly lin <ch...@gmail.com> on 2014/01/08 08:24:56 UTC

questions about multilang bolt's STDIN&STDOUT

Hi all,

I am trying to write a topology with a KafkaSpout and a
ShellBolt(implemented by python ).
According to the
Multilang-protocol<https://github.com/nathanmarz/storm/wiki/Multilang-protocol>,
multilang uses json messages over stdin/stdout to communicate with the
subprocess. Specially, *both ends of this protocol use a line-reading
mechanism. *Does it mean that, in multilang, we could not emit message as
byte array? If not, how to read a byte array tuple in a python bolt ?
the json which was read by python bolt is look like:

{
        "command": "emit",
        // The id for the tuple. Leave this out for an unreliable
emit. The id can
    // be a string or a number.
        "id": "1231231",
        // The id of the stream this tuple was emitted to. Leave this
empty to emit to default stream.
        "stream": "1",
        // If doing an emit direct, indicate the task to send the tuple to
        "task": 9,
        // All the values in this tuple
        "tuple": ["field1", 2, 3]}

This example shows that, the "tuple" can be String("field1") and number(2,
3). Could it be a byte array?

Re: questions about multilang bolt's STDIN&STDOUT

Posted by churly lin <ch...@gmail.com>.
Thank you Verardi!
Sorry for my poor English that making my question ambiguously.  Your answer
is so clearly.  Now I know that the byte arrays can be used as "values" for
the field "tuple".
One more question, In my project, I emit byte array tuples in KafakSpout
like this:
    List<Object> tup =
_spoutConfig.scheme.deserialize(Utils.toByteArray(toEmit.msg.payload()));
 // byte array
    collector.emit(tup), new KafkaMessageId(_partition, toEmit.offset));
But when I tried to get tuple in ShellBolt, readMsgs() got an exception.
And I write the emitting JSON message to file, it looks like:
*{"id":"7617035644022584549","stream":"default","comp":"KafkaSpout","tuple":[[B@6d695bcc],"task":1}*
It looks very weird for me. What is the *[[B@6d695bcc]*? Is it a byte
array's object address? Can It be read by Python?
Going even further, If I insist on emitting byte arrays in KafkaSpout, What
should I do to readMsgs in Python Bolt?

Thanks again.


2014/1/9 Antonio Verardi <an...@yelp.com>

> Hi,
>
> I am extensively using the multilang interface for Python. JSON is the way
> you serialize things for communication. It adds a fairly amount of
> overhead, but it is a reasonable design choice in terms of a multilang
> interface.
>
> If your question is: can I read byte array messages from a bolt (made up
> by command, id, stream, task and tuple), the answer is "that's not that
> easy, you should implement something in order to do that".
>
> If your question is: can I serialize byte arrays in JSON with Python and
> use them as "values" for the field "tuple", the answer is: "yes, even
> though JSON always produce string objects". [
> http://docs.python.org/3.3/library/json.html#basic-usage]. You may want
> to modify storm.py, in order to do that, or simply encode and decode your
> data within your own bolt, it depends on your needs.
>
> This is something I found just googling about encoding binary data in JSON:
> http://bytes.com/topic/python/answers/681314-simplejson-pack-binary-data
>
> I hope it was what you were looking for,
> Antonio Uccio Verardi
>
>
>
>
> On Tue, Jan 7, 2014 at 11:24 PM, churly lin <ch...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am trying to write a topology with a KafkaSpout and a
>> ShellBolt(implemented by python ).
>> According to the Multilang-protocol<https://github.com/nathanmarz/storm/wiki/Multilang-protocol>,
>> multilang uses json messages over stdin/stdout to communicate with the
>> subprocess. Specially, *both ends of this protocol use a line-reading
>> mechanism. *Does it mean that, in multilang, we could not emit message
>> as byte array? If not, how to read a byte array tuple in a python bolt ?
>> the json which was read by python bolt is look like:
>>
>>
>> {
>>         "command": "emit",
>>         // The id for the tuple. Leave this out for an unreliable emit. The id can
>>     // be a string or a number.
>>         "id": "1231231",
>>         // The id of the stream this tuple was emitted to. Leave this empty to emit to default stream.
>>         "stream": "1",
>>         // If doing an emit direct, indicate the task to send the tuple to
>>         "task": 9,
>>         // All the values in this tuple
>>         "tuple": ["field1", 2, 3]}
>>
>> This example shows that, the "tuple" can be String("field1") and
>> number(2, 3). Could it be a byte array?
>>
>
>

Re: questions about multilang bolt's STDIN&STDOUT

Posted by Ruhollah Farchtchi <ru...@gmail.com>.
I have had to do this for image data and per Antonio’s suggestion I am encoding and decoding my byte-array into base64. I’m using the clojure DSL and I’ve found it to be fairly performant (we have more optimizing on our image processing side to do). 

Ruhollah Farchtchi
ruhollah.farchtchi@gmail.com



On Jan 8, 2014, at 1:55 PM, Antonio Verardi <an...@yelp.com> wrote:

> Hi,
> 
> I am extensively using the multilang interface for Python. JSON is the way you serialize things for communication. It adds a fairly amount of overhead, but it is a reasonable design choice in terms of a multilang interface.
> 
> If your question is: can I read byte array messages from a bolt (made up by command, id, stream, task and tuple), the answer is "that's not that easy, you should implement something in order to do that".
> 
> If your question is: can I serialize byte arrays in JSON with Python and use them as "values" for the field "tuple", the answer is: "yes, even though JSON always produce string objects". [http://docs.python.org/3.3/library/json.html#basic-usage]. You may want to modify storm.py, in order to do that, or simply encode and decode your data within your own bolt, it depends on your needs. 
> 
> This is something I found just googling about encoding binary data in JSON:
> http://bytes.com/topic/python/answers/681314-simplejson-pack-binary-data
> 
> I hope it was what you were looking for,
> Antonio Uccio Verardi
> 
> 
> 
> 
> On Tue, Jan 7, 2014 at 11:24 PM, churly lin <ch...@gmail.com> wrote:
> Hi all,
> 
> I am trying to write a topology with a KafkaSpout and a ShellBolt(implemented by python ).
> According to the Multilang-protocol, multilang uses json messages over stdin/stdout to communicate with the subprocess. Specially, both ends of this protocol use a line-reading mechanism. Does it mean that, in multilang, we could not emit message as byte array? If not, how to read a byte array tuple in a python bolt ?
> the json which was read by python bolt is look like:
> 
> {
>         "command": "emit",
>         // The id for the tuple. Leave this out for an unreliable emit. The id can
>     // be a string or a number.
>         "id": "1231231",
>         // The id of the stream this tuple was emitted to. Leave this empty to emit to default stream.
>         "stream": "1",
>         // If doing an emit direct, indicate the task to send the tuple to
>         "task": 9,
>         // All the values in this tuple
>         "tuple": ["field1", 2, 3]}
> This example shows that, the "tuple" can be String("field1") and number(2, 3). Could it be a byte array?
> 


Re: questions about multilang bolt's STDIN&STDOUT

Posted by Antonio Verardi <an...@yelp.com>.
Hi,

I am extensively using the multilang interface for Python. JSON is the way
you serialize things for communication. It adds a fairly amount of
overhead, but it is a reasonable design choice in terms of a multilang
interface.

If your question is: can I read byte array messages from a bolt (made up by
command, id, stream, task and tuple), the answer is "that's not that easy,
you should implement something in order to do that".

If your question is: can I serialize byte arrays in JSON with Python and
use them as "values" for the field "tuple", the answer is: "yes, even
though JSON always produce string objects". [
http://docs.python.org/3.3/library/json.html#basic-usage]. You may want to
modify storm.py, in order to do that, or simply encode and decode your data
within your own bolt, it depends on your needs.

This is something I found just googling about encoding binary data in JSON:
http://bytes.com/topic/python/answers/681314-simplejson-pack-binary-data

I hope it was what you were looking for,
Antonio Uccio Verardi




On Tue, Jan 7, 2014 at 11:24 PM, churly lin <ch...@gmail.com> wrote:

> Hi all,
>
> I am trying to write a topology with a KafkaSpout and a
> ShellBolt(implemented by python ).
> According to the Multilang-protocol<https://github.com/nathanmarz/storm/wiki/Multilang-protocol>,
> multilang uses json messages over stdin/stdout to communicate with the
> subprocess. Specially, *both ends of this protocol use a line-reading
> mechanism. *Does it mean that, in multilang, we could not emit message as
> byte array? If not, how to read a byte array tuple in a python bolt ?
> the json which was read by python bolt is look like:
>
>
> {
>         "command": "emit",
>         // The id for the tuple. Leave this out for an unreliable emit. The id can
>     // be a string or a number.
>         "id": "1231231",
>         // The id of the stream this tuple was emitted to. Leave this empty to emit to default stream.
>         "stream": "1",
>         // If doing an emit direct, indicate the task to send the tuple to
>         "task": 9,
>         // All the values in this tuple
>         "tuple": ["field1", 2, 3]}
>
> This example shows that, the "tuple" can be String("field1") and number(2,
> 3). Could it be a byte array?
>