You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by praveen reddy <pr...@gmail.com> on 2016/07/02 01:46:08 UTC

storm usage and design question

Hi All,

i am new to Storm and Kafka and working on POC.

my requirement is get a message from Kafka in json format, spout reading
that message and firts bolt converting the json message to different format
like csv and the second bolt saving it to hadoop.

now i came up with initial design where i can use kafkaspout to read kafka
topics and bolt converting it to csv file and next bolt saving in hadoop.

i have following questions
can the first bold which coverts the message to csv file can omit it? the
file would be saving on disk. can a file which is saved on disk can be
omitted.
how does the second bolt read the file which is saved on disk by first bolt?
do we need to serialize message ommitted by spout and/or bolt?

sorry if the questions sound silly, this is my first topology with minimum
knowledge of storm.

if you guys think of proper design how to implement the my requirement can
you please let me know

thanks in advance

-Praveen

Re: storm usage and design question

Posted by Satish Duggana <sa...@gmail.com>.
Hi,
You can follow the link for instructions on using HDFS bolt.

http://storm.apache.org/releases/1.0.0/storm-hdfs.html

Thanks,
Satish.


On Tue, Jul 5, 2016 at 3:02 AM, praveen reddy <on...@gmail.com>
wrote:

> thanks for response, can you please help me on how can i emit csv data
> using bolt. i was able to read json data from Kafka, convert the data into
> java object. i created a utility class to convert java object into csv
> file. now i want to write that csv file (which i stored on disk) onto hdfs
> using bolt. any link to documentation on how to do it would be helpful. i
> did search in google but couldn't find relevant info.
>
> On Mon, Jul 4, 2016 at 5:22 PM, Harsha Chintalapani <st...@harsha.io>
> wrote:
>
>>
>> “Bolts can emit data even without having to write to disk (I think
>> there’s a 2MB limit to the size of that data that can be emitted, because
>> Thrift can’t handle more than that)."
>> There is no such limit. Between workers storm uses netty channels and
>> internal JVM component communication happens through distuptor queue.
>> If one needs to increase the size of buffers for netty take a look at
>> netty configs in storm.yaml. We recommend to go with the defaults.
>> Thanks,
>> Harsha
>>
>> On Mon, Jul 4, 2016 at 9:59 AM Nathan Leung <nc...@gmail.com> wrote:
>>
>>> Double check how you are pushing data into Kafka. You are probably
>>> pushing one line at a time.
>>> On Jul 4, 2016 12:30 PM, "Navin Ipe" <na...@searchlighthealth.com>
>>> wrote:
>>>
>>>> I haven't worked with Kafka, *so perhaps someone else here would be
>>>> able to help you with it. *
>>>> What I could suggest though, is to search for how to emit more than one
>>>> sentence using the Kafka spout.
>>>>
>>>> If you still can emit only one sentence, then I'd recommend not using a
>>>> separate SaveBolt. Instead, use FieldsGrouping where you group tuples based
>>>> on the name of the CSV file, and emit sentences to TransformBolt. When
>>>> TransformBolt completes receiving all tuples from a CSV, it can save to
>>>> HDFS.
>>>>
>>>> If you still want to use a separate TransformBolt and SaveBolt, then
>>>> use fields grouping as I mentioned above when emitting to both bolts. This
>>>> way, you can have multiple spouts which read from multiple files, and
>>>> whatever they emit will go only to specific bolts.
>>>>
>>>>
>>>> On Mon, Jul 4, 2016 at 9:21 PM, praveen reddy <on...@gmail.com>
>>>> wrote:
>>>>
>>>>> want to add bit more,
>>>>> i am posting the json data using kafka-console-produer.sh file, copy
>>>>> the json data and pasting on console.
>>>>>
>>>>> On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <onlineid8883@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Thanks Naveen for response, i was using mobile so couldn't see
>>>>>> typo's. here is my requirement. this is my first POC on Kafka/Storm, so
>>>>>> please help me if i can design it better way.
>>>>>>
>>>>>> i need to read a Json data from Kafka, than convert the Json Data to
>>>>>> CSV file and save it on HDFS.
>>>>>>
>>>>>> this is how i did initial design and having lot of issues.
>>>>>>
>>>>>>         builder.setSpout("kafka-spout", new
>>>>>> KafkaSpout(kafkaSpoutConfig));
>>>>>>         builder.setBolt("TransformBolt", new
>>>>>> TransformationBolt()).shuffleGrouping("kafka-spout");
>>>>>>         builder.setBolt("Savebolt", new
>>>>>> SaveBolt()).shuffleGrouping("TransformBolt");
>>>>>>
>>>>>> KafkaSpout to read the data from Kafka topic, TransformationBolt to
>>>>>> convert the json to cvs file and savebolt is to save the csv file.
>>>>>>
>>>>>> KafkaSpout was able to read data from Kafka Topic. what i was
>>>>>> expecting from Spout was to get the complete Json data but i am getting 1
>>>>>> line each from Json data i sent to topic
>>>>>>
>>>>>> here is my transport bolt
>>>>>>     @Override
>>>>>>     public void execute(Tuple input) {
>>>>>>         String sentence = input.getString(0);
>>>>>>         collector.emit(new Values(sentence));
>>>>>>         System.out.println("emitted " + sentence);
>>>>>>     }
>>>>>>
>>>>>> i was expecting getString(0) would return complete json data, but
>>>>>> getting only 1 line at once.
>>>>>>
>>>>>> and i am not sure how to emit csv file so that Savebolt would save it.
>>>>>>
>>>>>> can you please let me know how to get complete Json data in single
>>>>>> request rather than line by line, how to emit CSV file from bolt. and if
>>>>>> you guys can help me to design this better it would be really helpful
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <
>>>>>> navin.ipe@searchlighthealth.com> wrote:
>>>>>>
>>>>>>> Dear Praveen,
>>>>>>>
>>>>>>> The questions aren't silly, but it is rather tough to understand
>>>>>>> what you are trying to convey. When you say "omit", do you mean "emit"?
>>>>>>> Bolts can emit data even without having to write to disk (I think
>>>>>>> there's a 2MB limit to the size of that data that can be emitted, because
>>>>>>> Thrift can't handle more than that).
>>>>>>> If you want one bolt to write to disk and then want another bolt to
>>>>>>> read from disk, then that's also possible.
>>>>>>> The first bolt can just send to the second bolt, whatever
>>>>>>> information is necessary to read from file.
>>>>>>> As of what I know, basic datatypes will automatically get
>>>>>>> serialized. If you have a more complex class, then serialize it with
>>>>>>> Serializable.
>>>>>>>
>>>>>>> If you could re-phrase your question and make it clearer, people
>>>>>>> here would be able to help you better.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>>>>>>> praveen.onlinecourse@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> i am new to Storm and Kafka and working on POC.
>>>>>>>>
>>>>>>>> my requirement is get a message from Kafka in json format, spout
>>>>>>>> reading that message and firts bolt converting the json message to
>>>>>>>> different format like csv and the second bolt saving it to hadoop.
>>>>>>>>
>>>>>>>> now i came up with initial design where i can use kafkaspout to
>>>>>>>> read kafka topics and bolt converting it to csv file and next bolt saving
>>>>>>>> in hadoop.
>>>>>>>>
>>>>>>>> i have following questions
>>>>>>>> can the first bold which coverts the message to csv file can omit
>>>>>>>> it? the file would be saving on disk. can a file which is saved on disk can
>>>>>>>> be omitted.
>>>>>>>> how does the second bolt read the file which is saved on disk by
>>>>>>>> first bolt?
>>>>>>>> do we need to serialize message ommitted by spout and/or bolt?
>>>>>>>>
>>>>>>>> sorry if the questions sound silly, this is my first topology with
>>>>>>>> minimum knowledge of storm.
>>>>>>>>
>>>>>>>> if you guys think of proper design how to implement the my
>>>>>>>> requirement can you please let me know
>>>>>>>>
>>>>>>>> thanks in advance
>>>>>>>>
>>>>>>>> -Praveen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Navin
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Navin
>>>>
>>>
>

Re: storm usage and design question

Posted by praveen reddy <on...@gmail.com>.
thanks for response, can you please help me on how can i emit csv data
using bolt. i was able to read json data from Kafka, convert the data into
java object. i created a utility class to convert java object into csv
file. now i want to write that csv file (which i stored on disk) onto hdfs
using bolt. any link to documentation on how to do it would be helpful. i
did search in google but couldn't find relevant info.

On Mon, Jul 4, 2016 at 5:22 PM, Harsha Chintalapani <st...@harsha.io> wrote:

>
> “Bolts can emit data even without having to write to disk (I think
> there’s a 2MB limit to the size of that data that can be emitted, because
> Thrift can’t handle more than that)."
> There is no such limit. Between workers storm uses netty channels and
> internal JVM component communication happens through distuptor queue.
> If one needs to increase the size of buffers for netty take a look at
> netty configs in storm.yaml. We recommend to go with the defaults.
> Thanks,
> Harsha
>
> On Mon, Jul 4, 2016 at 9:59 AM Nathan Leung <nc...@gmail.com> wrote:
>
>> Double check how you are pushing data into Kafka. You are probably
>> pushing one line at a time.
>> On Jul 4, 2016 12:30 PM, "Navin Ipe" <na...@searchlighthealth.com>
>> wrote:
>>
>>> I haven't worked with Kafka, *so perhaps someone else here would be
>>> able to help you with it. *
>>> What I could suggest though, is to search for how to emit more than one
>>> sentence using the Kafka spout.
>>>
>>> If you still can emit only one sentence, then I'd recommend not using a
>>> separate SaveBolt. Instead, use FieldsGrouping where you group tuples based
>>> on the name of the CSV file, and emit sentences to TransformBolt. When
>>> TransformBolt completes receiving all tuples from a CSV, it can save to
>>> HDFS.
>>>
>>> If you still want to use a separate TransformBolt and SaveBolt, then use
>>> fields grouping as I mentioned above when emitting to both bolts. This way,
>>> you can have multiple spouts which read from multiple files, and whatever
>>> they emit will go only to specific bolts.
>>>
>>>
>>> On Mon, Jul 4, 2016 at 9:21 PM, praveen reddy <on...@gmail.com>
>>> wrote:
>>>
>>>> want to add bit more,
>>>> i am posting the json data using kafka-console-produer.sh file, copy
>>>> the json data and pasting on console.
>>>>
>>>> On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <on...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Naveen for response, i was using mobile so couldn't see typo's.
>>>>> here is my requirement. this is my first POC on Kafka/Storm, so please help
>>>>> me if i can design it better way.
>>>>>
>>>>> i need to read a Json data from Kafka, than convert the Json Data to
>>>>> CSV file and save it on HDFS.
>>>>>
>>>>> this is how i did initial design and having lot of issues.
>>>>>
>>>>>         builder.setSpout("kafka-spout", new
>>>>> KafkaSpout(kafkaSpoutConfig));
>>>>>         builder.setBolt("TransformBolt", new
>>>>> TransformationBolt()).shuffleGrouping("kafka-spout");
>>>>>         builder.setBolt("Savebolt", new
>>>>> SaveBolt()).shuffleGrouping("TransformBolt");
>>>>>
>>>>> KafkaSpout to read the data from Kafka topic, TransformationBolt to
>>>>> convert the json to cvs file and savebolt is to save the csv file.
>>>>>
>>>>> KafkaSpout was able to read data from Kafka Topic. what i was
>>>>> expecting from Spout was to get the complete Json data but i am getting 1
>>>>> line each from Json data i sent to topic
>>>>>
>>>>> here is my transport bolt
>>>>>     @Override
>>>>>     public void execute(Tuple input) {
>>>>>         String sentence = input.getString(0);
>>>>>         collector.emit(new Values(sentence));
>>>>>         System.out.println("emitted " + sentence);
>>>>>     }
>>>>>
>>>>> i was expecting getString(0) would return complete json data, but
>>>>> getting only 1 line at once.
>>>>>
>>>>> and i am not sure how to emit csv file so that Savebolt would save it.
>>>>>
>>>>> can you please let me know how to get complete Json data in single
>>>>> request rather than line by line, how to emit CSV file from bolt. and if
>>>>> you guys can help me to design this better it would be really helpful
>>>>>
>>>>>
>>>>> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <
>>>>> navin.ipe@searchlighthealth.com> wrote:
>>>>>
>>>>>> Dear Praveen,
>>>>>>
>>>>>> The questions aren't silly, but it is rather tough to understand what
>>>>>> you are trying to convey. When you say "omit", do you mean "emit"?
>>>>>> Bolts can emit data even without having to write to disk (I think
>>>>>> there's a 2MB limit to the size of that data that can be emitted, because
>>>>>> Thrift can't handle more than that).
>>>>>> If you want one bolt to write to disk and then want another bolt to
>>>>>> read from disk, then that's also possible.
>>>>>> The first bolt can just send to the second bolt, whatever information
>>>>>> is necessary to read from file.
>>>>>> As of what I know, basic datatypes will automatically get serialized.
>>>>>> If you have a more complex class, then serialize it with Serializable.
>>>>>>
>>>>>> If you could re-phrase your question and make it clearer, people here
>>>>>> would be able to help you better.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>>>>>> praveen.onlinecourse@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> i am new to Storm and Kafka and working on POC.
>>>>>>>
>>>>>>> my requirement is get a message from Kafka in json format, spout
>>>>>>> reading that message and firts bolt converting the json message to
>>>>>>> different format like csv and the second bolt saving it to hadoop.
>>>>>>>
>>>>>>> now i came up with initial design where i can use kafkaspout to read
>>>>>>> kafka topics and bolt converting it to csv file and next bolt saving in
>>>>>>> hadoop.
>>>>>>>
>>>>>>> i have following questions
>>>>>>> can the first bold which coverts the message to csv file can omit
>>>>>>> it? the file would be saving on disk. can a file which is saved on disk can
>>>>>>> be omitted.
>>>>>>> how does the second bolt read the file which is saved on disk by
>>>>>>> first bolt?
>>>>>>> do we need to serialize message ommitted by spout and/or bolt?
>>>>>>>
>>>>>>> sorry if the questions sound silly, this is my first topology with
>>>>>>> minimum knowledge of storm.
>>>>>>>
>>>>>>> if you guys think of proper design how to implement the my
>>>>>>> requirement can you please let me know
>>>>>>>
>>>>>>> thanks in advance
>>>>>>>
>>>>>>> -Praveen
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Navin
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Navin
>>>
>>

Re: storm usage and design question

Posted by Harsha Chintalapani <st...@harsha.io>.
“Bolts can emit data even without having to write to disk (I think there’s
a 2MB limit to the size of that data that can be emitted, because Thrift
can’t handle more than that)."
There is no such limit. Between workers storm uses netty channels and
internal JVM component communication happens through distuptor queue.
If one needs to increase the size of buffers for netty take a look at netty
configs in storm.yaml. We recommend to go with the defaults.
Thanks,
Harsha

On Mon, Jul 4, 2016 at 9:59 AM Nathan Leung <nc...@gmail.com> wrote:

> Double check how you are pushing data into Kafka. You are probably pushing
> one line at a time.
> On Jul 4, 2016 12:30 PM, "Navin Ipe" <na...@searchlighthealth.com>
> wrote:
>
>> I haven't worked with Kafka, *so perhaps someone else here would be able
>> to help you with it. *
>> What I could suggest though, is to search for how to emit more than one
>> sentence using the Kafka spout.
>>
>> If you still can emit only one sentence, then I'd recommend not using a
>> separate SaveBolt. Instead, use FieldsGrouping where you group tuples based
>> on the name of the CSV file, and emit sentences to TransformBolt. When
>> TransformBolt completes receiving all tuples from a CSV, it can save to
>> HDFS.
>>
>> If you still want to use a separate TransformBolt and SaveBolt, then use
>> fields grouping as I mentioned above when emitting to both bolts. This way,
>> you can have multiple spouts which read from multiple files, and whatever
>> they emit will go only to specific bolts.
>>
>>
>> On Mon, Jul 4, 2016 at 9:21 PM, praveen reddy <on...@gmail.com>
>> wrote:
>>
>>> want to add bit more,
>>> i am posting the json data using kafka-console-produer.sh file, copy the
>>> json data and pasting on console.
>>>
>>> On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <on...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Naveen for response, i was using mobile so couldn't see typo's.
>>>> here is my requirement. this is my first POC on Kafka/Storm, so please help
>>>> me if i can design it better way.
>>>>
>>>> i need to read a Json data from Kafka, than convert the Json Data to
>>>> CSV file and save it on HDFS.
>>>>
>>>> this is how i did initial design and having lot of issues.
>>>>
>>>>         builder.setSpout("kafka-spout", new
>>>> KafkaSpout(kafkaSpoutConfig));
>>>>         builder.setBolt("TransformBolt", new
>>>> TransformationBolt()).shuffleGrouping("kafka-spout");
>>>>         builder.setBolt("Savebolt", new
>>>> SaveBolt()).shuffleGrouping("TransformBolt");
>>>>
>>>> KafkaSpout to read the data from Kafka topic, TransformationBolt to
>>>> convert the json to cvs file and savebolt is to save the csv file.
>>>>
>>>> KafkaSpout was able to read data from Kafka Topic. what i was expecting
>>>> from Spout was to get the complete Json data but i am getting 1 line each
>>>> from Json data i sent to topic
>>>>
>>>> here is my transport bolt
>>>>     @Override
>>>>     public void execute(Tuple input) {
>>>>         String sentence = input.getString(0);
>>>>         collector.emit(new Values(sentence));
>>>>         System.out.println("emitted " + sentence);
>>>>     }
>>>>
>>>> i was expecting getString(0) would return complete json data, but
>>>> getting only 1 line at once.
>>>>
>>>> and i am not sure how to emit csv file so that Savebolt would save it.
>>>>
>>>> can you please let me know how to get complete Json data in single
>>>> request rather than line by line, how to emit CSV file from bolt. and if
>>>> you guys can help me to design this better it would be really helpful
>>>>
>>>>
>>>> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <
>>>> navin.ipe@searchlighthealth.com> wrote:
>>>>
>>>>> Dear Praveen,
>>>>>
>>>>> The questions aren't silly, but it is rather tough to understand what
>>>>> you are trying to convey. When you say "omit", do you mean "emit"?
>>>>> Bolts can emit data even without having to write to disk (I think
>>>>> there's a 2MB limit to the size of that data that can be emitted, because
>>>>> Thrift can't handle more than that).
>>>>> If you want one bolt to write to disk and then want another bolt to
>>>>> read from disk, then that's also possible.
>>>>> The first bolt can just send to the second bolt, whatever information
>>>>> is necessary to read from file.
>>>>> As of what I know, basic datatypes will automatically get serialized.
>>>>> If you have a more complex class, then serialize it with Serializable.
>>>>>
>>>>> If you could re-phrase your question and make it clearer, people here
>>>>> would be able to help you better.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>>>>> praveen.onlinecourse@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> i am new to Storm and Kafka and working on POC.
>>>>>>
>>>>>> my requirement is get a message from Kafka in json format, spout
>>>>>> reading that message and firts bolt converting the json message to
>>>>>> different format like csv and the second bolt saving it to hadoop.
>>>>>>
>>>>>> now i came up with initial design where i can use kafkaspout to read
>>>>>> kafka topics and bolt converting it to csv file and next bolt saving in
>>>>>> hadoop.
>>>>>>
>>>>>> i have following questions
>>>>>> can the first bold which coverts the message to csv file can omit it?
>>>>>> the file would be saving on disk. can a file which is saved on disk can be
>>>>>> omitted.
>>>>>> how does the second bolt read the file which is saved on disk by
>>>>>> first bolt?
>>>>>> do we need to serialize message ommitted by spout and/or bolt?
>>>>>>
>>>>>> sorry if the questions sound silly, this is my first topology with
>>>>>> minimum knowledge of storm.
>>>>>>
>>>>>> if you guys think of proper design how to implement the my
>>>>>> requirement can you please let me know
>>>>>>
>>>>>> thanks in advance
>>>>>>
>>>>>> -Praveen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Navin
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Navin
>>
>

Re: storm usage and design question

Posted by Nathan Leung <nc...@gmail.com>.
Double check how you are pushing data into Kafka. You are probably pushing
one line at a time.
On Jul 4, 2016 12:30 PM, "Navin Ipe" <na...@searchlighthealth.com>
wrote:

> I haven't worked with Kafka, *so perhaps someone else here would be able
> to help you with it. *
> What I could suggest though, is to search for how to emit more than one
> sentence using the Kafka spout.
>
> If you still can emit only one sentence, then I'd recommend not using a
> separate SaveBolt. Instead, use FieldsGrouping where you group tuples based
> on the name of the CSV file, and emit sentences to TransformBolt. When
> TransformBolt completes receiving all tuples from a CSV, it can save to
> HDFS.
>
> If you still want to use a separate TransformBolt and SaveBolt, then use
> fields grouping as I mentioned above when emitting to both bolts. This way,
> you can have multiple spouts which read from multiple files, and whatever
> they emit will go only to specific bolts.
>
>
> On Mon, Jul 4, 2016 at 9:21 PM, praveen reddy <on...@gmail.com>
> wrote:
>
>> want to add bit more,
>> i am posting the json data using kafka-console-produer.sh file, copy the
>> json data and pasting on console.
>>
>> On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <on...@gmail.com>
>> wrote:
>>
>>> Thanks Naveen for response, i was using mobile so couldn't see typo's.
>>> here is my requirement. this is my first POC on Kafka/Storm, so please help
>>> me if i can design it better way.
>>>
>>> i need to read a Json data from Kafka, than convert the Json Data to CSV
>>> file and save it on HDFS.
>>>
>>> this is how i did initial design and having lot of issues.
>>>
>>>         builder.setSpout("kafka-spout", new
>>> KafkaSpout(kafkaSpoutConfig));
>>>         builder.setBolt("TransformBolt", new
>>> TransformationBolt()).shuffleGrouping("kafka-spout");
>>>         builder.setBolt("Savebolt", new
>>> SaveBolt()).shuffleGrouping("TransformBolt");
>>>
>>> KafkaSpout to read the data from Kafka topic, TransformationBolt to
>>> convert the json to cvs file and savebolt is to save the csv file.
>>>
>>> KafkaSpout was able to read data from Kafka Topic. what i was expecting
>>> from Spout was to get the complete Json data but i am getting 1 line each
>>> from Json data i sent to topic
>>>
>>> here is my transport bolt
>>>     @Override
>>>     public void execute(Tuple input) {
>>>         String sentence = input.getString(0);
>>>         collector.emit(new Values(sentence));
>>>         System.out.println("emitted " + sentence);
>>>     }
>>>
>>> i was expecting getString(0) would return complete json data, but
>>> getting only 1 line at once.
>>>
>>> and i am not sure how to emit csv file so that Savebolt would save it.
>>>
>>> can you please let me know how to get complete Json data in single
>>> request rather than line by line, how to emit CSV file from bolt. and if
>>> you guys can help me to design this better it would be really helpful
>>>
>>>
>>> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <
>>> navin.ipe@searchlighthealth.com> wrote:
>>>
>>>> Dear Praveen,
>>>>
>>>> The questions aren't silly, but it is rather tough to understand what
>>>> you are trying to convey. When you say "omit", do you mean "emit"?
>>>> Bolts can emit data even without having to write to disk (I think
>>>> there's a 2MB limit to the size of that data that can be emitted, because
>>>> Thrift can't handle more than that).
>>>> If you want one bolt to write to disk and then want another bolt to
>>>> read from disk, then that's also possible.
>>>> The first bolt can just send to the second bolt, whatever information
>>>> is necessary to read from file.
>>>> As of what I know, basic datatypes will automatically get serialized.
>>>> If you have a more complex class, then serialize it with Serializable.
>>>>
>>>> If you could re-phrase your question and make it clearer, people here
>>>> would be able to help you better.
>>>>
>>>>
>>>>
>>>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>>>> praveen.onlinecourse@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> i am new to Storm and Kafka and working on POC.
>>>>>
>>>>> my requirement is get a message from Kafka in json format, spout
>>>>> reading that message and firts bolt converting the json message to
>>>>> different format like csv and the second bolt saving it to hadoop.
>>>>>
>>>>> now i came up with initial design where i can use kafkaspout to read
>>>>> kafka topics and bolt converting it to csv file and next bolt saving in
>>>>> hadoop.
>>>>>
>>>>> i have following questions
>>>>> can the first bold which coverts the message to csv file can omit it?
>>>>> the file would be saving on disk. can a file which is saved on disk can be
>>>>> omitted.
>>>>> how does the second bolt read the file which is saved on disk by first
>>>>> bolt?
>>>>> do we need to serialize message ommitted by spout and/or bolt?
>>>>>
>>>>> sorry if the questions sound silly, this is my first topology with
>>>>> minimum knowledge of storm.
>>>>>
>>>>> if you guys think of proper design how to implement the my requirement
>>>>> can you please let me know
>>>>>
>>>>> thanks in advance
>>>>>
>>>>> -Praveen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Navin
>>>>
>>>
>>>
>>
>
>
> --
> Regards,
> Navin
>

Re: storm usage and design question

Posted by Navin Ipe <na...@searchlighthealth.com>.
I haven't worked with Kafka, *so perhaps someone else here would be able to
help you with it. *
What I could suggest though, is to search for how to emit more than one
sentence using the Kafka spout.

If you still can emit only one sentence, then I'd recommend not using a
separate SaveBolt. Instead, use FieldsGrouping where you group tuples based
on the name of the CSV file, and emit sentences to TransformBolt. When
TransformBolt completes receiving all tuples from a CSV, it can save to
HDFS.

If you still want to use a separate TransformBolt and SaveBolt, then use
fields grouping as I mentioned above when emitting to both bolts. This way,
you can have multiple spouts which read from multiple files, and whatever
they emit will go only to specific bolts.


On Mon, Jul 4, 2016 at 9:21 PM, praveen reddy <on...@gmail.com>
wrote:

> want to add bit more,
> i am posting the json data using kafka-console-produer.sh file, copy the
> json data and pasting on console.
>
> On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <on...@gmail.com>
> wrote:
>
>> Thanks Naveen for response, i was using mobile so couldn't see typo's.
>> here is my requirement. this is my first POC on Kafka/Storm, so please help
>> me if i can design it better way.
>>
>> i need to read a Json data from Kafka, than convert the Json Data to CSV
>> file and save it on HDFS.
>>
>> this is how i did initial design and having lot of issues.
>>
>>         builder.setSpout("kafka-spout", new KafkaSpout(kafkaSpoutConfig));
>>         builder.setBolt("TransformBolt", new
>> TransformationBolt()).shuffleGrouping("kafka-spout");
>>         builder.setBolt("Savebolt", new
>> SaveBolt()).shuffleGrouping("TransformBolt");
>>
>> KafkaSpout to read the data from Kafka topic, TransformationBolt to
>> convert the json to cvs file and savebolt is to save the csv file.
>>
>> KafkaSpout was able to read data from Kafka Topic. what i was expecting
>> from Spout was to get the complete Json data but i am getting 1 line each
>> from Json data i sent to topic
>>
>> here is my transport bolt
>>     @Override
>>     public void execute(Tuple input) {
>>         String sentence = input.getString(0);
>>         collector.emit(new Values(sentence));
>>         System.out.println("emitted " + sentence);
>>     }
>>
>> i was expecting getString(0) would return complete json data, but getting
>> only 1 line at once.
>>
>> and i am not sure how to emit csv file so that Savebolt would save it.
>>
>> can you please let me know how to get complete Json data in single
>> request rather than line by line, how to emit CSV file from bolt. and if
>> you guys can help me to design this better it would be really helpful
>>
>>
>> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <
>> navin.ipe@searchlighthealth.com> wrote:
>>
>>> Dear Praveen,
>>>
>>> The questions aren't silly, but it is rather tough to understand what
>>> you are trying to convey. When you say "omit", do you mean "emit"?
>>> Bolts can emit data even without having to write to disk (I think
>>> there's a 2MB limit to the size of that data that can be emitted, because
>>> Thrift can't handle more than that).
>>> If you want one bolt to write to disk and then want another bolt to read
>>> from disk, then that's also possible.
>>> The first bolt can just send to the second bolt, whatever information is
>>> necessary to read from file.
>>> As of what I know, basic datatypes will automatically get serialized. If
>>> you have a more complex class, then serialize it with Serializable.
>>>
>>> If you could re-phrase your question and make it clearer, people here
>>> would be able to help you better.
>>>
>>>
>>>
>>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>>> praveen.onlinecourse@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> i am new to Storm and Kafka and working on POC.
>>>>
>>>> my requirement is get a message from Kafka in json format, spout
>>>> reading that message and firts bolt converting the json message to
>>>> different format like csv and the second bolt saving it to hadoop.
>>>>
>>>> now i came up with initial design where i can use kafkaspout to read
>>>> kafka topics and bolt converting it to csv file and next bolt saving in
>>>> hadoop.
>>>>
>>>> i have following questions
>>>> can the first bold which coverts the message to csv file can omit it?
>>>> the file would be saving on disk. can a file which is saved on disk can be
>>>> omitted.
>>>> how does the second bolt read the file which is saved on disk by first
>>>> bolt?
>>>> do we need to serialize message ommitted by spout and/or bolt?
>>>>
>>>> sorry if the questions sound silly, this is my first topology with
>>>> minimum knowledge of storm.
>>>>
>>>> if you guys think of proper design how to implement the my requirement
>>>> can you please let me know
>>>>
>>>> thanks in advance
>>>>
>>>> -Praveen
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Navin
>>>
>>
>>
>


-- 
Regards,
Navin

Re: storm usage and design question

Posted by praveen reddy <on...@gmail.com>.
want to add bit more,
i am posting the json data using kafka-console-produer.sh file, copy the
json data and pasting on console.

On Mon, Jul 4, 2016 at 11:44 AM, praveen reddy <on...@gmail.com>
wrote:

> Thanks Naveen for response, i was using mobile so couldn't see typo's.
> here is my requirement. this is my first POC on Kafka/Storm, so please help
> me if i can design it better way.
>
> i need to read a Json data from Kafka, than convert the Json Data to CSV
> file and save it on HDFS.
>
> this is how i did initial design and having lot of issues.
>
>         builder.setSpout("kafka-spout", new KafkaSpout(kafkaSpoutConfig));
>         builder.setBolt("TransformBolt", new
> TransformationBolt()).shuffleGrouping("kafka-spout");
>         builder.setBolt("Savebolt", new
> SaveBolt()).shuffleGrouping("TransformBolt");
>
> KafkaSpout to read the data from Kafka topic, TransformationBolt to
> convert the json to cvs file and savebolt is to save the csv file.
>
> KafkaSpout was able to read data from Kafka Topic. what i was expecting
> from Spout was to get the complete Json data but i am getting 1 line each
> from Json data i sent to topic
>
> here is my transport bolt
>     @Override
>     public void execute(Tuple input) {
>         String sentence = input.getString(0);
>         collector.emit(new Values(sentence));
>         System.out.println("emitted " + sentence);
>     }
>
> i was expecting getString(0) would return complete json data, but getting
> only 1 line at once.
>
> and i am not sure how to emit csv file so that Savebolt would save it.
>
> can you please let me know how to get complete Json data in single request
> rather than line by line, how to emit CSV file from bolt. and if you guys
> can help me to design this better it would be really helpful
>
>
> On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <navin.ipe@searchlighthealth.com
> > wrote:
>
>> Dear Praveen,
>>
>> The questions aren't silly, but it is rather tough to understand what you
>> are trying to convey. When you say "omit", do you mean "emit"?
>> Bolts can emit data even without having to write to disk (I think there's
>> a 2MB limit to the size of that data that can be emitted, because Thrift
>> can't handle more than that).
>> If you want one bolt to write to disk and then want another bolt to read
>> from disk, then that's also possible.
>> The first bolt can just send to the second bolt, whatever information is
>> necessary to read from file.
>> As of what I know, basic datatypes will automatically get serialized. If
>> you have a more complex class, then serialize it with Serializable.
>>
>> If you could re-phrase your question and make it clearer, people here
>> would be able to help you better.
>>
>>
>>
>> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
>> praveen.onlinecourse@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> i am new to Storm and Kafka and working on POC.
>>>
>>> my requirement is get a message from Kafka in json format, spout reading
>>> that message and firts bolt converting the json message to different format
>>> like csv and the second bolt saving it to hadoop.
>>>
>>> now i came up with initial design where i can use kafkaspout to read
>>> kafka topics and bolt converting it to csv file and next bolt saving in
>>> hadoop.
>>>
>>> i have following questions
>>> can the first bold which coverts the message to csv file can omit it?
>>> the file would be saving on disk. can a file which is saved on disk can be
>>> omitted.
>>> how does the second bolt read the file which is saved on disk by first
>>> bolt?
>>> do we need to serialize message ommitted by spout and/or bolt?
>>>
>>> sorry if the questions sound silly, this is my first topology with
>>> minimum knowledge of storm.
>>>
>>> if you guys think of proper design how to implement the my requirement
>>> can you please let me know
>>>
>>> thanks in advance
>>>
>>> -Praveen
>>>
>>
>>
>>
>> --
>> Regards,
>> Navin
>>
>
>

Re: storm usage and design question

Posted by praveen reddy <on...@gmail.com>.
Thanks Naveen for response, i was using mobile so couldn't see typo's. here
is my requirement. this is my first POC on Kafka/Storm, so please help me
if i can design it better way.

i need to read a Json data from Kafka, than convert the Json Data to CSV
file and save it on HDFS.

this is how i did initial design and having lot of issues.

        builder.setSpout("kafka-spout", new KafkaSpout(kafkaSpoutConfig));
        builder.setBolt("TransformBolt", new
TransformationBolt()).shuffleGrouping("kafka-spout");
        builder.setBolt("Savebolt", new
SaveBolt()).shuffleGrouping("TransformBolt");

KafkaSpout to read the data from Kafka topic, TransformationBolt to convert
the json to cvs file and savebolt is to save the csv file.

KafkaSpout was able to read data from Kafka Topic. what i was expecting
from Spout was to get the complete Json data but i am getting 1 line each
from Json data i sent to topic

here is my transport bolt
    @Override
    public void execute(Tuple input) {
        String sentence = input.getString(0);
        collector.emit(new Values(sentence));
        System.out.println("emitted " + sentence);
    }

i was expecting getString(0) would return complete json data, but getting
only 1 line at once.

and i am not sure how to emit csv file so that Savebolt would save it.

can you please let me know how to get complete Json data in single request
rather than line by line, how to emit CSV file from bolt. and if you guys
can help me to design this better it would be really helpful


On Mon, Jul 4, 2016 at 5:59 AM, Navin Ipe <na...@searchlighthealth.com>
wrote:

> Dear Praveen,
>
> The questions aren't silly, but it is rather tough to understand what you
> are trying to convey. When you say "omit", do you mean "emit"?
> Bolts can emit data even without having to write to disk (I think there's
> a 2MB limit to the size of that data that can be emitted, because Thrift
> can't handle more than that).
> If you want one bolt to write to disk and then want another bolt to read
> from disk, then that's also possible.
> The first bolt can just send to the second bolt, whatever information is
> necessary to read from file.
> As of what I know, basic datatypes will automatically get serialized. If
> you have a more complex class, then serialize it with Serializable.
>
> If you could re-phrase your question and make it clearer, people here
> would be able to help you better.
>
>
>
> On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
> praveen.onlinecourse@gmail.com> wrote:
>
>> Hi All,
>>
>> i am new to Storm and Kafka and working on POC.
>>
>> my requirement is get a message from Kafka in json format, spout reading
>> that message and firts bolt converting the json message to different format
>> like csv and the second bolt saving it to hadoop.
>>
>> now i came up with initial design where i can use kafkaspout to read
>> kafka topics and bolt converting it to csv file and next bolt saving in
>> hadoop.
>>
>> i have following questions
>> can the first bold which coverts the message to csv file can omit it? the
>> file would be saving on disk. can a file which is saved on disk can be
>> omitted.
>> how does the second bolt read the file which is saved on disk by first
>> bolt?
>> do we need to serialize message ommitted by spout and/or bolt?
>>
>> sorry if the questions sound silly, this is my first topology with
>> minimum knowledge of storm.
>>
>> if you guys think of proper design how to implement the my requirement
>> can you please let me know
>>
>> thanks in advance
>>
>> -Praveen
>>
>
>
>
> --
> Regards,
> Navin
>

Re: storm usage and design question

Posted by Navin Ipe <na...@searchlighthealth.com>.
Dear Praveen,

The questions aren't silly, but it is rather tough to understand what you
are trying to convey. When you say "omit", do you mean "emit"?
Bolts can emit data even without having to write to disk (I think there's a
2MB limit to the size of that data that can be emitted, because Thrift
can't handle more than that).
If you want one bolt to write to disk and then want another bolt to read
from disk, then that's also possible.
The first bolt can just send to the second bolt, whatever information is
necessary to read from file.
As of what I know, basic datatypes will automatically get serialized. If
you have a more complex class, then serialize it with Serializable.

If you could re-phrase your question and make it clearer, people here would
be able to help you better.



On Sat, Jul 2, 2016 at 7:16 AM, praveen reddy <
praveen.onlinecourse@gmail.com> wrote:

> Hi All,
>
> i am new to Storm and Kafka and working on POC.
>
> my requirement is get a message from Kafka in json format, spout reading
> that message and firts bolt converting the json message to different format
> like csv and the second bolt saving it to hadoop.
>
> now i came up with initial design where i can use kafkaspout to read kafka
> topics and bolt converting it to csv file and next bolt saving in hadoop.
>
> i have following questions
> can the first bold which coverts the message to csv file can omit it? the
> file would be saving on disk. can a file which is saved on disk can be
> omitted.
> how does the second bolt read the file which is saved on disk by first
> bolt?
> do we need to serialize message ommitted by spout and/or bolt?
>
> sorry if the questions sound silly, this is my first topology with minimum
> knowledge of storm.
>
> if you guys think of proper design how to implement the my requirement can
> you please let me know
>
> thanks in advance
>
> -Praveen
>



-- 
Regards,
Navin