You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Nirmal Kumar <ni...@impetus.co.in> on 2013/10/17 15:39:13 UTC

Writing a simple KafkaProducer in Samza

Hi All,

I was referring the hello-samza project as was able to run it successfully.
I was able to run all the jobs and also wrote a consumer task to listen to kafka.wikipedia-stats topic.

I now want to write a Samza job that act as a KafkaProducer to continuously publishes simple string messages to a topic.
Just like the WikipediaFeedStreamTask that reads Wikipedia events and publishes them to a topic.
I am not sure of the any value of task.inputs in the config properties file?
The way I think is like a java program publishing string messages to a kafka topic.
How can I write such a Samza job?

Any pointers would be of great help.

Later on I want can read the same messages from a consumer like WikipediaParserStreamTask does.
Referring the hello-samza project I was able to write a Consumer task that reads messages from the topic(kafka.wikipedia-stats) by simply
task.class=samza.examples.wikipedia.task.TestConsumer
task.inputs=kafka.wikipedia-stats


Thanks,
-Nirmal

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Writing a simple KafkaProducer in Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Nirmal,

Your method of comparison makes sense.

For Samza, you just need to setup a kafka system in the configuration to
point to your brokers (you can take a look at the wiki-parser.properties
file, for an example).

As far as playing with state, the best place to look right now is the
hello-stageful-world.samsa file in Samza's code base. This is a job config
for a job that uses state management. This feature is generally most
useful when you're doing joins (ad view stream + ad click stream), sorting
(sort messages by time over a 5 minute window, and emit top 10), or
aggregation (count page views by member id).

For partitioning and parallelism, you just need to make sure that your
input streams have a partition count > 1 (either by setting the Kafka
broker default, or by manually creating topics with a partition count >
1), and then set yarn.container.count > 1 in your Samza job's config, as
well.

Cheers,
Chris

On 10/21/13 12:33 AM, "Nirmal Kumar" <ni...@impetus.co.in> wrote:

>Hi Chris,
>
>Thanks a lot for the information.
>
>I am comparing Storm + Kafka0.8  vs  Samza.
>
>As part of the initial use case I am using Kafka API to publish messages
>to the Kafka Broker say 40k, 50k...
>Then using the Storm spout I am consuming these messages.
>On the Samza side I am using a similar Kafka Consumer that consumes
>messages from Kafka.
>
>Is this use case fine for comparing Storm + Kafka0.8  vs  Samza ? or do
>you think any other things that I need to consider?
>
>Also going forward I am planning to find some use cases where I can
>compare Samza's  features like State Management, Partitioning and
>Parallelism, etc.
>Any pointers towards these specific use cases?
>
>Thanks,
>-Nirmal
>
>-----Original Message-----
>From: Chris Riccomini [mailto:criccomini@linkedin.com]
>Sent: Thursday, October 17, 2013 10:42 PM
>To: dev@samza.incubator.apache.org
>Subject: Re: Writing a simple KafkaProducer in Samza
>
>Hey Nirmal,
>
>Glad to hear that the hello-samza project is working for you.
>
>If I understand you correctly, I believe that you're saying you want to
>send messages to a Kafka topic, right? As you've pointed out, you can
>send messages to Kafka through Samza. You can also send messages to Kafka
>directly using the Kafka producer API. Which one you use depends on what
>you're code is doing.
>
>With Samza, typically a task is sending messages to Kafka as a reaction
>to some other event (which triggers the process method). For example, in
>the Wiki example, we send a message to Kafka whenever an update happens
>on Wikipedia (via the IRC channel). In this example, we had to write a
>Wikipedia consumer for Samza, which implements the SystemConsumer API in
>Samza. This implementation reads messages from the Wikimedia IRC channel.
>You can see this implementation here:
>
>
>https://github.com/linkedin/hello-samza/blob/master/samza-wikipedia/src/ma
>i
>n/java/samza/examples/wikipedia/system/WikipediaConsumer.java
>
>If you use Samza, you MUST have at least one task.input defined, which
>feeds messages to your StreamTask. Out of the box, Samza comes with a
>KafkaSystemConsumer implementation. The hello-samza project comes with
>the WikipediaSystemConsumer implementation. If you want to react to
>messages from another system, or feed, you'd have to implement this
>interface (and hopefully contribute it back :).
>
>The alternative approach would be to just send messages directly to Kafka
>using the Kafka API. This approach is more appropriate in cases that
>don't fit well with Samza's processing model (e.g. you can't easily
>implement the SystemConsumer API, you need to guarantee deployment on a
>specific host all the time, etc). For example, if you wanted to read
>syslog messages on a specific host, and send them to Kafka, it probably
>makes more sense to just write a simple Java main() method that creates a
>Kafka producer, polls syslog periodically, and calls producer.send()
>whenever a new message appears in the syslog.
>
>If you can be more specific about what you're doing, I can probably
>provide better advice.
>
>Cheers,
>Chris
>
>On 10/17/13 6:39 AM, "Nirmal Kumar" <ni...@impetus.co.in> wrote:
>
>>Hi All,
>>
>>I was referring the hello-samza project as was able to run it
>>successfully.
>>I was able to run all the jobs and also wrote a consumer task to listen
>>to kafka.wikipedia-stats topic.
>>
>>I now want to write a Samza job that act as a KafkaProducer to
>>continuously publishes simple string messages to a topic.
>>Just like the WikipediaFeedStreamTask that reads Wikipedia events and
>>publishes them to a topic.
>>I am not sure of the any value of task.inputs in the config properties
>>file?
>>The way I think is like a java program publishing string messages to a
>>kafka topic.
>>How can I write such a Samza job?
>>
>>Any pointers would be of great help.
>>
>>Later on I want can read the same messages from a consumer like
>>WikipediaParserStreamTask does.
>>Referring the hello-samza project I was able to write a Consumer task
>>that reads messages from the topic(kafka.wikipedia-stats) by simply
>>task.class=samza.examples.wikipedia.task.TestConsumer
>>task.inputs=kafka.wikipedia-stats
>>
>>
>>Thanks,
>>-Nirmal
>>
>>________________________________
>>
>>
>>
>>
>>
>>
>>NOTE: This message may contain information that is confidential,
>>proprietary, privileged or otherwise protected by law. The message is
>>intended solely for the named addressee. If received in error, please
>>destroy and notify the sender. Any use of this email is prohibited when
>>received in error. Impetus does not represent, warrant and/or
>>guarantee, that the integrity of this communication has been maintained
>>nor that the communication is free of errors, virus, interception or
>>interference.
>
>
>________________________________
>
>
>
>
>
>
>NOTE: This message may contain information that is confidential,
>proprietary, privileged or otherwise protected by law. The message is
>intended solely for the named addressee. If received in error, please
>destroy and notify the sender. Any use of this email is prohibited when
>received in error. Impetus does not represent, warrant and/or guarantee,
>that the integrity of this communication has been maintained nor that the
>communication is free of errors, virus, interception or interference.


RE: Writing a simple KafkaProducer in Samza

Posted by Nirmal Kumar <ni...@impetus.co.in>.
Hi Chris,

Thanks a lot for the information.

I am comparing Storm + Kafka0.8  vs  Samza.

As part of the initial use case I am using Kafka API to publish messages to the Kafka Broker say 40k, 50k...
Then using the Storm spout I am consuming these messages.
On the Samza side I am using a similar Kafka Consumer that consumes messages from Kafka.

Is this use case fine for comparing Storm + Kafka0.8  vs  Samza ? or do you think any other things that I need to consider?

Also going forward I am planning to find some use cases where I can compare Samza's  features like State Management, Partitioning and Parallelism, etc.
Any pointers towards these specific use cases?

Thanks,
-Nirmal

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com]
Sent: Thursday, October 17, 2013 10:42 PM
To: dev@samza.incubator.apache.org
Subject: Re: Writing a simple KafkaProducer in Samza

Hey Nirmal,

Glad to hear that the hello-samza project is working for you.

If I understand you correctly, I believe that you're saying you want to send messages to a Kafka topic, right? As you've pointed out, you can send messages to Kafka through Samza. You can also send messages to Kafka directly using the Kafka producer API. Which one you use depends on what you're code is doing.

With Samza, typically a task is sending messages to Kafka as a reaction to some other event (which triggers the process method). For example, in the Wiki example, we send a message to Kafka whenever an update happens on Wikipedia (via the IRC channel). In this example, we had to write a Wikipedia consumer for Samza, which implements the SystemConsumer API in Samza. This implementation reads messages from the Wikimedia IRC channel.
You can see this implementation here:


https://github.com/linkedin/hello-samza/blob/master/samza-wikipedia/src/mai
n/java/samza/examples/wikipedia/system/WikipediaConsumer.java

If you use Samza, you MUST have at least one task.input defined, which feeds messages to your StreamTask. Out of the box, Samza comes with a KafkaSystemConsumer implementation. The hello-samza project comes with the WikipediaSystemConsumer implementation. If you want to react to messages from another system, or feed, you'd have to implement this interface (and hopefully contribute it back :).

The alternative approach would be to just send messages directly to Kafka using the Kafka API. This approach is more appropriate in cases that don't fit well with Samza's processing model (e.g. you can't easily implement the SystemConsumer API, you need to guarantee deployment on a specific host all the time, etc). For example, if you wanted to read syslog messages on a specific host, and send them to Kafka, it probably makes more sense to just write a simple Java main() method that creates a Kafka producer, polls syslog periodically, and calls producer.send() whenever a new message appears in the syslog.

If you can be more specific about what you're doing, I can probably provide better advice.

Cheers,
Chris

On 10/17/13 6:39 AM, "Nirmal Kumar" <ni...@impetus.co.in> wrote:

>Hi All,
>
>I was referring the hello-samza project as was able to run it
>successfully.
>I was able to run all the jobs and also wrote a consumer task to listen
>to kafka.wikipedia-stats topic.
>
>I now want to write a Samza job that act as a KafkaProducer to
>continuously publishes simple string messages to a topic.
>Just like the WikipediaFeedStreamTask that reads Wikipedia events and
>publishes them to a topic.
>I am not sure of the any value of task.inputs in the config properties
>file?
>The way I think is like a java program publishing string messages to a
>kafka topic.
>How can I write such a Samza job?
>
>Any pointers would be of great help.
>
>Later on I want can read the same messages from a consumer like
>WikipediaParserStreamTask does.
>Referring the hello-samza project I was able to write a Consumer task
>that reads messages from the topic(kafka.wikipedia-stats) by simply
>task.class=samza.examples.wikipedia.task.TestConsumer
>task.inputs=kafka.wikipedia-stats
>
>
>Thanks,
>-Nirmal
>
>________________________________
>
>
>
>
>
>
>NOTE: This message may contain information that is confidential,
>proprietary, privileged or otherwise protected by law. The message is
>intended solely for the named addressee. If received in error, please
>destroy and notify the sender. Any use of this email is prohibited when
>received in error. Impetus does not represent, warrant and/or
>guarantee, that the integrity of this communication has been maintained
>nor that the communication is free of errors, virus, interception or interference.


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Writing a simple KafkaProducer in Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Nirmal,

Glad to hear that the hello-samza project is working for you.

If I understand you correctly, I believe that you're saying you want to
send messages to a Kafka topic, right? As you've pointed out, you can send
messages to Kafka through Samza. You can also send messages to Kafka
directly using the Kafka producer API. Which one you use depends on what
you're code is doing.

With Samza, typically a task is sending messages to Kafka as a reaction to
some other event (which triggers the process method). For example, in the
Wiki example, we send a message to Kafka whenever an update happens on
Wikipedia (via the IRC channel). In this example, we had to write a
Wikipedia consumer for Samza, which implements the SystemConsumer API in
Samza. This implementation reads messages from the Wikimedia IRC channel.
You can see this implementation here:

  
https://github.com/linkedin/hello-samza/blob/master/samza-wikipedia/src/mai
n/java/samza/examples/wikipedia/system/WikipediaConsumer.java

If you use Samza, you MUST have at least one task.input defined, which
feeds messages to your StreamTask. Out of the box, Samza comes with a
KafkaSystemConsumer implementation. The hello-samza project comes with the
WikipediaSystemConsumer implementation. If you want to react to messages
from another system, or feed, you'd have to implement this interface (and
hopefully contribute it back :).

The alternative approach would be to just send messages directly to Kafka
using the Kafka API. This approach is more appropriate in cases that don't
fit well with Samza's processing model (e.g. you can't easily implement
the SystemConsumer API, you need to guarantee deployment on a specific
host all the time, etc). For example, if you wanted to read syslog
messages on a specific host, and send them to Kafka, it probably makes
more sense to just write a simple Java main() method that creates a Kafka
producer, polls syslog periodically, and calls producer.send() whenever a
new message appears in the syslog.

If you can be more specific about what you're doing, I can probably
provide better advice.

Cheers,
Chris

On 10/17/13 6:39 AM, "Nirmal Kumar" <ni...@impetus.co.in> wrote:

>Hi All,
>
>I was referring the hello-samza project as was able to run it
>successfully.
>I was able to run all the jobs and also wrote a consumer task to listen
>to kafka.wikipedia-stats topic.
>
>I now want to write a Samza job that act as a KafkaProducer to
>continuously publishes simple string messages to a topic.
>Just like the WikipediaFeedStreamTask that reads Wikipedia events and
>publishes them to a topic.
>I am not sure of the any value of task.inputs in the config properties
>file?
>The way I think is like a java program publishing string messages to a
>kafka topic.
>How can I write such a Samza job?
>
>Any pointers would be of great help.
>
>Later on I want can read the same messages from a consumer like
>WikipediaParserStreamTask does.
>Referring the hello-samza project I was able to write a Consumer task
>that reads messages from the topic(kafka.wikipedia-stats) by simply
>task.class=samza.examples.wikipedia.task.TestConsumer
>task.inputs=kafka.wikipedia-stats
>
>
>Thanks,
>-Nirmal
>
>________________________________
>
>
>
>
>
>
>NOTE: This message may contain information that is confidential,
>proprietary, privileged or otherwise protected by law. The message is
>intended solely for the named addressee. If received in error, please
>destroy and notify the sender. Any use of this email is prohibited when
>received in error. Impetus does not represent, warrant and/or guarantee,
>that the integrity of this communication has been maintained nor that the
>communication is free of errors, virus, interception or interference.