You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Murtaza Doctor <mu...@richrelevance.com> on 2012/06/27 07:07:32 UTC

Kafka - Avro Encoder

Hello Folks,

We are currently evaluating Kafka and had a few questions around the
Encoder functionality.
Our data is in avro format and we wish to send the data to the broker in
this format as well eventually write to HDFS. As documented, we do realize
that we need a Custom Encoder to achieve creation of the Message object.

Questions we had:
- Is there any sample code around this since this is probably a common
use-case. I meant is there a CustomAvroEncoder which we can use out of the
box or any chance this can also be open-sourced?
- In terms of internals - are we converting avro into byte stream and
creating a Message Object and then writing to the queue, does this incur
any overhead in your opinion?
- Any best practices around this or how others would approach this problem?

If there is any value we would definitely like to see this added to the
FAQs or even part of some sample code.

Thanks,
murtaza

Re: Hadoop Consumer

Posted by Min <mi...@gmail.com>.

ConsumerConfig is in the kafka's main trunk.

As I used the same package namespace, kafka.consumer, (sure I don't
think it's good approach), I didn't have to import it explicitly.

kafka jar is not on the maven repository, you might have to register
it into your local maven repository.

> mvn install:install-file -Dfile=kafka-0.7.0.jar -DgroupId=kafka -DartifactId=kafka -Dversion=0.7.0 -Dpackaging=jar

Thanks
Min

2012/7/13 Murtaza Doctor <mu...@richrelevance.com>:
> Hello Min,
>
> In your github project source code are you missing the ConsumerConfig
> class? I was trying to download and play with the source code.
>
> Thanks,
> murtaza
>
> On 7/3/12 6:29 PM, "Min" <mi...@gmail.com> wrote:
>
>>I've created another hadoop consumer which uses zookeeper.
>>
>>https://github.com/miniway/kafka-hadoop-consumer
>>
>>With a hadoop OutputFormatter, I could add new files to the existing
>>target directory.
>>Hope this would help.
>>
>>Thanks
>>Min
>>
>>2012/7/4 Murtaza Doctor <mu...@richrelevance.com>:
>>> +1 This surely sounds interesting.
>>>
>>> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>>>
>>>>Hmm that's surprising. I didn't know about that...!
>>>>
>>>>I wonder if it's a new feature... Judging from your email, I assume
>>>>you're
>>>>using CDH? What version?
>>>>
>>>>Interesting :) ...
>>>>
>>>>--
>>>>Felix
>>>>
>>>>
>>>>
>>>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>>>Casey.Sybrandy@six3systems.com> wrote:
>>>>
>>>>> >> - Is there a version of consumer which appends to an existing file
>>>>>on
>>>>> HDFS
>>>>> >> until it reaches a specific size?
>>>>> >>
>>>>> >
>>>>> >No there isn't, as far as I know. Potential solutions to this would
>>>>>be:
>>>>> >
>>>>> >   1. Leave the data in the broker long enough for it to reach the
>>>>>size
>>>>> you
>>>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>>>you
>>>>> the
>>>>> >   file size you want. This is the simplest thing to do, but the
>>>>>drawback
>>>>> is
>>>>> >   that your data in HDFS will be less real-time.
>>>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>>>roll
>>>>> up
>>>>> >   / compact your small files into one bigger file. You would need to
>>>>> come up
>>>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>>>> makes
>>>>> >   use of hadoop append instead...
>>>>> >
>>>>> >Also, you may be interested to take a look at these
>>>>> >scripts<
>>>>>
>>>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume
>>>>>r/
>>>>> >I
>>>>> >posted a while ago. If you follow the links in this post, you can get
>>>>> >more details about how the scripts work and why it was necessary to
>>>>>do
>>>>>the
>>>>> >things it does... or you can just use them without reading. They
>>>>>should
>>>>> >work pretty much out of the box...
>>>>>
>>>>> Where I work, we discovered that you can keep a file in HDFS open and
>>>>> still run MapReduce jobs against the data in that file.  What you do
>>>>>is
>>>>>you
>>>>> flush the data periodically (every record for us), but you don't close
>>>>>the
>>>>> file right away.  This allows us to have data files that contain 24
>>>>>hours
>>>>> worth of data, but not have to close the file to run the jobs or to
>>>>> schedule the jobs for after the file is closed.  You can also check
>>>>>the
>>>>> file size periodically and rotate the files based on size.  We use
>>>>>Avro
>>>>> files, but sequence files should work too according to Cloudera.
>>>>>
>>>>> It's a great compromise for when you want the latest and greatest
>>>>>data,
>>>>> but don't want to have to wait until all of the files are closed to
>>>>>get
>>>>>it.
>>>>>
>>>>> Casey
>>>
>

Re: Hadoop Consumer

Posted by Murtaza Doctor <mu...@richrelevance.com>.

Hello Min,

In your github project source code are you missing the ConsumerConfig
class? I was trying to download and play with the source code.

Thanks,
murtaza

On 7/3/12 6:29 PM, "Min" <mi...@gmail.com> wrote:

>I've created another hadoop consumer which uses zookeeper.
>
>https://github.com/miniway/kafka-hadoop-consumer
>
>With a hadoop OutputFormatter, I could add new files to the existing
>target directory.
>Hope this would help.
>
>Thanks
>Min
>
>2012/7/4 Murtaza Doctor <mu...@richrelevance.com>:
>> +1 This surely sounds interesting.
>>
>> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>>
>>>Hmm that's surprising. I didn't know about that...!
>>>
>>>I wonder if it's a new feature... Judging from your email, I assume
>>>you're
>>>using CDH? What version?
>>>
>>>Interesting :) ...
>>>
>>>--
>>>Felix
>>>
>>>
>>>
>>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>>Casey.Sybrandy@six3systems.com> wrote:
>>>
>>>> >> - Is there a version of consumer which appends to an existing file
>>>>on
>>>> HDFS
>>>> >> until it reaches a specific size?
>>>> >>
>>>> >
>>>> >No there isn't, as far as I know. Potential solutions to this would
>>>>be:
>>>> >
>>>> >   1. Leave the data in the broker long enough for it to reach the
>>>>size
>>>> you
>>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>>you
>>>> the
>>>> >   file size you want. This is the simplest thing to do, but the
>>>>drawback
>>>> is
>>>> >   that your data in HDFS will be less real-time.
>>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>>roll
>>>> up
>>>> >   / compact your small files into one bigger file. You would need to
>>>> come up
>>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>>> makes
>>>> >   use of hadoop append instead...
>>>> >
>>>> >Also, you may be interested to take a look at these
>>>> >scripts<
>>>>
>>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume
>>>>r/
>>>> >I
>>>> >posted a while ago. If you follow the links in this post, you can get
>>>> >more details about how the scripts work and why it was necessary to
>>>>do
>>>>the
>>>> >things it does... or you can just use them without reading. They
>>>>should
>>>> >work pretty much out of the box...
>>>>
>>>> Where I work, we discovered that you can keep a file in HDFS open and
>>>> still run MapReduce jobs against the data in that file.  What you do
>>>>is
>>>>you
>>>> flush the data periodically (every record for us), but you don't close
>>>>the
>>>> file right away.  This allows us to have data files that contain 24
>>>>hours
>>>> worth of data, but not have to close the file to run the jobs or to
>>>> schedule the jobs for after the file is closed.  You can also check
>>>>the
>>>> file size periodically and rotate the files based on size.  We use
>>>>Avro
>>>> files, but sequence files should work too according to Cloudera.
>>>>
>>>> It's a great compromise for when you want the latest and greatest
>>>>data,
>>>> but don't want to have to wait until all of the files are closed to
>>>>get
>>>>it.
>>>>
>>>> Casey
>>

Re: Hadoop Consumer

Posted by Min <mi...@gmail.com>.

I've created another hadoop consumer which uses zookeeper.

https://github.com/miniway/kafka-hadoop-consumer

With a hadoop OutputFormatter, I could add new files to the existing
target directory.
Hope this would help.

Thanks
Min

2012/7/4 Murtaza Doctor <mu...@richrelevance.com>:
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>
>>Hmm that's surprising. I didn't know about that...!
>>
>>I wonder if it's a new feature... Judging from your email, I assume you're
>>using CDH? What version?
>>
>>Interesting :) ...
>>
>>--
>>Felix
>>
>>
>>
>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>>Casey.Sybrandy@six3systems.com> wrote:
>>
>>> >> - Is there a version of consumer which appends to an existing file on
>>> HDFS
>>> >> until it reaches a specific size?
>>> >>
>>> >
>>> >No there isn't, as far as I know. Potential solutions to this would be:
>>> >
>>> >   1. Leave the data in the broker long enough for it to reach the size
>>> you
>>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>>you
>>> the
>>> >   file size you want. This is the simplest thing to do, but the
>>>drawback
>>> is
>>> >   that your data in HDFS will be less real-time.
>>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>>roll
>>> up
>>> >   / compact your small files into one bigger file. You would need to
>>> come up
>>> >   with the hadoop job that does the roll up, or find one somewhere.
>>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>>> makes
>>> >   use of hadoop append instead...
>>> >
>>> >Also, you may be interested to take a look at these
>>> >scripts<
>>>
>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
>>> >I
>>> >posted a while ago. If you follow the links in this post, you can get
>>> >more details about how the scripts work and why it was necessary to do
>>>the
>>> >things it does... or you can just use them without reading. They should
>>> >work pretty much out of the box...
>>>
>>> Where I work, we discovered that you can keep a file in HDFS open and
>>> still run MapReduce jobs against the data in that file.  What you do is
>>>you
>>> flush the data periodically (every record for us), but you don't close
>>>the
>>> file right away.  This allows us to have data files that contain 24
>>>hours
>>> worth of data, but not have to close the file to run the jobs or to
>>> schedule the jobs for after the file is closed.  You can also check the
>>> file size periodically and rotate the files based on size.  We use Avro
>>> files, but sequence files should work too according to Cloudera.
>>>
>>> It's a great compromise for when you want the latest and greatest data,
>>> but don't want to have to wait until all of the files are closed to get
>>>it.
>>>
>>> Casey
>

RE: Hadoop Consumer

Posted by Grégoire Seux <g....@criteo.com>.

Thanks a lot Min, this is indeed very useful. 

-- 
Greg

-----Original Message-----
From: Felix GV [mailto:felix@mate1inc.com] 
Sent: mercredi 4 juillet 2012 18:19
To: kafka-users@incubator.apache.org
Subject: Re: Hadoop Consumer

Thanks for the info, that's interesting :) ...

And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is cool :) ...

--
Felix



On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < Casey.Sybrandy@six3systems.com> wrote:

> We're using CDH3 update 2 or 3.  I don't know how much the version 
> matters, so it may work on plain-old Hadoop.
> _____________________
> From: Murtaza Doctor [murtaza@richrelevance.com]
> Sent: Tuesday, July 03, 2012 1:56 PM
> To: kafka-users@incubator.apache.org
> Subject: Re: Hadoop Consumer
>
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>
> >Hmm that's surprising. I didn't know about that...!
> >
> >I wonder if it's a new feature... Judging from your email, I assume 
> >you're using CDH? What version?
> >
> >Interesting :) ...
> >
> >--
> >Felix
> >
> >
> >
> >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < 
> >Casey.Sybrandy@six3systems.com> wrote:
> >
> >> >> - Is there a version of consumer which appends to an existing 
> >> >> file on
> >> HDFS
> >> >> until it reaches a specific size?
> >> >>
> >> >
> >> >No there isn't, as far as I know. Potential solutions to this would be:
> >> >
> >> >   1. Leave the data in the broker long enough for it to reach the 
> >> > size
> >> you
> >> >   want. Running the SimpleKafkaETLJob at those intervals would 
> >> > give
> >>you
> >> the
> >> >   file size you want. This is the simplest thing to do, but the
> >>drawback
> >> is
> >> >   that your data in HDFS will be less real-time.
> >> >   2. Run the SimpleKafkaETLJob as frequently as you want, and 
> >> > then
> >>roll
> >> up
> >> >   / compact your small files into one bigger file. You would need 
> >> > to
> >> come up
> >> >   with the hadoop job that does the roll up, or find one somewhere.
> >> >   3. Don't use the SimpleKafkaETLJob at all and write a new job 
> >> > that
> >> makes
> >> >   use of hadoop append instead...
> >> >
> >> >Also, you may be interested to take a look at these scripts<
> >>
> >>
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consum
> er/
> >> >I
> >> >posted a while ago. If you follow the links in this post, you can 
> >> >get more details about how the scripts work and why it was 
> >> >necessary to do
> >>the
> >> >things it does... or you can just use them without reading. They 
> >> >should work pretty much out of the box...
> >>
> >> Where I work, we discovered that you can keep a file in HDFS open 
> >>and  still run MapReduce jobs against the data in that file.  What 
> >>you do is you  flush the data periodically (every record for us), 
> >>but you don't close the  file right away.  This allows us to have 
> >>data files that contain 24 hours  worth of data, but not have to 
> >>close the file to run the jobs or to  schedule the jobs for after 
> >>the file is closed.  You can also check the  file size periodically 
> >>and rotate the files based on size.  We use Avro  files, but 
> >>sequence files should work too according to Cloudera.
> >>
> >> It's a great compromise for when you want the latest and greatest 
> >>data,  but don't want to have to wait until all of the files are 
> >>closed to get it.
> >>
> >> Casey
>
>

Re: Hadoop Consumer

Posted by Felix GV <fe...@mate1inc.com>.

Thanks for the info, that's interesting :) ...

And thanks for the link Min :) Having a hadoop consumer that manages the
offsets with ZK is cool :) ...

--
Felix



On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey <
Casey.Sybrandy@six3systems.com> wrote:

> We're using CDH3 update 2 or 3.  I don't know how much the version
> matters, so it may work on plain-old Hadoop.
> _____________________
> From: Murtaza Doctor [murtaza@richrelevance.com]
> Sent: Tuesday, July 03, 2012 1:56 PM
> To: kafka-users@incubator.apache.org
> Subject: Re: Hadoop Consumer
>
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:
>
> >Hmm that's surprising. I didn't know about that...!
> >
> >I wonder if it's a new feature... Judging from your email, I assume you're
> >using CDH? What version?
> >
> >Interesting :) ...
> >
> >--
> >Felix
> >
> >
> >
> >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
> >Casey.Sybrandy@six3systems.com> wrote:
> >
> >> >> - Is there a version of consumer which appends to an existing file on
> >> HDFS
> >> >> until it reaches a specific size?
> >> >>
> >> >
> >> >No there isn't, as far as I know. Potential solutions to this would be:
> >> >
> >> >   1. Leave the data in the broker long enough for it to reach the size
> >> you
> >> >   want. Running the SimpleKafkaETLJob at those intervals would give
> >>you
> >> the
> >> >   file size you want. This is the simplest thing to do, but the
> >>drawback
> >> is
> >> >   that your data in HDFS will be less real-time.
> >> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
> >>roll
> >> up
> >> >   / compact your small files into one bigger file. You would need to
> >> come up
> >> >   with the hadoop job that does the roll up, or find one somewhere.
> >> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> >> makes
> >> >   use of hadoop append instead...
> >> >
> >> >Also, you may be interested to take a look at these
> >> >scripts<
> >>
> >>
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >> >I
> >> >posted a while ago. If you follow the links in this post, you can get
> >> >more details about how the scripts work and why it was necessary to do
> >>the
> >> >things it does... or you can just use them without reading. They should
> >> >work pretty much out of the box...
> >>
> >> Where I work, we discovered that you can keep a file in HDFS open and
> >> still run MapReduce jobs against the data in that file.  What you do is
> >>you
> >> flush the data periodically (every record for us), but you don't close
> >>the
> >> file right away.  This allows us to have data files that contain 24
> >>hours
> >> worth of data, but not have to close the file to run the jobs or to
> >> schedule the jobs for after the file is closed.  You can also check the
> >> file size periodically and rotate the files based on size.  We use Avro
> >> files, but sequence files should work too according to Cloudera.
> >>
> >> It's a great compromise for when you want the latest and greatest data,
> >> but don't want to have to wait until all of the files are closed to get
> >>it.
> >>
> >> Casey
>
>

RE: Hadoop Consumer

Posted by "Sybrandy, Casey" <Ca...@Six3Systems.com>.

We're using CDH3 update 2 or 3.  I don't know how much the version matters, so it may work on plain-old Hadoop.
_____________________
From: Murtaza Doctor [murtaza@richrelevance.com]
Sent: Tuesday, July 03, 2012 1:56 PM
To: kafka-users@incubator.apache.org
Subject: Re: Hadoop Consumer

+1 This surely sounds interesting.

On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:

>Hmm that's surprising. I didn't know about that...!
>
>I wonder if it's a new feature... Judging from your email, I assume you're
>using CDH? What version?
>
>Interesting :) ...
>
>--
>Felix
>
>
>
>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>Casey.Sybrandy@six3systems.com> wrote:
>
>> >> - Is there a version of consumer which appends to an existing file on
>> HDFS
>> >> until it reaches a specific size?
>> >>
>> >
>> >No there isn't, as far as I know. Potential solutions to this would be:
>> >
>> >   1. Leave the data in the broker long enough for it to reach the size
>> you
>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>you
>> the
>> >   file size you want. This is the simplest thing to do, but the
>>drawback
>> is
>> >   that your data in HDFS will be less real-time.
>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>roll
>> up
>> >   / compact your small files into one bigger file. You would need to
>> come up
>> >   with the hadoop job that does the roll up, or find one somewhere.
>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>> makes
>> >   use of hadoop append instead...
>> >
>> >Also, you may be interested to take a look at these
>> >scripts<
>>
>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
>> >I
>> >posted a while ago. If you follow the links in this post, you can get
>> >more details about how the scripts work and why it was necessary to do
>>the
>> >things it does... or you can just use them without reading. They should
>> >work pretty much out of the box...
>>
>> Where I work, we discovered that you can keep a file in HDFS open and
>> still run MapReduce jobs against the data in that file.  What you do is
>>you
>> flush the data periodically (every record for us), but you don't close
>>the
>> file right away.  This allows us to have data files that contain 24
>>hours
>> worth of data, but not have to close the file to run the jobs or to
>> schedule the jobs for after the file is closed.  You can also check the
>> file size periodically and rotate the files based on size.  We use Avro
>> files, but sequence files should work too according to Cloudera.
>>
>> It's a great compromise for when you want the latest and greatest data,
>> but don't want to have to wait until all of the files are closed to get
>>it.
>>
>> Casey

Re: Hadoop Consumer

Posted by Murtaza Doctor <mu...@richrelevance.com>.

+1 This surely sounds interesting.

On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote:

>Hmm that's surprising. I didn't know about that...!
>
>I wonder if it's a new feature... Judging from your email, I assume you're
>using CDH? What version?
>
>Interesting :) ...
>
>--
>Felix
>
>
>
>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
>Casey.Sybrandy@six3systems.com> wrote:
>
>> >> - Is there a version of consumer which appends to an existing file on
>> HDFS
>> >> until it reaches a specific size?
>> >>
>> >
>> >No there isn't, as far as I know. Potential solutions to this would be:
>> >
>> >   1. Leave the data in the broker long enough for it to reach the size
>> you
>> >   want. Running the SimpleKafkaETLJob at those intervals would give
>>you
>> the
>> >   file size you want. This is the simplest thing to do, but the
>>drawback
>> is
>> >   that your data in HDFS will be less real-time.
>> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
>>roll
>> up
>> >   / compact your small files into one bigger file. You would need to
>> come up
>> >   with the hadoop job that does the roll up, or find one somewhere.
>> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>> makes
>> >   use of hadoop append instead...
>> >
>> >Also, you may be interested to take a look at these
>> >scripts<
>> 
>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
>> >I
>> >posted a while ago. If you follow the links in this post, you can get
>> >more details about how the scripts work and why it was necessary to do
>>the
>> >things it does... or you can just use them without reading. They should
>> >work pretty much out of the box...
>>
>> Where I work, we discovered that you can keep a file in HDFS open and
>> still run MapReduce jobs against the data in that file.  What you do is
>>you
>> flush the data periodically (every record for us), but you don't close
>>the
>> file right away.  This allows us to have data files that contain 24
>>hours
>> worth of data, but not have to close the file to run the jobs or to
>> schedule the jobs for after the file is closed.  You can also check the
>> file size periodically and rotate the files based on size.  We use Avro
>> files, but sequence files should work too according to Cloudera.
>>
>> It's a great compromise for when you want the latest and greatest data,
>> but don't want to have to wait until all of the files are closed to get
>>it.
>>
>> Casey

Re: Hadoop Consumer

Posted by Felix GV <fe...@mate1inc.com>.

Hmm that's surprising. I didn't know about that...!

I wonder if it's a new feature... Judging from your email, I assume you're
using CDH? What version?

Interesting :) ...

--
Felix



On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
Casey.Sybrandy@six3systems.com> wrote:

> >> - Is there a version of consumer which appends to an existing file on
> HDFS
> >> until it reaches a specific size?
> >>
> >
> >No there isn't, as far as I know. Potential solutions to this would be:
> >
> >   1. Leave the data in the broker long enough for it to reach the size
> you
> >   want. Running the SimpleKafkaETLJob at those intervals would give you
> the
> >   file size you want. This is the simplest thing to do, but the drawback
> is
> >   that your data in HDFS will be less real-time.
> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll
> up
> >   / compact your small files into one bigger file. You would need to
> come up
> >   with the hadoop job that does the roll up, or find one somewhere.
> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> makes
> >   use of hadoop append instead...
> >
> >Also, you may be interested to take a look at these
> >scripts<
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >I
> >posted a while ago. If you follow the links in this post, you can get
> >more details about how the scripts work and why it was necessary to do the
> >things it does... or you can just use them without reading. They should
> >work pretty much out of the box...
>
> Where I work, we discovered that you can keep a file in HDFS open and
> still run MapReduce jobs against the data in that file.  What you do is you
> flush the data periodically (every record for us), but you don't close the
> file right away.  This allows us to have data files that contain 24 hours
> worth of data, but not have to close the file to run the jobs or to
> schedule the jobs for after the file is closed.  You can also check the
> file size periodically and rotate the files based on size.  We use Avro
> files, but sequence files should work too according to Cloudera.
>
> It's a great compromise for when you want the latest and greatest data,
> but don't want to have to wait until all of the files are closed to get it.
>
> Casey

RE: Hadoop Consumer

Posted by "Sybrandy, Casey" <Ca...@Six3Systems.com>.

>> - Is there a version of consumer which appends to an existing file on HDFS
>> until it reaches a specific size?
>>
>
>No there isn't, as far as I know. Potential solutions to this would be:
>
>   1. Leave the data in the broker long enough for it to reach the size you
>   want. Running the SimpleKafkaETLJob at those intervals would give you the
>   file size you want. This is the simplest thing to do, but the drawback is
>   that your data in HDFS will be less real-time.
>   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up
>   / compact your small files into one bigger file. You would need to come up
>   with the hadoop job that does the roll up, or find one somewhere.
>   3. Don't use the SimpleKafkaETLJob at all and write a new job that makes
>   use of hadoop append instead...
>
>Also, you may be interested to take a look at these
>scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I
>posted a while ago. If you follow the links in this post, you can get
>more details about how the scripts work and why it was necessary to do the
>things it does... or you can just use them without reading. They should
>work pretty much out of the box...

Where I work, we discovered that you can keep a file in HDFS open and still run MapReduce jobs against the data in that file.  What you do is you flush the data periodically (every record for us), but you don't close the file right away.  This allows us to have data files that contain 24 hours worth of data, but not have to close the file to run the jobs or to schedule the jobs for after the file is closed.  You can also check the file size periodically and rotate the files based on size.  We use Avro files, but sequence files should work too according to Cloudera.

It's a great compromise for when you want the latest and greatest data, but don't want to have to wait until all of the files are closed to get it.

Casey

Re: Hadoop Consumer

Posted by Murtaza Doctor <mu...@richrelevance.com>.

>>
>>- We have event data under the topic "foo" written to the kafka
>> Server/Broker in avro format and want to write those events to HDFS.
>>Does
>> the Hadoop consumer expect the data written to HDFS already?
>
>
>No it doesn't expect the data to be written into HDFS already... There
>wouldn't be much point to it, otherwise, no ;) ?
>

Sorry, my note was unclear. I meant the SimpleKafkaETLJob requires a
sequence file with an offset written to HDFS and then uses that as a
bookmark to pull the data from the broker?
This file has a checksum and I was trying to modify the topic in it, which
then of course messes up the checksum. I already have events generated on
my Kafka server and all I wanted to do is run SimpleKafkaETLJob to pull
out the data and write to HDFS. Was trying to fulfill the sequence file
pre-requisite and that does not seem to work for me.

>
>> Based on the
>> doc looks like the DataGenerator is pulling events from the broker and
>> writing to HDFS. In our case we only wanted to utilize the
>> SimpleKafkaETLJob to write to HDFS.
>
>
>That's what it does. It spawns a (map only) Map Reduce job that pulls in
>parallel from the broker(s) and writes that data into HDFS.
>
>
>> I am surely missing something here?
>>
>
>Maybe...? I don't know. Do tell if anything is not clear still...!

Thanks for asserting, just want to make sure I got it right.

>
>
>> - Is there a version of consumer which appends to an existing file on
>>HDFS
>> until it reaches a specific size?
>>
>
>No there isn't, as far as I know. Potential solutions to this would be:
>
>   1. Leave the data in the broker long enough for it to reach the size
>you
>   want. Running the SimpleKafkaETLJob at those intervals would give you
>the
>   file size you want. This is the simplest thing to do, but the drawback
>is
>   that your data in HDFS will be less real-time.
>   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll
>up
>   / compact your small files into one bigger file. You would need to
>come up
>   with the hadoop job that does the roll up, or find one somewhere.
>   3. Don't use the SimpleKafkaETLJob at all and write a new job that
>makes
>   use of hadoop append instead...

These options are very useful. I like option 3 the most :)

>
>Also, you may be interested to take a look at these
>scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-co
>nsumer/>I
>posted a while ago. If you follow the links in this post, you can get
>more details about how the scripts work and why it was necessary to do the
>things it does... or you can just use them without reading. They should
>work pretty much out of the box...

Will surely give them a spin. Thanks!
>
>>
>> Thanks,
>> murtaza
>>
>>

Re: Hadoop Consumer

Posted by Felix GV <fe...@mate1inc.com>.

Answer inlined...

--
Felix

On Fri, Jun 29, 2012 at 9:24 PM, Murtaza Doctor
<mu...@richrelevance.com>wrote:

> Had a few questions around the Hadoop Consumer.
>
> - We have event data under the topic "foo" written to the kafka
> Server/Broker in avro format and want to write those events to HDFS. Does
> the Hadoop consumer expect the data written to HDFS already?

No it doesn't expect the data to be written into HDFS already... There
wouldn't be much point to it, otherwise, no ;) ?

> Based on the
> doc looks like the DataGenerator is pulling events from the broker and
> writing to HDFS. In our case we only wanted to utilize the
> SimpleKafkaETLJob to write to HDFS.

That's what it does. It spawns a (map only) Map Reduce job that pulls in
parallel from the broker(s) and writes that data into HDFS.

> I am surely missing something here?
>

Maybe...? I don't know. Do tell if anything is not clear still...!

> - Is there a version of consumer which appends to an existing file on HDFS
> until it reaches a specific size?
>

No there isn't, as far as I know. Potential solutions to this would be:

   1. Leave the data in the broker long enough for it to reach the size you
   want. Running the SimpleKafkaETLJob at those intervals would give you the
   file size you want. This is the simplest thing to do, but the drawback is
   that your data in HDFS will be less real-time.
   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up
   / compact your small files into one bigger file. You would need to come up
   with the hadoop job that does the roll up, or find one somewhere.
   3. Don't use the SimpleKafkaETLJob at all and write a new job that makes
   use of hadoop append instead...

Also, you may be interested to take a look at these
scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I
posted a while ago. If you follow the links in this post, you can get
more details about how the scripts work and why it was necessary to do the
things it does... or you can just use them without reading. They should
work pretty much out of the box...

>
> Thanks,
> murtaza
>
>

Hadoop Consumer

Posted by Murtaza Doctor <mu...@richrelevance.com>.

Had a few questions around the Hadoop Consumer.

- We have event data under the topic "foo" written to the kafka
Server/Broker in avro format and want to write those events to HDFS. Does
the Hadoop consumer expect the data written to HDFS already? Based on the
doc looks like the DataGenerator is pulling events from the broker and
writing to HDFS. In our case we only wanted to utilize the
SimpleKafkaETLJob to write to HDFS. I am surely missing something here?
- Is there a version of consumer which appends to an existing file on HDFS
until it reaches a specific size?

Thanks,
murtaza

Re: Kafka - Avro Encoder

Posted by Neha Narkhede <ne...@gmail.com>.

Hi Murtaza,

>> - Is there any sample code around this since this is probably a common use-case. I meant is there a CustomAvroEncoder which we can use out of the box or any chance this can also be open-sourced?

The encoding/decoding using Avro is pretty simple. We just use the
BinaryEncoder with the Specific/Generic DatumWriter to write
IndexedRecord objects.

>> - In terms of internals - are we converting avro into byte stream and creating a Message Object and then writing to the queue, does this incur any overhead in your opinion?

The overhead of serialization I've seen in production is ~0.05 ms per record.

Thanks,
Neha


On Tue, Jun 26, 2012 at 10:07 PM, Murtaza Doctor
<mu...@richrelevance.com> wrote:
> Hello Folks,
>
> We are currently evaluating Kafka and had a few questions around the
> Encoder functionality.
> Our data is in avro format and we wish to send the data to the broker in
> this format as well eventually write to HDFS. As documented, we do realize
> that we need a Custom Encoder to achieve creation of the Message object.
>
> Questions we had:
> - Is there any sample code around this since this is probably a common
> use-case. I meant is there a CustomAvroEncoder which we can use out of the
> box or any chance this can also be open-sourced?
> - In terms of internals - are we converting avro into byte stream and
> creating a Message Object and then writing to the queue, does this incur
> any overhead in your opinion?
> - Any best practices around this or how others would approach this problem?
>
> If there is any value we would definitely like to see this added to the
> FAQs or even part of some sample code.
>
> Thanks,
> murtaza
>
>