You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Paul Ingles <pa...@forward.co.uk> on 2012/01/24 09:24:14 UTC

Incremental Hadoop + SimpleKafkaETLJob

Hi,

I'm investigating using Kafka and would really appreciate getting some more experienced opinion on the way things work together.

Our application instances are creating Protocol Buffer serialized messages and pushing them to topics in Kafka:

* Web log requests
* Product details viewed
* Search performed
* Email registered
etc...

I would like to be able to perform incremental loads from these topics into HDFS and then into the rest of the batch processing. I guess I had 3 broad questions

1) How do people trigger the batch loads? Do you just point your SimpleKafkaETLJob input to the previous runs outputted offset file? Do you move files between runs of the SimpleKafkaETLJob- move the part-* file into one place and move the offsets into an input directory ready for the next run?

2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper outputs Long/Text writables and is marked as deprecated (this is in the 0.7 source). Is there an alternative class that should be used instead, or is the hadoop-consumer being deprecated overall?

3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, are most people using Kafka for passing text messages around or using JSON data etc.?

Thanks,
Paul

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Russell Jurney <ru...@gmail.com>.

Anyone have code that does incremental S3?

Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

On Jan 25, 2012, at 8:36 AM, Felix GV <fe...@mate1inc.com> wrote:

> Yeah those shell scripts are basically the continuation of what I was doing
> in my last blog posts. I planned to make new blog posts about them but I
> just never got around to it. Then I saw your message and it gave me the
> little kick in the arse I needed to at least gist those things :) ...
>
> Hopefully, it can save you some time :) !
>
> --
> Felix
>
>
>
> On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <pa...@forward.co.uk> wrote:
>
>> Thanks Felix- I found your blog posts before and it really helped me
>> figure out how to get things working so I'll definitely give the shell
>> scripts a run.
>>
>>
>>
>> On 24 Jan 2012, at 19:05, Felix GV wrote:
>>
>>> Hello :)
>>>
>>> For question 1:
>>>
>>> The hadoop consumer in the contrib directory has almost everything it
>> needs
>>> to do distributed incremental imports out of the box, but it requires a
>> bit
>>> of hand holding.
>>>
>>> I've created two scripts to automate the process. One of them generates
>>> initial offset files, and the other does incremental hadoop consumption.
>>>
>>> I personally use a cron job to periodically call the incremental consumer
>>> script with specific parameters (for topic and HDFS path output).
>>>
>>> You can find all of the required files in this gist:
>>> https://gist.github.com/1671887
>>>
>>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
>>> eventually but I think they didn't have time to get around to it yet.
>> When
>>> they do release it, it's probably going to be better than my scripts, but
>>> for now, I think those scripts are the only publically available way to
>> do
>>> this stuff without writing it yourself.
>>>
>>> I don't know about question 2 and 3.
>>>
>>> I hope this helps :) !
>>>
>>> --
>>> Felix
>>>
>>>
>>>
>>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm investigating using Kafka and would really appreciate getting some
>>>> more experienced opinion on the way things work together.
>>>>
>>>> Our application instances are creating Protocol Buffer serialized
>> messages
>>>> and pushing them to topics in Kafka:
>>>>
>>>> * Web log requests
>>>> * Product details viewed
>>>> * Search performed
>>>> * Email registered
>>>> etc...
>>>>
>>>> I would like to be able to perform incremental loads from these topics
>>>> into HDFS and then into the rest of the batch processing. I guess I had
>> 3
>>>> broad questions
>>>>
>>>> 1) How do people trigger the batch loads? Do you just point your
>>>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
>> you
>>>> move files between runs of the SimpleKafkaETLJob- move the part-* file
>> into
>>>> one place and move the offsets into an input directory ready for the
>> next
>>>> run?
>>>>
>>>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>>>> outputs Long/Text writables and is marked as deprecated (this is in the
>> 0.7
>>>> source). Is there an alternative class that should be used instead, or
>> is
>>>> the hadoop-consumer being deprecated overall?
>>>>
>>>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>>>> are most people using Kafka for passing text messages around or using
>> JSON
>>>> data etc.?
>>>>
>>>> Thanks,
>>>> Paul
>>
>>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Felix GV <fe...@mate1inc.com>.

Yeah those shell scripts are basically the continuation of what I was doing
in my last blog posts. I planned to make new blog posts about them but I
just never got around to it. Then I saw your message and it gave me the
little kick in the arse I needed to at least gist those things :) ...

Hopefully, it can save you some time :) !

--
Felix



On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <pa...@forward.co.uk> wrote:

> Thanks Felix- I found your blog posts before and it really helped me
> figure out how to get things working so I'll definitely give the shell
> scripts a run.
>
>
>
> On 24 Jan 2012, at 19:05, Felix GV wrote:
>
> > Hello :)
> >
> > For question 1:
> >
> > The hadoop consumer in the contrib directory has almost everything it
> needs
> > to do distributed incremental imports out of the box, but it requires a
> bit
> > of hand holding.
> >
> > I've created two scripts to automate the process. One of them generates
> > initial offset files, and the other does incremental hadoop consumption.
> >
> > I personally use a cron job to periodically call the incremental consumer
> > script with specific parameters (for topic and HDFS path output).
> >
> > You can find all of the required files in this gist:
> > https://gist.github.com/1671887
> >
> > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> > eventually but I think they didn't have time to get around to it yet.
> When
> > they do release it, it's probably going to be better than my scripts, but
> > for now, I think those scripts are the only publically available way to
> do
> > this stuff without writing it yourself.
> >
> > I don't know about question 2 and 3.
> >
> > I hope this helps :) !
> >
> > --
> > Felix
> >
> >
> >
> > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
> >
> >> Hi,
> >>
> >> I'm investigating using Kafka and would really appreciate getting some
> >> more experienced opinion on the way things work together.
> >>
> >> Our application instances are creating Protocol Buffer serialized
> messages
> >> and pushing them to topics in Kafka:
> >>
> >> * Web log requests
> >> * Product details viewed
> >> * Search performed
> >> * Email registered
> >> etc...
> >>
> >> I would like to be able to perform incremental loads from these topics
> >> into HDFS and then into the rest of the batch processing. I guess I had
> 3
> >> broad questions
> >>
> >> 1) How do people trigger the batch loads? Do you just point your
> >> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> you
> >> move files between runs of the SimpleKafkaETLJob- move the part-* file
> into
> >> one place and move the offsets into an input directory ready for the
> next
> >> run?
> >>
> >> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> >> outputs Long/Text writables and is marked as deprecated (this is in the
> 0.7
> >> source). Is there an alternative class that should be used instead, or
> is
> >> the hadoop-consumer being deprecated overall?
> >>
> >> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> >> are most people using Kafka for passing text messages around or using
> JSON
> >> data etc.?
> >>
> >> Thanks,
> >> Paul
>
>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Paul Ingles <pa...@forward.co.uk>.

Thanks Felix- I found your blog posts before and it really helped me figure out how to get things working so I'll definitely give the shell scripts a run.



On 24 Jan 2012, at 19:05, Felix GV wrote:

> Hello :)
> 
> For question 1:
> 
> The hadoop consumer in the contrib directory has almost everything it needs
> to do distributed incremental imports out of the box, but it requires a bit
> of hand holding.
> 
> I've created two scripts to automate the process. One of them generates
> initial offset files, and the other does incremental hadoop consumption.
> 
> I personally use a cron job to periodically call the incremental consumer
> script with specific parameters (for topic and HDFS path output).
> 
> You can find all of the required files in this gist:
> https://gist.github.com/1671887
> 
> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> eventually but I think they didn't have time to get around to it yet. When
> they do release it, it's probably going to be better than my scripts, but
> for now, I think those scripts are the only publically available way to do
> this stuff without writing it yourself.
> 
> I don't know about question 2 and 3.
> 
> I hope this helps :) !
> 
> --
> Felix
> 
> 
> 
> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
> 
>> Hi,
>> 
>> I'm investigating using Kafka and would really appreciate getting some
>> more experienced opinion on the way things work together.
>> 
>> Our application instances are creating Protocol Buffer serialized messages
>> and pushing them to topics in Kafka:
>> 
>> * Web log requests
>> * Product details viewed
>> * Search performed
>> * Email registered
>> etc...
>> 
>> I would like to be able to perform incremental loads from these topics
>> into HDFS and then into the rest of the batch processing. I guess I had 3
>> broad questions
>> 
>> 1) How do people trigger the batch loads? Do you just point your
>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you
>> move files between runs of the SimpleKafkaETLJob- move the part-* file into
>> one place and move the offsets into an input directory ready for the next
>> run?
>> 
>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>> outputs Long/Text writables and is marked as deprecated (this is in the 0.7
>> source). Is there an alternative class that should be used instead, or is
>> the hadoop-consumer being deprecated overall?
>> 
>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>> are most people using Kafka for passing text messages around or using JSON
>> data etc.?
>> 
>> Thanks,
>> Paul

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Felix GV <fe...@mate1inc.com>.

It's ok, we're all busy and open source is essentially volunteer work.

Besides, you guys didn't promise any time frame, as far as I remember, so
technically there is no deadline at which you'll ever "break your promise"
hehe...

Still looking forward to it though :)

--
Felix



On Tue, Jan 24, 2012 at 5:12 PM, Richard Park <ri...@gmail.com>wrote:

> Yeah, sorry about missing the promise to release code.
> I'll talk to someone about releasing what we have.
>
> On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <fe...@mate1inc.com> wrote:
>
> > Hello :)
> >
> > For question 1:
> >
> > The hadoop consumer in the contrib directory has almost everything it
> needs
> > to do distributed incremental imports out of the box, but it requires a
> bit
> > of hand holding.
> >
> > I've created two scripts to automate the process. One of them generates
> > initial offset files, and the other does incremental hadoop consumption.
> >
> > I personally use a cron job to periodically call the incremental consumer
> > script with specific parameters (for topic and HDFS path output).
> >
> > You can find all of the required files in this gist:
> > https://gist.github.com/1671887
> >
> > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> > eventually but I think they didn't have time to get around to it yet.
> When
> > they do release it, it's probably going to be better than my scripts, but
> > for now, I think those scripts are the only publically available way to
> do
> > this stuff without writing it yourself.
> >
> > I don't know about question 2 and 3.
> >
> > I hope this helps :) !
> >
> > --
> > Felix
> >
> >
> >
> > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
> >
> > > Hi,
> > >
> > > I'm investigating using Kafka and would really appreciate getting some
> > > more experienced opinion on the way things work together.
> > >
> > > Our application instances are creating Protocol Buffer serialized
> > messages
> > > and pushing them to topics in Kafka:
> > >
> > > * Web log requests
> > > * Product details viewed
> > > * Search performed
> > > * Email registered
> > > etc...
> > >
> > > I would like to be able to perform incremental loads from these topics
> > > into HDFS and then into the rest of the batch processing. I guess I
> had 3
> > > broad questions
> > >
> > > 1) How do people trigger the batch loads? Do you just point your
> > > SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> > you
> > > move files between runs of the SimpleKafkaETLJob- move the part-* file
> > into
> > > one place and move the offsets into an input directory ready for the
> next
> > > run?
> > >
> > > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> > > outputs Long/Text writables and is marked as deprecated (this is in the
> > 0.7
> > > source). Is there an alternative class that should be used instead, or
> is
> > > the hadoop-consumer being deprecated overall?
> > >
> > > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text
> lines,
> > > are most people using Kafka for passing text messages around or using
> > JSON
> > > data etc.?
> > >
> > > Thanks,
> > > Paul
> >
>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Paul Ingles <pa...@forward.co.uk>.

> In our case, all of our data in Kafka
> is serialized into Avro. We happen to keep avro when we pull the data into
> Hadoop as well.

Awesome- do you still publish avro messages via Kafka? Are you using something else to pull the data out of Kafka and onto Hadoop?

Thanks again, really appreciate the insight.



On 24 Jan 2012, at 22:33, Richard Park wrote:

> Let me try to answer the other questions.
> 
> For 1. Latest offset files that are written by the mappers are then used as
> input for subsequent run throughs. We output these files to a temp dir
> after which a hdfs mv occurs to a 'completed' directory for a pseudo atomic
> commit. Subsequent run throughs search for the latest completed run. We're
> careful not to have several jobs pulling from the same offsets.
> 
> 2. I believe SimpleKafkaETLMapper was written as an example. I'm unsure why
> it's deprecated except that it may be outdated.
> 
> 3. We don't use SimpleKafkaETLMapper. In our case, all of our data in Kafka
> is serialized into Avro. We happen to keep avro when we pull the data into
> Hadoop as well.
> 
> On Tue, Jan 24, 2012 at 2:12 PM, Richard Park <ri...@gmail.com>wrote:
> 
>> Yeah, sorry about missing the promise to release code.
>> I'll talk to someone about releasing what we have.
>> 
>> 
>> On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <fe...@mate1inc.com> wrote:
>> 
>>> Hello :)
>>> 
>>> For question 1:
>>> 
>>> The hadoop consumer in the contrib directory has almost everything it
>>> needs
>>> to do distributed incremental imports out of the box, but it requires a
>>> bit
>>> of hand holding.
>>> 
>>> I've created two scripts to automate the process. One of them generates
>>> initial offset files, and the other does incremental hadoop consumption.
>>> 
>>> I personally use a cron job to periodically call the incremental consumer
>>> script with specific parameters (for topic and HDFS path output).
>>> 
>>> You can find all of the required files in this gist:
>>> https://gist.github.com/1671887
>>> 
>>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
>>> eventually but I think they didn't have time to get around to it yet. When
>>> they do release it, it's probably going to be better than my scripts, but
>>> for now, I think those scripts are the only publically available way to do
>>> this stuff without writing it yourself.
>>> 
>>> I don't know about question 2 and 3.
>>> 
>>> I hope this helps :) !
>>> 
>>> --
>>> Felix
>>> 
>>> 
>>> 
>>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I'm investigating using Kafka and would really appreciate getting some
>>>> more experienced opinion on the way things work together.
>>>> 
>>>> Our application instances are creating Protocol Buffer serialized
>>> messages
>>>> and pushing them to topics in Kafka:
>>>> 
>>>> * Web log requests
>>>> * Product details viewed
>>>> * Search performed
>>>> * Email registered
>>>> etc...
>>>> 
>>>> I would like to be able to perform incremental loads from these topics
>>>> into HDFS and then into the rest of the batch processing. I guess I had
>>> 3
>>>> broad questions
>>>> 
>>>> 1) How do people trigger the batch loads? Do you just point your
>>>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
>>> you
>>>> move files between runs of the SimpleKafkaETLJob- move the part-* file
>>> into
>>>> one place and move the offsets into an input directory ready for the
>>> next
>>>> run?
>>>> 
>>>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>>>> outputs Long/Text writables and is marked as deprecated (this is in the
>>> 0.7
>>>> source). Is there an alternative class that should be used instead, or
>>> is
>>>> the hadoop-consumer being deprecated overall?
>>>> 
>>>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>>>> are most people using Kafka for passing text messages around or using
>>> JSON
>>>> data etc.?
>>>> 
>>>> Thanks,
>>>> Paul
>>> 
>> 
>>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Richard Park <ri...@gmail.com>.

Let me try to answer the other questions.

For 1. Latest offset files that are written by the mappers are then used as
input for subsequent run throughs. We output these files to a temp dir
after which a hdfs mv occurs to a 'completed' directory for a pseudo atomic
commit. Subsequent run throughs search for the latest completed run. We're
careful not to have several jobs pulling from the same offsets.

2. I believe SimpleKafkaETLMapper was written as an example. I'm unsure why
it's deprecated except that it may be outdated.

3. We don't use SimpleKafkaETLMapper. In our case, all of our data in Kafka
is serialized into Avro. We happen to keep avro when we pull the data into
Hadoop as well.

On Tue, Jan 24, 2012 at 2:12 PM, Richard Park <ri...@gmail.com>wrote:

> Yeah, sorry about missing the promise to release code.
> I'll talk to someone about releasing what we have.
>
>
> On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <fe...@mate1inc.com> wrote:
>
>> Hello :)
>>
>> For question 1:
>>
>> The hadoop consumer in the contrib directory has almost everything it
>> needs
>> to do distributed incremental imports out of the box, but it requires a
>> bit
>> of hand holding.
>>
>> I've created two scripts to automate the process. One of them generates
>> initial offset files, and the other does incremental hadoop consumption.
>>
>> I personally use a cron job to periodically call the incremental consumer
>> script with specific parameters (for topic and HDFS path output).
>>
>> You can find all of the required files in this gist:
>> https://gist.github.com/1671887
>>
>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
>> eventually but I think they didn't have time to get around to it yet. When
>> they do release it, it's probably going to be better than my scripts, but
>> for now, I think those scripts are the only publically available way to do
>> this stuff without writing it yourself.
>>
>> I don't know about question 2 and 3.
>>
>> I hope this helps :) !
>>
>> --
>> Felix
>>
>>
>>
>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
>>
>> > Hi,
>> >
>> > I'm investigating using Kafka and would really appreciate getting some
>> > more experienced opinion on the way things work together.
>> >
>> > Our application instances are creating Protocol Buffer serialized
>> messages
>> > and pushing them to topics in Kafka:
>> >
>> > * Web log requests
>> > * Product details viewed
>> > * Search performed
>> > * Email registered
>> > etc...
>> >
>> > I would like to be able to perform incremental loads from these topics
>> > into HDFS and then into the rest of the batch processing. I guess I had
>> 3
>> > broad questions
>> >
>> > 1) How do people trigger the batch loads? Do you just point your
>> > SimpleKafkaETLJob input to the previous runs outputted offset file? Do
>> you
>> > move files between runs of the SimpleKafkaETLJob- move the part-* file
>> into
>> > one place and move the offsets into an input directory ready for the
>> next
>> > run?
>> >
>> > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>> > outputs Long/Text writables and is marked as deprecated (this is in the
>> 0.7
>> > source). Is there an alternative class that should be used instead, or
>> is
>> > the hadoop-consumer being deprecated overall?
>> >
>> > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>> > are most people using Kafka for passing text messages around or using
>> JSON
>> > data etc.?
>> >
>> > Thanks,
>> > Paul
>>
>
>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Richard Park <ri...@gmail.com>.

Yeah, sorry about missing the promise to release code.
I'll talk to someone about releasing what we have.

On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <fe...@mate1inc.com> wrote:

> Hello :)
>
> For question 1:
>
> The hadoop consumer in the contrib directory has almost everything it needs
> to do distributed incremental imports out of the box, but it requires a bit
> of hand holding.
>
> I've created two scripts to automate the process. One of them generates
> initial offset files, and the other does incremental hadoop consumption.
>
> I personally use a cron job to periodically call the incremental consumer
> script with specific parameters (for topic and HDFS path output).
>
> You can find all of the required files in this gist:
> https://gist.github.com/1671887
>
> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> eventually but I think they didn't have time to get around to it yet. When
> they do release it, it's probably going to be better than my scripts, but
> for now, I think those scripts are the only publically available way to do
> this stuff without writing it yourself.
>
> I don't know about question 2 and 3.
>
> I hope this helps :) !
>
> --
> Felix
>
>
>
> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:
>
> > Hi,
> >
> > I'm investigating using Kafka and would really appreciate getting some
> > more experienced opinion on the way things work together.
> >
> > Our application instances are creating Protocol Buffer serialized
> messages
> > and pushing them to topics in Kafka:
> >
> > * Web log requests
> > * Product details viewed
> > * Search performed
> > * Email registered
> > etc...
> >
> > I would like to be able to perform incremental loads from these topics
> > into HDFS and then into the rest of the batch processing. I guess I had 3
> > broad questions
> >
> > 1) How do people trigger the batch loads? Do you just point your
> > SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> you
> > move files between runs of the SimpleKafkaETLJob- move the part-* file
> into
> > one place and move the offsets into an input directory ready for the next
> > run?
> >
> > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> > outputs Long/Text writables and is marked as deprecated (this is in the
> 0.7
> > source). Is there an alternative class that should be used instead, or is
> > the hadoop-consumer being deprecated overall?
> >
> > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> > are most people using Kafka for passing text messages around or using
> JSON
> > data etc.?
> >
> > Thanks,
> > Paul
>

Re: Incremental Hadoop + SimpleKafkaETLJob

Posted by Felix GV <fe...@mate1inc.com>.

Hello :)

For question 1:

The hadoop consumer in the contrib directory has almost everything it needs
to do distributed incremental imports out of the box, but it requires a bit
of hand holding.

I've created two scripts to automate the process. One of them generates
initial offset files, and the other does incremental hadoop consumption.

I personally use a cron job to periodically call the incremental consumer
script with specific parameters (for topic and HDFS path output).

You can find all of the required files in this gist:
https://gist.github.com/1671887

The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
eventually but I think they didn't have time to get around to it yet. When
they do release it, it's probably going to be better than my scripts, but
for now, I think those scripts are the only publically available way to do
this stuff without writing it yourself.

I don't know about question 2 and 3.

I hope this helps :) !

--
Felix

On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <pa...@forward.co.uk> wrote:

> Hi,
>
> I'm investigating using Kafka and would really appreciate getting some
> more experienced opinion on the way things work together.
>
> Our application instances are creating Protocol Buffer serialized messages
> and pushing them to topics in Kafka:
>
> * Web log requests
> * Product details viewed
> * Search performed
> * Email registered
> etc...
>
> I would like to be able to perform incremental loads from these topics
> into HDFS and then into the rest of the batch processing. I guess I had 3
> broad questions
>
> 1) How do people trigger the batch loads? Do you just point your
> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you
> move files between runs of the SimpleKafkaETLJob- move the part-* file into
> one place and move the offsets into an input directory ready for the next
> run?
>
> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> outputs Long/Text writables and is marked as deprecated (this is in the 0.7
> source). Is there an alternative class that should be used instead, or is
> the hadoop-consumer being deprecated overall?
>
> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> are most people using Kafka for passing text messages around or using JSON
> data etc.?
>
> Thanks,
> Paul