You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Michal Haris <mi...@visualdna.com> on 2012/10/16 19:34:21 UTC

Our scenario and couple of questions

Hi,

Hi everyone*,

Our current situtation (without kafka)*

- we have at the moment 8 event tracker servers that in total are capable
of handling 8000 http events / second but a normal day peak throughput is
about 1250 messages / second.
- messages are basically http events enriched by various apache mods and
trasnformations eventually written into log files
- each event is cca 0.5kb when packed as json
- these message logs are compressed and every 5 minutes shipped into S3
where they are used by hive and other hadoop jobs
- pretty standard
*
My plan is to introduce a kafka system on top the existing offline
log-processing. *

I have a simulated event stream and have written a hadoop job similar to
the etl consumer in the trunk except i keep the offsets in the zookeeper
and the output files are partitioned by date directory.
In the first phase I am going to install kafka broker on each of the 8
tracker servers and simply tail | php producer.php on each of the 8 tracker
servers and then have a PHP code publishing into a local broker node under
a single topic, so in total there will be a cluster of 8 kafka server with
a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic is
going to be mirrored into a central kafka cluster where the hadoop-loader
job will run every 30 min or so.

*Question 1*: If each broker has one topic and one partition, if i want to
implement a partitioned producer (in php), I still have 8 partitions in
total, correct ?
*Question 2*: In future I may have mutliple event tracking clusters which I
want mirrored onto a single topic in the central trucker, is this kind of
mirroring possible with 0.7.x ?
*Question 3*: If i want the low-level php producer to batch & zip 10
messages like the async scala/java producer does, all i have to do is to
send a message that is a message set containing all the 10 messages,
correct ?
*Question 4*: This system is quite likely to go into production in next
weeks, and I prefer staying with 0.7.x because it's simpler for non-java
clients but would you advice me to build on 0.8.x and why ?


Thanks a lot
-- 
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033

Re: Our scenario and couple of questions

Posted by Michal Haris <mi...@visualdna.com>.
Oh, one more aspect of the problem: The event stream can be potentially
split into multiple topics and I have an idea how and with what
partitioning but since the mirroring doesn't obey the partitioning nor
supports partitioner implementation I have a dilemma. Note that there will
be other topics besides this event stream in the entire system but for now
only this one is relevant:

   - Is it fine to have a single topic and then have consumers process
   pointlessly many messages only to find the few they are interested in?
   - Or would it make more sense to have one topic for the sake of
   mirroring and then have a consumer and producer that republishes those
   messages into multiple sub-topics where messages would appear redundantly
   in several topics each with a different parititioner ?

thanks for your help,

Michal,


On 17 October 2012 10:39, Michal Haris <mi...@visualdna.com> wrote:

> Great, thanks a lot!
>
>
> On 16 October 2012 18:45, Neha Narkhede <ne...@gmail.com> wrote:
>
>> >> *Question 1*: If each broker has one topic and one partition, if i
>> want to
>> implement a partitioned producer (in php), I still have 8 partitions in
>> total, correct ?
>>
>> Correct
>>
>> >> *Question 2*: In future I may have mutliple event tracking clusters
>> which I
>> want mirrored onto a single topic in the central trucker, is this kind of
>> mirroring possible with 0.7.x ?
>>
>> This is available in 0.7.1 onwards
>>
>> >> *Question 3*: If i want the low-level php producer to batch & zip 10
>> messages like the async scala/java producer does, all i have to do is to
>> send a message that is a message set containing all the 10 messages,
>> correct ?
>>
>> Yes, provided you conform with the format of a compressed message -
>> https://cwiki.apache.org/confluence/display/KAFKA/Compression
>>
>> >> *Question 4*: This system is quite likely to go into production in next
>> weeks, and I prefer staying with 0.7.x because it's simpler for non-java
>> clients but would you advice me to build on 0.8.x and why ?
>>
>> Recommend staying on 0.7.x since it is stable. If your requirements
>> include message replication, durability and guaranteed delivery,
>> you might want to wait until 0.8 is released. The wire protocol has
>> changed considerably in 0.8.
>>
>> Thanks,
>> Neha
>>
>> On Tue, Oct 16, 2012 at 10:34 AM, Michal Haris
>> <mi...@visualdna.com> wrote:
>> > Hi,
>> >
>> > Hi everyone*,
>> >
>> > Our current situtation (without kafka)*
>> >
>> > - we have at the moment 8 event tracker servers that in total are
>> capable
>> > of handling 8000 http events / second but a normal day peak throughput
>> is
>> > about 1250 messages / second.
>> > - messages are basically http events enriched by various apache mods and
>> > trasnformations eventually written into log files
>> > - each event is cca 0.5kb when packed as json
>> > - these message logs are compressed and every 5 minutes shipped into S3
>> > where they are used by hive and other hadoop jobs
>> > - pretty standard
>> > *
>> > My plan is to introduce a kafka system on top the existing offline
>> > log-processing. *
>> >
>> > I have a simulated event stream and have written a hadoop job similar to
>> > the etl consumer in the trunk except i keep the offsets in the zookeeper
>> > and the output files are partitioned by date directory.
>> > In the first phase I am going to install kafka broker on each of the 8
>> > tracker servers and simply tail | php producer.php on each of the 8
>> tracker
>> > servers and then have a PHP code publishing into a local broker node
>> under
>> > a single topic, so in total there will be a cluster of 8 kafka server
>> with
>> > a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic
>> is
>> > going to be mirrored into a central kafka cluster where the
>> hadoop-loader
>> > job will run every 30 min or so.
>> >
>> > *Question 1*: If each broker has one topic and one partition, if i want
>> to
>> > implement a partitioned producer (in php), I still have 8 partitions in
>> > total, correct ?
>> > *Question 2*: In future I may have mutliple event tracking clusters
>> which I
>> > want mirrored onto a single topic in the central trucker, is this kind
>> of
>> > mirroring possible with 0.7.x ?
>> > *Question 3*: If i want the low-level php producer to batch & zip 10
>> > messages like the async scala/java producer does, all i have to do is to
>> > send a message that is a message set containing all the 10 messages,
>> > correct ?
>> > *Question 4*: This system is quite likely to go into production in next
>> > weeks, and I prefer staying with 0.7.x because it's simpler for non-java
>> > clients but would you advice me to build on 0.8.x and why ?
>> >
>> >
>> > Thanks a lot
>> > --
>> > Michal Haris
>> > Software Engineer
>> >
>> > VisualDNA | 7 Moor Street, London, W1D 5NB
>> > www.visualdna.com | t: +44 (0) 207 734 7033
>>
>
>
>
> --
> Michal Haris
> Software Engineer
>
> VisualDNA | 7 Moor Street, London, W1D 5NB
> www.visualdna.com | t: +44 (0) 207 734 7033
>
>
>


-- 
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033

Re: Our scenario and couple of questions

Posted by Michal Haris <mi...@visualdna.com>.
Great, thanks a lot!

On 16 October 2012 18:45, Neha Narkhede <ne...@gmail.com> wrote:

> >> *Question 1*: If each broker has one topic and one partition, if i want
> to
> implement a partitioned producer (in php), I still have 8 partitions in
> total, correct ?
>
> Correct
>
> >> *Question 2*: In future I may have mutliple event tracking clusters
> which I
> want mirrored onto a single topic in the central trucker, is this kind of
> mirroring possible with 0.7.x ?
>
> This is available in 0.7.1 onwards
>
> >> *Question 3*: If i want the low-level php producer to batch & zip 10
> messages like the async scala/java producer does, all i have to do is to
> send a message that is a message set containing all the 10 messages,
> correct ?
>
> Yes, provided you conform with the format of a compressed message -
> https://cwiki.apache.org/confluence/display/KAFKA/Compression
>
> >> *Question 4*: This system is quite likely to go into production in next
> weeks, and I prefer staying with 0.7.x because it's simpler for non-java
> clients but would you advice me to build on 0.8.x and why ?
>
> Recommend staying on 0.7.x since it is stable. If your requirements
> include message replication, durability and guaranteed delivery,
> you might want to wait until 0.8 is released. The wire protocol has
> changed considerably in 0.8.
>
> Thanks,
> Neha
>
> On Tue, Oct 16, 2012 at 10:34 AM, Michal Haris
> <mi...@visualdna.com> wrote:
> > Hi,
> >
> > Hi everyone*,
> >
> > Our current situtation (without kafka)*
> >
> > - we have at the moment 8 event tracker servers that in total are capable
> > of handling 8000 http events / second but a normal day peak throughput is
> > about 1250 messages / second.
> > - messages are basically http events enriched by various apache mods and
> > trasnformations eventually written into log files
> > - each event is cca 0.5kb when packed as json
> > - these message logs are compressed and every 5 minutes shipped into S3
> > where they are used by hive and other hadoop jobs
> > - pretty standard
> > *
> > My plan is to introduce a kafka system on top the existing offline
> > log-processing. *
> >
> > I have a simulated event stream and have written a hadoop job similar to
> > the etl consumer in the trunk except i keep the offsets in the zookeeper
> > and the output files are partitioned by date directory.
> > In the first phase I am going to install kafka broker on each of the 8
> > tracker servers and simply tail | php producer.php on each of the 8
> tracker
> > servers and then have a PHP code publishing into a local broker node
> under
> > a single topic, so in total there will be a cluster of 8 kafka server
> with
> > a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic
> is
> > going to be mirrored into a central kafka cluster where the hadoop-loader
> > job will run every 30 min or so.
> >
> > *Question 1*: If each broker has one topic and one partition, if i want
> to
> > implement a partitioned producer (in php), I still have 8 partitions in
> > total, correct ?
> > *Question 2*: In future I may have mutliple event tracking clusters
> which I
> > want mirrored onto a single topic in the central trucker, is this kind of
> > mirroring possible with 0.7.x ?
> > *Question 3*: If i want the low-level php producer to batch & zip 10
> > messages like the async scala/java producer does, all i have to do is to
> > send a message that is a message set containing all the 10 messages,
> > correct ?
> > *Question 4*: This system is quite likely to go into production in next
> > weeks, and I prefer staying with 0.7.x because it's simpler for non-java
> > clients but would you advice me to build on 0.8.x and why ?
> >
> >
> > Thanks a lot
> > --
> > Michal Haris
> > Software Engineer
> >
> > VisualDNA | 7 Moor Street, London, W1D 5NB
> > www.visualdna.com | t: +44 (0) 207 734 7033
>



-- 
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033

Re: Our scenario and couple of questions

Posted by Neha Narkhede <ne...@gmail.com>.
>> *Question 1*: If each broker has one topic and one partition, if i want to
implement a partitioned producer (in php), I still have 8 partitions in
total, correct ?

Correct

>> *Question 2*: In future I may have mutliple event tracking clusters which I
want mirrored onto a single topic in the central trucker, is this kind of
mirroring possible with 0.7.x ?

This is available in 0.7.1 onwards

>> *Question 3*: If i want the low-level php producer to batch & zip 10
messages like the async scala/java producer does, all i have to do is to
send a message that is a message set containing all the 10 messages,
correct ?

Yes, provided you conform with the format of a compressed message -
https://cwiki.apache.org/confluence/display/KAFKA/Compression

>> *Question 4*: This system is quite likely to go into production in next
weeks, and I prefer staying with 0.7.x because it's simpler for non-java
clients but would you advice me to build on 0.8.x and why ?

Recommend staying on 0.7.x since it is stable. If your requirements
include message replication, durability and guaranteed delivery,
you might want to wait until 0.8 is released. The wire protocol has
changed considerably in 0.8.

Thanks,
Neha

On Tue, Oct 16, 2012 at 10:34 AM, Michal Haris
<mi...@visualdna.com> wrote:
> Hi,
>
> Hi everyone*,
>
> Our current situtation (without kafka)*
>
> - we have at the moment 8 event tracker servers that in total are capable
> of handling 8000 http events / second but a normal day peak throughput is
> about 1250 messages / second.
> - messages are basically http events enriched by various apache mods and
> trasnformations eventually written into log files
> - each event is cca 0.5kb when packed as json
> - these message logs are compressed and every 5 minutes shipped into S3
> where they are used by hive and other hadoop jobs
> - pretty standard
> *
> My plan is to introduce a kafka system on top the existing offline
> log-processing. *
>
> I have a simulated event stream and have written a hadoop job similar to
> the etl consumer in the trunk except i keep the offsets in the zookeeper
> and the output files are partitioned by date directory.
> In the first phase I am going to install kafka broker on each of the 8
> tracker servers and simply tail | php producer.php on each of the 8 tracker
> servers and then have a PHP code publishing into a local broker node under
> a single topic, so in total there will be a cluster of 8 kafka server with
> a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic is
> going to be mirrored into a central kafka cluster where the hadoop-loader
> job will run every 30 min or so.
>
> *Question 1*: If each broker has one topic and one partition, if i want to
> implement a partitioned producer (in php), I still have 8 partitions in
> total, correct ?
> *Question 2*: In future I may have mutliple event tracking clusters which I
> want mirrored onto a single topic in the central trucker, is this kind of
> mirroring possible with 0.7.x ?
> *Question 3*: If i want the low-level php producer to batch & zip 10
> messages like the async scala/java producer does, all i have to do is to
> send a message that is a message set containing all the 10 messages,
> correct ?
> *Question 4*: This system is quite likely to go into production in next
> weeks, and I prefer staying with 0.7.x because it's simpler for non-java
> clients but would you advice me to build on 0.8.x and why ?
>
>
> Thanks a lot
> --
> Michal Haris
> Software Engineer
>
> VisualDNA | 7 Moor Street, London, W1D 5NB
> www.visualdna.com | t: +44 (0) 207 734 7033