You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Eric Tschetter <ec...@gmail.com> on 2012/04/03 03:05:59 UTC

Embedding a broker into a producer?

I'm setting up an HTTP endpoint that just takes a posted object and
shoves it into Kafka.  I'm imagining this as basically an embedded
broker in my producer and am wondering if there's a way to emit
messages directly into the broker without actually setting up a
Producer object?  Or, is it just going to be simpler and more
supported for me if I actually set up the separate objects and have
them talk via whatever mechanism they end up talking via?

--Eric

Re: Embedding a broker into a producer?

Posted by Edward Smith <es...@stardotstar.org>.
Niek,

  Thanks for the response.  I agree with your assessments.

  It's a beautiful thing that I don't have to commit to this decision
early in the build process.  I can start it one way and just switch to
the other, as it is all a matter of configuration and not code.

  I'll let you know if we learn anything interesting along the way.

Ed

On Thu, Apr 12, 2012 at 5:38 PM, Niek Sanders <ni...@gmail.com> wrote:
>> What do you see as the downside to running a KAFKA Broker and letting
>> it write your files locally?
>>
>
> Here is every downside I can come up with.  Some are trivial, but I've
> tried to play devil's advocate.
>
> 1) Additional memory/processing footprint, as you mentioned.
>
> 2) Additional network usage by Kafka consumers hitting producer box.
>
> 3) Loss of either producer elasticity or message retention.  One of
> the awesome features of Kafka is being able to hold on a long history
> of messages and replay as needed.  But if the producers machines hold
> my brokers, I can no longer scale down the number of producer machines
> as the system load drops--a loss of elasticity.  This ties to another
> group discussion about decommissioning brokers.
>
> 4) Security... some producers live on web-accessible machines.  The
> less open ports accepting incoming connections, the better.  Not a
> huge issue with good firewall rules, but still something to ponder.
>
> 5) Loss of data duplication has downsides too.  Having multiple copies
> of data does violate DRY, but it can also add robustness.  If the
> hard-disk on either the producer or the brokers dies, you still have
> the data lying around.  Since the data is not getting changed on
> either the broker or any producer log files, you shouldn't have
> syncing issues normally associated with DRY violations.
>
>
> - Niek

Re: Embedding a broker into a producer?

Posted by Niek Sanders <ni...@gmail.com>.
> What do you see as the downside to running a KAFKA Broker and letting
> it write your files locally?
>

Here is every downside I can come up with.  Some are trivial, but I've
tried to play devil's advocate.

1) Additional memory/processing footprint, as you mentioned.

2) Additional network usage by Kafka consumers hitting producer box.

3) Loss of either producer elasticity or message retention.  One of
the awesome features of Kafka is being able to hold on a long history
of messages and replay as needed.  But if the producers machines hold
my brokers, I can no longer scale down the number of producer machines
as the system load drops--a loss of elasticity.  This ties to another
group discussion about decommissioning brokers.

4) Security... some producers live on web-accessible machines.  The
less open ports accepting incoming connections, the better.  Not a
huge issue with good firewall rules, but still something to ponder.

5) Loss of data duplication has downsides too.  Having multiple copies
of data does violate DRY, but it can also add robustness.  If the
hard-disk on either the producer or the brokers dies, you still have
the data lying around.  Since the data is not getting changed on
either the broker or any producer log files, you shouldn't have
syncing issues normally associated with DRY violations.


- Niek

Re: Embedding a broker into a producer?

Posted by Edward Smith <es...@stardotstar.org>.
Niek,

  Thanks for sharing your architecture.  We are in a similar boat, as
our current datastream is written to files first, and then the kafka
producer can read/transmit those.

What do you see as the downside to running a KAFKA Broker and letting
it write your files locally?

I'm new to kafka, so just exploring ideas here:

producer-side broker downsides:
1.  Heavier memory/processing footprint than just a producer

producer-side broker upsides:
1.  eliminates the middle man, you essentially have peer-to-peer
operation between producers and consumers, with ZK as the coordinator.
 For me, this is big, since I don't have to worry about High
Availability (HA) for the brokers.
2.  eliminates duplicating data on disk at both the producer and the broker.
3.  Data has been demultiplexed into it's topics when it is on disk at
the producer/broker.  This means that I can purge data based on
per-topic policies (Our data arrives multiplexed and has to be split
into topics, we also run into out-of-storage during a network outage).

In the research I've been doing, this is the model proposed by the 0mq
(zeromq) folks, I think.  Its just that all of the wiring is already
written in kafka.

Ed


On Thu, Apr 12, 2012 at 12:33 PM, Niek Sanders <ni...@gmail.com> wrote:
> Dealing with network/broker outage on the producer side is also
> something that I've been trying to solve.
>
> Having a hook for the producer to dump to a local file would probably
> be the simplest solution.  In the event of a prolonged outage, this
> file could be replayed once availability is restored.
>
> The current approach I've been taking:
> 1) My bridge code between my data source and the Kafka producer writes
> everything to a local log files.  When this bridge starts up, it
> generates a unique 8 character alphanumeric string.  For each log
> entry it writes to the local file, it prefixes both the alphanumeric
> string and a log line number (0,1,2,3,....).  The data already has
> timestamps coming with it.
> 2) In the event of a network outage or Kafka being unable to keep up
> with the producer, I simply drop the Kafka messages.  I never allow my
> data source to be blocked because I'm waiting on Kafka
> producer/broker.
> 3) For given time ranges, my consumers track all the alphanumeric
> identifiers that they consumed and the maximum complete sequence
> number that they have seen.
>
> So I can manually go back to producers and replay any lost data.
> (Whether it was never sent because of network outage or if it died
> with a broker hardware failure).
>
> I basically go to the producer machine (which I track in the Kafka
> message body) and say: for time A to time B, I received data for these
> identifiers and max sequence numbers (najeh2wh, 12312), (ji3njdKL,
> 71).  Replay anything that I'm missing.
>
> I use random identifier strings because it saves me from having to
> persist the number of log lines my producer has generated.
> (Robustness against producer failure).
>
> - Niek
>
>
>
>
>
>
>
> On Thu, Apr 12, 2012 at 7:12 AM, Edward Smith <es...@stardotstar.org> wrote:
>> Jun/Eric,
>>
>> [snip]
>>
>>  However, we have a requirement to support HA.  If I stick with the
>> approach above, I have to worry about replication/mirroring the
>> queues, which always gets sticky.   We have to handle the case where a
>> producer loses network connectivity, and so, must be able to queue
>> locally at the producer, which, I believe either means put the KAFKA
>> broker here or continue to use some 'homebrew'  local queue.  With
>> brokers on the same node as producers, consumers only have to HA the
>> results of their processing and I don't have to HA the queues.
>>
>>  Any thoughts or feedback from the group is welcome.
>>
>> Ed
>>

Re: Embedding a broker into a producer?

Posted by Niek Sanders <ni...@gmail.com>.
Dealing with network/broker outage on the producer side is also
something that I've been trying to solve.

Having a hook for the producer to dump to a local file would probably
be the simplest solution.  In the event of a prolonged outage, this
file could be replayed once availability is restored.

The current approach I've been taking:
1) My bridge code between my data source and the Kafka producer writes
everything to a local log files.  When this bridge starts up, it
generates a unique 8 character alphanumeric string.  For each log
entry it writes to the local file, it prefixes both the alphanumeric
string and a log line number (0,1,2,3,....).  The data already has
timestamps coming with it.
2) In the event of a network outage or Kafka being unable to keep up
with the producer, I simply drop the Kafka messages.  I never allow my
data source to be blocked because I'm waiting on Kafka
producer/broker.
3) For given time ranges, my consumers track all the alphanumeric
identifiers that they consumed and the maximum complete sequence
number that they have seen.

So I can manually go back to producers and replay any lost data.
(Whether it was never sent because of network outage or if it died
with a broker hardware failure).

I basically go to the producer machine (which I track in the Kafka
message body) and say: for time A to time B, I received data for these
identifiers and max sequence numbers (najeh2wh, 12312), (ji3njdKL,
71).  Replay anything that I'm missing.

I use random identifier strings because it saves me from having to
persist the number of log lines my producer has generated.
(Robustness against producer failure).

- Niek







On Thu, Apr 12, 2012 at 7:12 AM, Edward Smith <es...@stardotstar.org> wrote:
> Jun/Eric,
>
> [snip]
>
>  However, we have a requirement to support HA.  If I stick with the
> approach above, I have to worry about replication/mirroring the
> queues, which always gets sticky.   We have to handle the case where a
> producer loses network connectivity, and so, must be able to queue
> locally at the producer, which, I believe either means put the KAFKA
> broker here or continue to use some 'homebrew'  local queue.  With
> brokers on the same node as producers, consumers only have to HA the
> results of their processing and I don't have to HA the queues.
>
>  Any thoughts or feedback from the group is welcome.
>
> Ed
>

Re: Embedding a broker into a producer?

Posted by Jun Rao <ju...@gmail.com>.
Ed,

We also thought about have a local log in the producer in case the producer
can't send data to the brokers. It's doable. However, it adds a bit of
complexity in the code and for the operations (since now producers have to
worry about storage and typically there are many more producers than
brokers).

Thanks,

Jun

On Thu, Apr 12, 2012 at 7:12 AM, Edward Smith <es...@stardotstar.org>wrote:

> Jun/Eric,
>
>  Just to add my two cents:  I am starting a new project, and starting
> with KAFKA.  Current architecture writes data to files on the
> producing hosts.  Then a homebrew queuing system reads the files and
> passes them up to a consumer.  Producer/Consumer pairing is all done
> manually, there is no load balancing.  Fault tolerance is handled by
> having the producer send to 2 consumers and duplicating the
> processing, and then ignoring the duplicate results.
>
>  My initial approach will be to run a KAFKA cluster and use a
> producer on the producing nodes to read the files from disk and send
> them up to the cluster, and then have consumers subscribe to the
> topics, etc.  This seems like the 'normal' approach.
>
>  However, we have a requirement to support HA.  If I stick with the
> approach above, I have to worry about replication/mirroring the
> queues, which always gets sticky.   We have to handle the case where a
> producer loses network connectivity, and so, must be able to queue
> locally at the producer, which, I believe either means put the KAFKA
> broker here or continue to use some 'homebrew'  local queue.  With
> brokers on the same node as producers, consumers only have to HA the
> results of their processing and I don't have to HA the queues.
>
>  Any thoughts or feedback from the group is welcome.
>
> Ed
>
> On Tue, Apr 3, 2012 at 2:30 PM, Jun Rao <ju...@gmail.com> wrote:
> > There is currently no plan for doing that. However, if you think this is
> a
> > useful feature, please create a jira so that we can track it.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Apr 3, 2012 at 10:17 AM, Eric Tschetter <ec...@gmail.com>
> wrote:
> >
> >> Ok, I can do that (that's actually how our current stuff works as
> >> well), I was just hoping to maybe remove the need to tell my producer
> >> to connect to localhost so that it can talk to some other part of the
> >> code running in the same process.
> >>
> >> Do you think you will ever have a Producer object implemented in terms
> >> of a KafkaServer object?  Or, if that were to exist would you be
> >> willing to take on the maintenance of it as part of the public API?
> >>
> >> --Eric
> >>
> >>
> >> On Tue, Apr 3, 2012 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
> >> > Eric,
> >> >
> >> > Try using the Producer api. Internal apis are subject to change in the
> >> > future and are not officially supported.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Mon, Apr 2, 2012 at 6:05 PM, Eric Tschetter <ec...@gmail.com>
> >> wrote:
> >> >
> >> >> I'm setting up an HTTP endpoint that just takes a posted object and
> >> >> shoves it into Kafka.  I'm imagining this as basically an embedded
> >> >> broker in my producer and am wondering if there's a way to emit
> >> >> messages directly into the broker without actually setting up a
> >> >> Producer object?  Or, is it just going to be simpler and more
> >> >> supported for me if I actually set up the separate objects and have
> >> >> them talk via whatever mechanism they end up talking via?
> >> >>
> >> >> --Eric
> >> >>
> >>
>

Re: Embedding a broker into a producer?

Posted by Edward Smith <es...@stardotstar.org>.
Jun/Eric,

  Just to add my two cents:  I am starting a new project, and starting
with KAFKA.  Current architecture writes data to files on the
producing hosts.  Then a homebrew queuing system reads the files and
passes them up to a consumer.  Producer/Consumer pairing is all done
manually, there is no load balancing.  Fault tolerance is handled by
having the producer send to 2 consumers and duplicating the
processing, and then ignoring the duplicate results.

  My initial approach will be to run a KAFKA cluster and use a
producer on the producing nodes to read the files from disk and send
them up to the cluster, and then have consumers subscribe to the
topics, etc.  This seems like the 'normal' approach.

  However, we have a requirement to support HA.  If I stick with the
approach above, I have to worry about replication/mirroring the
queues, which always gets sticky.   We have to handle the case where a
producer loses network connectivity, and so, must be able to queue
locally at the producer, which, I believe either means put the KAFKA
broker here or continue to use some 'homebrew'  local queue.  With
brokers on the same node as producers, consumers only have to HA the
results of their processing and I don't have to HA the queues.

  Any thoughts or feedback from the group is welcome.

Ed

On Tue, Apr 3, 2012 at 2:30 PM, Jun Rao <ju...@gmail.com> wrote:
> There is currently no plan for doing that. However, if you think this is a
> useful feature, please create a jira so that we can track it.
>
> Thanks,
>
> Jun
>
> On Tue, Apr 3, 2012 at 10:17 AM, Eric Tschetter <ec...@gmail.com> wrote:
>
>> Ok, I can do that (that's actually how our current stuff works as
>> well), I was just hoping to maybe remove the need to tell my producer
>> to connect to localhost so that it can talk to some other part of the
>> code running in the same process.
>>
>> Do you think you will ever have a Producer object implemented in terms
>> of a KafkaServer object?  Or, if that were to exist would you be
>> willing to take on the maintenance of it as part of the public API?
>>
>> --Eric
>>
>>
>> On Tue, Apr 3, 2012 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
>> > Eric,
>> >
>> > Try using the Producer api. Internal apis are subject to change in the
>> > future and are not officially supported.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Mon, Apr 2, 2012 at 6:05 PM, Eric Tschetter <ec...@gmail.com>
>> wrote:
>> >
>> >> I'm setting up an HTTP endpoint that just takes a posted object and
>> >> shoves it into Kafka.  I'm imagining this as basically an embedded
>> >> broker in my producer and am wondering if there's a way to emit
>> >> messages directly into the broker without actually setting up a
>> >> Producer object?  Or, is it just going to be simpler and more
>> >> supported for me if I actually set up the separate objects and have
>> >> them talk via whatever mechanism they end up talking via?
>> >>
>> >> --Eric
>> >>
>>

Re: Embedding a broker into a producer?

Posted by Jun Rao <ju...@gmail.com>.
There is currently no plan for doing that. However, if you think this is a
useful feature, please create a jira so that we can track it.

Thanks,

Jun

On Tue, Apr 3, 2012 at 10:17 AM, Eric Tschetter <ec...@gmail.com> wrote:

> Ok, I can do that (that's actually how our current stuff works as
> well), I was just hoping to maybe remove the need to tell my producer
> to connect to localhost so that it can talk to some other part of the
> code running in the same process.
>
> Do you think you will ever have a Producer object implemented in terms
> of a KafkaServer object?  Or, if that were to exist would you be
> willing to take on the maintenance of it as part of the public API?
>
> --Eric
>
>
> On Tue, Apr 3, 2012 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
> > Eric,
> >
> > Try using the Producer api. Internal apis are subject to change in the
> > future and are not officially supported.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 2, 2012 at 6:05 PM, Eric Tschetter <ec...@gmail.com>
> wrote:
> >
> >> I'm setting up an HTTP endpoint that just takes a posted object and
> >> shoves it into Kafka.  I'm imagining this as basically an embedded
> >> broker in my producer and am wondering if there's a way to emit
> >> messages directly into the broker without actually setting up a
> >> Producer object?  Or, is it just going to be simpler and more
> >> supported for me if I actually set up the separate objects and have
> >> them talk via whatever mechanism they end up talking via?
> >>
> >> --Eric
> >>
>

Re: Embedding a broker into a producer?

Posted by Eric Tschetter <ec...@gmail.com>.
Ok, I can do that (that's actually how our current stuff works as
well), I was just hoping to maybe remove the need to tell my producer
to connect to localhost so that it can talk to some other part of the
code running in the same process.

Do you think you will ever have a Producer object implemented in terms
of a KafkaServer object?  Or, if that were to exist would you be
willing to take on the maintenance of it as part of the public API?

--Eric


On Tue, Apr 3, 2012 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
> Eric,
>
> Try using the Producer api. Internal apis are subject to change in the
> future and are not officially supported.
>
> Thanks,
>
> Jun
>
> On Mon, Apr 2, 2012 at 6:05 PM, Eric Tschetter <ec...@gmail.com> wrote:
>
>> I'm setting up an HTTP endpoint that just takes a posted object and
>> shoves it into Kafka.  I'm imagining this as basically an embedded
>> broker in my producer and am wondering if there's a way to emit
>> messages directly into the broker without actually setting up a
>> Producer object?  Or, is it just going to be simpler and more
>> supported for me if I actually set up the separate objects and have
>> them talk via whatever mechanism they end up talking via?
>>
>> --Eric
>>

Re: Embedding a broker into a producer?

Posted by Jun Rao <ju...@gmail.com>.
Eric,

Try using the Producer api. Internal apis are subject to change in the
future and are not officially supported.

Thanks,

Jun

On Mon, Apr 2, 2012 at 6:05 PM, Eric Tschetter <ec...@gmail.com> wrote:

> I'm setting up an HTTP endpoint that just takes a posted object and
> shoves it into Kafka.  I'm imagining this as basically an embedded
> broker in my producer and am wondering if there's a way to emit
> messages directly into the broker without actually setting up a
> Producer object?  Or, is it just going to be simpler and more
> supported for me if I actually set up the separate objects and have
> them talk via whatever mechanism they end up talking via?
>
> --Eric
>