You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by The Real Preacher <pr...@gmail.com> on 2021/01/19 12:19:28 UTC

Are Kafka and Kafka Streams right tools for my case?

I'm new to Kafka and will be grateful for any advice We are updating a 
legacy application together with moving it from IBM MQ to something 
different.


Application currently does the following:

  * Reads batch XML messages (up to 5 MB)
  * Parses it to something meaningful
  * Processes data parallelizing this procedure somehow manually for
    parts of the batch. Involves some external legacy API calls
    resulting in DB changes
  * Sends several kinds of email notifications
  * Sends some reply to some other queue
  * input messages are profiled to disk


We are considering using Kafka with Kafka Streams as it is nice to

  * Scale processing easily
  * Have messages persistently stored out of the box
  * Built-in partitioning, replication, and fault-tolerance
  * Confluent Schema Registry to let us move to schema-on-write
  * Can be used for service-to-service communication for other
    applications as well


But I have some concerns.


We are thinking about splitting those huge messages logically and 
putting them to Kafka this way, as from how I understand it - Kafka is 
not a huge fan of big messages. Also it will let us parallelize 
processing on partition basis.


After that use Kafka Streams for actual processing and further on for 
aggregating some batch responses back using state store. Also to push 
some messages to some other topics (e.g. for sending emails)


But I wonder if it is a good idea to do actual processing in Kafka 
Streams at all, as it involves some external API calls?


Also I'm not sure what is the best way to handle the cases when this 
external API is down for any reason. It means temporary failure for 
current and all the subsequent messages. Is there any way to stop Kafka 
Stream processing for some time? I can see that there are Pause and 
Resume methods on the Consumer API, can they be utilized somehow in Streams?


Is it better to use a regular Kafka consumer here, possibly adding 
Streams as a next step to merge those batch messages together? Sounds 
like an overcomplication


Is Kafka a good tool for these purposes at all?

Cheers,
TRP


Re: Are Kafka and Kafka Streams right tools for my case?

Posted by Guozhang Wang <wa...@gmail.com>.
Hello,

I have observed several use cases similar to yours that are using Kafka /
Kafka Streams in production. That being said, your concerns are valid:

* Big messages: 5MB is indeed large, but not extremely big for Kafka. If it
is a single message of hundreds of MBs or over a GB then it's a bit
different handling story (and you may need to consider chunking it). You
would still need to make sure Kafka is configured right with
max.message.size etc to handle them, also you may need to tune your clients
on network sockets (like the buffer size, etc) for optimal networking
performance.

* External call in Streams: if the external service may be unavailable,
then your implementation should have a timeout scenario to either drop the
record or retry it later (some implementations would put it into a retry
queue, again stored as a Kafka topic, and then read from it later to
retry). Also note that Kafka Streams rely on consumer to poll the records,
and if that `poll` call is not triggered in time because of the external
API calls taking too long, you'd need to configure the poll.interval long
enough for this. Another caveat I can think of is that Kafka Streams at the
moment do not have async-processing capabilities, i.e. if a single record
taking too long for external call (or simply local IO call), then it would
block all records after it --- so if processing capability bottleneck is a
common case for you, you'd probably need to consider writing a custom
processor for async external calls yourself. In the future we do have plans
to support async processing in Kafka Streams though.



Guozhang




On Tue, Jan 19, 2021 at 8:44 AM The Real Preacher <pr...@gmail.com> wrote:

> I'm new to Kafka and will be grateful for any advice We are updating a
> legacy application together with moving it from IBM MQ to something
> different.
>
>
> Application currently does the following:
>
>   * Reads batch XML messages (up to 5 MB)
>   * Parses it to something meaningful
>   * Processes data parallelizing this procedure somehow manually for
>     parts of the batch. Involves some external legacy API calls
>     resulting in DB changes
>   * Sends several kinds of email notifications
>   * Sends some reply to some other queue
>   * input messages are profiled to disk
>
>
> We are considering using Kafka with Kafka Streams as it is nice to
>
>   * Scale processing easily
>   * Have messages persistently stored out of the box
>   * Built-in partitioning, replication, and fault-tolerance
>   * Confluent Schema Registry to let us move to schema-on-write
>   * Can be used for service-to-service communication for other
>     applications as well
>
>
> But I have some concerns.
>
>
> We are thinking about splitting those huge messages logically and
> putting them to Kafka this way, as from how I understand it - Kafka is
> not a huge fan of big messages. Also it will let us parallelize
> processing on partition basis.
>
>
> After that use Kafka Streams for actual processing and further on for
> aggregating some batch responses back using state store. Also to push
> some messages to some other topics (e.g. for sending emails)
>
>
> But I wonder if it is a good idea to do actual processing in Kafka
> Streams at all, as it involves some external API calls?
>
>
> Also I'm not sure what is the best way to handle the cases when this
> external API is down for any reason. It means temporary failure for
> current and all the subsequent messages. Is there any way to stop Kafka
> Stream processing for some time? I can see that there are Pause and
> Resume methods on the Consumer API, can they be utilized somehow in
> Streams?
>
>
> Is it better to use a regular Kafka consumer here, possibly adding
> Streams as a next step to merge those batch messages together? Sounds
> like an overcomplication
>
>
> Is Kafka a good tool for these purposes at all?
>
> Cheers,
> TRP
>
>

-- 
-- Guozhang