You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Richard Yu (JIRA)" <ji...@apache.org> on 2018/12/17 01:57:00 UTC

[jira] [Commented] (KAFKA-7432) API Method on Kafka Streams for processing chunks/batches of data

    [ https://issues.apache.org/jira/browse/KAFKA-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722635#comment-16722635 ] 

Richard Yu commented on KAFKA-7432:
-----------------------------------

Hi, just want to point out something here.

What Kafka currently supports is continuous processing, which Spark Streaming most recently implemented. In contrast, what this ticket is suggesting to implement is microbatch processing in which data is sent in batches.  In some data streaming circles, continuous processing is considered the best option for sending data. Microbatching was an older technique. 

I don't know if we need to implement this particular option, especially since latency overall for microbatching is higher than continuous processing.

> API Method on Kafka Streams for processing chunks/batches of data
> -----------------------------------------------------------------
>
>                 Key: KAFKA-7432
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7432
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: sam
>            Priority: Major
>
> For many situations in Big Data it is preferable to work with a small buffer of records at a go, rather than one record at a time.
> The natural example is calling some external API that supports batching for efficiency.
> How can we do this in Kafka Streams? I cannot find anything in the API that looks like what I want.
> So far I have:
> {{builder.stream[String, String]("my-input-topic") .mapValues(externalApiCall).to("my-output-topic")}}
> What I want is:
> {{builder.stream[String, String]("my-input-topic") .batched(chunkSize = 2000).map(externalBatchedApiCall).to("my-output-topic")}}
> In Scala and Akka Streams the function is called {{grouped}} or {{batch}}. In Spark Structured Streaming we can do {{mapPartitions.map(_.grouped(2000).map(externalBatchedApiCall))}}.
>  
>  
> https://stackoverflow.com/questions/52366623/how-to-process-data-in-chunks-batches-with-kafka-streams



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)