You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Andrew Olson (JIRA)" <ji...@apache.org> on 2019/02/26 23:02:00 UTC
[jira] [Created] (CRUNCH-680) Kafka Source should split very large
partitions
Andrew Olson created CRUNCH-680:
-----------------------------------
Summary: Kafka Source should split very large partitions
Key: CRUNCH-680
URL: https://issues.apache.org/jira/browse/CRUNCH-680
Project: Crunch
Issue Type: Improvement
Components: IO
Reporter: Andrew Olson
If a single Kafka partition has a very large number of messages, the map task for that partition can take a long time to run leading to potential timeout problems. We should limit the number of messages assigned to each split so that the workload is more evenly balanced.
Based on our testing we have determined that 5 million messages should be a generally reasonable default for the maximum split size, with a configuration property (org.apache.crunch.kafka.split.max) provided to optionally override that value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)