You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/03/01 16:41:00 UTC

[jira] [Work logged] (CRUNCH-680) Kafka Source should split very large partitions

     [ https://issues.apache.org/jira/browse/CRUNCH-680?focusedWorklogId=206520&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-206520 ]

ASF GitHub Bot logged work on CRUNCH-680:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Mar/19 16:40
            Start Date: 01/Mar/19 16:40
    Worklog Time Spent: 10m 
      Work Description: mkwhitacre commented on pull request #21: CRUNCH-680: Kafka Source should split very large partitions
URL: https://github.com/apache/crunch/pull/21
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 206520)
    Time Spent: 20m  (was: 10m)

> Kafka Source should split very large partitions
> -----------------------------------------------
>
>                 Key: CRUNCH-680
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-680
>             Project: Crunch
>          Issue Type: Improvement
>          Components: IO
>            Reporter: Andrew Olson
>            Assignee: Micah Whitacre
>            Priority: Minor
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If a single Kafka partition has a very large number of messages, the map task for that partition can take a long time to run leading to potential timeout problems. We should limit the number of messages assigned to each split so that the workload is more evenly balanced.
> Based on our testing we have determined that 5 million messages should be a generally reasonable default for the maximum split size, with a configuration property (org.apache.crunch.kafka.split.max) provided to optionally override that value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)