You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by StephanEwen <gi...@git.apache.org> on 2015/08/17 00:23:08 UTC

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/1028

    [FLINK-2386] Rework Kafka consumer for Flink

    This is a reworked and extended version of #996 . It also build on top of #1017 
    
    It improves the Kafka consumer, fixes bugs, and offers pluggable *fetcher* and *offset committer* to make it work across Kafka versions from 0.8.1 to 0.8.3 (upcoming).
    
    ## Functionality
    
      - The Kafka consumer properly preserves partitioning across failures/restart.
      - Pluggable fetchers / committers for multiple Kafka versions interoperability
        - Fetcher based on the low level consumer API
        - Fetcher based on the upcoming new consumer API (backported and included in the Flink Kafka consumer).
      - Proper cancelability
      - The test coverage is vastly improved.
    
    ## Tests
    
    This pull request includes a set of new thorough test for the Kafka consumer
    
      - Preserving of partitioning and exactly once behavior for
        - 1:1 source to kafka partition mapping
        - 1:n source to kafka partition mapping
        - n:1 source to kafka partition mapping
      - Broker failure
      - Cancelling
        - After immediate failures during deployment
        - While waiting to read from a partition
        - While currently reading from a partition
      - Commit notifications for checkpoints
      - Large records (30 MB)
      - Alignment of offsets with what is committed into ZooKeeper
      - Concurrent produce/consumer programs
    
    ## Limitations
    
    The code based on the low-level consumer seems to work well.
    
    The high-level consumer does not work with very large records It looks like a problem in the backported Kafka code, but it is not 100% sure.
    
    ## Debug Code
    
    This pull request includes some debug code in the `BarrierBuffer` that I intend to remove. It is there to track possible cornercase problems in the checkpoint barrier handling.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink kafka

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1028.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1028
    
----
commit beed1d499b8c876330fac73da324235dd69a4e91
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-14T21:32:35Z

    [FLINK-2462] [streaming] Major cleanup of operator structure for exception handling and code simplication
    
      - The exceptions are no longer logged by the operators themselves.
        Operators perform only cleanup in reaction to exceptions.
        Exceptions are reported only the the root Task object, which knows whether this is the first
        failure-causing exception (root cause), or is a subsequent exception, or whether the task was
        actually canceled already. In the later case, exceptions are ignored, because many
        cancellations lead to meaningless exceptions.
    
      - more exception in signatures, less wrapping where not needed
    
      - Core resource acquisition/release logic is in one streaming task, reducing code duplication
    
      - Guaranteed cleanup of output buffer and input buffer resources (formerly missed when other exceptions where encountered)
    
      - Fix mixup in instantiation of source contexts in the stream source task
    
      - Auto watermark generators correctly shut down their interval scheduler
    
      - Improve use of generics, got rid of many raw types
    
    This closes #1017

commit 57caed89279253ba3be7d69144cb762aac98f1f5
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-16T14:52:16Z

    [tests] Reinforce StateCheckpoinedITCase to make sure actual checkpointing has happened before a failure.

commit 2568d8dfa97e8a33ce63b39e2b39a01584781568
Author: Robert Metzger <rm...@apache.org>
Date:   2015-07-20T19:39:46Z

    [FLINK-2386] [kafka connector] Add new Kafka Consumer for Flink
    
    This closes #996

commit 1a420413381a5d660742e7da576cb3cae5f0a613
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-11T12:21:33Z

    [streaming] Cleanup de-/serialization schema, add TypeInformationSerializationSchema prominent, add tests.

commit 32674145503ce36ab16bf7641892fdbe4f4a1045
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-11T14:48:26Z

    [FLINK-2386] [kafka connector] Add comments to all backported kafka sources and move them to 'org.apache.flink.kafka_backport'

commit 4d4cb1cf79dc656261284bc3e044c27898a4152c
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-11T20:21:53Z

    [FLINK-2386] [kafka connector] Refactor, cleanup, and fix kafka consumers

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-131847425
  
    I like this idea a lot. The backported code is not very stable anyways...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-131843222
  
    How about dropping the backported Kafka code and relying completely on our own implementation against the SimpleConsumer API?
    We would need to implement the `KafkaConsumer.partitionsFor()` method ourselves, but I think that's doable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-132256558
  
    I've opened a pull request with the code removed against your branch: https://github.com/StephanEwen/incubator-flink/pull/14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-133475027
  
    Subsumed by #1039 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-133470378
  
    Yes, I think we can close this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen closed the pull request at:

    https://github.com/apache/flink/pull/1028


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-133463879
  
    Since it is rebased to #1039 already, could this one be closed and do review on that one instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2386] Rework Kafka consumer for Flink

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on the pull request:

    https://github.com/apache/flink/pull/1028#issuecomment-133463429
  
    So is the #1039 depends on this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---