You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/06/01 16:15:59 UTC

[jira] [Commented] (BEAM-321) Hash encoded keys in Flink batch mode

    [ https://issues.apache.org/jira/browse/BEAM-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310556#comment-15310556 ] 

ASF GitHub Bot commented on BEAM-321:
-------------------------------------

GitHub user aljoscha opened a pull request:

    https://github.com/apache/incubator-beam/pull/409

    [BEAM-321] Fix Flink Comparators

    `KvCoderComparator` and `CoderComparator` were hashing the key directly
    while doing comparisons on the encoded form. This lead to
    inconsistencies in GroupByKey results with large numbers of elements per
    key.
    
    This changes the comparators to hash on the encoded form and also adds a
    test case to verify correct behavior. 
    
    I made the test a `RunnableOnService` test that verifies that runners use the encoded form of a value for hashing/comparison. In the test, I use a bogus key that returns garbage from `equals()` and `hashCode()`. The result of the test is correct if only the encoded form is used for hashing/comparison.
    
    R: @tgroh for the test, do you think there's a better way of doing it?
    CC: @tgroh the newly introduced test hangs on the `InProcessPipelineRunner`, see `InProcessGroupByKeyTest` that I added for quickly checking this
    CC: @francesperry is this new test in line with the Beam model

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aljoscha/incubator-beam flink/fix-comparator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-beam/pull/409.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #409
    
----
commit 8b269b542032c887ea149c50245de8d07a94fd51
Author: Aljoscha Krettek <al...@gmail.com>
Date:   2016-06-01T09:56:18Z

    [BEAM-321] Fix Flink Comparators
    
    KvCoderComparator and CoderComparator were hashing the key directly
    while doing comparisons on the encoded form. This lead to
    inconsistencies in GroupByKey results with large numbers of elements per
    key.
    
    This changes the comparators to hash on the encoded form and also adds a
    test case to verify correct behavior.

commit 6866fd740af07a214ef5a9be39983260878ecb8d
Author: Aljoscha Krettek <al...@gmail.com>
Date:   2016-06-01T09:58:48Z

    Fix typo in KvCoderComperator, is now KvCoderComparator

commit 6334b87ef3d2a90c540cf89aadab013dcded1483
Author: Aljoscha Krettek <al...@gmail.com>
Date:   2016-06-01T15:54:37Z

    move to RunnableOnService test

commit 85defbc67323e10a87835d2514f45ad238e47842
Author: Aljoscha Krettek <al...@gmail.com>
Date:   2016-06-01T16:14:51Z

    make test class package private

----


> Hash encoded keys in Flink batch mode
> -------------------------------------
>
>                 Key: BEAM-321
>                 URL: https://issues.apache.org/jira/browse/BEAM-321
>             Project: Beam
>          Issue Type: Sub-task
>          Components: runner-flink
>    Affects Versions: 0.1.0-incubating
>            Reporter: Maximilian Michels
>            Assignee: Aljoscha Krettek
>             Fix For: 0.1.0-incubating
>
>
> Right now, hashing of keys happens on the value itself not on the encoded representation. This is at odds with the Beam specification and can lead to incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)