You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2018/03/29 12:08:00 UTC
[jira] [Resolved] (FLINK-9031) DataSet Job result changes when adding rebalance after union

     [ https://issues.apache.org/jira/browse/FLINK-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Ewen resolved FLINK-9031.
---------------------------------
       Resolution: Fixed
         Assignee: Fabian Hueske
    Fix Version/s: 1.4.3
                   1.5.0

Fixed in
  - 1.4.3 via 492dcb1765e86f1e7d66ee08db714992e7a31f4e
  - 1.5.0 via ef2038755f3af8cc356727fee3f0f28c3cfffc64
  - 1.6.0 via 87ff6eb86e7130309aab8e4ff322f0fdd2cdc020

> DataSet Job result changes when adding rebalance after union
> ------------------------------------------------------------
>
>                 Key: FLINK-9031
>                 URL: https://issues.apache.org/jira/browse/FLINK-9031
>             Project: Flink
>          Issue Type: Bug
>          Components: DataSet API, Local Runtime, Optimizer
>    Affects Versions: 1.3.1
>            Reporter: Fabian Hueske
>            Assignee: Fabian Hueske
>            Priority: Critical
>             Fix For: 1.5.0, 1.4.3
>
>         Attachments: Person.java, RunAll.java, newplan.txt, oldplan.txt
>
>
> A user [reported this issue on the user mailing list|https://lists.apache.org/thread.html/075f1a487b044079b5d61f199439cb77dd4174bd425bcb3327ed7dfc@%3Cuser.flink.apache.org%3E].
> {quote}I am using Flink 1.3.1 and I have found a strange behavior on running the following logic:
>  # Read data from file and store into DataSet<POJO>
>  # Split dataset in two, by checking if "field1" of POJOs is empty or not, so that the first dataset contains only elements with non empty "field1", and the second dataset will contain the other elements.
>  # Each dataset is then grouped by, one by "field1" and other by another field, and subsequently reduced.
>  # The 2 datasets are merged together by union.
>  # The final dataset is written as json.
> What I was expected, from output, was to find only one element with a specific value of "field1" because:
>  # Reducing the first dataset grouped by "field1" should generate only one element with a specific value of "field1".
>  # The second dataset should contain only elements with empty "field1".
>  # Making an union of them should not duplicate any record.
> This does not happen. When i read the generated jsons i see some duplicate (non empty) values of "field1".
>  Strangely this does not happen when the union between the two datasets is not computed. In this case the first dataset produces elements only with distinct values of "field1", while second dataset produces only records with empty field "value1".
> {quote}
> The user has not enable object reuse.
> Later he reports that the problem disappears when he injects a rebalance() after a union resolves the problem. I had a look at the execution plans for both cases (attached to this issue) but could not identify a problem.
> Hence I assume, this might be an issue with the runtime code but we need to look deeper into this. The user also provided an example program consisting of two classes which are attached to the issue as well.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)