You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Fabian Hueske (JIRA)" <ji...@apache.org> on 2018/03/21 09:51:00 UTC

[jira] [Commented] (FLINK-9031) DataSet Job result changes when adding rebalance after union

    [ https://issues.apache.org/jira/browse/FLINK-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407683#comment-16407683 ] 

Fabian Hueske commented on FLINK-9031:
--------------------------------------

Adding relevant information from the mail thread:

[~StephanEwen] suggested
{quote}
To diagnose that, can you please check the following:
   - Change the Person data type to be immutable (final fields, no setters, set fields in constructor instead). Does that make the problem go away?
   - Change the Person data type to not be a POJO by adding a dummy fields that is never used, but does not have a getter/setter. 
Does that make the problem go away?
If either of that is the case, it must be a mutability bug somewhere in either accidental object reuse or accidental serializer sharing.
{quote}
 
Making the Person object immutable solved the problem.

> DataSet Job result changes when adding rebalance after union
> ------------------------------------------------------------
>
>                 Key: FLINK-9031
>                 URL: https://issues.apache.org/jira/browse/FLINK-9031
>             Project: Flink
>          Issue Type: Bug
>          Components: DataSet API, Local Runtime, Optimizer
>    Affects Versions: 1.3.1
>            Reporter: Fabian Hueske
>            Priority: Critical
>         Attachments: Person.java, RunAll.java, newplan.txt, oldplan.txt
>
>
> A user [reported this issue on the user mailing list|https://lists.apache.org/thread.html/075f1a487b044079b5d61f199439cb77dd4174bd425bcb3327ed7dfc@%3Cuser.flink.apache.org%3E].
> {quote}I am using Flink 1.3.1 and I have found a strange behavior on running the following logic:
>  # Read data from file and store into DataSet<POJO>
>  # Split dataset in two, by checking if "field1" of POJOs is empty or not, so that the first dataset contains only elements with non empty "field1", and the second dataset will contain the other elements.
>  # Each dataset is then grouped by, one by "field1" and other by another field, and subsequently reduced.
>  # The 2 datasets are merged together by union.
>  # The final dataset is written as json.
> What I was expected, from output, was to find only one element with a specific value of "field1" because:
>  # Reducing the first dataset grouped by "field1" should generate only one element with a specific value of "field1".
>  # The second dataset should contain only elements with empty "field1".
>  # Making an union of them should not duplicate any record.
> This does not happen. When i read the generated jsons i see some duplicate (non empty) values of "field1".
>  Strangely this does not happen when the union between the two datasets is not computed. In this case the first dataset produces elements only with distinct values of "field1", while second dataset produces only records with empty field "value1".
> {quote}
> The user has not enable object reuse.
> Later he reports that the problem disappears when he injects a rebalance() after a union resolves the problem. I had a look at the execution plans for both cases (attached to this issue) but could not identify a problem.
> Hence I assume, this might be an issue with the runtime code but we need to look deeper into this. The user also provided an example program consisting of two classes which are attached to the issue as well.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)