You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by XuPingyong <gi...@git.apache.org> on 2017/08/11 06:33:44 UTC

[GitHub] flink pull request #4525: [FLINK-7423] Always reuse an instance to get eleme...

GitHub user XuPingyong opened a pull request:

    https://github.com/apache/flink/pull/4525

    [FLINK-7423] Always reuse an instance to get elements from the inputFormat

    ## What is the purpose of the change
    
    This pull request fix a bug about getting elements from the inputFormat in InputFormatSourceFunction.java
    
    
    ## Verifying this change
    
    This change added tests and can be verified as follows:
    
      - *Added test in InputFormatSourceFunctionTest*
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/XuPingyong/flink FLINK-7423

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4525
    
----
commit ddca0f9d258eed4bc69bf931aebc1bbde385c799
Author: pingyong.xpy <pi...@alibaba-inc.com>
Date:   2017-08-11T06:09:12Z

    [FLINK-7423] Always reuse an instance to get elements from the inputFormat

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    `MutableObjectIterator#next` allows/requires a `null` result: "@return The object or <code>null</code> if the iterator is exhausted.". The documentation could be improved, e.g. in `MergeIterator` the returned object is not necessarily the immediately passed argument.
    
    `DataSourceTask` looks to be the only `InputFormat` consumer re-passing a reuse object.
    
    It seems the reason to allow returning `null` is handling, for example, a bad record at the end of a file, such that `reachedEnd` would have returned `false` but there is no record to return.
    
    Object reuse is always optional so is not an issue for immutable types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4525: [FLINK-7423] Always reuse an instance to get eleme...

Posted by XuPingyong <gi...@git.apache.org>.
Github user XuPingyong closed the pull request at:

    https://github.com/apache/flink/pull/4525


---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    Hi @XuPingyong thanks for submitting this PR. I'm not clear on why the `null` check was added in FLINK-4075 and `ContinuousFileProcessingTest` is not running locally for me. When calling `InputFormat#nextRecord` we can only reuse the returned object. The object passed to `nextRecord` may be reused internally by the `InputFormat` so we cannot simply pass in the same object. It would be better to move `OUT nextElement = serializer.createInstance();` into the outer loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    @StephanEwen why are `null` values permitted if not in the contract of the input formats?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    I think `null` values are not really permitted at the moment, but I think the InputFormats do not explicitly forbid them. That's why the logic is not "run until returns null", but "run until reachedEnd()".
    
    The change here implements a mixed contract like "run until reachedEnd() or returns null", which is more complicated to document, enforce, test, and understand for users.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    As I understand Flink does not allow `null` records since what does it mean to partition or window a null record? `null` fields are sometimes allowed but `null` records are meaningless.
    
    The caller of `InputFormat#nextRecord` cannot reuse an object which has already been passed to `nextRecord` as a reusable object as this may result in an object being "used" twice. That's why object reuse always replaces the passed object with the returned object, which may be the same but could be different.
    
    @kl0u could you offer some insight into FLINK-4075 adding the null check with break to `InputFormatSourceFunction#run`? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by XuPingyong <gi...@git.apache.org>.
Github user XuPingyong commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    Thanks @greghogan. 
    
          I think  the InputFormat can return anything, may be null or other immutable objects that cannot be passed to nextRecord.
         And I am also not clear why the object passed to nextRecord may be reused internally by the InputFormat. The param of nextRecord named reuse means it may be re-passed to this call again?
    
        What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    The logic is currently not correct with the contract of the input formats. A return value of null is not an "end of split" indicator.
    
    Also, the description mentions that this adds a test, which I cannot find in the diff...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by XuPingyong <gi...@git.apache.org>.
Github user XuPingyong commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    Thanks @StephanEwen ,  so I think `null` values can be returned from `InputFormat#nextRecord` anytime, even not end. 
    
    Can `null` values be passed to `InputFormat#nextRecord`,  or checkNull(if `null`, give a new object) before passed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by XuPingyong <gi...@git.apache.org>.
Github user XuPingyong commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    @greghogan, if the object passed to nextRecord may be reused internally by the InputFormat, do the similar cases need to  be re-considered?
    
    In `DataSourceTask.java`:
                  
                  OT reuse = serializer.createInstance();
    
    		// as long as there is data to read
    		while (!this.taskCanceled && !format.reachedEnd()) {
    		    OT returned;
    		    if ((returned = format.nextRecord(reuse)) != null) {
    			output.collect(returned);
    		    }
    		}
    
    And in many batch drivers:
                  
                  final MutableObjectIterator<T> in = taskContext.getInput(0);
                  T value = serializer.createInstance();
    	          while (running && (value = in.next(value)) != null) {
                      .......
                  } 
    
    
    In my opinion:
         1.  `Null` records are meaningless, but `null` is meaningful for input or format which means the end. If a user only call `InputFormat#nextRecord` without `InputFormat#reachedEnd`, only `null` can be returned. 
         2.  The returned object of `InputFormat#nextRecord` should not need to be considered that it may be passed again. If a immutable object is returned, an exception will be thrown  when it is reused again in `InputFormat#nextRecord`.
    
    @greghogan, could you please offer some cases that the object passed to nextRecord can be reused internally by the InputFormat?  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4525: [FLINK-7423] Always reuse an instance to get elements fro...

Posted by XuPingyong <gi...@git.apache.org>.
Github user XuPingyong commented on the issue:

    https://github.com/apache/flink/pull/4525
  
    Thanks @greghogan , `MergeIterator` is really a special case that it would be better to pass the returned object to  `MutableObjectIterator#next` again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---