You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Greg Hogan (JIRA)" <ji...@apache.org> on 2016/02/09 21:08:18 UTC

[jira] [Comment Edited] (FLINK-3333) Documentation about object reuse should be improved

    [ https://issues.apache.org/jira/browse/FLINK-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139151#comment-15139151 ] 

Greg Hogan edited comment on FLINK-3333 at 2/9/16 8:07 PM:
-----------------------------------------------------------

Apache Flink programs can be written and configured to reduce the number of object allocations for better performance. User defined functions (like map() or groupReduce()) process many millions or billions of input and output values. Enabling object reuse and processing mutable objects improves performance by lowering demand on the CPU cache and Java garbage collector.

Object reuse is disabled by default, with user defined functions generally getting new objects on each call (or through an iterator). In this case it is safe to store references to the objects inside the function (for example, in a List).

<'storing values in a list' example>

Apache Flink will chain functions to improve performance when sorting is preserved and the parallelism unchanged. The chainable operators are Map, FlatMap, Reduce on full DataSet, GroupCombine on a Grouped DataSet, or a GroupReduce where the user supplied a RichGroupReduceFunction with a combine method. Objects are passed without copying _even when object reuse is disabled_.

In the chaining case, the functions in the chain are receiving the same object instances. So the the second map() function is receiving the objects the first map() is returning. This behavior can lead to errors when the first map() function keeps a list of all objects and the second mapper is modifying objects. In that case, the user has to manually create copies of the objects before putting them into the list.

<chainable example>

<discussion of copyable values>

<copyablevalue example>

There is a switch at the ExecutionConfig which allows users to enable the object reuse mode (enableObjectReuse()). For mutable types, Flink will reuse object instances. In practice that means that a user function will always receive the same object instance (with its fields set to new values). The object reuse mode will lead to better performance because fewer objects are created, but the user has to manually take care of what they are doing with the object references.

<object reuse example>


was (Author: greghogan):
Apache Flink programs can be written and configured to reduce the number of object allocations for better performance. User defined functions (like map() or groupReduce()) process many millions or billions of input and output values. Enabling object reuse and processing mutable objects improves performance by lowering demand on the CPU cache and Java garbage collector.

Object reuse is disabled by default, with user defined functions generally getting new objects on each call (or through an iterator). In this case it is safe to store references to the objects inside the function (for example, in a List).

<'storing values in a list' example>

Apache Flink will chain functions to improve performance when sorting is preserved and the parallelism unchanged. The chainable operators are Map, FlatMap, Reduce on full DataSet, GroupCombine on a Grouped DataSet, or a GroupReduce where the user supplied a RichGroupReduceFunction with a combine method). Objects are passed without copying _even when object reuse is disabled_.

In the chaining case, the functions in the chain are receiving the same object instances. So the the second map() function is receiving the objects the first map() is returning. This behavior can lead to errors when the first map() function keeps a list of all objects and the second mapper is modifying objects. In that case, the user has to manually create copies of the objects before putting them into the list.

<chainable example>

<discussion of copyable values>

<copyablevalue example>

There is a switch at the ExecutionConfig which allows users to enable the object reuse mode (enableObjectReuse()). For mutable types, Flink will reuse object instances. In practice that means that a user function will always receive the same object instance (with its fields set to new values). The object reuse mode will lead to better performance because fewer objects are created, but the user has to manually take care of what they are doing with the object references.

<object reuse example>

> Documentation about object reuse should be improved
> ---------------------------------------------------
>
>                 Key: FLINK-3333
>                 URL: https://issues.apache.org/jira/browse/FLINK-3333
>             Project: Flink
>          Issue Type: Bug
>          Components: Documentation
>    Affects Versions: 1.0.0
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> The documentation about object reuse \[1\] has several problems, see \[2\].
> \[1\] https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
> \[2\] https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)