You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2015/08/11 16:49:46 UTC
[jira] [Comment Edited] (FLINK-2501) [py] Remove the need to specify types for transformations

    [ https://issues.apache.org/jira/browse/FLINK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680732#comment-14680732 ] 

Chesnay Schepler edited comment on FLINK-2501 at 8/11/15 2:49 PM:
------------------------------------------------------------------

hmmmmm.........................

one question that still remains is how would we tell the Java API what size the emitted tuples have? This is the primary reason i didn't list such a solution.

(3) has really nice things going for it: you save bandwidth as you don't send separate keys around; you save computation power by not having to extract keys; you reduce complexity since you don't have to alter the program plan or discard/hide the keys.

But unless a solution for the above issue is brought up (2) seems like the way to go. Unless I'm misunderstanding something.

Regarding your other points:
projections is the only operation i see right now that wouldn't need a special implementation with (3), as it allows access to individual fields.

to skip the sort implementation one could modify (2) to work on a tuple of keys. so for grouped operations, instead of Tuple2<byte[], byte[]> you work on a Tuple2<TupleX<byte[],..>, byte[]>. this would make sorts equivalent for (2) and (3).

and yes,the binary data would contain type information,



was (Author: zentol):
hmmmmm.........................

one question that still remains is how would we tell the Java API whether a UDF emits a basic type (which would just be a byte[]) or an arbitrarily nested tuple? This is the primary reason i didn't list such a solution.

(3) has really nice things going for it: you save bandwidth as you don't send separate keys around; you save computation power by not having to extract keys; you reduce complexity since you don't have to alter the program plan or discard/hide the keys.

But unless a solution for the above issue is brought up (2) seems like the way to go. Unless I'm misunderstanding something.

Regarding your other points:
projections is the only operation i see right now that wouldn't need a special implementation with (3), as it allows access to individual fields.

to skip the sort implementation one could modify (2) to work on a tuple of keys. so for grouped operations, instead of Tuple2<byte[], byte[]> you work on a Tuple2<TupleX<byte[],..>, byte[]>. this would make sorts equivalent for (2) and (3).

and yes,the binary data would contain type information,


> [py] Remove the need to specify types for transformations
> ---------------------------------------------------------
>
>                 Key: FLINK-2501
>                 URL: https://issues.apache.org/jira/browse/FLINK-2501
>             Project: Flink
>          Issue Type: Improvement
>          Components: Python API
>            Reporter: Chesnay Schepler
>
> Currently, users of the Python API have to provide type arguments when using a UDF, like so:
> {code}
> d1.map(Mapper(), (INT, STRING))
> {code}
> Instead, it would be really convenient to be able to do this:
> {code}
> d1.map(Mapper())
> {code}
> The intention behind this issue is convenience, and it's also not really pythonic to specify types.
> Before I'll go into possible solutions, let me summarize the way these type arguments are currently used, and in general how types are handled:
> The type argument passed is actually an object of the type it represents, as INT is a constant int value, whereas STRING is a constant string value. You could as well write the following and it would still work.
> {code}
> d1.map(Mapper(), (1, "ImNotATypInfo"))
> {code}
> This object is transmitted to the java side during the plan binding (and is now an actual Tuple2<Integer, String>), then passed to the type extractor, and the resulting TypeInformation saved in the java counterpart of the udf, which all implement the ResultTypeQueryable interface. 
> The TypeInformation object is only used by the Java API, python never touches it. Instead, at runtime, the serializers used between python and java check the classes of the values passed and are thus generated dynamically.
> This means that, if a UDF does not pass the type it claims to pass, the Python API wont complain, but the underlying java API will when it's serializers fail.
> Now let's talk solutions.
> In discussions on the mailing list, pretty much 2 proposals were made:
> # Add a way to disable/circumvent type checks during the plan phase in the Java API and generate serializers dynamically.
> # Have objects always in serialized form on the java side, stored in a single bytearray or Tuple2 containing a key/value pair.
> These proposals vary wildly in the changes necessary to the system:
> # "How can we change the Java API to support this?"
> This proposal would hardly change the way the Python API works, or even touch the related source code. It mostly deals with the Java API. Since I'm not to familiar with the Plan processing life-cycle on the java side I can't assess which classes would have to be changed.
> # "How can we make this work within the limits of the Java API?"
> is the exact opposite, it changes nothing in the Java API. Instead, the following issues would have to be solved:
> * Alter the plan to extract keys before keyed operations, while hiding these keys from the UDF. This is exactly how KeySelectors (will) work, and as such is generally solved. In fact, this solution would make a few things easier in regards to KeySelectors.
> * Rework all operations that currently rely on Java API functions, that need deserialized data, for example Projections or the upcoming Aggregations; 
> This generally means implementing them in python, or with special java UDF's (they could de-/serialize data within the udf call, or work on serialized data).
> * Change (De)Serializers accordingly
> * implement a reliable, not all-memory-consuming sorting mechanism on the python side
> Personally i prefer the second option, as it
> # does not modify the Java API, it works within it's well-tested limits
> # Plan changes are similar to issues that are already worked on (KeySelectors)
> # Sorting implementation was necessary anyway (for chained reducers)
> # having data in serialized form was a performance-related consideration already
> While the first option could work, and most likely require less work, i feel like many of the things required for option 2 will be implemented eventually anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)