You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Gyula Fóra <gy...@gmail.com> on 2015/07/30 21:17:02 UTC

Types in the Python API

Hey!

Could anyone briefly tell me what exactly is the reason why we force the
users in the Python API to declare types for operators?

I don't really understand how this works in different systems but I am just
curious why Flink has types and why Spark doesn't for instance.

If you give me some pointers to read that would also be fine :)

Thank you,
Gyula

Re: Types in the Python API

Posted by Aljoscha Krettek <al...@apache.org>.

I don't know yet. :D

Maybe the sorting will have to be delegated to python. I don't think it's
possible to always get a meaningful order when only sorting on the
serialized bytes. It should however work for grouping.

On Fri, 31 Jul 2015 at 10:31 Chesnay Schepler <c....@web.de> wrote:

> if its just a single array, how would you define group/sort keys?
>
> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> > I think then the Python part would just serialize all the tuple fields
> to a
> > big byte array. And all the key fields to another array, so that the java
> > side can to comparisons on the whole "key blob".
> >
> > Maybe it's overly simplistic, but it might work. :D
> >
> > On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de> wrote:
> >
> >> I can see this working for basic types, but am unsure how it would work
> >> with Tuples. Wouldn't the java API still need to know the arity to setup
> >> serializers?
> >>
> >> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>> I believe it should be possible to create a special PythonTypeInfo
> where
> >>> the python side is responsible for serializing data to a byte array and
> >> to
> >>> the java side it is just a byte array and all the comparisons are also
> >>> performed on these byte arrays. I think partitioning and sort should
> >> still
> >>> work, since the sorting is (in most cases) only used to group the
> >> elements
> >>> for a groupBy(). If proper sort order would be required this would have
> >> to
> >>> be done on the python side.
> >>>
> >>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
> wrote:
> >>>
> >>>> To be perfectly honest i never really managed to work my way through
> >>>> Spark's python API, it's a whole bunch of magic to me; not even the
> >>>> general structure is understandable.
> >>>>
> >>>> With "pure python" do you mean doing everything in python? as in just
> >>>> having serialized data on the java side?
> >>>>
> >>>> I believe the way to do this with Flink is to add a switch that
> >>>> a) disables all type checks
> >>>> b) creates serializers dynamically at runtime.
> >>>>
> >>>> a) should be fairly straight forward, b) on the other hand....
> >>>>
> >>>> btw., the Python API itself doesn't require the type information, it
> >>>> already does the b part.
> >>>>
> >>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>> That I understand, but could you please tell me how is this done
> >>>>> differently in Spark for instance?
> >>>>>
> >>>>> What would we need to change to make this work with pure python (as
> it
> >>>>> seems to be possible)? This probably have large performance
> >> implications
> >>>>> though.
> >>>>>
> >>>>> Gyula
> >>>>>
> >>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl.
> >> 30.,
> >>>> Cs,
> >>>>> 22:04):
> >>>>>
> >>>>>> because it still goes through the Java API that requires some kind
> of
> >>>>>> type information. imagine a java api program where you omit all
> >> generic
> >>>>>> types, it just wouldn't work as of now.
> >>>>>>
> >>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>> Hey!
> >>>>>>>
> >>>>>>> Could anyone briefly tell me what exactly is the reason why we
> force
> >>>> the
> >>>>>>> users in the Python API to declare types for operators?
> >>>>>>>
> >>>>>>> I don't really understand how this works in different systems but I
> >> am
> >>>>>> just
> >>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>
> >>>>>>> If you give me some pointers to read that would also be fine :)
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Gyula
> >>>>>>>
> >>
>
>

Re: Types in the Python API

Posted by Gyula Fóra <gy...@gmail.com>.

In any case, thank you guys for the exhaustive discussion :D

Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. júl. 31.,
P, 13:52):

> Yes, I wouldn't deal with that now, that's orthogonal to the Types issue.
>
> On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c....@web.de> wrote:
>
> > I feel like we drifted away from the original topic a bit, but alright.
> >
> > I don't consider it a pity we created a proprietary protocol. we know
> > exactly how it works and what it is capable of. It is also made exactly
> > for our use case, in contrast to general purpose libraries. If we ever
> > decide that the current implementation is lacking we can always look for
> > better alternatives and swap stuff out fairly easily. bonus points for
> > being able to swap out only part of the system, since there is a clear
> > distinction between *what*(how the data is serialized) and
> > *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way
> > too often bundled together.
> >
> > on the other hand, let's assume we went from the start with one of these
> > magic libraries. if you then notice it lacks something (let's say its to
> > slow), and can't find a different library without these faults, you are
> > so screwed. now you have to re-implement these magic libraries, with all
> > their supported features, without these faults, or otherwise you break a
> > a lot of user programs that built upon these.
> >
> > The current implementation was a safer approach imo. It has it's faults,
> > and did provide me with some very nerve wrecking afternoon's, but I'd
> > feel really uncomfortable relying on some library that i have no control
> > over for the most performance-impacting component.
> >
> > On 31.07.2015 11:18, Maximilian Michels wrote:
> > > py4j looks really nice and the communication works in both ways. There
> is
> > > also another Python to Java communication library called javabridge. I
> > > think it is a pity we chose to implement a proprietary protocol for the
> > > network communication of the Python API. This could have been
> abstracted
> > > more nicely and we have already seen that you can run into problems if
> > you
> > > implement that yourself.
> > >
> > > Serializers could be created dynamically if Python passed its
> dynamically
> > > determined types to Java at runtime. Then everything should work on the
> > > Java side.
> > >
> > > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <tr...@apache.org>
> > > wrote:
> > >
> > >> Zeppelin uses py4j [1] to transfer data between a Python process and a
> > JVM.
> > >> That way they can run a Python interpreter and Java interpreter and
> > easily
> > >> share state between them. Spark also uses py4j as a bridge between
> Java
> > and
> > >> Python. However, I don't know for what exactly. And I also don't know
> > >> what's the performance penalty of py4j. But programming is a lot of
> fun
> > >> with it :-)
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> [1] https://www.py4j.org/
> > >>
> > >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org>
> > wrote:
> > >>
> > >>> I think in short: Spark never worried about types. It is just
> something
> > >>> arbitrary.
> > >>>
> > >>> Flink worries about types, for memory management.
> > >>>
> > >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
> > >> dynamic.
> > >>> Till' also found a pretty nice way to connect Python and Java in his
> > >>> Zeppelin-based demo at the meetup.
> > >>>
> > >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <
> c.schepler@web.de>
> > >>> wrote:
> > >>>
> > >>>> if its just a single array, how would you define group/sort keys?
> > >>>>
> > >>>>
> > >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> > >>>>
> > >>>>> I think then the Python part would just serialize all the tuple
> > fields
> > >>> to
> > >>>>> a
> > >>>>> big byte array. And all the key fields to another array, so that
> the
> > >>> java
> > >>>>> side can to comparisons on the whole "key blob".
> > >>>>>
> > >>>>> Maybe it's overly simplistic, but it might work. :D
> > >>>>>
> > >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de>
> > >>> wrote:
> > >>>>> I can see this working for basic types, but am unsure how it would
> > >> work
> > >>>>>> with Tuples. Wouldn't the java API still need to know the arity to
> > >>> setup
> > >>>>>> serializers?
> > >>>>>>
> > >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> > >>>>>>
> > >>>>>>> I believe it should be possible to create a special
> PythonTypeInfo
> > >>> where
> > >>>>>>> the python side is responsible for serializing data to a byte
> array
> > >>> and
> > >>>>>> to
> > >>>>>>
> > >>>>>>> the java side it is just a byte array and all the comparisons are
> > >> also
> > >>>>>>> performed on these byte arrays. I think partitioning and sort
> > should
> > >>>>>>>
> > >>>>>> still
> > >>>>>>
> > >>>>>>> work, since the sorting is (in most cases) only used to group the
> > >>>>>>>
> > >>>>>> elements
> > >>>>>>
> > >>>>>>> for a groupBy(). If proper sort order would be required this
> would
> > >>> have
> > >>>>>> to
> > >>>>>>
> > >>>>>>> be done on the python side.
> > >>>>>>>
> > >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schepler@web.de
> >
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> To be perfectly honest i never really managed to work my way
> > through
> > >>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even
> > the
> > >>>>>>>> general structure is understandable.
> > >>>>>>>>
> > >>>>>>>> With "pure python" do you mean doing everything in python? as in
> > >> just
> > >>>>>>>> having serialized data on the java side?
> > >>>>>>>>
> > >>>>>>>> I believe the way to do this with Flink is to add a switch that
> > >>>>>>>> a) disables all type checks
> > >>>>>>>> b) creates serializers dynamically at runtime.
> > >>>>>>>>
> > >>>>>>>> a) should be fairly straight forward, b) on the other hand....
> > >>>>>>>>
> > >>>>>>>> btw., the Python API itself doesn't require the type
> information,
> > >> it
> > >>>>>>>> already does the b part.
> > >>>>>>>>
> > >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> > >>>>>>>>
> > >>>>>>>>> That I understand, but could you please tell me how is this
> done
> > >>>>>>>>> differently in Spark for instance?
> > >>>>>>>>>
> > >>>>>>>>> What would we need to change to make this work with pure python
> > >> (as
> > >>> it
> > >>>>>>>>> seems to be possible)? This probably have large performance
> > >>>>>>>>>
> > >>>>>>>> implications
> > >>>>>>> though.
> > >>>>>>>>> Gyula
> > >>>>>>>>>
> > >>>>>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015.
> > >> júl.
> > >>>>>>>> 30.,
> > >>>>>>> Cs,
> > >>>>>>>>> 22:04):
> > >>>>>>>>>
> > >>>>>>>>> because it still goes through the Java API that requires some
> > kind
> > >>> of
> > >>>>>>>>>> type information. imagine a java api program where you omit
> all
> > >>>>>>>>>>
> > >>>>>>>>> generic
> > >>>>>>> types, it just wouldn't work as of now.
> > >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why
> we
> > >>> force
> > >>>>>>>>>> the
> > >>>>>>>>> users in the Python API to declare types for operators?
> > >>>>>>>>>>> I don't really understand how this works in different systems
> > >> but
> > >>> I
> > >>>>>>>>>> am
> > >>>>>>> just
> > >>>>>>>>>>> curious why Flink has types and why Spark doesn't for
> instance.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If you give me some pointers to read that would also be fine
> :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank you,
> > >>>>>>>>>>> Gyula
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> >
> >
>

Re: Types in the Python API

Posted by Aljoscha Krettek <al...@apache.org>.

Yes, I wouldn't deal with that now, that's orthogonal to the Types issue.

On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c....@web.de> wrote:

> I feel like we drifted away from the original topic a bit, but alright.
>
> I don't consider it a pity we created a proprietary protocol. we know
> exactly how it works and what it is capable of. It is also made exactly
> for our use case, in contrast to general purpose libraries. If we ever
> decide that the current implementation is lacking we can always look for
> better alternatives and swap stuff out fairly easily. bonus points for
> being able to swap out only part of the system, since there is a clear
> distinction between *what*(how the data is serialized) and
> *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way
> too often bundled together.
>
> on the other hand, let's assume we went from the start with one of these
> magic libraries. if you then notice it lacks something (let's say its to
> slow), and can't find a different library without these faults, you are
> so screwed. now you have to re-implement these magic libraries, with all
> their supported features, without these faults, or otherwise you break a
> a lot of user programs that built upon these.
>
> The current implementation was a safer approach imo. It has it's faults,
> and did provide me with some very nerve wrecking afternoon's, but I'd
> feel really uncomfortable relying on some library that i have no control
> over for the most performance-impacting component.
>
> On 31.07.2015 11:18, Maximilian Michels wrote:
> > py4j looks really nice and the communication works in both ways. There is
> > also another Python to Java communication library called javabridge. I
> > think it is a pity we chose to implement a proprietary protocol for the
> > network communication of the Python API. This could have been abstracted
> > more nicely and we have already seen that you can run into problems if
> you
> > implement that yourself.
> >
> > Serializers could be created dynamically if Python passed its dynamically
> > determined types to Java at runtime. Then everything should work on the
> > Java side.
> >
> > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> >> Zeppelin uses py4j [1] to transfer data between a Python process and a
> JVM.
> >> That way they can run a Python interpreter and Java interpreter and
> easily
> >> share state between them. Spark also uses py4j as a bridge between Java
> and
> >> Python. However, I don't know for what exactly. And I also don't know
> >> what's the performance penalty of py4j. But programming is a lot of fun
> >> with it :-)
> >>
> >> Cheers,
> >> Till
> >>
> >> [1] https://www.py4j.org/
> >>
> >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org>
> wrote:
> >>
> >>> I think in short: Spark never worried about types. It is just something
> >>> arbitrary.
> >>>
> >>> Flink worries about types, for memory management.
> >>>
> >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
> >> dynamic.
> >>> Till' also found a pretty nice way to connect Python and Java in his
> >>> Zeppelin-based demo at the meetup.
> >>>
> >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c....@web.de>
> >>> wrote:
> >>>
> >>>> if its just a single array, how would you define group/sort keys?
> >>>>
> >>>>
> >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> >>>>
> >>>>> I think then the Python part would just serialize all the tuple
> fields
> >>> to
> >>>>> a
> >>>>> big byte array. And all the key fields to another array, so that the
> >>> java
> >>>>> side can to comparisons on the whole "key blob".
> >>>>>
> >>>>> Maybe it's overly simplistic, but it might work. :D
> >>>>>
> >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de>
> >>> wrote:
> >>>>> I can see this working for basic types, but am unsure how it would
> >> work
> >>>>>> with Tuples. Wouldn't the java API still need to know the arity to
> >>> setup
> >>>>>> serializers?
> >>>>>>
> >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>>>>>
> >>>>>>> I believe it should be possible to create a special PythonTypeInfo
> >>> where
> >>>>>>> the python side is responsible for serializing data to a byte array
> >>> and
> >>>>>> to
> >>>>>>
> >>>>>>> the java side it is just a byte array and all the comparisons are
> >> also
> >>>>>>> performed on these byte arrays. I think partitioning and sort
> should
> >>>>>>>
> >>>>>> still
> >>>>>>
> >>>>>>> work, since the sorting is (in most cases) only used to group the
> >>>>>>>
> >>>>>> elements
> >>>>>>
> >>>>>>> for a groupBy(). If proper sort order would be required this would
> >>> have
> >>>>>> to
> >>>>>>
> >>>>>>> be done on the python side.
> >>>>>>>
> >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> To be perfectly honest i never really managed to work my way
> through
> >>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even
> the
> >>>>>>>> general structure is understandable.
> >>>>>>>>
> >>>>>>>> With "pure python" do you mean doing everything in python? as in
> >> just
> >>>>>>>> having serialized data on the java side?
> >>>>>>>>
> >>>>>>>> I believe the way to do this with Flink is to add a switch that
> >>>>>>>> a) disables all type checks
> >>>>>>>> b) creates serializers dynamically at runtime.
> >>>>>>>>
> >>>>>>>> a) should be fairly straight forward, b) on the other hand....
> >>>>>>>>
> >>>>>>>> btw., the Python API itself doesn't require the type information,
> >> it
> >>>>>>>> already does the b part.
> >>>>>>>>
> >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>>>>>
> >>>>>>>>> That I understand, but could you please tell me how is this done
> >>>>>>>>> differently in Spark for instance?
> >>>>>>>>>
> >>>>>>>>> What would we need to change to make this work with pure python
> >> (as
> >>> it
> >>>>>>>>> seems to be possible)? This probably have large performance
> >>>>>>>>>
> >>>>>>>> implications
> >>>>>>> though.
> >>>>>>>>> Gyula
> >>>>>>>>>
> >>>>>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015.
> >> júl.
> >>>>>>>> 30.,
> >>>>>>> Cs,
> >>>>>>>>> 22:04):
> >>>>>>>>>
> >>>>>>>>> because it still goes through the Java API that requires some
> kind
> >>> of
> >>>>>>>>>> type information. imagine a java api program where you omit all
> >>>>>>>>>>
> >>>>>>>>> generic
> >>>>>>> types, it just wouldn't work as of now.
> >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey!
> >>>>>>>>>>>
> >>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why we
> >>> force
> >>>>>>>>>> the
> >>>>>>>>> users in the Python API to declare types for operators?
> >>>>>>>>>>> I don't really understand how this works in different systems
> >> but
> >>> I
> >>>>>>>>>> am
> >>>>>>> just
> >>>>>>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>>>>>
> >>>>>>>>>>> If you give me some pointers to read that would also be fine :)
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you,
> >>>>>>>>>>> Gyula
> >>>>>>>>>>>
> >>>>>>>>>>>
>
>

Re: Types in the Python API

Posted by Chesnay Schepler <c....@web.de>.

I feel like we drifted away from the original topic a bit, but alright.

I don't consider it a pity we created a proprietary protocol. we know 
exactly how it works and what it is capable of. It is also made exactly 
for our use case, in contrast to general purpose libraries. If we ever 
decide that the current implementation is lacking we can always look for 
better alternatives and swap stuff out fairly easily. bonus points for 
being able to swap out only part of the system, since there is a clear 
distinction between *what*(how the data is serialized) and 
*how*(tcp/mmap) data is exchanged, something that, in my opinion, is way 
too often bundled together.

on the other hand, let's assume we went from the start with one of these 
magic libraries. if you then notice it lacks something (let's say its to 
slow), and can't find a different library without these faults, you are 
so screwed. now you have to re-implement these magic libraries, with all 
their supported features, without these faults, or otherwise you break a 
a lot of user programs that built upon these.

The current implementation was a safer approach imo. It has it's faults, 
and did provide me with some very nerve wrecking afternoon's, but I'd 
feel really uncomfortable relying on some library that i have no control 
over for the most performance-impacting component.

On 31.07.2015 11:18, Maximilian Michels wrote:
> py4j looks really nice and the communication works in both ways. There is
> also another Python to Java communication library called javabridge. I
> think it is a pity we chose to implement a proprietary protocol for the
> network communication of the Python API. This could have been abstracted
> more nicely and we have already seen that you can run into problems if you
> implement that yourself.
>
> Serializers could be created dynamically if Python passed its dynamically
> determined types to Java at runtime. Then everything should work on the
> Java side.
>
> On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Zeppelin uses py4j [1] to transfer data between a Python process and a JVM.
>> That way they can run a Python interpreter and Java interpreter and easily
>> share state between them. Spark also uses py4j as a bridge between Java and
>> Python. However, I don't know for what exactly. And I also don't know
>> what's the performance penalty of py4j. But programming is a lot of fun
>> with it :-)
>>
>> Cheers,
>> Till
>>
>> [1] https://www.py4j.org/
>>
>> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> I think in short: Spark never worried about types. It is just something
>>> arbitrary.
>>>
>>> Flink worries about types, for memory management.
>>>
>>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
>> dynamic.
>>> Till' also found a pretty nice way to connect Python and Java in his
>>> Zeppelin-based demo at the meetup.
>>>
>>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c....@web.de>
>>> wrote:
>>>
>>>> if its just a single array, how would you define group/sort keys?
>>>>
>>>>
>>>> On 31.07.2015 07:03, Aljoscha Krettek wrote:
>>>>
>>>>> I think then the Python part would just serialize all the tuple fields
>>> to
>>>>> a
>>>>> big byte array. And all the key fields to another array, so that the
>>> java
>>>>> side can to comparisons on the whole "key blob".
>>>>>
>>>>> Maybe it's overly simplistic, but it might work. :D
>>>>>
>>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de>
>>> wrote:
>>>>> I can see this working for basic types, but am unsure how it would
>> work
>>>>>> with Tuples. Wouldn't the java API still need to know the arity to
>>> setup
>>>>>> serializers?
>>>>>>
>>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
>>>>>>
>>>>>>> I believe it should be possible to create a special PythonTypeInfo
>>> where
>>>>>>> the python side is responsible for serializing data to a byte array
>>> and
>>>>>> to
>>>>>>
>>>>>>> the java side it is just a byte array and all the comparisons are
>> also
>>>>>>> performed on these byte arrays. I think partitioning and sort should
>>>>>>>
>>>>>> still
>>>>>>
>>>>>>> work, since the sorting is (in most cases) only used to group the
>>>>>>>
>>>>>> elements
>>>>>>
>>>>>>> for a groupBy(). If proper sort order would be required this would
>>> have
>>>>>> to
>>>>>>
>>>>>>> be done on the python side.
>>>>>>>
>>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> To be perfectly honest i never really managed to work my way through
>>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even the
>>>>>>>> general structure is understandable.
>>>>>>>>
>>>>>>>> With "pure python" do you mean doing everything in python? as in
>> just
>>>>>>>> having serialized data on the java side?
>>>>>>>>
>>>>>>>> I believe the way to do this with Flink is to add a switch that
>>>>>>>> a) disables all type checks
>>>>>>>> b) creates serializers dynamically at runtime.
>>>>>>>>
>>>>>>>> a) should be fairly straight forward, b) on the other hand....
>>>>>>>>
>>>>>>>> btw., the Python API itself doesn't require the type information,
>> it
>>>>>>>> already does the b part.
>>>>>>>>
>>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>>>>>>>
>>>>>>>>> That I understand, but could you please tell me how is this done
>>>>>>>>> differently in Spark for instance?
>>>>>>>>>
>>>>>>>>> What would we need to change to make this work with pure python
>> (as
>>> it
>>>>>>>>> seems to be possible)? This probably have large performance
>>>>>>>>>
>>>>>>>> implications
>>>>>>> though.
>>>>>>>>> Gyula
>>>>>>>>>
>>>>>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015.
>> júl.
>>>>>>>> 30.,
>>>>>>> Cs,
>>>>>>>>> 22:04):
>>>>>>>>>
>>>>>>>>> because it still goes through the Java API that requires some kind
>>> of
>>>>>>>>>> type information. imagine a java api program where you omit all
>>>>>>>>>>
>>>>>>>>> generic
>>>>>>> types, it just wouldn't work as of now.
>>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey!
>>>>>>>>>>>
>>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why we
>>> force
>>>>>>>>>> the
>>>>>>>>> users in the Python API to declare types for operators?
>>>>>>>>>>> I don't really understand how this works in different systems
>> but
>>> I
>>>>>>>>>> am
>>>>>>> just
>>>>>>>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>>>>>>>
>>>>>>>>>>> If you give me some pointers to read that would also be fine :)
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Gyula
>>>>>>>>>>>
>>>>>>>>>>>

Re: Types in the Python API

Posted by Maximilian Michels <mx...@apache.org>.

py4j looks really nice and the communication works in both ways. There is
also another Python to Java communication library called javabridge. I
think it is a pity we chose to implement a proprietary protocol for the
network communication of the Python API. This could have been abstracted
more nicely and we have already seen that you can run into problems if you
implement that yourself.

Serializers could be created dynamically if Python passed its dynamically
determined types to Java at runtime. Then everything should work on the
Java side.

On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <tr...@apache.org>
wrote:

> Zeppelin uses py4j [1] to transfer data between a Python process and a JVM.
> That way they can run a Python interpreter and Java interpreter and easily
> share state between them. Spark also uses py4j as a bridge between Java and
> Python. However, I don't know for what exactly. And I also don't know
> what's the performance penalty of py4j. But programming is a lot of fun
> with it :-)
>
> Cheers,
> Till
>
> [1] https://www.py4j.org/
>
> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > I think in short: Spark never worried about types. It is just something
> > arbitrary.
> >
> > Flink worries about types, for memory management.
> >
> > Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
> dynamic.
> >
> > Till' also found a pretty nice way to connect Python and Java in his
> > Zeppelin-based demo at the meetup.
> >
> > On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c....@web.de>
> > wrote:
> >
> > > if its just a single array, how would you define group/sort keys?
> > >
> > >
> > > On 31.07.2015 07:03, Aljoscha Krettek wrote:
> > >
> > >> I think then the Python part would just serialize all the tuple fields
> > to
> > >> a
> > >> big byte array. And all the key fields to another array, so that the
> > java
> > >> side can to comparisons on the whole "key blob".
> > >>
> > >> Maybe it's overly simplistic, but it might work. :D
> > >>
> > >> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de>
> > wrote:
> > >>
> > >> I can see this working for basic types, but am unsure how it would
> work
> > >>> with Tuples. Wouldn't the java API still need to know the arity to
> > setup
> > >>> serializers?
> > >>>
> > >>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> > >>>
> > >>>> I believe it should be possible to create a special PythonTypeInfo
> > where
> > >>>> the python side is responsible for serializing data to a byte array
> > and
> > >>>>
> > >>> to
> > >>>
> > >>>> the java side it is just a byte array and all the comparisons are
> also
> > >>>> performed on these byte arrays. I think partitioning and sort should
> > >>>>
> > >>> still
> > >>>
> > >>>> work, since the sorting is (in most cases) only used to group the
> > >>>>
> > >>> elements
> > >>>
> > >>>> for a groupBy(). If proper sort order would be required this would
> > have
> > >>>>
> > >>> to
> > >>>
> > >>>> be done on the python side.
> > >>>>
> > >>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
> > >>>> wrote:
> > >>>>
> > >>>> To be perfectly honest i never really managed to work my way through
> > >>>>> Spark's python API, it's a whole bunch of magic to me; not even the
> > >>>>> general structure is understandable.
> > >>>>>
> > >>>>> With "pure python" do you mean doing everything in python? as in
> just
> > >>>>> having serialized data on the java side?
> > >>>>>
> > >>>>> I believe the way to do this with Flink is to add a switch that
> > >>>>> a) disables all type checks
> > >>>>> b) creates serializers dynamically at runtime.
> > >>>>>
> > >>>>> a) should be fairly straight forward, b) on the other hand....
> > >>>>>
> > >>>>> btw., the Python API itself doesn't require the type information,
> it
> > >>>>> already does the b part.
> > >>>>>
> > >>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> > >>>>>
> > >>>>>> That I understand, but could you please tell me how is this done
> > >>>>>> differently in Spark for instance?
> > >>>>>>
> > >>>>>> What would we need to change to make this work with pure python
> (as
> > it
> > >>>>>> seems to be possible)? This probably have large performance
> > >>>>>>
> > >>>>> implications
> > >>>
> > >>>> though.
> > >>>>>>
> > >>>>>> Gyula
> > >>>>>>
> > >>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015.
> júl.
> > >>>>>>
> > >>>>> 30.,
> > >>>
> > >>>> Cs,
> > >>>>>
> > >>>>>> 22:04):
> > >>>>>>
> > >>>>>> because it still goes through the Java API that requires some kind
> > of
> > >>>>>>> type information. imagine a java api program where you omit all
> > >>>>>>>
> > >>>>>> generic
> > >>>
> > >>>> types, it just wouldn't work as of now.
> > >>>>>>>
> > >>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> > >>>>>>>
> > >>>>>>>> Hey!
> > >>>>>>>>
> > >>>>>>>> Could anyone briefly tell me what exactly is the reason why we
> > force
> > >>>>>>>>
> > >>>>>>> the
> > >>>>>
> > >>>>>> users in the Python API to declare types for operators?
> > >>>>>>>>
> > >>>>>>>> I don't really understand how this works in different systems
> but
> > I
> > >>>>>>>>
> > >>>>>>> am
> > >>>
> > >>>> just
> > >>>>>>>
> > >>>>>>>> curious why Flink has types and why Spark doesn't for instance.
> > >>>>>>>>
> > >>>>>>>> If you give me some pointers to read that would also be fine :)
> > >>>>>>>>
> > >>>>>>>> Thank you,
> > >>>>>>>> Gyula
> > >>>>>>>>
> > >>>>>>>>
> > >>>
> > >
> >
>

Re: Types in the Python API

Posted by Till Rohrmann <tr...@apache.org>.

Zeppelin uses py4j [1] to transfer data between a Python process and a JVM.
That way they can run a Python interpreter and Java interpreter and easily
share state between them. Spark also uses py4j as a bridge between Java and
Python. However, I don't know for what exactly. And I also don't know
what's the performance penalty of py4j. But programming is a lot of fun
with it :-)

Cheers,
Till

[1] https://www.py4j.org/

On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> wrote:

> I think in short: Spark never worried about types. It is just something
> arbitrary.
>
> Flink worries about types, for memory management.
>
> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic.
>
> Till' also found a pretty nice way to connect Python and Java in his
> Zeppelin-based demo at the meetup.
>
> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c....@web.de>
> wrote:
>
> > if its just a single array, how would you define group/sort keys?
> >
> >
> > On 31.07.2015 07:03, Aljoscha Krettek wrote:
> >
> >> I think then the Python part would just serialize all the tuple fields
> to
> >> a
> >> big byte array. And all the key fields to another array, so that the
> java
> >> side can to comparisons on the whole "key blob".
> >>
> >> Maybe it's overly simplistic, but it might work. :D
> >>
> >> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de>
> wrote:
> >>
> >> I can see this working for basic types, but am unsure how it would work
> >>> with Tuples. Wouldn't the java API still need to know the arity to
> setup
> >>> serializers?
> >>>
> >>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>>
> >>>> I believe it should be possible to create a special PythonTypeInfo
> where
> >>>> the python side is responsible for serializing data to a byte array
> and
> >>>>
> >>> to
> >>>
> >>>> the java side it is just a byte array and all the comparisons are also
> >>>> performed on these byte arrays. I think partitioning and sort should
> >>>>
> >>> still
> >>>
> >>>> work, since the sorting is (in most cases) only used to group the
> >>>>
> >>> elements
> >>>
> >>>> for a groupBy(). If proper sort order would be required this would
> have
> >>>>
> >>> to
> >>>
> >>>> be done on the python side.
> >>>>
> >>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
> >>>> wrote:
> >>>>
> >>>> To be perfectly honest i never really managed to work my way through
> >>>>> Spark's python API, it's a whole bunch of magic to me; not even the
> >>>>> general structure is understandable.
> >>>>>
> >>>>> With "pure python" do you mean doing everything in python? as in just
> >>>>> having serialized data on the java side?
> >>>>>
> >>>>> I believe the way to do this with Flink is to add a switch that
> >>>>> a) disables all type checks
> >>>>> b) creates serializers dynamically at runtime.
> >>>>>
> >>>>> a) should be fairly straight forward, b) on the other hand....
> >>>>>
> >>>>> btw., the Python API itself doesn't require the type information, it
> >>>>> already does the b part.
> >>>>>
> >>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>>
> >>>>>> That I understand, but could you please tell me how is this done
> >>>>>> differently in Spark for instance?
> >>>>>>
> >>>>>> What would we need to change to make this work with pure python (as
> it
> >>>>>> seems to be possible)? This probably have large performance
> >>>>>>
> >>>>> implications
> >>>
> >>>> though.
> >>>>>>
> >>>>>> Gyula
> >>>>>>
> >>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl.
> >>>>>>
> >>>>> 30.,
> >>>
> >>>> Cs,
> >>>>>
> >>>>>> 22:04):
> >>>>>>
> >>>>>> because it still goes through the Java API that requires some kind
> of
> >>>>>>> type information. imagine a java api program where you omit all
> >>>>>>>
> >>>>>> generic
> >>>
> >>>> types, it just wouldn't work as of now.
> >>>>>>>
> >>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>>
> >>>>>>>> Hey!
> >>>>>>>>
> >>>>>>>> Could anyone briefly tell me what exactly is the reason why we
> force
> >>>>>>>>
> >>>>>>> the
> >>>>>
> >>>>>> users in the Python API to declare types for operators?
> >>>>>>>>
> >>>>>>>> I don't really understand how this works in different systems but
> I
> >>>>>>>>
> >>>>>>> am
> >>>
> >>>> just
> >>>>>>>
> >>>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>>
> >>>>>>>> If you give me some pointers to read that would also be fine :)
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> Gyula
> >>>>>>>>
> >>>>>>>>
> >>>
> >
>

Re: Types in the Python API

Posted by Stephan Ewen <se...@apache.org>.

I think in short: Spark never worried about types. It is just something
arbitrary.

Flink worries about types, for memory management.

Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic.

Till' also found a pretty nice way to connect Python and Java in his
Zeppelin-based demo at the meetup.

On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c....@web.de>
wrote:

> if its just a single array, how would you define group/sort keys?
>
>
> On 31.07.2015 07:03, Aljoscha Krettek wrote:
>
>> I think then the Python part would just serialize all the tuple fields to
>> a
>> big byte array. And all the key fields to another array, so that the java
>> side can to comparisons on the whole "key blob".
>>
>> Maybe it's overly simplistic, but it might work. :D
>>
>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de> wrote:
>>
>> I can see this working for basic types, but am unsure how it would work
>>> with Tuples. Wouldn't the java API still need to know the arity to setup
>>> serializers?
>>>
>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
>>>
>>>> I believe it should be possible to create a special PythonTypeInfo where
>>>> the python side is responsible for serializing data to a byte array and
>>>>
>>> to
>>>
>>>> the java side it is just a byte array and all the comparisons are also
>>>> performed on these byte arrays. I think partitioning and sort should
>>>>
>>> still
>>>
>>>> work, since the sorting is (in most cases) only used to group the
>>>>
>>> elements
>>>
>>>> for a groupBy(). If proper sort order would be required this would have
>>>>
>>> to
>>>
>>>> be done on the python side.
>>>>
>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de>
>>>> wrote:
>>>>
>>>> To be perfectly honest i never really managed to work my way through
>>>>> Spark's python API, it's a whole bunch of magic to me; not even the
>>>>> general structure is understandable.
>>>>>
>>>>> With "pure python" do you mean doing everything in python? as in just
>>>>> having serialized data on the java side?
>>>>>
>>>>> I believe the way to do this with Flink is to add a switch that
>>>>> a) disables all type checks
>>>>> b) creates serializers dynamically at runtime.
>>>>>
>>>>> a) should be fairly straight forward, b) on the other hand....
>>>>>
>>>>> btw., the Python API itself doesn't require the type information, it
>>>>> already does the b part.
>>>>>
>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>>>>
>>>>>> That I understand, but could you please tell me how is this done
>>>>>> differently in Spark for instance?
>>>>>>
>>>>>> What would we need to change to make this work with pure python (as it
>>>>>> seems to be possible)? This probably have large performance
>>>>>>
>>>>> implications
>>>
>>>> though.
>>>>>>
>>>>>> Gyula
>>>>>>
>>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl.
>>>>>>
>>>>> 30.,
>>>
>>>> Cs,
>>>>>
>>>>>> 22:04):
>>>>>>
>>>>>> because it still goes through the Java API that requires some kind of
>>>>>>> type information. imagine a java api program where you omit all
>>>>>>>
>>>>>> generic
>>>
>>>> types, it just wouldn't work as of now.
>>>>>>>
>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>>>>
>>>>>>>> Hey!
>>>>>>>>
>>>>>>>> Could anyone briefly tell me what exactly is the reason why we force
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> users in the Python API to declare types for operators?
>>>>>>>>
>>>>>>>> I don't really understand how this works in different systems but I
>>>>>>>>
>>>>>>> am
>>>
>>>> just
>>>>>>>
>>>>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>>>>
>>>>>>>> If you give me some pointers to read that would also be fine :)
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Gyula
>>>>>>>>
>>>>>>>>
>>>
>

Re: Types in the Python API

Posted by Chesnay Schepler <c....@web.de>.

if its just a single array, how would you define group/sort keys?

On 31.07.2015 07:03, Aljoscha Krettek wrote:
> I think then the Python part would just serialize all the tuple fields to a
> big byte array. And all the key fields to another array, so that the java
> side can to comparisons on the whole "key blob".
>
> Maybe it's overly simplistic, but it might work. :D
>
> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de> wrote:
>
>> I can see this working for basic types, but am unsure how it would work
>> with Tuples. Wouldn't the java API still need to know the arity to setup
>> serializers?
>>
>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
>>> I believe it should be possible to create a special PythonTypeInfo where
>>> the python side is responsible for serializing data to a byte array and
>> to
>>> the java side it is just a byte array and all the comparisons are also
>>> performed on these byte arrays. I think partitioning and sort should
>> still
>>> work, since the sorting is (in most cases) only used to group the
>> elements
>>> for a groupBy(). If proper sort order would be required this would have
>> to
>>> be done on the python side.
>>>
>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de> wrote:
>>>
>>>> To be perfectly honest i never really managed to work my way through
>>>> Spark's python API, it's a whole bunch of magic to me; not even the
>>>> general structure is understandable.
>>>>
>>>> With "pure python" do you mean doing everything in python? as in just
>>>> having serialized data on the java side?
>>>>
>>>> I believe the way to do this with Flink is to add a switch that
>>>> a) disables all type checks
>>>> b) creates serializers dynamically at runtime.
>>>>
>>>> a) should be fairly straight forward, b) on the other hand....
>>>>
>>>> btw., the Python API itself doesn't require the type information, it
>>>> already does the b part.
>>>>
>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>>>> That I understand, but could you please tell me how is this done
>>>>> differently in Spark for instance?
>>>>>
>>>>> What would we need to change to make this work with pure python (as it
>>>>> seems to be possible)? This probably have large performance
>> implications
>>>>> though.
>>>>>
>>>>> Gyula
>>>>>
>>>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl.
>> 30.,
>>>> Cs,
>>>>> 22:04):
>>>>>
>>>>>> because it still goes through the Java API that requires some kind of
>>>>>> type information. imagine a java api program where you omit all
>> generic
>>>>>> types, it just wouldn't work as of now.
>>>>>>
>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>>>> Hey!
>>>>>>>
>>>>>>> Could anyone briefly tell me what exactly is the reason why we force
>>>> the
>>>>>>> users in the Python API to declare types for operators?
>>>>>>>
>>>>>>> I don't really understand how this works in different systems but I
>> am
>>>>>> just
>>>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>>>
>>>>>>> If you give me some pointers to read that would also be fine :)
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Gyula
>>>>>>>
>>

Re: Types in the Python API

Posted by Aljoscha Krettek <al...@apache.org>.

I think then the Python part would just serialize all the tuple fields to a
big byte array. And all the key fields to another array, so that the java
side can to comparisons on the whole "key blob".

Maybe it's overly simplistic, but it might work. :D

On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c....@web.de> wrote:

> I can see this working for basic types, but am unsure how it would work
> with Tuples. Wouldn't the java API still need to know the arity to setup
> serializers?
>
> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> > I believe it should be possible to create a special PythonTypeInfo where
> > the python side is responsible for serializing data to a byte array and
> to
> > the java side it is just a byte array and all the comparisons are also
> > performed on these byte arrays. I think partitioning and sort should
> still
> > work, since the sorting is (in most cases) only used to group the
> elements
> > for a groupBy(). If proper sort order would be required this would have
> to
> > be done on the python side.
> >
> > On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de> wrote:
> >
> >> To be perfectly honest i never really managed to work my way through
> >> Spark's python API, it's a whole bunch of magic to me; not even the
> >> general structure is understandable.
> >>
> >> With "pure python" do you mean doing everything in python? as in just
> >> having serialized data on the java side?
> >>
> >> I believe the way to do this with Flink is to add a switch that
> >> a) disables all type checks
> >> b) creates serializers dynamically at runtime.
> >>
> >> a) should be fairly straight forward, b) on the other hand....
> >>
> >> btw., the Python API itself doesn't require the type information, it
> >> already does the b part.
> >>
> >> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>> That I understand, but could you please tell me how is this done
> >>> differently in Spark for instance?
> >>>
> >>> What would we need to change to make this work with pure python (as it
> >>> seems to be possible)? This probably have large performance
> implications
> >>> though.
> >>>
> >>> Gyula
> >>>
> >>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl.
> 30.,
> >> Cs,
> >>> 22:04):
> >>>
> >>>> because it still goes through the Java API that requires some kind of
> >>>> type information. imagine a java api program where you omit all
> generic
> >>>> types, it just wouldn't work as of now.
> >>>>
> >>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>> Hey!
> >>>>>
> >>>>> Could anyone briefly tell me what exactly is the reason why we force
> >> the
> >>>>> users in the Python API to declare types for operators?
> >>>>>
> >>>>> I don't really understand how this works in different systems but I
> am
> >>>> just
> >>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>
> >>>>> If you give me some pointers to read that would also be fine :)
> >>>>>
> >>>>> Thank you,
> >>>>> Gyula
> >>>>>
> >>
>
>

Re: Types in the Python API

Posted by Chesnay Schepler <c....@web.de>.

I can see this working for basic types, but am unsure how it would work 
with Tuples. Wouldn't the java API still need to know the arity to setup 
serializers?

On 30.07.2015 23:02, Aljoscha Krettek wrote:
> I believe it should be possible to create a special PythonTypeInfo where
> the python side is responsible for serializing data to a byte array and to
> the java side it is just a byte array and all the comparisons are also
> performed on these byte arrays. I think partitioning and sort should still
> work, since the sorting is (in most cases) only used to group the elements
> for a groupBy(). If proper sort order would be required this would have to
> be done on the python side.
>
> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de> wrote:
>
>> To be perfectly honest i never really managed to work my way through
>> Spark's python API, it's a whole bunch of magic to me; not even the
>> general structure is understandable.
>>
>> With "pure python" do you mean doing everything in python? as in just
>> having serialized data on the java side?
>>
>> I believe the way to do this with Flink is to add a switch that
>> a) disables all type checks
>> b) creates serializers dynamically at runtime.
>>
>> a) should be fairly straight forward, b) on the other hand....
>>
>> btw., the Python API itself doesn't require the type information, it
>> already does the b part.
>>
>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>> That I understand, but could you please tell me how is this done
>>> differently in Spark for instance?
>>>
>>> What would we need to change to make this work with pure python (as it
>>> seems to be possible)? This probably have large performance implications
>>> though.
>>>
>>> Gyula
>>>
>>> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl. 30.,
>> Cs,
>>> 22:04):
>>>
>>>> because it still goes through the Java API that requires some kind of
>>>> type information. imagine a java api program where you omit all generic
>>>> types, it just wouldn't work as of now.
>>>>
>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>> Hey!
>>>>>
>>>>> Could anyone briefly tell me what exactly is the reason why we force
>> the
>>>>> users in the Python API to declare types for operators?
>>>>>
>>>>> I don't really understand how this works in different systems but I am
>>>> just
>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>
>>>>> If you give me some pointers to read that would also be fine :)
>>>>>
>>>>> Thank you,
>>>>> Gyula
>>>>>
>>

Re: Types in the Python API

Posted by Aljoscha Krettek <al...@apache.org>.

I believe it should be possible to create a special PythonTypeInfo where
the python side is responsible for serializing data to a byte array and to
the java side it is just a byte array and all the comparisons are also
performed on these byte arrays. I think partitioning and sort should still
work, since the sorting is (in most cases) only used to group the elements
for a groupBy(). If proper sort order would be required this would have to
be done on the python side.

On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c....@web.de> wrote:

> To be perfectly honest i never really managed to work my way through
> Spark's python API, it's a whole bunch of magic to me; not even the
> general structure is understandable.
>
> With "pure python" do you mean doing everything in python? as in just
> having serialized data on the java side?
>
> I believe the way to do this with Flink is to add a switch that
> a) disables all type checks
> b) creates serializers dynamically at runtime.
>
> a) should be fairly straight forward, b) on the other hand....
>
> btw., the Python API itself doesn't require the type information, it
> already does the b part.
>
> On 30.07.2015 22:11, Gyula Fóra wrote:
> > That I understand, but could you please tell me how is this done
> > differently in Spark for instance?
> >
> > What would we need to change to make this work with pure python (as it
> > seems to be possible)? This probably have large performance implications
> > though.
> >
> > Gyula
> >
> > Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl. 30.,
> Cs,
> > 22:04):
> >
> >> because it still goes through the Java API that requires some kind of
> >> type information. imagine a java api program where you omit all generic
> >> types, it just wouldn't work as of now.
> >>
> >> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>> Hey!
> >>>
> >>> Could anyone briefly tell me what exactly is the reason why we force
> the
> >>> users in the Python API to declare types for operators?
> >>>
> >>> I don't really understand how this works in different systems but I am
> >> just
> >>> curious why Flink has types and why Spark doesn't for instance.
> >>>
> >>> If you give me some pointers to read that would also be fine :)
> >>>
> >>> Thank you,
> >>> Gyula
> >>>
> >>
>
>

Re: Types in the Python API

Posted by Chesnay Schepler <c....@web.de>.

To be perfectly honest i never really managed to work my way through 
Spark's python API, it's a whole bunch of magic to me; not even the 
general structure is understandable.

With "pure python" do you mean doing everything in python? as in just 
having serialized data on the java side?

I believe the way to do this with Flink is to add a switch that
a) disables all type checks
b) creates serializers dynamically at runtime.

a) should be fairly straight forward, b) on the other hand....

btw., the Python API itself doesn't require the type information, it 
already does the b part.

On 30.07.2015 22:11, Gyula Fóra wrote:
> That I understand, but could you please tell me how is this done
> differently in Spark for instance?
>
> What would we need to change to make this work with pure python (as it
> seems to be possible)? This probably have large performance implications
> though.
>
> Gyula
>
> Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl. 30., Cs,
> 22:04):
>
>> because it still goes through the Java API that requires some kind of
>> type information. imagine a java api program where you omit all generic
>> types, it just wouldn't work as of now.
>>
>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>> Hey!
>>>
>>> Could anyone briefly tell me what exactly is the reason why we force the
>>> users in the Python API to declare types for operators?
>>>
>>> I don't really understand how this works in different systems but I am
>> just
>>> curious why Flink has types and why Spark doesn't for instance.
>>>
>>> If you give me some pointers to read that would also be fine :)
>>>
>>> Thank you,
>>> Gyula
>>>
>>

Re: Types in the Python API

Posted by Gyula Fóra <gy...@gmail.com>.

That I understand, but could you please tell me how is this done
differently in Spark for instance?

What would we need to change to make this work with pure python (as it
seems to be possible)? This probably have large performance implications
though.

Gyula

Chesnay Schepler <c....@web.de> ezt írta (időpont: 2015. júl. 30., Cs,
22:04):

> because it still goes through the Java API that requires some kind of
> type information. imagine a java api program where you omit all generic
> types, it just wouldn't work as of now.
>
> On 30.07.2015 21:17, Gyula Fóra wrote:
> > Hey!
> >
> > Could anyone briefly tell me what exactly is the reason why we force the
> > users in the Python API to declare types for operators?
> >
> > I don't really understand how this works in different systems but I am
> just
> > curious why Flink has types and why Spark doesn't for instance.
> >
> > If you give me some pointers to read that would also be fine :)
> >
> > Thank you,
> > Gyula
> >
>
>

Re: Types in the Python API

Posted by Chesnay Schepler <c....@web.de>.

because it still goes through the Java API that requires some kind of 
type information. imagine a java api program where you omit all generic 
types, it just wouldn't work as of now.

On 30.07.2015 21:17, Gyula Fóra wrote:
> Hey!
>
> Could anyone briefly tell me what exactly is the reason why we force the
> users in the Python API to declare types for operators?
>
> I don't really understand how this works in different systems but I am just
> curious why Flink has types and why Spark doesn't for instance.
>
> If you give me some pointers to read that would also be fine :)
>
> Thank you,
> Gyula
>