You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Sharipov, Rinat" <r....@cleverdata.ru> on 2020/10/09 17:17:57 UTC

[PyFlink] register udf functions with different versions of the same library in the same job

Hi mates !

I've just read an amazing article
<https://medium.com/@Alibaba_Cloud/the-flink-ecosystem-a-quick-start-to-pyflink-6ad09560bf50>
about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's
possible to specify the list of libraries that should be used to evaluate
udf functions.

As far as I understand, each udf function is a separate process, that is
managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different
versions of the same library or what would be even better with different
python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build
a recommendation pipeline that will accumulate features as a table and make
recommendations using models from Ml flow registry. Since I don't want to
limit data analysts from usage in all libraries that they won't, the best
solution
for me - assemble the environment using conda descriptor and register a UDF
function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to
include models into existing pipelines.

thx !

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Posted by "Sharipov, Rinat" <r....@cleverdata.ru>.

Hi Xingbo ! Thx a lot for such a detailed reply, it is very useful.

пн, 12 окт. 2020 г. в 09:32, Xingbo Huang <hx...@gmail.com>:

> Hi,
> I will do my best to provide pyflink related content, I hope it helps you.
>
> >>>  each udf function is a separate process, that is managed by Beam (but
> I'm not sure I got it right).
>
> Strictly speaking, it is not true that every UDF is in a different python
> process. For example, the two python functions of udf1 and udf2 such as
> udf1(udf2(a)) are running in a python process, and you can even think that
> there is a return value of python wrap func udf1(udf2(a)). In fact, you can
> think that in most of the cases, we will put multiple python udf together
> to improve its performance.
>
> >>> Does it mean that I can register multiple udf functions with different
> versions of the same library or what would be even better with different
> python environments and they won't clash
>
> A PyFlink job All nodes use the same python environment path currently. So
> there is no way to make each UDF use a different python execution
> environment. Maybe you need to use multiple jobs to achieve this effect.
>
> Best,
> Xingbo
>
> Sharipov, Rinat <r....@cleverdata.ru> 于2020年10月10日周六 上午1:18写道：
>
>> Hi mates !
>>
>> I've just read an amazing article
>> <https://medium.com/@Alibaba_Cloud/the-flink-ecosystem-a-quick-start-to-pyflink-6ad09560bf50>
>> about PyFlink and I'm absolutely delighted.
>> I got some questions about udf registration, and it seems that it's
>> possible to specify the list of libraries that should be used to evaluate
>> udf functions.
>>
>> As far as I understand, each udf function is a separate process, that is
>> managed by Beam (but I'm not sure I got it right).
>> Does it mean that I can register multiple udf functions with different
>> versions of the same library or what would be even better with different
>> python environments and they won't clash ?
>>
>> A few words about the task that I'm trying to solve: I would like to
>> build a recommendation pipeline that will accumulate features as a table
>> and make
>> recommendations using models from Ml flow registry. Since I don't want to
>> limit data analysts from usage in all libraries that they won't, the best
>> solution
>> for me - assemble the environment using conda descriptor and register a
>> UDF function.
>>
>> Kubernetes and Kubeflow are not an option for us yet, so we are trying to
>> include models into existing pipelines.
>>
>> thx !
>>
>>
>>
>>
>>

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Posted by Xingbo Huang <hx...@gmail.com>.

Hi,
I will do my best to provide pyflink related content, I hope it helps you.

>>>  each udf function is a separate process, that is managed by Beam (but
I'm not sure I got it right).

Strictly speaking, it is not true that every UDF is in a different python
process. For example, the two python functions of udf1 and udf2 such as
udf1(udf2(a)) are running in a python process, and you can even think that
there is a return value of python wrap func udf1(udf2(a)). In fact, you can
think that in most of the cases, we will put multiple python udf together
to improve its performance.

>>> Does it mean that I can register multiple udf functions with different
versions of the same library or what would be even better with different
python environments and they won't clash

A PyFlink job All nodes use the same python environment path currently. So
there is no way to make each UDF use a different python execution
environment. Maybe you need to use multiple jobs to achieve this effect.

Best,
Xingbo

Sharipov, Rinat <r....@cleverdata.ru> 于2020年10月10日周六 上午1:18写道：

> Hi mates !
>
> I've just read an amazing article
> <https://medium.com/@Alibaba_Cloud/the-flink-ecosystem-a-quick-start-to-pyflink-6ad09560bf50>
> about PyFlink and I'm absolutely delighted.
> I got some questions about udf registration, and it seems that it's
> possible to specify the list of libraries that should be used to evaluate
> udf functions.
>
> As far as I understand, each udf function is a separate process, that is
> managed by Beam (but I'm not sure I got it right).
> Does it mean that I can register multiple udf functions with different
> versions of the same library or what would be even better with different
> python environments and they won't clash ?
>
> A few words about the task that I'm trying to solve: I would like to build
> a recommendation pipeline that will accumulate features as a table and make
> recommendations using models from Ml flow registry. Since I don't want to
> limit data analysts from usage in all libraries that they won't, the best
> solution
> for me - assemble the environment using conda descriptor and register a
> UDF function.
>
> Kubernetes and Kubeflow are not an option for us yet, so we are trying to
> include models into existing pipelines.
>
> thx !
>
>
>
>
>