You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Allison Wang <al...@databricks.com.INVALID> on 2023/08/30 22:26:54 UTC

[DISCUSS] SPIP: Python Stored Procedures

Hi all,

I would like to start a discussion on “Python Stored Procedures".

This proposal aims to extend Spark SQL by introducing support for stored
procedures, starting with Python as the procedural language. This will
enable users to run complex logic using Python within their SQL workflows
and save these routines in catalogs like HMS for future use.

*SPIP*:
https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
*JIRA*: https://issues.apache.org/jira/browse/SPARK-45023

Looking forward to your feedback!

Thanks,
Allison

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

These are my initial thoughts:

As usual your mileage varies. Depending on the use case, introducing
support for stored procedures (SP) in Spark SQL with Python as the
procedural language

*Pros*

   - Can potentially provide more flexibility and capabilities in the
   respective SQL workflows. We  can seamlessly integrate Python code with SQL
   workflows, thus enabling ourselves to perform a wider range of tasks
   directly within Spark SQL.
   - SPs as usual will enable more modular and reusable coding. Users can
   build their own libraries of stored procedures and remember these are
   compiled once and used thereafter.
   - With SPs, one can potentially perform advanced analytics in Spark SQL
   through Python packages
   - Restricted access and enhanced security by hiding sensitive code in
   SPs, only accessible through SP
   - Build your own Catalog and enhance it

*Cons*

   - Performance implications due to the need to serialize and deserialize
   data between Spark and Python, especially for large datasets
   - Additional resource utilisation
   - Error handling will require more thoughts
   - Compatibility with different versions of Spark andPython libraries
   - Client side and server side Python compatibilities
   - if the underlying table schema changes, often the SP code will be
   invalidated and has to be recompiled

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 09:45, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks Allison!
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 01:26, Allison Wang <al...@databricks.com>
> wrote:
>
>> Hi Mich,
>>
>> I've updated the permissions on the document. Please feel free to leave
>> comments.
>> Thanks,
>> Allison
>>
>> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Great. Please allow edit access on SPIP or ability to comment.
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>>> <al...@databricks.com.invalid> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start a discussion on “Python Stored Procedures".
>>>>
>>>> This proposal aims to extend Spark SQL by introducing support for
>>>> stored procedures, starting with Python as the procedural language. This
>>>> will enable users to run complex logic using Python within their SQL
>>>> workflows and save these routines in catalogs like HMS for future use.
>>>>
>>>> *SPIP*:
>>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>>
>>>> Looking forward to your feedback!
>>>>
>>>> Thanks,
>>>> Allison
>>>>
>>>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Allison!

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 01:26, Allison Wang <al...@databricks.com>
wrote:

> Hi Mich,
>
> I've updated the permissions on the document. Please feel free to leave
> comments.
> Thanks,
> Allison
>
> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Great. Please allow edit access on SPIP or ability to comment.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>> <al...@databricks.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> I would like to start a discussion on “Python Stored Procedures".
>>>
>>> This proposal aims to extend Spark SQL by introducing support for stored
>>> procedures, starting with Python as the procedural language. This will
>>> enable users to run complex logic using Python within their SQL workflows
>>> and save these routines in catalogs like HMS for future use.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>
>>> Looking forward to your feedback!
>>>
>>> Thanks,
>>> Allison
>>>
>>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Allison Wang <al...@databricks.com.INVALID>.

Hi Mich,

I've updated the permissions on the document. Please feel free to leave
comments.
Thanks,
Allison

On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Great. Please allow edit access on SPIP or ability to comment.
>
> Thanks
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 23:29, Allison Wang
> <al...@databricks.com.invalid> wrote:
>
>> Hi all,
>>
>> I would like to start a discussion on “Python Stored Procedures".
>>
>> This proposal aims to extend Spark SQL by introducing support for stored
>> procedures, starting with Python as the procedural language. This will
>> enable users to run complex logic using Python within their SQL workflows
>> and save these routines in catalogs like HMS for future use.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Allison
>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Hyukjin Kwon <gu...@apache.org>.

+1 we should have this .. a lot of other projects and DBMSes have this too,
and we currently don't have a way to handle them within Apache Spark.

Disclaimer: I am the shepherd of this SPIP.

On Thu, 31 Aug 2023 at 09:31, Allison Wang
<al...@databricks.com.invalid> wrote:

> Hi Mich,
>
> I've updated the permissions on the document. Please feel free to leave
> comments.
> Thanks,
> Allison
>
> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Great. Please allow edit access on SPIP or ability to comment.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>> <al...@databricks.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> I would like to start a discussion on “Python Stored Procedures".
>>>
>>> This proposal aims to extend Spark SQL by introducing support for stored
>>> procedures, starting with Python as the procedural language. This will
>>> enable users to run complex logic using Python within their SQL workflows
>>> and save these routines in catalogs like HMS for future use.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>
>>> Looking forward to your feedback!
>>>
>>> Thanks,
>>> Allison
>>>
>>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Allison Wang <al...@databricks.com.INVALID>.

Hi Mich,

I've updated the permissions on the document. Please feel free to leave
comments.
Thanks,
Allison

On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Great. Please allow edit access on SPIP or ability to comment.
>
> Thanks
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 23:29, Allison Wang
> <al...@databricks.com.invalid> wrote:
>
>> Hi all,
>>
>> I would like to start a discussion on “Python Stored Procedures".
>>
>> This proposal aims to extend Spark SQL by introducing support for stored
>> procedures, starting with Python as the procedural language. This will
>> enable users to run complex logic using Python within their SQL workflows
>> and save these routines in catalogs like HMS for future use.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Allison
>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Great. Please allow edit access on SPIP or ability to comment.

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Wed, 30 Aug 2023 at 23:29, Allison Wang
<al...@databricks.com.invalid> wrote:

> Hi all,
>
> I would like to start a discussion on “Python Stored Procedures".
>
> This proposal aims to extend Spark SQL by introducing support for stored
> procedures, starting with Python as the procedural language. This will
> enable users to run complex logic using Python within their SQL workflows
> and save these routines in catalogs like HMS for future use.
>
> *SPIP*:
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>
> Looking forward to your feedback!
>
> Thanks,
> Allison
>
>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Alison for your explanation.

   1. As a matter of interest, what does "sessionCatalog.resolveProcedure" do?
   Does it recompile the stored procedure (SP)?
   2. If the SP makes a reference to an underlying table and table schema
   is changed. then by definition that SP compiled plan will be invalidated
   3. When you use the command sessionCatalog.createProcedure, we should
   add an optional syntax for creating the SP with recompile option. to
   allow a new execution plan to be generated to reflect the current state of
   the metadata.
   4. since SP is compiled once and used many times, we ought to provide an
   API to recompile the existing SPs.


Regards,

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 00:38, Allison Wang <al...@databricks.com>
wrote:

> Hi Mich,
>
> Thank you for your comments! I've left some comments on the SPIP, but
> let's continue the discussion here.
>
> You've highlighted the potential advantages of Python stored procedures,
> and I'd like to emphasize two important aspects:
>
>    1. *Versatility*: Integrating Python into SQL provides remarkable
>    versatility to the SQL workflow. By leveraging Spark Connect, it's even
>    possible to execute Spark queries within a Python stored procedure.
>    2. *Reusability*: Stored procedures, once saved in the catalog (e.g.,
>    HMS), can be reused across various users and sessions.
>
> This initiative will also pave the way for supporting other procedural
> languages in the future.
>
> Regarding the cons you mentioned, I'd like to shed some light on the
> potential implementation of Pyhton stored procedures. The plan is to
> leverage the existing Python UDF implementation. I.e the Python stored
> procedural logic will be executed inside a Python worker. As @Sean Owen
> mentioned, many of the challenges are shared with the current way of
> executing Python logic in Spark, whether for UDFs/UDTFs or Python stored
> procedures. We should think more about them, esp regarding error handling.
>
> For storage options, regardless of the chosen storage solution, we need to
> expose these APIs for stored procedures to integrate with Spark:
>
>    - sessionCatalog.createProcedure:  create a new stored procedure
>    - sessionCatalog.dropProcedure: drop a stored procedure
>    - sessionCatalog.resolveProcedure: resolve a stored procedure given
>    the identifier
>
> Stored procedures are similar to functions, and we can leverage HMS
> function interface to support storing stored procedures (by serializing
> them into strings and placing them into the resource field of the
> CatalogFunction). We could also make these APIs compatible with other
> storage systems in the future, whether they are 3rd party or native storage
> solutions, but for the short term, HMS remains a decent option.
>
> I'd appreciate your thoughts on this, and I am more than willing to delve
> deeper or clarify any aspect :)
>
> Thanks,
> Allison
>
> On Sat, Sep 2, 2023 at 8:27 AM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>>
>> I have noticed an worthy discussion in the SPIP comments regarding the
>> definition of "stored procedure" in the context of Spark, and I believe it
>> is an important point to address.
>>
>> To provide some historical context, Sybase
>> <https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
>> relational database vendor (which later co-licensed their code to Microsoft
>> for SQL Server), introduced the concept of stored procedures while
>> positioning themselves as a client-server company. During this period, they
>> were in competition with Oracle, particularly in the realm of front-office
>> trading systems. The introduction of stored procedures, stored on the
>> server-side within the database, allowed Sybase to modularize frequently
>> used code. This move significantly reduced network overhead and latency.
>> Stored procedures were first introduced in the mid-1980s and proved to be a
>> profitable innovation. It is important to note that they had a robust
>> database to rely on during this process.
>>
>> Now, as we contemplate the implementation of stored procedures in Spark,
>> we must think strategically about where these procedures will be stored and
>> how they will be reused. Some colleagues have suggested using HMS (Derby)
>> by default, but it is worth noting that HMS is inherently single-threaded.
>> If we intend to leverage stored procedures extensively, Should we consider
>> establishing "a native" storage solution? This approach not only aligns
>> with good architectural practices but also has the potential for broader
>> applications beyond Spark. While empowering users to choose their preferred
>> database for this purpose might sound appealing, it may not be the most
>> realistic or practical approach. This discussion highlights the importance
>> of clarifying terminologies and establishing a solid foundation for this
>> feature.
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 31 Aug 2023 at 18:19, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> I concur with the view point raised by @Sean Owen
>>>
>>> While this might introduce some challenges related to compatibility and
>>> environment issues, it is not fundamentally different from how the users
>>> currently import and use common code in Python. The main difference is that
>>> now this shared code would be stored as stored procedures in the catalog of
>>> user choice -> probably Hive Metastore
>>>
>>> HTH
>>>
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 31 Aug 2023 at 16:41, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I think you're talking past Hyukjin here.
>>>>
>>>> I think the response is: none of that is managed by Pyspark now, and
>>>> this proposal does not change that. Your current interpreter and
>>>> environment is used to execute the stored procedure, which is just Python
>>>> code. It's on you to bring an environment that runs the code correctly.
>>>> This is just the same as how running any python code works now.
>>>>
>>>> I think you have exactly the same problems with UDFs now, and that's
>>>> all a real problem, just not something Spark has ever tried to solve for
>>>> you. Think of this as exactly like: I have a bit of python code I import as
>>>> a function and share across many python workloads. Just, now that chunk is
>>>> stored as a 'stored procedure'.
>>>>
>>>> I agree this raises the same problem in new ways - now, you are storing
>>>> and sharing a chunk of code across many workloads. There is more potential
>>>> for compatibility and environment problems, as all of that is simply punted
>>>> to the end workloads. But, it's not different from importing common code
>>>> and the world doesn't fall apart.
>>>>
>>>> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kx...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> Which Python version will run that stored procedure?
>>>>>>
>>>>>> All Python versions supported in PySpark
>>>>>>
>>>>>
>>>>> Where in stored procedure defines the exact python version which will
>>>>> run the code? That was the question.
>>>>>
>>>>>
>>>>>> How to manage external dependencies?
>>>>>>
>>>>>> Existing way we have
>>>>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>>>>> .
>>>>>> In fact, this will use the external dependencies within your Python
>>>>>> interpreter so you can use all existing conda or venvs.
>>>>>>
>>>>> Current proposal solves this issue nohow (the stored code doesn't
>>>>> provide any manifest about its dependencies and what is required to run
>>>>> it). So feels like it's better to stay with UDF since they are under
>>>>> control and their behaviour is predictable. Did I miss something?
>>>>>
>>>>> How to test it via a common CI process?
>>>>>>
>>>>>> Existing way of PySpark unittests, see
>>>>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>>>>
>>>>> Sorry, but this wouldn't work since stored procedure thing requires
>>>>> some specific definition and this code will not be stored as regular python
>>>>> code. Do you have any examples how to test stored python procedures as a
>>>>> unit e.g. without spark?
>>>>>
>>>>> How to manage versions and do upgrades? Migrations?
>>>>>>
>>>>>> This is a new feature so no migration is needed. We will keep the
>>>>>> compatibility according to the sember we follow.
>>>>>>
>>>>> Question was not about spark, but about stored procedures itself. Any
>>>>> guidelines which will not copy flaws of other systems?
>>>>>
>>>>> Current Python UDF solution handles these problems in a good way since
>>>>>> they delegate them to project level.
>>>>>>
>>>>>> Current UDF solution cannot handle stored procedures because UDF is
>>>>>> on the worker side. This is Driver side.
>>>>>>
>>>>> How so? Currently it works and we never faced such issue. May be you
>>>>> should have the same Python code also on the driver side? But such trivial
>>>>> idea doesn't require new feature on Spark since you already have to ship
>>>>> that code somehow.
>>>>>
>>>>> --
>>>>> ,,,^..^,,,
>>>>>
>>>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Allison Wang <al...@databricks.com.INVALID>.

Hi Mich,

Thank you for your comments! I've left some comments on the SPIP, but let's
continue the discussion here.

You've highlighted the potential advantages of Python stored procedures,
and I'd like to emphasize two important aspects:

   1. *Versatility*: Integrating Python into SQL provides remarkable
   versatility to the SQL workflow. By leveraging Spark Connect, it's even
   possible to execute Spark queries within a Python stored procedure.
   2. *Reusability*: Stored procedures, once saved in the catalog (e.g.,
   HMS), can be reused across various users and sessions.

This initiative will also pave the way for supporting other procedural
languages in the future.

Regarding the cons you mentioned, I'd like to shed some light on the
potential implementation of Pyhton stored procedures. The plan is to
leverage the existing Python UDF implementation. I.e the Python stored
procedural logic will be executed inside a Python worker. As @Sean Owen
mentioned, many of the challenges are shared with the current way of
executing Python logic in Spark, whether for UDFs/UDTFs or Python stored
procedures. We should think more about them, esp regarding error handling.

For storage options, regardless of the chosen storage solution, we need to
expose these APIs for stored procedures to integrate with Spark:

   - sessionCatalog.createProcedure:  create a new stored procedure
   - sessionCatalog.dropProcedure: drop a stored procedure
   - sessionCatalog.resolveProcedure: resolve a stored procedure given the
   identifier

Stored procedures are similar to functions, and we can leverage HMS
function interface to support storing stored procedures (by serializing
them into strings and placing them into the resource field of the
CatalogFunction). We could also make these APIs compatible with other
storage systems in the future, whether they are 3rd party or native storage
solutions, but for the short term, HMS remains a decent option.

I'd appreciate your thoughts on this, and I am more than willing to delve
deeper or clarify any aspect :)

Thanks,
Allison

On Sat, Sep 2, 2023 at 8:27 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

>
> I have noticed an worthy discussion in the SPIP comments regarding the
> definition of "stored procedure" in the context of Spark, and I believe it
> is an important point to address.
>
> To provide some historical context, Sybase
> <https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
> relational database vendor (which later co-licensed their code to Microsoft
> for SQL Server), introduced the concept of stored procedures while
> positioning themselves as a client-server company. During this period, they
> were in competition with Oracle, particularly in the realm of front-office
> trading systems. The introduction of stored procedures, stored on the
> server-side within the database, allowed Sybase to modularize frequently
> used code. This move significantly reduced network overhead and latency.
> Stored procedures were first introduced in the mid-1980s and proved to be a
> profitable innovation. It is important to note that they had a robust
> database to rely on during this process.
>
> Now, as we contemplate the implementation of stored procedures in Spark,
> we must think strategically about where these procedures will be stored and
> how they will be reused. Some colleagues have suggested using HMS (Derby)
> by default, but it is worth noting that HMS is inherently single-threaded.
> If we intend to leverage stored procedures extensively, Should we consider
> establishing "a native" storage solution? This approach not only aligns
> with good architectural practices but also has the potential for broader
> applications beyond Spark. While empowering users to choose their preferred
> database for this purpose might sound appealing, it may not be the most
> realistic or practical approach. This discussion highlights the importance
> of clarifying terminologies and establishing a solid foundation for this
> feature.
>
> HTH
>
> Mich Talebzadeh,
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 18:19, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> I concur with the view point raised by @Sean Owen
>>
>> While this might introduce some challenges related to compatibility and
>> environment issues, it is not fundamentally different from how the users
>> currently import and use common code in Python. The main difference is that
>> now this shared code would be stored as stored procedures in the catalog of
>> user choice -> probably Hive Metastore
>>
>> HTH
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 31 Aug 2023 at 16:41, Sean Owen <sr...@gmail.com> wrote:
>>
>>> I think you're talking past Hyukjin here.
>>>
>>> I think the response is: none of that is managed by Pyspark now, and
>>> this proposal does not change that. Your current interpreter and
>>> environment is used to execute the stored procedure, which is just Python
>>> code. It's on you to bring an environment that runs the code correctly.
>>> This is just the same as how running any python code works now.
>>>
>>> I think you have exactly the same problems with UDFs now, and that's all
>>> a real problem, just not something Spark has ever tried to solve for you.
>>> Think of this as exactly like: I have a bit of python code I import as a
>>> function and share across many python workloads. Just, now that chunk is
>>> stored as a 'stored procedure'.
>>>
>>> I agree this raises the same problem in new ways - now, you are storing
>>> and sharing a chunk of code across many workloads. There is more potential
>>> for compatibility and environment problems, as all of that is simply punted
>>> to the end workloads. But, it's not different from importing common code
>>> and the world doesn't fall apart.
>>>
>>> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kx...@apache.org>
>>> wrote:
>>>
>>>>
>>>> Which Python version will run that stored procedure?
>>>>>
>>>>> All Python versions supported in PySpark
>>>>>
>>>>
>>>> Where in stored procedure defines the exact python version which will
>>>> run the code? That was the question.
>>>>
>>>>
>>>>> How to manage external dependencies?
>>>>>
>>>>> Existing way we have
>>>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>>>> .
>>>>> In fact, this will use the external dependencies within your Python
>>>>> interpreter so you can use all existing conda or venvs.
>>>>>
>>>> Current proposal solves this issue nohow (the stored code doesn't
>>>> provide any manifest about its dependencies and what is required to run
>>>> it). So feels like it's better to stay with UDF since they are under
>>>> control and their behaviour is predictable. Did I miss something?
>>>>
>>>> How to test it via a common CI process?
>>>>>
>>>>> Existing way of PySpark unittests, see
>>>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>>>
>>>> Sorry, but this wouldn't work since stored procedure thing requires
>>>> some specific definition and this code will not be stored as regular python
>>>> code. Do you have any examples how to test stored python procedures as a
>>>> unit e.g. without spark?
>>>>
>>>> How to manage versions and do upgrades? Migrations?
>>>>>
>>>>> This is a new feature so no migration is needed. We will keep the
>>>>> compatibility according to the sember we follow.
>>>>>
>>>> Question was not about spark, but about stored procedures itself. Any
>>>> guidelines which will not copy flaws of other systems?
>>>>
>>>> Current Python UDF solution handles these problems in a good way since
>>>>> they delegate them to project level.
>>>>>
>>>>> Current UDF solution cannot handle stored procedures because UDF is on
>>>>> the worker side. This is Driver side.
>>>>>
>>>> How so? Currently it works and we never faced such issue. May be you
>>>> should have the same Python code also on the driver side? But such trivial
>>>> idea doesn't require new feature on Spark since you already have to ship
>>>> that code somehow.
>>>>
>>>> --
>>>> ,,,^..^,,,
>>>>
>>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

I have noticed an worthy discussion in the SPIP comments regarding the
definition of "stored procedure" in the context of Spark, and I believe it
is an important point to address.

To provide some historical context, Sybase
<https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
relational database vendor (which later co-licensed their code to Microsoft
for SQL Server), introduced the concept of stored procedures while
positioning themselves as a client-server company. During this period, they
were in competition with Oracle, particularly in the realm of front-office
trading systems. The introduction of stored procedures, stored on the
server-side within the database, allowed Sybase to modularize frequently
used code. This move significantly reduced network overhead and latency.
Stored procedures were first introduced in the mid-1980s and proved to be a
profitable innovation. It is important to note that they had a robust
database to rely on during this process.

Now, as we contemplate the implementation of stored procedures in Spark, we
must think strategically about where these procedures will be stored and
how they will be reused. Some colleagues have suggested using HMS (Derby)
by default, but it is worth noting that HMS is inherently single-threaded.
If we intend to leverage stored procedures extensively, Should we consider
establishing "a native" storage solution? This approach not only aligns
with good architectural practices but also has the potential for broader
applications beyond Spark. While empowering users to choose their preferred
database for this purpose might sound appealing, it may not be the most
realistic or practical approach. This discussion highlights the importance
of clarifying terminologies and establishing a solid foundation for this
feature.

HTH

Mich Talebzadeh,

Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 31 Aug 2023 at 18:19, Mich Talebzadeh <mi...@gmail.com>
wrote:

> I concur with the view point raised by @Sean Owen
>
> While this might introduce some challenges related to compatibility and
> environment issues, it is not fundamentally different from how the users
> currently import and use common code in Python. The main difference is that
> now this shared code would be stored as stored procedures in the catalog of
> user choice -> probably Hive Metastore
>
> HTH
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 16:41, Sean Owen <sr...@gmail.com> wrote:
>
>> I think you're talking past Hyukjin here.
>>
>> I think the response is: none of that is managed by Pyspark now, and this
>> proposal does not change that. Your current interpreter and environment is
>> used to execute the stored procedure, which is just Python code. It's on
>> you to bring an environment that runs the code correctly. This is just the
>> same as how running any python code works now.
>>
>> I think you have exactly the same problems with UDFs now, and that's all
>> a real problem, just not something Spark has ever tried to solve for you.
>> Think of this as exactly like: I have a bit of python code I import as a
>> function and share across many python workloads. Just, now that chunk is
>> stored as a 'stored procedure'.
>>
>> I agree this raises the same problem in new ways - now, you are storing
>> and sharing a chunk of code across many workloads. There is more potential
>> for compatibility and environment problems, as all of that is simply punted
>> to the end workloads. But, it's not different from importing common code
>> and the world doesn't fall apart.
>>
>> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kx...@apache.org>
>> wrote:
>>
>>>
>>> Which Python version will run that stored procedure?
>>>>
>>>> All Python versions supported in PySpark
>>>>
>>>
>>> Where in stored procedure defines the exact python version which will
>>> run the code? That was the question.
>>>
>>>
>>>> How to manage external dependencies?
>>>>
>>>> Existing way we have
>>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>>> .
>>>> In fact, this will use the external dependencies within your Python
>>>> interpreter so you can use all existing conda or venvs.
>>>>
>>> Current proposal solves this issue nohow (the stored code doesn't
>>> provide any manifest about its dependencies and what is required to run
>>> it). So feels like it's better to stay with UDF since they are under
>>> control and their behaviour is predictable. Did I miss something?
>>>
>>> How to test it via a common CI process?
>>>>
>>>> Existing way of PySpark unittests, see
>>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>>
>>> Sorry, but this wouldn't work since stored procedure thing requires some
>>> specific definition and this code will not be stored as regular python
>>> code. Do you have any examples how to test stored python procedures as a
>>> unit e.g. without spark?
>>>
>>> How to manage versions and do upgrades? Migrations?
>>>>
>>>> This is a new feature so no migration is needed. We will keep the
>>>> compatibility according to the sember we follow.
>>>>
>>> Question was not about spark, but about stored procedures itself. Any
>>> guidelines which will not copy flaws of other systems?
>>>
>>> Current Python UDF solution handles these problems in a good way since
>>>> they delegate them to project level.
>>>>
>>>> Current UDF solution cannot handle stored procedures because UDF is on
>>>> the worker side. This is Driver side.
>>>>
>>> How so? Currently it works and we never faced such issue. May be you
>>> should have the same Python code also on the driver side? But such trivial
>>> idea doesn't require new feature on Spark since you already have to ship
>>> that code somehow.
>>>
>>> --
>>> ,,,^..^,,,
>>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Mich Talebzadeh <mi...@gmail.com>.

I concur with the view point raised by @Sean Owen

While this might introduce some challenges related to compatibility and
environment issues, it is not fundamentally different from how the users
currently import and use common code in Python. The main difference is that
now this shared code would be stored as stored procedures in the catalog of
user choice -> probably Hive Metastore

HTH



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 16:41, Sean Owen <sr...@gmail.com> wrote:

> I think you're talking past Hyukjin here.
>
> I think the response is: none of that is managed by Pyspark now, and this
> proposal does not change that. Your current interpreter and environment is
> used to execute the stored procedure, which is just Python code. It's on
> you to bring an environment that runs the code correctly. This is just the
> same as how running any python code works now.
>
> I think you have exactly the same problems with UDFs now, and that's all a
> real problem, just not something Spark has ever tried to solve for you.
> Think of this as exactly like: I have a bit of python code I import as a
> function and share across many python workloads. Just, now that chunk is
> stored as a 'stored procedure'.
>
> I agree this raises the same problem in new ways - now, you are storing
> and sharing a chunk of code across many workloads. There is more potential
> for compatibility and environment problems, as all of that is simply punted
> to the end workloads. But, it's not different from importing common code
> and the world doesn't fall apart.
>
> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kx...@apache.org>
> wrote:
>
>>
>> Which Python version will run that stored procedure?
>>>
>>> All Python versions supported in PySpark
>>>
>>
>> Where in stored procedure defines the exact python version which will run
>> the code? That was the question.
>>
>>
>>> How to manage external dependencies?
>>>
>>> Existing way we have
>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>> .
>>> In fact, this will use the external dependencies within your Python
>>> interpreter so you can use all existing conda or venvs.
>>>
>> Current proposal solves this issue nohow (the stored code doesn't provide
>> any manifest about its dependencies and what is required to run it). So
>> feels like it's better to stay with UDF since they are under control and
>> their behaviour is predictable. Did I miss something?
>>
>> How to test it via a common CI process?
>>>
>>> Existing way of PySpark unittests, see
>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>
>> Sorry, but this wouldn't work since stored procedure thing requires some
>> specific definition and this code will not be stored as regular python
>> code. Do you have any examples how to test stored python procedures as a
>> unit e.g. without spark?
>>
>> How to manage versions and do upgrades? Migrations?
>>>
>>> This is a new feature so no migration is needed. We will keep the
>>> compatibility according to the sember we follow.
>>>
>> Question was not about spark, but about stored procedures itself. Any
>> guidelines which will not copy flaws of other systems?
>>
>> Current Python UDF solution handles these problems in a good way since
>>> they delegate them to project level.
>>>
>>> Current UDF solution cannot handle stored procedures because UDF is on
>>> the worker side. This is Driver side.
>>>
>> How so? Currently it works and we never faced such issue. May be you
>> should have the same Python code also on the driver side? But such trivial
>> idea doesn't require new feature on Spark since you already have to ship
>> that code somehow.
>>
>> --
>> ,,,^..^,,,
>>
>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Sean Owen <sr...@gmail.com>.

I think you're talking past Hyukjin here.

I think the response is: none of that is managed by Pyspark now, and this
proposal does not change that. Your current interpreter and environment is
used to execute the stored procedure, which is just Python code. It's on
you to bring an environment that runs the code correctly. This is just the
same as how running any python code works now.

I think you have exactly the same problems with UDFs now, and that's all a
real problem, just not something Spark has ever tried to solve for you.
Think of this as exactly like: I have a bit of python code I import as a
function and share across many python workloads. Just, now that chunk is
stored as a 'stored procedure'.

I agree this raises the same problem in new ways - now, you are storing and
sharing a chunk of code across many workloads. There is more potential for
compatibility and environment problems, as all of that is simply punted to
the end workloads. But, it's not different from importing common code and
the world doesn't fall apart.

On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kx...@apache.org> wrote:

>
> Which Python version will run that stored procedure?
>>
>> All Python versions supported in PySpark
>>
>
> Where in stored procedure defines the exact python version which will run
> the code? That was the question.
>
>
>> How to manage external dependencies?
>>
>> Existing way we have
>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>> .
>> In fact, this will use the external dependencies within your Python
>> interpreter so you can use all existing conda or venvs.
>>
> Current proposal solves this issue nohow (the stored code doesn't provide
> any manifest about its dependencies and what is required to run it). So
> feels like it's better to stay with UDF since they are under control and
> their behaviour is predictable. Did I miss something?
>
> How to test it via a common CI process?
>>
>> Existing way of PySpark unittests, see
>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>
> Sorry, but this wouldn't work since stored procedure thing requires some
> specific definition and this code will not be stored as regular python
> code. Do you have any examples how to test stored python procedures as a
> unit e.g. without spark?
>
> How to manage versions and do upgrades? Migrations?
>>
>> This is a new feature so no migration is needed. We will keep the
>> compatibility according to the sember we follow.
>>
> Question was not about spark, but about stored procedures itself. Any
> guidelines which will not copy flaws of other systems?
>
> Current Python UDF solution handles these problems in a good way since
>> they delegate them to project level.
>>
>> Current UDF solution cannot handle stored procedures because UDF is on
>> the worker side. This is Driver side.
>>
> How so? Currently it works and we never faced such issue. May be you
> should have the same Python code also on the driver side? But such trivial
> idea doesn't require new feature on Spark since you already have to ship
> that code somehow.
>
> --
> ,,,^..^,,,
>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Alexander Shorin <kx...@apache.org>.

> Which Python version will run that stored procedure?
>
> All Python versions supported in PySpark
>

Where in stored procedure defines the exact python version which will run
the code? That was the question.


> How to manage external dependencies?
>
> Existing way we have
> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
> .
> In fact, this will use the external dependencies within your Python
> interpreter so you can use all existing conda or venvs.
>
Current proposal solves this issue nohow (the stored code doesn't provide
any manifest about its dependencies and what is required to run it). So
feels like it's better to stay with UDF since they are under control and
their behaviour is predictable. Did I miss something?

How to test it via a common CI process?
>
> Existing way of PySpark unittests, see
> https://github.com/apache/spark/tree/master/python/pyspark/tests
>
Sorry, but this wouldn't work since stored procedure thing requires some
specific definition and this code will not be stored as regular python
code. Do you have any examples how to test stored python procedures as a
unit e.g. without spark?

How to manage versions and do upgrades? Migrations?
>
> This is a new feature so no migration is needed. We will keep the
> compatibility according to the sember we follow.
>
Question was not about spark, but about stored procedures itself. Any
guidelines which will not copy flaws of other systems?

Current Python UDF solution handles these problems in a good way since they
> delegate them to project level.
>
> Current UDF solution cannot handle stored procedures because UDF is on the
> worker side. This is Driver side.
>
How so? Currently it works and we never faced such issue. May be you should
have the same Python code also on the driver side? But such trivial idea
doesn't require new feature on Spark since you already have to ship that
code somehow.

--
,,,^..^,,,

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Hyukjin Kwon <gu...@apache.org>.

Which Python version will run that stored procedure?

All Python versions supported in PySpark

How to manage external dependencies?

Existing way we have
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
.
In fact, this will use the external dependencies within your Python
interpreter so you can use all existing conda or venvs.

How to test it via a common CI process?

Existing way of PySpark unittests, see
https://github.com/apache/spark/tree/master/python/pyspark/tests

How to manage versions and do upgrades? Migrations?

This is a new feature so no migration is needed. We will keep the
compatibility according to the sember we follow.

Current Python UDF solution handles these problems in a good way since they
delegate them to project level.

Current UDF solution cannot handle stored procedures because UDF is on the
worker side. This is Driver side.

In my opinion, the concerns raised here look orthogonal with the Stored
Procedure itself.
Let me know if this does not address your concern.

On Thu, 31 Aug 2023 at 12:49, Alexander Shorin <kx...@apache.org> wrote:

> -1
>
> Great idea to ignore the experience of others and copy bad practices back
> for nothing.
>
> If you are familiar with Python ecosystem then you should answer the
> questions:
> 1. Which Python version will run that stored procedure?
> 2. How to manage external dependencies?
> 3. How to test it via a common CI process?
> 4. How to manage versions and do upgrades? Migrations?
>
> Current Python UDF solution handles these problems in a good way since
> they delegate them to project level.
>
> --
> ,,,^..^,,,
>
>
> On Thu, Aug 31, 2023 at 1:29 AM Allison Wang
> <al...@databricks.com.invalid> wrote:
>
>> Hi all,
>>
>> I would like to start a discussion on “Python Stored Procedures".
>>
>> This proposal aims to extend Spark SQL by introducing support for stored
>> procedures, starting with Python as the procedural language. This will
>> enable users to run complex logic using Python within their SQL workflows
>> and save these routines in catalogs like HMS for future use.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Allison
>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Posted by Alexander Shorin <kx...@apache.org>.

-1

Great idea to ignore the experience of others and copy bad practices back
for nothing.

If you are familiar with Python ecosystem then you should answer the
questions:
1. Which Python version will run that stored procedure?
2. How to manage external dependencies?
3. How to test it via a common CI process?
4. How to manage versions and do upgrades? Migrations?

Current Python UDF solution handles these problems in a good way since they
delegate them to project level.

--
,,,^..^,,,

On Thu, Aug 31, 2023 at 1:29 AM Allison Wang
<al...@databricks.com.invalid> wrote:

> Hi all,
>
> I would like to start a discussion on “Python Stored Procedures".
>
> This proposal aims to extend Spark SQL by introducing support for stored
> procedures, starting with Python as the procedural language. This will
> enable users to run complex logic using Python within their SQL workflows
> and save these routines in catalogs like HMS for future use.
>
> *SPIP*:
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>
> Looking forward to your feedback!
>
> Thanks,
> Allison
>
>