You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alexander Gorokhov (JIRA)" <ji...@apache.org> on 2018/06/20 13:21:00 UTC

[jira] [Commented] (SPARK-17333) Make pyspark interface friendly with static analysis

    [ https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518126#comment-16518126 ] 

Alexander Gorokhov commented on SPARK-17333:
--------------------------------------------

Hi everyone

There was almost a year since the last comment on this issue.

Are there any updates on this? 

Why i am asking is that i would like to see static typing support in pyspark and ready to implement that and provide a pull request. After some analyze i think this should be implemented as .pyi stub files since they supported both by type checking tools such as great mypy and pycharm, and docstring type annotation syntax is not even going to be supported by mypy, as Guido van Rossum mentioned on similar ticket in mypy: [https://github.com/python/mypy/issues/612#issuecomment-223467302] 

> Make pyspark interface friendly with static analysis
> ----------------------------------------------------
>
>                 Key: SPARK-17333
>                 URL: https://issues.apache.org/jira/browse/SPARK-17333
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Assaf Mendelson
>            Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning that we chain many actions (e.g. df.filter().groupby().agg()....) and since python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface enhancements.
> Changing the interface means defining (when possible) all functions directly, i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py and then generating the functions programmatically by using _create_function, create the function directly. 
> def max(col):
>    """
>    docstring
>    """
>    _create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or pycharm's legacy type hinting (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>    """
>    does  a max.
>   :type col: Column
>   :rtype Column
>    """
> This would provide a wide range of support as these types of hints, while old are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something like:
> def max(col: Column) -> Column:
>     """
>     Aggregate function: returns the maximum value of the expression in a group.
>     """
>     ...
> This has the advantage of easier to understand types and not touching the code (only supported code) but has the disadvantage of being separately managed (i.e. greater chance of doing a mistake) and the fact that some configuration would be needed in the IDE/static analysis tool instead of working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org