You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Maciej Szymkiewicz <ms...@gmail.com> on 2017/05/14 21:44:17 UTC

[PYTHON] PySpark typing hints

Hi everyone,

For the last few months I've been working on static type annotations for
PySpark. For those of you, who are not familiar with the idea, typing
hints have been introduced by PEP 484
(https://www.python.org/dev/peps/pep-0484/) and further extended with
PEP 526 (https://www.python.org/dev/peps/pep-0526/) with the main goal
of providing information required for static analysis. Right now there a
few tools which support typing hints, including Mypy
(https://github.com/python/mypy) and PyCharm
(https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html). 
Type hints can be added using function annotations
(https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings,
or source independent stub files
(https://www.python.org/dev/peps/pep-0484/#stub-files). Typing is
optional, gradual and has no runtime impact.

At this moment I've annotated majority of the API, including majority of
pyspark.sql and pyspark.ml. At this moment project is still rough around
the edges, and may result in both false positive and false negatives,
but I think it become mature enough to be useful in practice.

The current version is compatible only with Python 3, but it is
possible, with some limitations, to backport it to Python 2 (though it
is not on my todo list).

There is a number of possible benefits for PySpark users and developers:

  * Static analysis can detect a number of common mistakes to prevent
    runtime failures. Generic self is still fairly limited, so it is
    more useful with DataFrames, SS and ML than RDD, DStreams or RDD.
  * Annotations can be used for documenting complex signatures
    (https://git.io/v95JN) including dependencies on arguments and value
    (https://git.io/v95JA).
  * Detecting possible bugs in Spark (SPARK-20631) .
  * Showing API inconsistencies.

Roadmap

  * Update the project to reflect Spark 2.2.
  * Refine existing annotations.

If there will be enough interest I am happy to contribute this back to
Spark or submit to Typeshed (https://github.com/python/typeshed -  this
would require a formal ASF approval, and since Typeshed doesn't provide
versioning, is probably not the best option in our case).

Further inforamtion:

  * https://github.com/zero323/pyspark-stubs - GitHub repository

  * https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
    - interesting presentation by Marco Bonzanini

-- 
Best,
Maciej


Re: [PYTHON] PySpark typing hints

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

On 05/23/2017 02:45 PM, Mendelson, Assaf wrote:
>
> You are correct,
>
> I actually did not look too deeply into it until now as I noticed you
> mentioned it is compatible with python 3 only and I saw in the github
> that mypy or pytype is required.
>
>  
>
> Because of that I made my suggestions with the thought of python 2.
>
>  
>
> Looking into it more deeply, I am wondering what is not supported? Are
> you talking about limitation for testing?
>

Since type checkers (unlike annotations) are not standardized, this
varies between projects and versions. For MyPy quite a lot changed since
I started annotating Spark.

Few months ago I wouldn't even bother looking at the list of issues,
today (as mentioned in the other message) we could remove metaclasses,
and pass both Python 2 and Python 3 checks.

The other part is typing module itself, as well as function annotations
(outside docstrings). But this is not a problem with stub files.
>
>  
>
> If I understand correctly then one can use this without any issues for
> pycharm (and other IDEs supporting the type hinting) even when
> developing for python 2.
>

This strictly depends on type checker. I didn't follow the development,
but I got this impression that a lot changed for example between PyCharm
2016.3 and 2017.1. I think that the important point is that lack of
support, doesn't break anything.
>
> In addition, the tests can test the existing pyspark, they just have
> to be run with a compatible packaging (e.g. mypy).
>
> Meaning that porting for python 2 would provide a very small advantage
> over the immediate advantages (IDE usage and testing for most cases).
>
>  
>
> Am I missing something?
>
>  
>
> Thanks,
>
>               Assaf.
>
>  
>
> *From:*Maciej Szymkiewicz [mailto:mszymkiewicz@gmail.com]
> *Sent:* Tuesday, May 23, 2017 3:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [PYTHON] PySpark typing hints
>
>  
>
>  
>
>  
>
> On 05/23/2017 01:12 PM, assaf.mendelson wrote:
>
>     That said, If we make a decision on the way to handle it then I
>     believe it would be a good idea to start even with the bare
>     minimum and continue to add to it (and therefore make it so many
>     people can contribute). The code I added in github were basically
>     the things I needed.
>
> I already have almost full coverage of the API, excluding some exotic
> part of the legacy streaming, so starting with bare minimum is not
> really required.
>
> The advantage of the first is that it is part of the code which means
> it is easier to make it updated. The main issue with this is that
> supporting auto generated code (as is the case in most functions) can
> be a little awkward and actually is a relate to a separate issue as it
> means pycharm marks most of the functions as an error (i.e.
> pyspark.sql.functions.XXX is marked as not there…)
>
>
> Comment based annotations are not suitable for complex signatures with
> multliversion support.
>
> Also there is no support for overloading, therefore it is not possible
> to capture relationship between arguments, and arguments and return type.
>

-- 
Maciej Szymkiewicz


RE: [PYTHON] PySpark typing hints

Posted by "assaf.mendelson" <as...@rsa.com>.
Actually there is, at least for pycharm. I actually opened a jira on it (https://issues.apache.org/jira/browse/SPARK-17333). It describes two way of doing it (I also made a github stub at: https://github.com/assafmendelson/ExamplePysparkAnnotation). Unfortunately, I never found the time to follow through.
That said, If we make a decision on the way to handle it then I believe it would be a good idea to start even with the bare minimum and continue to add to it (and therefore make it so many people can contribute). The code I added in github were basically the things I needed.

To summarize, there are two main ways of doing it (at least in pycharm):

1.       Give the hints as part of the docstring for the function

2.       Create files with the signatures only and mark it for pycharm to use

The advantage of the first is that it is part of the code which means it is easier to make it updated. The main issue with this is that supporting auto generated code (as is the case in most functions) can be a little awkward and actually is a relate to a separate issue as it means pycharm marks most of the functions as an error (i.e. pyspark.sql.functions.XXX is marked as not there…)

The advantage of the second is that it is completely separate so messing around with it cannot harm the main code. The disadvantages are that we would need to maintain it manually and that to use it in pycharm, one needs to add them to the path (in pycharm this means mark them as source, I am not sure how other IDEs support this).

Lastly, I only tested these two solutions for pycharm. I am not sure of their support in other IDEs.


Thanks,
              Assaf.

From: rxin [via Apache Spark Developers List] [mailto:ml+s1001551n21611h30@n3.nabble.com]
Sent: Tuesday, May 23, 2017 1:10 PM
To: Mendelson, Assaf
Subject: Re: [PYTHON] PySpark typing hints

Seems useful to do. Is there a way to do this so it doesn't break Python 2.x?


On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz <[hidden email]</user/SendEmail.jtp?type=node&node=21611&i=0>> wrote:

Hi everyone,

For the last few months I've been working on static type annotations for PySpark. For those of you, who are not familiar with the idea, typing hints have been introduced by PEP 484 (https://www.python.org/dev/peps/pep-0484/) and further extended with PEP 526 (https://www.python.org/dev/peps/pep-0526/) with the main goal of providing information required for static analysis. Right now there a few tools which support typing hints, including Mypy (https://github.com/python/mypy) and PyCharm (https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html).  Type hints can be added using function annotations (https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings, or source independent stub files (https://www.python.org/dev/peps/pep-0484/#stub-files). Typing is optional, gradual and has no runtime impact.

At this moment I've annotated majority of the API, including majority of pyspark.sql and pyspark.ml<http://pyspark.ml>. At this moment project is still rough around the edges, and may result in both false positive and false negatives, but I think it become mature enough to be useful in practice.
The current version is compatible only with Python 3, but it is possible, with some limitations, to backport it to Python 2 (though it is not on my todo list).

There is a number of possible benefits for PySpark users and developers:

  *   Static analysis can detect a number of common mistakes to prevent runtime failures. Generic self is still fairly limited, so it is more useful with DataFrames, SS and ML than RDD, DStreams or RDD.
  *   Annotations can be used for documenting complex signatures (https://git.io/v95JN) including dependencies on arguments and value (https://git.io/v95JA).
  *   Detecting possible bugs in Spark (SPARK-20631) .
  *   Showing API inconsistencies.

Roadmap

  *   Update the project to reflect Spark 2.2.
  *   Refine existing annotations.

If there will be enough interest I am happy to contribute this back to Spark or submit to Typeshed (https://github.com/python/typeshed -  this would require a formal ASF approval, and since Typeshed doesn't provide versioning, is probably not the best option in our case).

Further inforamtion:

  *   https://github.com/zero323/pyspark-stubs - GitHub repository

  *   https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017 - interesting presentation by Marco Bonzanini

--

Best,

Maciej


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-tp21560p21611.html
To start a new topic under Apache Spark Developers List, email ml+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-tp21560p21612.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [PYTHON] PySpark typing hints

Posted by Maciej Szymkiewicz <ms...@gmail.com>.
It doesn't break anything at all. You can take stub files as-is, put
these into PySpark root, and as long as users are not interested in type
checking, it won't have any runtime impact.

Surprisingly the current MyPy build (mypy==0.511) reports only one
incompatibility with Python 2 (dynamic metaclasses), which is could be
resolved without significant loss of function.

On 05/23/2017 12:08 PM, Reynold Xin wrote:
> Seems useful to do. Is there a way to do this so it doesn't break
> Python 2.x?
>
>
> On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi everyone,
>
>     For the last few months I've been working on static type
>     annotations for PySpark. For those of you, who are not familiar
>     with the idea, typing hints have been introduced by PEP 484
>     (https://www.python.org/dev/peps/pep-0484/
>     <https://www.python.org/dev/peps/pep-0484/>) and further extended
>     with PEP 526 (https://www.python.org/dev/peps/pep-0526/
>     <https://www.python.org/dev/peps/pep-0526/>) with the main goal of
>     providing information required for static analysis. Right now
>     there a few tools which support typing hints, including Mypy
>     (https://github.com/python/mypy <https://github.com/python/mypy>)
>     and PyCharm
>     (https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html
>     <https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html>). 
>     Type hints can be added using function annotations
>     (https://www.python.org/dev/peps/pep-3107/
>     <https://www.python.org/dev/peps/pep-3107/>, Python 3 only),
>     docstrings, or source independent stub files
>     (https://www.python.org/dev/peps/pep-0484/#stub-files
>     <https://www.python.org/dev/peps/pep-0484/#stub-files>). Typing is
>     optional, gradual and has no runtime impact.
>
>     At this moment I've annotated majority of the API, including
>     majority of pyspark.sql and pyspark.ml <http://pyspark.ml>. At
>     this moment project is still rough around the edges, and may
>     result in both false positive and false negatives, but I think it
>     become mature enough to be useful in practice.
>
>     The current version is compatible only with Python 3, but it is
>     possible, with some limitations, to backport it to Python 2
>     (though it is not on my todo list).
>
>     There is a number of possible benefits for PySpark users and
>     developers:
>
>       * Static analysis can detect a number of common mistakes to
>         prevent runtime failures. Generic self is still fairly
>         limited, so it is more useful with DataFrames, SS and ML than
>         RDD, DStreams or RDD.
>       * Annotations can be used for documenting complex signatures
>         (https://git.io/v95JN) including dependencies on arguments and
>         value (https://git.io/v95JA).
>       * Detecting possible bugs in Spark (SPARK-20631) .
>       * Showing API inconsistencies.
>
>     Roadmap
>
>       * Update the project to reflect Spark 2.2.
>       * Refine existing annotations.
>
>     If there will be enough interest I am happy to contribute this
>     back to Spark or submit to Typeshed
>     (https://github.com/python/typeshed
>     <https://github.com/python/typeshed> -  this would require a
>     formal ASF approval, and since Typeshed doesn't provide
>     versioning, is probably not the best option in our case).
>
>     Further inforamtion:
>
>       * https://github.com/zero323/pyspark-stubs
>         <https://github.com/zero323/pyspark-stubs> - GitHub repository
>
>       * https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
>         <https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017>
>         - interesting presentation by Marco Bonzanini
>
>     -- 
>     Best,
>     Maciej
>
>

-- 
Maciej Szymkiewicz


Re: [PYTHON] PySpark typing hints

Posted by Reynold Xin <rx...@databricks.com>.
Seems useful to do. Is there a way to do this so it doesn't break Python
2.x?


On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz <mszymkiewicz@gmail.com
> wrote:

> Hi everyone,
>
> For the last few months I've been working on static type annotations for
> PySpark. For those of you, who are not familiar with the idea, typing hints
> have been introduced by PEP 484 (https://www.python.org/dev/peps/pep-0484/)
> and further extended with PEP 526 (https://www.python.org/dev/pe
> ps/pep-0526/) with the main goal of providing information required for
> static analysis. Right now there a few tools which support typing hints,
> including Mypy (https://github.com/python/mypy) and PyCharm (
> https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html).
> Type hints can be added using function annotations (
> https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings, or
> source independent stub files (https://www.python.org/dev/pe
> ps/pep-0484/#stub-files). Typing is optional, gradual and has no runtime
> impact.
>
> At this moment I've annotated majority of the API, including majority of pyspark.sql
> and pyspark.ml. At this moment project is still rough around the edges,
> and may result in both false positive and false negatives, but I think it
> become mature enough to be useful in practice.
> The current version is compatible only with Python 3, but it is possible,
> with some limitations, to backport it to Python 2 (though it is not on my
> todo list).
>
> There is a number of possible benefits for PySpark users and developers:
>
>    - Static analysis can detect a number of common mistakes to prevent
>    runtime failures. Generic self is still fairly limited, so it is more
>    useful with DataFrames, SS and ML than RDD, DStreams or RDD.
>    - Annotations can be used for documenting complex signatures (
>    https://git.io/v95JN) including dependencies on arguments and value (
>    https://git.io/v95JA).
>    - Detecting possible bugs in Spark (SPARK-20631) .
>    - Showing API inconsistencies.
>
> Roadmap
>
>    - Update the project to reflect Spark 2.2.
>    - Refine existing annotations.
>
> If there will be enough interest I am happy to contribute this back to
> Spark or submit to Typeshed (https://github.com/python/typeshed -  this
> would require a formal ASF approval, and since Typeshed doesn't provide
> versioning, is probably not the best option in our case).
>
> Further inforamtion:
>
>    - https://github.com/zero323/pyspark-stubs - GitHub repository
>
>
>    - https://speakerdeck.com/marcobonzanini/static-type-analysis-
>    for-robust-data-products-at-pydata-london-2017
>    <https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017>
>    - interesting presentation by Marco Bonzanini
>
> --
> Best,
> Maciej
>
>