You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (Jira)" <ji...@apache.org> on 2022/11/08 05:47:00 UTC
[jira] [Updated] (SPARK-32082) Project Zen: Improving Python usability

     [ https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li updated SPARK-32082:
----------------------------
    Description: 
The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached [more than 1.3 million _every week_|https://pypistats.org/packages/pyspark] when we count them _only_ in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.

This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:
 * Being Pythonic
 ** Pandas UDF enhancements and type hints
 ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which makes IDEs unable to detect.

 * Better and easier usability in PySpark
 ** User-facing error message and warnings
 ** Documentation
 ** User guide
 ** Better examples and API documentation, e.g. [Koalas|https://koalas.readthedocs.io/en/latest/] and [pandas|https://pandas.pydata.org/docs/]

 * Better interoperability with other Python libraries
 ** Visualization and plotting
 ** Potentially better interface by leveraging Arrow
 ** Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas

 * PyPI Installation
 ** PySpark with Hadoop 3 support on PyPi
 ** Better error handling

 
| | | | |
|SPARK-31382|Show a better error message for different python and pip installation mistake|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31849|Improve Python exception messages to be more Pythonic|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31851|Redesign PySpark documentation|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32017|Make Pyspark Hadoop 3.2+ Variant available in PyPI|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32084|Replace dictionary-based function definitions to proper functions in functions.py|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32085|Migrate to NumPy documentation style|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32161|Hide JVM traceback for SparkUpgradeException|{color:#006644}RESOLVED{color}|[Pralabh Kumar|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=pralabhkumar]|
|SPARK-32185|User Guide - Monitoring|{color:#006644}RESOLVED{color}|[Abhijeet Prasad|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=a7prasad]|
|SPARK-32195|Standardize warning types and messages|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32204|Binder Integration|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32681|PySpark type hints support|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32686|Un-deprecate inferring DataFrame schema from list of dictionaries|{color:#006644}RESOLVED{color}|[Nicholas Chammas|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nchammas]|
|SPARK-33247|Improve examples and scenarios in docstrings|{color:#006644}RESOLVED{color}|_Unassigned_|
|SPARK-33407|Simplify the exception message from Python UDFs|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-33530|Support --archives option natively|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-34629|Python type hints improvement|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-34849|SPIP: Support pandas API layer on PySpark|{color:#006644}RESOLVED{color}|[Haejoon Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|
|SPARK-34885|Port/integrate Koalas documentation into PySpark|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-35337|pandas API on Spark: Separate basic operations into data type based structures|{color:#006644}RESOLVED{color}|[Xinrong Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|
|SPARK-35419|Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-35464|pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.|{color:#006644}RESOLVED{color}|[Takuya Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|
|SPARK-35805|API auditing in Pandas API on Spark|{color:#006644}RESOLVED{color}|[Haejoon Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|

  was:
The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached [more than 1.3 million _every week_|https://pypistats.org/packages/pyspark] when we count them _only_ in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.

This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:
 * Being Pythonic
 ** Pandas UDF enhancements and type hints
 ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which makes IDEs unable to detect.

 * Better and easier usability in PySpark
 ** User-facing error message and warnings
 ** Documentation
 ** User guide
 ** Better examples and API documentation, e.g. [Koalas|https://koalas.readthedocs.io/en/latest/] and [pandas|https://pandas.pydata.org/docs/]

 * Better interoperability with other Python libraries
 ** Visualization and plotting
 ** Potentially better interface by leveraging Arrow
 ** Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas

 * PyPI Installation
 ** PySpark with Hadoop 3 support on PyPi
 ** Better error handling



> Project Zen: Improving Python usability
> ---------------------------------------
>
>                 Key: SPARK-32082
>                 URL: https://issues.apache.org/jira/browse/SPARK-32082
>             Project: Spark
>          Issue Type: Epic
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Critical
>             Fix For: 3.4.0
>
>
> The importance of Python and PySpark has grown radically in the last few years. The number of PySpark downloads reached [more than 1.3 million _every week_|https://pypistats.org/packages/pyspark] when we count them _only_ in PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more Pythonic. To be more explicit, this JIRA targets four bullet points below. Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. [Koalas|https://koalas.readthedocs.io/en/latest/] and [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling
>  
> | | | | |
> |SPARK-31382|Show a better error message for different python and pip installation mistake|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-31849|Improve Python exception messages to be more Pythonic|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-31851|Redesign PySpark documentation|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32017|Make Pyspark Hadoop 3.2+ Variant available in PyPI|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32084|Replace dictionary-based function definitions to proper functions in functions.py|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32085|Migrate to NumPy documentation style|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32161|Hide JVM traceback for SparkUpgradeException|{color:#006644}RESOLVED{color}|[Pralabh Kumar|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=pralabhkumar]|
> |SPARK-32185|User Guide - Monitoring|{color:#006644}RESOLVED{color}|[Abhijeet Prasad|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=a7prasad]|
> |SPARK-32195|Standardize warning types and messages|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32204|Binder Integration|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32681|PySpark type hints support|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32686|Un-deprecate inferring DataFrame schema from list of dictionaries|{color:#006644}RESOLVED{color}|[Nicholas Chammas|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nchammas]|
> |SPARK-33247|Improve examples and scenarios in docstrings|{color:#006644}RESOLVED{color}|_Unassigned_|
> |SPARK-33407|Simplify the exception message from Python UDFs|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-33530|Support --archives option natively|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-34629|Python type hints improvement|{color:#006644}RESOLVED{color}|[Maciej Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-34849|SPIP: Support pandas API layer on PySpark|{color:#006644}RESOLVED{color}|[Haejoon Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|
> |SPARK-34885|Port/integrate Koalas documentation into PySpark|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-35337|pandas API on Spark: Separate basic operations into data type based structures|{color:#006644}RESOLVED{color}|[Xinrong Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|
> |SPARK-35419|Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default|{color:#006644}RESOLVED{color}|[Hyukjin Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-35464|pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.|{color:#006644}RESOLVED{color}|[Takuya Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|
> |SPARK-35805|API auditing in Pandas API on Spark|{color:#006644}RESOLVED{color}|[Haejoon Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org