You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/03 07:45:00 UTC

[jira] [Resolved] (SPARK-29317) Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan

     [ https://issues.apache.org/jira/browse/SPARK-29317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-29317.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 25989
[https://github.com/apache/spark/pull/25989]

> Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan
> -----------------------------------------------------------------------
>
>                 Key: SPARK-29317
>                 URL: https://issues.apache.org/jira/browse/SPARK-29317
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>             Fix For: 3.0.0
>
>
> At SPARK-27463, some refactoring was made. There are two common base abstract classes were introduced:
> 1. {{BaseArrowPythonRunner}}
> Before:
> {code}
> └── BasePythonRunner
>     ├── ArrowPythonRunner
>     ├── CoGroupedArrowPythonRunner
>     ├── PythonRunner
>     └── PythonUDFRunner
> {code}
> After:
> {code}
> BasePythonRunner
> ├── BaseArrowPythonRunner
> │   ├── ArrowPythonRunner
> │   └── CoGroupedArrowPythonRunner
> ├── PythonRunner
> └── PythonUDFRunner
> {code}
> The problem is that R code path is being matched with Python side:
> {code}
> └── BaseRRunner
>     ├── ArrowRRunner
>     └── RRunner
> {code}
> I would like to match the hierarchy and decouple other stuff for now. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
> 2. {{BasePandasGroupExec}}
> Before:
> {code}
> ├── FlatMapGroupsInPandasExec
> └── FlatMapCoGroupsInPandasExec
> {code}
> After:
> {code}
> └── BasePandasGroupExec
>     ├── FlatMapGroupsInPandasExec
>     └── FlatMapCoGroupsInPandasExec
> {code}
> Problem is that, R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs. 
> {{FlatMapGroupsInRWithArrowExec}} <> {{FlatMapGroupsInPandasExec}}
> {{MapPartitionsInRWithArrowExec}} <> {{ArrowEvalPythonExec}}
> In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python sides but just rather decouple it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org