You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/03 07:45:00 UTC
[jira] [Resolved] (SPARK-29317) Avoid inheritance hierarchy in
pandas CoGroup arrow runner and its plan
[ https://issues.apache.org/jira/browse/SPARK-29317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-29317.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 25989
[https://github.com/apache/spark/pull/25989]
> Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan
> -----------------------------------------------------------------------
>
> Key: SPARK-29317
> URL: https://issues.apache.org/jira/browse/SPARK-29317
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 3.0.0
> Reporter: Hyukjin Kwon
> Priority: Major
> Fix For: 3.0.0
>
>
> At SPARK-27463, some refactoring was made. There are two common base abstract classes were introduced:
> 1. {{BaseArrowPythonRunner}}
> Before:
> {code}
> └── BasePythonRunner
> ├── ArrowPythonRunner
> ├── CoGroupedArrowPythonRunner
> ├── PythonRunner
> └── PythonUDFRunner
> {code}
> After:
> {code}
> BasePythonRunner
> ├── BaseArrowPythonRunner
> │ ├── ArrowPythonRunner
> │ └── CoGroupedArrowPythonRunner
> ├── PythonRunner
> └── PythonUDFRunner
> {code}
> The problem is that R code path is being matched with Python side:
> {code}
> └── BaseRRunner
> ├── ArrowRRunner
> └── RRunner
> {code}
> I would like to match the hierarchy and decouple other stuff for now. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
> 2. {{BasePandasGroupExec}}
> Before:
> {code}
> ├── FlatMapGroupsInPandasExec
> └── FlatMapCoGroupsInPandasExec
> {code}
> After:
> {code}
> └── BasePandasGroupExec
> ├── FlatMapGroupsInPandasExec
> └── FlatMapCoGroupsInPandasExec
> {code}
> Problem is that, R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.
> {{FlatMapGroupsInRWithArrowExec}} <> {{FlatMapGroupsInPandasExec}}
> {{MapPartitionsInRWithArrowExec}} <> {{ArrowEvalPythonExec}}
> In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python sides but just rather decouple it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org