You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/07/29 10:24:00 UTC

[jira] [Commented] (SPARK-32479) Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table

    [ https://issues.apache.org/jira/browse/SPARK-32479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167112#comment-17167112 ] 

Apache Spark commented on SPARK-32479:
--------------------------------------

User 'liangz1' has created a pull request for this issue:
https://github.com/apache/spark/pull/29284

> Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-32479
>                 URL: https://issues.apache.org/jira/browse/SPARK-32479
>             Project: Spark
>          Issue Type: Story
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Liang Zhang
>            Assignee: Liang Zhang
>            Priority: Major
>
> h1. Problem:
> In [https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418|https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418,] , the slicing logic may result in less partitions than specified.
> h1. Example:
> Assume:
> {noformat}
> length = 100 -> [0, 1, ..., 99]
> num_slices = 99 = self.sparkContext.defaultParallelism{noformat}
> Old method:
> step = math.ceil(length / num_slices) = 2
>  start = i * step, end = (i + 1) * step:
>  output: [0,1] [2,3] [4,5] ... [98,99] -> 50 slices != num_slices
>  
> h1. Solution:
> We can use a silimar logic as in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L125]
> {code:python}
> # replace conversion.py#L418
> pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // num_slices] for i in xrange(0, num_slices))
> {code}
> New method:
>  start = i * length // num_slices, end = (i + 1) * length // num_slices:
>  output: [0] [1] [2] ... [98,99] -> 99 slices
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org