You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Greg Bowyer (JIRA)" <ji...@apache.org> on 2016/06/10 04:04:20 UTC
[jira] [Updated] (SPARK-15861) pyspark mapPartitions with none
generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Bowyer updated SPARK-15861:
--------------------------------
Description:
Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine.
For instance, lets say we have the following
{code}
rows = range(25)
rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
rdd = sc.parallelize(rows)
def to_np(data):
return np.array(list(data))
rdd.mapPartitions(to_np).collect()
...
[array([0, 1, 2, 3, 4]),
array([5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14]),
array([15, 16, 17, 18, 19]),
array([20, 21, 22, 23, 24])]
rdd.mapPartitions(to_np, preservePartitioning=True).collect()
...
[array([0, 1, 2, 3, 4]),
array([5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14]),
array([15, 16, 17, 18, 19]),
array([20, 21, 22, 23, 24])]
{code}
This basically makes the provided function that did return act like the end user called {code}rdd.map{code}
I think that maybe a check should be put in to call {code}inspect.isgeneratorfunction{code}
?
was:
Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine.
For instance, lets say we have the following
{code:python}
rows = range(25)
rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
rdd = sc.parallelize(rows)
def to_np(data):
return np.array(list(data))
rdd.mapPartitions(to_np).collect()
...
[array([0, 1, 2, 3, 4]),
array([5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14]),
array([15, 16, 17, 18, 19]),
array([20, 21, 22, 23, 24])]
rdd.mapPartitions(to_np, preservePartitioning=True).collect()
...
[array([0, 1, 2, 3, 4]),
array([5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14]),
array([15, 16, 17, 18, 19]),
array([20, 21, 22, 23, 24])]
{code}
This basically makes the provided function that did return act like the end user called {code}rdd.map{code}
I think that maybe a check should be put in to call {code:python}inspect.isgeneratorfunction{code}
?
> pyspark mapPartitions with none generator functions / functors
> --------------------------------------------------------------
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.6.1
> Reporter: Greg Bowyer
> Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
> array([5, 6, 7, 8, 9]),
> array([10, 11, 12, 13, 14]),
> array([15, 16, 17, 18, 19]),
> array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
> array([5, 6, 7, 8, 9]),
> array([10, 11, 12, 13, 14]),
> array([15, 16, 17, 18, 19]),
> array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end user called {code}rdd.map{code}
> I think that maybe a check should be put in to call {code}inspect.isgeneratorfunction{code}
> ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org