You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:24:06 UTC

[jira] [Updated] (SPARK-6822) lapplyPartition passes empty list to function

     [ https://issues.apache.org/jira/browse/SPARK-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-6822:
--------------------------------
    Labels: bulk-closed  (was: )

> lapplyPartition passes empty list to function
> ---------------------------------------------
>
>                 Key: SPARK-6822
>                 URL: https://issues.apache.org/jira/browse/SPARK-6822
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.4.0
>            Reporter: Shivaram Venkataraman
>            Priority: Major
>              Labels: bulk-closed
>
> I have an rdd containing two elements, as expected or as shown by a collect. When I call lapplyPartition on it with a function that prints its arguments in stderr, the function gets called three times, the first two with the expected arguments and the third with an empty list as argument. I was wondering if that's a bug or if there are conditions under which that's possible. I apologize I don't have a simple test case ready yet. I run into this potential bug developing a separate package, plyrmr. If you are willing to install it, the test case is very simple. The rdd that creates this problem is a result of a join, but I couldn't replicate the problem using a plain vanilla join.
> Example from Antonio on SparkR JIRA: I don't have time to try any harder to repro this without plyrmr. For the record this is the example
> {code}
> library(plyrmr)
> plyrmr.options(backend = "spark")
> df1 = mtcars[1:4,]
> df2 = mtcars[3:6,]
> w = as.data.frame(gapply(merge(input(df1), input(df2)), identity))
> {code}
> the gapply is implemented with a lapplyPartition in most cases. The merge with a join. as.data.frame with a collect. The join has an arbitrary argument of 4 partitions. If I turn that down to 2L, the problem disappears. I will check in a version with a workaround in place but a debugging statement will leave a record in stderr whenever the workaround kicks in, so that we can track it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org