You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/05/06 08:53:15 UTC

[jira] [Updated] (SPARK-3095) [PySpark] Speed up RDD.count()

     [ https://issues.apache.org/jira/browse/SPARK-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-3095:
-----------------------------
    Target Version/s:   (was: 1.2.0)

> [PySpark] Speed up RDD.count()
> ------------------------------
>
>                 Key: SPARK-3095
>                 URL: https://issues.apache.org/jira/browse/SPARK-3095
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Davies Liu
>            Assignee: Davies Liu
>            Priority: Minor
>
> RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not PipelineRDD.
> If the JavaRDD is serialized in batch mode, it's possible to skip the deserialization of chunks (except the last one), because they all have the same number of elements in them. There are some special cases that the chunks are re-ordered, so this will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org