You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Han Altae-Tran (JIRA)" <ji...@apache.org> on 2019/02/17 06:39:00 UTC

[jira] [Created] (SPARK-26906) Pyspark RDD Replication Not Working

Han Altae-Tran created SPARK-26906:
--------------------------------------

             Summary: Pyspark RDD Replication Not Working
                 Key: SPARK-26906
                 URL: https://issues.apache.org/jira/browse/SPARK-26906
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Web UI
    Affects Versions: 2.3.2
         Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018] (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with python version 3.7. PySpark shell is activated using pyspark --num-executors = 100
            Reporter: Han Altae-Tran


Pyspark RDD replication doesn't seem to be functioning properly. Even with a simple example, the UI reports only 1x replication, despite using the flag for 2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at PythonRDD.scala:52

mapped.count(){code}
 

resulting in the following:

!image-2019-02-17-01-33-08-551.png!

 

Interestingly, if you catch the UI page at just the right time, you see that it starts off 2x replicated:

 

!image-2019-02-17-01-35-37-034.png!

 

but ends up going back to 1x replicated once the RDD is fully materialized. This is likely not a UI bug because the cached partitions page also shows only 1x replication:

 

!image-2019-02-17-01-36-55-418.png!

 

This could result from some type of optimization for replication, but is undesirable for users that want a specific level of replication for fault tolerance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org