You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2020/05/12 13:29:13 UTC

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch <an...@delpeuch.eu>
AuthorDate: Tue May 12 08:27:43 2020 -0500

    [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
    
    ### What changes were proposed in this pull request?
    
    This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).
    
    I added two sentences about this in the RDD Programming Guide.
    
    The issue was discussed on the dev mailing list:
    http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
    
    ### Why are the changes needed?
    
    Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order.
    
    This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, two new sentences in the documentation.
    
    ### How was this patch tested?
    
    By checking that the documentation looks good.
    
    Closes #28465 from wetneb/SPARK-5300-docs.
    
    Authored-by: Antonin Delpeuch <an...@delpeuch.eu>
    Signed-off-by: Sean Owen <sr...@gmail.com>
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org