You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Antonin Delpeuch <li...@antonin.delpeuch.eu> on 2020/04/07 14:25:53 UTC

Re: RDD order guarantees

Hi,

Sorry to dig out this thread but this bug is still present.

The fix proposed in this thread (creating a new FileSystem implementation
which sorts listed files) was rejected, with the suggestion that it is the
FileInputFormat's responsibility to sort the file names if preserving
partition order is desired:
https://github.com/apache/spark/pull/4204

Given that Spark RDDs are supposed to preserve the order of the collections
they represent, this would still deserve to be fixed in Spark, I think. As a
user, I expect that if I use saveAsTextFile and then load the resulting file
with sparkContext.textFile, I obtain a dataset in the same order.

Because Spark uses the FileInputFormats exposed by Hadoop, that would mean
either patching Hadoop for it to sort file names directly (which is likely
going to fail since Hadoop might not care about the ordering in general), or
create subclasses of all Hadoop formats used in Spark, adding the required
sorting to the listStatus method. This strikes me as less elegant than
implementing a new FileSystem as suggested by Reynold, though.

Another way to "fix" this would be to mention in the docs that order is not
preserved in this scenario, which could hopefully avoid bad surprises to
others (just like we already have a caveat about nondeterminism of order
after shuffles).

I would be happy to try submitting a fix for this, if there is a consensus
around the correct course of action.

Cheers,
Antonin



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: RDD order guarantees

Posted by "Antonin Delpeuch (lists)" <li...@antonin.delpeuch.eu>.
Thanks a lot for the reply Steve!

If you don't see a way to fix this in Spark itself, then I will try to
improve the docs.

Antonin

On 06/05/2020 17:19, Steve Loughran wrote:
> 
> 
> On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <lists@antonin.delpeuch.eu
> <ma...@antonin.delpeuch.eu>> wrote:
> 
>     Hi,
> 
>     Sorry to dig out this thread but this bug is still present.
> 
>     The fix proposed in this thread (creating a new FileSystem
>     implementation
>     which sorts listed files) was rejected, with the suggestion that it
>     is the
>     FileInputFormat's responsibility to sort the file names if preserving
>     partition order is desired:
>     https://github.com/apache/spark/pull/4204
> 
>     Given that Spark RDDs are supposed to preserve the order of the
>     collections
>     they represent, this would still deserve to be fixed in Spark, I
>     think. As a
>     user, I expect that if I use saveAsTextFile and then load the
>     resulting file
>     with sparkContext.textFile, I obtain a dataset in the same order.
> 
>     Because Spark uses the FileInputFormats exposed by Hadoop, that
>     would mean
>     either patching Hadoop for it to sort file names directly (which is
>     likely
>     going to fail since Hadoop might not care about the ordering in
>     general), 
> 
> 
> Don't see any guarantees in Hadoop about the order of listLocatedStatus
> -and for the local FS you get what the OS gives you.
> 
> What isn't easy is to take an entire listing and sort it -not if it is
> potentially millions of entries. That issue is why the newer FS list
> APIs all return a RemoteIterator<>: incremental paging of values so
> reducing payload of single RPC messages between HDFS client & namenode
> (HDFS) or allowing for paged/incremental lists against object stores.
> You can't provide incremental pages of results *and sort those results
> at the same time*
> 
> Which, given they're my problem, means I wouldn't be happy with adding
> "sort all listings" as a new restriction on FS semantics.
> 
>  
> 
>     or
>     create subclasses of all Hadoop formats used in Spark, adding the
>     required
>     sorting to the listStatus method. This strikes me as less elegant than
>     implementing a new FileSystem as suggested by Reynold, though.
> 
> 
> Again, you've got some scale issues to deal with -but as FileInputFormat
> builds a list it's already in trouble if you point it at a sufficiently
> large directory tree
> 
> Best thing to do would be to add entries to a treemap during the
> recursive treewalk and then serve it up ordered from there -no need to
> do a sort @ the end.
> 
> But: trying to subclass all Hadoop formats is itself troublesome. If you
> go that way: make it an optional interface. And/or talk to the mapreduce
> project about actually providing a base implementation
> 
>  
> 
>     Another way to "fix" this would be to mention in the docs that order
>     is not
>     preserved in this scenario, which could hopefully avoid bad surprises to
>     others (just like we already have a caveat about nondeterminism of order
>     after shuffles).
> 
>     I would be happy to try submitting a fix for this, if there is a
>     consensus
>     around the correct course of action.
> 
> Even if it's not the final desired goal, it's a correct description of
> the current state of the application ...


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: RDD order guarantees

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <li...@antonin.delpeuch.eu>
wrote:

> Hi,
>
> Sorry to dig out this thread but this bug is still present.
>
> The fix proposed in this thread (creating a new FileSystem implementation
> which sorts listed files) was rejected, with the suggestion that it is the
> FileInputFormat's responsibility to sort the file names if preserving
> partition order is desired:
> https://github.com/apache/spark/pull/4204
>
> Given that Spark RDDs are supposed to preserve the order of the collections
> they represent, this would still deserve to be fixed in Spark, I think. As
> a
> user, I expect that if I use saveAsTextFile and then load the resulting
> file
> with sparkContext.textFile, I obtain a dataset in the same order.
>
> Because Spark uses the FileInputFormats exposed by Hadoop, that would mean
> either patching Hadoop for it to sort file names directly (which is likely
> going to fail since Hadoop might not care about the ordering in general),


Don't see any guarantees in Hadoop about the order of listLocatedStatus
-and for the local FS you get what the OS gives you.

What isn't easy is to take an entire listing and sort it -not if it is
potentially millions of entries. That issue is why the newer FS list APIs
all return a RemoteIterator<>: incremental paging of values so reducing
payload of single RPC messages between HDFS client & namenode (HDFS) or
allowing for paged/incremental lists against object stores. You can't
provide incremental pages of results *and sort those results at the same
time*

Which, given they're my problem, means I wouldn't be happy with adding
"sort all listings" as a new restriction on FS semantics.



> or
> create subclasses of all Hadoop formats used in Spark, adding the required
> sorting to the listStatus method. This strikes me as less elegant than
> implementing a new FileSystem as suggested by Reynold, though.
>
>
Again, you've got some scale issues to deal with -but as FileInputFormat
builds a list it's already in trouble if you point it at a sufficiently
large directory tree

Best thing to do would be to add entries to a treemap during the recursive
treewalk and then serve it up ordered from there -no need to do a sort @
the end.

But: trying to subclass all Hadoop formats is itself troublesome. If you go
that way: make it an optional interface. And/or talk to the mapreduce
project about actually providing a base implementation



> Another way to "fix" this would be to mention in the docs that order is not
> preserved in this scenario, which could hopefully avoid bad surprises to
> others (just like we already have a caveat about nondeterminism of order
> after shuffles).
>
> I would be happy to try submitting a fix for this, if there is a consensus
> around the correct course of action.
>
> Even if it's not the final desired goal, it's a correct description of the
current state of the application ...