You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/10 00:30:56 UTC

[GitHub] [spark] HyukjinKwon edited a comment on issue #26809: [SPARK-30185][SQL] Implement Dataset.tail API

HyukjinKwon edited a comment on issue #26809: [SPARK-30185][SQL] Implement Dataset.tail API
URL: https://github.com/apache/spark/pull/26809#issuecomment-563503647

> How much is this different from sorting in reverse and head()? in comparison this looks like it has to traverse the whole data set?

At least it can drop the records at executor sides and it won't require a sort.

> once the shuffle is involved, without ordering there should be no outstanding difference with head() as we don't guarantee ordering anyway, and with ordering the semantic would be same as sort with reverse order + head().

Yes, I think this is a good point. It can be just a different way for the same thing with ordering. Without ordering, it's designed to follow its natural order, which is not guaranteed in many cases in Spark.

One clear use case might be when it reads from external datasource. If I am not wrong, when we use Hadoop RDD (which most of external datasources use), it respects its natural order. So, `spark.read.format("xml").load().tail(5)` case will work.
Another case is local collection. If I am not wrong, the natural order is preserved.
I am sure there are such more cases which I should identify.

FWIW, Spark used to (unofficially) respect its natural order but it's broken after we started to consolidate small partitions into a big partition IIRC.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org