You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/12/30 16:08:00 UTC
[jira] [Resolved] (SPARK-30185) Implement Dataset.tail API
[ https://issues.apache.org/jira/browse/SPARK-30185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-30185.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 26809
[https://github.com/apache/spark/pull/26809]
> Implement Dataset.tail API
> --------------------------
>
> Key: SPARK-30185
> URL: https://issues.apache.org/jira/browse/SPARK-30185
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Hyukjin Kwon
> Assignee: Hyukjin Kwon
> Priority: Major
> Fix For: 3.0.0
>
>
> I would like to propose an API called DataFrame.tail.
> *Background & Motivation*
> Many other systems support the way to take data from the end, for instance, pandas[1] and
> Python[2][3]. Scala collections APIs also have head and tail
> On the other hand, in Spark, we only provide a way to take data from the start
> (e.g., DataFrame.head). This has been requested multiple times here and there in Spark
> user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as
> Koalas[8].
> It seems we're missing non-trivial use case in Spark and this motivated me to propose this
> API.
> *Proposal*
> I would like to propose an API against DataFrame called tail that collects rows from the
> end in contrast with head.
> Namely, as below:
> {code:java}
> scala> spark.range(10).head(5)
> res1: Array[Long] = Array(0, 1, 2, 3, 4)
> scala> spark.range(10).tail(5)
> res2: Array[Long] = Array(5, 6, 7, 8, 9){code}
> Implementation details will be similar with head but it will be reversed:
> Run the job against the last partition and collect rows. If this is enough, return as is.
> If this is not enough, calculate the number of partitions to select more based upon
> ‘spark.sql.limit.scaleUpFactor’
> Run more jobs against more partitions (in a reversed order compared to head)
> as many as the number calculated from 2.
> Go to 2.
> [1] [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail]
> [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line]
> [3] [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python]
> [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html]
> [5] [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index]
> [6] [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe]
> [7] https://issues.apache.org/jira/browse/SPARK-26433
> [8] [https://github.com/databricks/koalas/issues/343]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org