You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asoka Diggs (JIRA)" <ji...@apache.org> on 2015/09/28 21:12:05 UTC
[jira] [Comment Edited] (SPARK-10782) Duplicate examples for
drop_duplicates and DropDuplicates
[ https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933792#comment-14933792 ]
Asoka Diggs edited comment on SPARK-10782 at 9/28/15 7:11 PM:
--------------------------------------------------------------
A reasonable sounding request, but I'm not familiar with the acronym (PR), and this is my first time dipping my toe into reporting an issue. I will try to be more specific, and may need a pointer to remedial education :)
EDIT: PR = Pull Request. I found the documentation about Contributing to Spark and will puzzle my way through.
The change I propose making is in the documentation for drop_duplicates only (tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()
<NEW line>
df.drop_duplicates().show()
A larger philosophical question - based on the documentation, it appears that there are 3 implementations of the equivalent of SQL's DISTINCT clause: distinct(), dropDuplicates(), and drop_duplicates(). The latter two support a column list to work on, but are otherwise the same as distinct(). It seems that ideally, all three of these are really 1 implementation behind the scenes, with the other two listed as aliases.
This is hopefully a second update to the documentation (listing the 3 methods as aliases of each other). In the worst case, this becomes a suggestion that the 3 implementations get merged into 1, and the documentation updated to indicate these are aliases.
was (Author: asoka.diggs):
A reasonable sounding request, but I'm not familiar with the acronym (PR), and this is my first time dipping my toe into reporting an issue. I will try to be more specific, and may need a pointer to remedial education :)
The change I propose making is in the documentation for drop_duplicates only (tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()
<NEW line>
df.drop_duplicates().show()
A larger philosophical question - based on the documentation, it appears that there are 3 implementations of the equivalent of SQL's DISTINCT clause: distinct(), dropDuplicates(), and drop_duplicates(). The latter two support a column list to work on, but are otherwise the same as distinct(). It seems that ideally, all three of these are really 1 implementation behind the scenes, with the other two listed as aliases.
This is hopefully a second update to the documentation (listing the 3 methods as aliases of each other). In the worst case, this becomes a suggestion that the 3 implementations get merged into 1, and the documentation updated to indicate these are aliases.
> Duplicate examples for drop_duplicates and DropDuplicates
> ---------------------------------------------------------
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.5.0
> Reporter: Asoka Diggs
> Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates and drop_duplicates are identical with each other. It appears that the example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org