You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asoka Diggs (JIRA)" <ji...@apache.org> on 2015/09/28 21:12:05 UTC

[jira] [Comment Edited] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

    [ https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933792#comment-14933792 ] 

Asoka Diggs edited comment on SPARK-10782 at 9/28/15 7:11 PM:
--------------------------------------------------------------

A reasonable sounding request, but I'm not familiar with the acronym (PR), and this is my first time dipping my toe into reporting an issue.  I will try to be more specific, and may need a pointer to remedial education :)

EDIT: PR = Pull Request.  I found the documentation about Contributing to Spark and will puzzle my way through.


The change I propose making is in the documentation for drop_duplicates only (tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()

<NEW line>
df.drop_duplicates().show()


A larger philosophical question - based on the documentation, it appears that there are 3 implementations of the equivalent of SQL's DISTINCT clause:  distinct(), dropDuplicates(), and drop_duplicates().  The latter two support a column list to work on, but are otherwise the same as distinct().  It seems that ideally, all three of these are really 1 implementation behind the scenes, with the other two listed as aliases.

This is hopefully a second update to the documentation (listing the 3 methods as aliases of each other).  In the worst case, this becomes a suggestion that the 3 implementations get merged into 1, and the documentation updated to indicate these are aliases.


was (Author: asoka.diggs):
A reasonable sounding request, but I'm not familiar with the acronym (PR), and this is my first time dipping my toe into reporting an issue.  I will try to be more specific, and may need a pointer to remedial education :)

The change I propose making is in the documentation for drop_duplicates only (tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()

<NEW line>
df.drop_duplicates().show()


A larger philosophical question - based on the documentation, it appears that there are 3 implementations of the equivalent of SQL's DISTINCT clause:  distinct(), dropDuplicates(), and drop_duplicates().  The latter two support a column list to work on, but are otherwise the same as distinct().  It seems that ideally, all three of these are really 1 implementation behind the scenes, with the other two listed as aliases.

This is hopefully a second update to the documentation (listing the 3 methods as aliases of each other).  In the worst case, this becomes a suggestion that the 3 implementations get merged into 1, and the documentation updated to indicate these are aliases.

> Duplicate examples for drop_duplicates and DropDuplicates
> ---------------------------------------------------------
>
>                 Key: SPARK-10782
>                 URL: https://issues.apache.org/jira/browse/SPARK-10782
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.5.0
>            Reporter: Asoka Diggs
>            Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates and drop_duplicates are identical with each other.  It appears that the example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org