You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/10/30 09:38:27 UTC

[jira] [Commented] (SPARK-10962) DataFrame "except" method...

    [ https://issues.apache.org/jira/browse/SPARK-10962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982160#comment-14982160 ] 

Herman van Hovell commented on SPARK-10962:
-------------------------------------------

Do you want to know which row had a duplicate? If you do, you can also use a window function for this. For example
{noformat}
// Create a dataset.
val duplicateDf = sqlContext.range(1 << 20)
  .select(
    ($"id" - ($"id" % 2)).as("grp1"),
    ($"id" - ($"id" % 3)).as("grp2"))

// Count Unique records
duplicateDf.distinct.count // res1: Long = 699051

// Count Using window functions
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"grp1", $"grp2").orderBy($"grp1", $"grp2")
val deDuplicatedDf = duplicateDf
  .select(
    $"*"
    ,rowNumber().over(window).as("selector")
    ,count(lit(1)).over(window).as("count"))
  .filter($"selector" === lit(1))

// Count Unique records with window function
deDuplicatedDf.count // res2: Long = 699051
{noformat}





> DataFrame "except" method...
> ----------------------------
>
>                 Key: SPARK-10962
>                 URL: https://issues.apache.org/jira/browse/SPARK-10962
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Abhijit Deb
>            Priority: Critical
>
> We are trying to find the duplicates in a DataFrame. We first get the uniques and then we are trying to get the duplicates using "except". While the uniques is quite fast, but getting the duplicates using "except" is tremendously slow. What will be the best way to get the duplicates - getting just the uniques is not sufficient in most use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org