You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Abhijit Deb (JIRA)" <ji...@apache.org> on 2015/10/06 23:37:27 UTC

[jira] [Updated] (SPARK-10962) DataFrame "except" method...

     [ https://issues.apache.org/jira/browse/SPARK-10962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Abhijit Deb updated SPARK-10962:
--------------------------------
    Affects Version/s: 1.5.0
             Priority: Critical  (was: Major)
          Description: We are trying to find the duplicates in a DataFrame. We first get the uniques and then we are trying to get the duplicates using "except". While the uniques is quite fast, but getting the duplicates using "except" is tremendously slow. What will be the best way to get the duplicates - getting just the uniques is not sufficient in most use cases. 
          Component/s: SQL
              Summary: DataFrame "except" method...  (was: DataFrame "except)

> DataFrame "except" method...
> ----------------------------
>
>                 Key: SPARK-10962
>                 URL: https://issues.apache.org/jira/browse/SPARK-10962
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Abhijit Deb
>            Priority: Critical
>
> We are trying to find the duplicates in a DataFrame. We first get the uniques and then we are trying to get the duplicates using "except". While the uniques is quite fast, but getting the duplicates using "except" is tremendously slow. What will be the best way to get the duplicates - getting just the uniques is not sufficient in most use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org