You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@datafu.apache.org by "Eyal Allweil (Jira)" <ji...@apache.org> on 2022/05/16 09:34:00 UTC

[jira] [Commented] (DATAFU-159) Add diff functionality to datafu-spark

    [ https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537423#comment-17537423 ] 

Eyal Allweil commented on DATAFU-159:
-------------------------------------

I've discovered the [spark-extension|https://github.com/G-Research/spark-extension] library, which contains a [diff|https://github.com/G-Research/spark-extension/blob/master/DIFF.md] method which seems to do exactly this. The only caveat is that this library is provided for Spark 3.x, whereas DataFu is 2.x.

In light of this, my tendency is to close this issue. Anyone disagree? I suppose we could also copy (with attribution) the code so people on the Spark 2.x line could use it until they upgrade.

> Add diff functionality to datafu-spark
> --------------------------------------
>
>                 Key: DATAFU-159
>                 URL: https://issues.apache.org/jira/browse/DATAFU-159
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Priority: Major
>
> A useful feature when examining results is the ability to clearly understand the differences between two datasets - for example, doing regressions between expected and actual results.
> Spark provides the _except_ functionality, but this is often not enough for this - for example, see [this question on Stack Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala]
> Datafu-pig had a macro for doing this, and this could be a useful addition to datafu-spark.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)