You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "Eyal Allweil (JIRA)" <ji...@apache.org> on 2017/09/12 12:50:01 UTC
[jira] [Created] (DATAFU-129) New macro - dedup
Eyal Allweil created DATAFU-129:
-----------------------------------
Summary: New macro - dedup
Key: DATAFU-129
URL: https://issues.apache.org/jira/browse/DATAFU-129
Project: DataFu
Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil
Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically a date updated field).
One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test dependencies in order for the test to run. While I feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true - do we have an alternative? (maybe adding it to the jarjar?)
The macro's definition looks as follows:
DEFINE dedup(relation, row_key, order_field) returns out {
relation - relation to dedup
row_key - field(s) for group by
order_field - the field for ordering (to find the most recent record)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)