You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ohad Raviv (JIRA)" <ji...@apache.org> on 2016/09/25 06:39:20 UTC

[jira] [Created] (SPARK-17662) Dedup UDAF

Ohad Raviv created SPARK-17662:
----------------------------------

             Summary: Dedup UDAF
                 Key: SPARK-17662
                 URL: https://issues.apache.org/jira/browse/SPARK-17662
             Project: Spark
          Issue Type: New Feature
            Reporter: Ohad Raviv


We have a common use case od deduping a table in a creation order.
For example, we have an event log of user actions. A user marks his favorite category from time to time.
In our analytics we would like to know only the user's last favorite category.
The data:
user_id    action_type    value    date    
123          fav category   1           2016-02-01
123          fav category   4           2016-02-02
123          fav category   8           2016-02-03
123          fav category   2           2016-02-04

we would like to get only the last update by the date column.

we could of-course do it in sql:
select * from (
select *, row_number() over (partition by user_id,action_type order by date desc) as rnum from tbl)
where rnum=1;

but then, I believe it can't be optimized on the mappers side and we'll get all the data shuffled to the reducers instead of partially aggregated in the map side.

We have written a UDAF for this, but then we have other issues - like blocking push-down-predicate for columns.

do you have any idea for a proper solution?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org