You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ohad Raviv (JIRA)" <ji...@apache.org> on 2016/09/25 06:39:20 UTC
[jira] [Created] (SPARK-17662) Dedup UDAF
Ohad Raviv created SPARK-17662:
----------------------------------
Summary: Dedup UDAF
Key: SPARK-17662
URL: https://issues.apache.org/jira/browse/SPARK-17662
Project: Spark
Issue Type: New Feature
Reporter: Ohad Raviv
We have a common use case od deduping a table in a creation order.
For example, we have an event log of user actions. A user marks his favorite category from time to time.
In our analytics we would like to know only the user's last favorite category.
The data:
user_id action_type value date
123 fav category 1 2016-02-01
123 fav category 4 2016-02-02
123 fav category 8 2016-02-03
123 fav category 2 2016-02-04
we would like to get only the last update by the date column.
we could of-course do it in sql:
select * from (
select *, row_number() over (partition by user_id,action_type order by date desc) as rnum from tbl)
where rnum=1;
but then, I believe it can't be optimized on the mappers side and we'll get all the data shuffled to the reducers instead of partially aggregated in the map side.
We have written a UDAF for this, but then we have other issues - like blocking push-down-predicate for columns.
do you have any idea for a proper solution?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org