You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Venkat Subramanian <vs...@gmail.com> on 2014/07/30 20:32:18 UTC

Partioner to process data in the same order for each key

I have a data file that I need to process using Spark . The file has multiple
events for different users and I need to process the events for each user in
the order it is in the file.

User 1 : Event 1
User 2: Event 1
User 1 : Event 2
User 3: Event 1
User 2: Event 2
User 3: Event 2

etc..

I want to make sure User 1 events , User 2 events , User 3 events are
processed in the "same order" it comes in sequentially for each user (don't
care if order is not maintained across users).
i.e, Process User 1's: Event 1, Event 2, Event 3 in the same order in "one
Spark node" and do the same for User 2 etc.

We used the User i as the key (it is unique) and expected the default
partitioner which is the hash partitioner to always distribute User 'i'
events to one particular node (so that order of events for that user is
maintained). It was not happening that way in reality, we were seeing that
events for same users were getting distributed to different nodes. May be my
understanding of hash partitioner is not correct then or may be we are
making some mistake.

Is there any standard partitioner that spark supports that we can use if
hash partitioner is not the right one for this use case? or Do we write our
own partitioner ?- if we need to write a new partitioner, can someone give a
psedocode for this use case to help us.

Regards,

Venkat

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partioner-to-process-data-in-the-same-order-for-each-key-tp10977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.