You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nikos Viorres <nv...@gmail.com> on 2015/03/18 07:09:32 UTC

updateStateByKey performance / API

Hi all,

We are having a few issues with the performance of updateStateByKey
operation in Spark Streaming (1.2.1 at the moment) and any advice would be
greatly appreciated. Specifically, on each tick of the system (which is set
at 10 secs) we need to update a state tuple where the key is the user_id
and value an object with some state about the user. The problem is that
using Kryo serialization for 5M users, this gets really slow to the point
that we have to increase the period to more than 10 seconds so as not to
fall behind.
The input for the streaming job is a Kafka stream which is consists of key
value pairs of user_ids with some sort of action codes, we join this to our
checkpointed state key and update the state.
I understand that the reason for iterating over the whole state set is for
evicting items or updating state for everyone for time-depended
computations but this does not apply on our situation and it hurts
performance really bad.
Is there a possibility of implementing in the future and extra call in the
API for updating only a specific subset of keys?

p.s. i will try asap to setting the dstream as non-serialized but then i am
worried about GC and checkpointing performance