You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nehalsyed <ne...@cable.comcast.com> on 2015/08/21 00:47:52 UTC

Kafka Spark Partition Mapping

I have data in Kafka topic-partition and I am reading it from Spark like
this: JavaPairReceiverInputDStream<String, String> directKafkaStream =     
KafkaUtils.createDirectStream(streamingContext,         [key class], [value
class], [key decoder class], [value decoder class],         [map of Kafka
parameters], [set of topics to consume]);I want that message from a kafka
partition always land on same machine on Spark rdd so I can cache some
decoration data locally and later reuse with other messages (that belong to
same key). Can anyone tell me how can I achieve it?Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Kafka Spark Partition Mapping

Posted by Cody Koeninger <co...@koeninger.org>.

If your cache doesn't change during operation, you can just create it once
then broadcast it to all workers.

Otherwise, use redis / memcache / whatever.

On Mon, Aug 24, 2015 at 12:21 PM, Syed, Nehal (Contractor) <
Nehal_Syed@cable.comcast.com> wrote:

> Dear Cody,
> Thanks for your response, I am trying to do decoration which means when a
> message comes from Kafka (partitioned by key) in to the Spark I want to add
> more fields/data to it.
> How Does normally people do it in Spark? If it were you how would you
> decorate message without hitting database for every message?
>
> Our current strategy is,  decoration data comes from *local *in Memory
> Cache (Guava LoadingCache) and/or from SQL DB if not in cache.  If we take
> this approach we want cached decoration data available locally to RDDs most
> of the time.
> Our Kafka and Spark run on separate machines and thats why I just wants
> kafka-partition to go to same Spark RDD partition most of time so I can
> utilized cached decoration Data.
>
> Do you think if I Create JdbcRDD for décorarion data and join it with
> JavaPairReceiverInputDStream it will always stays where JdbcRDD lives?
>
> Nehal
>
> From: Cody Koeninger <co...@koeninger.org>
> Date: Thursday, August 20, 2015 at 6:33 PM
> To: Microsoft Office User <ne...@cable.comcast.com>
> Cc: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Re: Kafka Spark Partition Mapping
>
> In general you cannot guarantee which node an RDD will be processed on.
>
> The preferred location for a kafkardd is the kafka leader for that
> partition, if they're deployed on the same machine. If you want to try to
> override that behavior, the method is getPreferredLocations
>
> But even in that case, location preferences are just a scheduler hint, the
> rdd can still be scheduled elsewhere.  You can turn up spark.locality.wait
> to a very high value to decrease the likelihood.
>
>
>
> On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed <ne...@cable.comcast.com>
> wrote:
>
>> I have data in Kafka topic-partition and I am reading it from Spark like
>> this: JavaPairReceiverInputDStream<String, String> directKafkaStream =
>> KafkaUtils.createDirectStream(streamingContext, [key class], [value class],
>> [key decoder class], [value decoder class], [map of Kafka parameters], [set
>> of topics to consume]); I want that message from a kafka partition always
>> land on same machine on Spark rdd so I can cache some decoration data
>> locally and later reuse with other messages (that belong to same key). Can
>> anyone tell me how can I achieve it? Thanks
>> ------------------------------
>> View this message in context: Kafka Spark Partition Mapping
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Kafka Spark Partition Mapping

Posted by "Syed, Nehal (Contractor)" <Ne...@cable.comcast.com>.

Dear Cody,
Thanks for your response, I am trying to do decoration which means when a message comes from Kafka (partitioned by key) in to the Spark I want to add more fields/data to it.
How Does normally people do it in Spark? If it were you how would you decorate message without hitting database for every message?

Our current strategy is,  decoration data comes from local in Memory Cache (Guava LoadingCache) and/or from SQL DB if not in cache.  If we take this approach we want cached decoration data available locally to RDDs most of the time.
Our Kafka and Spark run on separate machines and thats why I just wants kafka-partition to go to same Spark RDD partition most of time so I can utilized cached decoration Data.

Do you think if I Create JdbcRDD for décorarion data and join it with JavaPairReceiverInputDStream it will always stays where JdbcRDD lives?

Nehal

From: Cody Koeninger <co...@koeninger.org>>
Date: Thursday, August 20, 2015 at 6:33 PM
To: Microsoft Office User <ne...@cable.comcast.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Kafka Spark Partition Mapping

In general you cannot guarantee which node an RDD will be processed on.

The preferred location for a kafkardd is the kafka leader for that partition, if they're deployed on the same machine. If you want to try to override that behavior, the method is getPreferredLocations

But even in that case, location preferences are just a scheduler hint, the rdd can still be scheduled elsewhere.  You can turn up spark.locality.wait to a very high value to decrease the likelihood.



On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed <ne...@cable.comcast.com>> wrote:
I have data in Kafka topic-partition and I am reading it from Spark like this: JavaPairReceiverInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(streamingContext, [key class], [value class], [key decoder class], [value decoder class], [map of Kafka parameters], [set of topics to consume]); I want that message from a kafka partition always land on same machine on Spark rdd so I can cache some decoration data locally and later reuse with other messages (that belong to same key). Can anyone tell me how can I achieve it? Thanks
________________________________
View this message in context: Kafka Spark Partition Mapping<http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html>
Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.

Re: Kafka Spark Partition Mapping

Posted by Cody Koeninger <co...@koeninger.org>.

In general you cannot guarantee which node an RDD will be processed on.

The preferred location for a kafkardd is the kafka leader for that
partition, if they're deployed on the same machine. If you want to try to
override that behavior, the method is getPreferredLocations

But even in that case, location preferences are just a scheduler hint, the
rdd can still be scheduled elsewhere.  You can turn up spark.locality.wait
to a very high value to decrease the likelihood.

On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed <ne...@cable.comcast.com>
wrote:

> I have data in Kafka topic-partition and I am reading it from Spark like
> this: JavaPairReceiverInputDStream<String, String> directKafkaStream =
> KafkaUtils.createDirectStream(streamingContext, [key class], [value class],
> [key decoder class], [value decoder class], [map of Kafka parameters], [set
> of topics to consume]); I want that message from a kafka partition always
> land on same machine on Spark rdd so I can cache some decoration data
> locally and later reuse with other messages (that belong to same key). Can
> anyone tell me how can I achieve it? Thanks
> ------------------------------
> View this message in context: Kafka Spark Partition Mapping
> <http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>