You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Samuel Zhou <zh...@gmail.com> on 2016/05/15 22:24:46 UTC

Kafka stream message sampling

Hi,

I was trying to use filter to sampling a Kafka direct stream, and the
filter function just take 1 messages from 10 by using hashcode % 10 == 0,
but the number of events of input for each batch didn't shrink to 10% of
original traffic. So I want to ask if there are any way to shrink the batch
size by a sampling function to save the traffic from Kafka?

Thanks!
Samuel

Re: Kafka stream message sampling

Posted by Samuel Zhou <zh...@gmail.com>.
Hi, Mich,

I created the Kafka DStream with following Java code:

sparkConf = new SparkConf().setAppName(this.getClass().getSimpleName() + ",
topic: " + topics);

jssc = new JavaStreamingContext(sparkConf, Durations.seconds(batchInterval
));

HashSet<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));

HashMap<String, String> kafkaParams = new HashMap<>();

kafkaParams.put("metadata.broker.list", brokers);

dStream = KafkaUtils.createDirectStream(jssc, byte[].class, byte[].class,
DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);

Do you know if there a way to do sampling for a stream when creating it?

Thanks,

Samuel

On Mon, May 16, 2016 at 12:54 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi Samuel,
>
> How do you create your RDD based on Kakfa direct stream?
>
> Do you have your code snippet?
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 23:24, Samuel Zhou <zh...@gmail.com> wrote:
>
>> Hi,
>>
>> I was trying to use filter to sampling a Kafka direct stream, and the
>> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
>> but the number of events of input for each batch didn't shrink to 10% of
>> original traffic. So I want to ask if there are any way to shrink the batch
>> size by a sampling function to save the traffic from Kafka?
>>
>> Thanks!
>> Samuel
>>
>
>

Re: Kafka stream message sampling

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Samuel,

How do you create your RDD based on Kakfa direct stream?

Do you have your code snippet?

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 23:24, Samuel Zhou <zh...@gmail.com> wrote:

> Hi,
>
> I was trying to use filter to sampling a Kafka direct stream, and the
> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
> but the number of events of input for each batch didn't shrink to 10% of
> original traffic. So I want to ask if there are any way to shrink the batch
> size by a sampling function to save the traffic from Kafka?
>
> Thanks!
> Samuel
>