You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeff Nadler <jn...@srcginc.com> on 2016/09/09 23:41:41 UTC

Streaming Backpressure with Multiple Streams

Maybe this is a pretty esoteric implementation, but I'm seeing some bad
behavior with backpressure plus multiple Kafka streams / direct streams.

Here's the scenario:
We have 1 Kafka topic using the reliable receiver (4 receivers, union the
result).    In the same app, we consume another Kafka topic using a direct
stream.

This may seem strange, but it's necessary in my application to work around
another problem:   Maxrate is set globally in SparkConf.    IMO It would be
more flexible if we could set maxrate for each stream independently.
Since directstream uses a different config parameter for maxrate, we get
the desired result.

A bit hacky I know.

Anyway, we recently turned on backpressure.   It works as expected for the
receiver-based stream.     For the direct stream, it starts out at the
maxrate (as expected) on the first batch.    Then it ratchets down the
consumption until it is eventually consuming 1 record / second / partition.

This happens even though there's no scheduling delay, and the
receiver-based stream does not appear to be throttled.

Anyone ever see anything like this?

Thanks!

Jeff Nadler
Aerohive Networks

Re: Streaming Backpressure with Multiple Streams

Posted by Jeff Nadler <jn...@srcginc.com>.
So as you were maybe thinking, it only happens with the combination:

Direct Stream only + backpressure = works as expected

4x Receiver on Topic A + Direct Stream on Topic B + backpressure = the
direct stream is throttled even in the absence of scheduling delay

This is using Spark 1.5.0 on CDH.

After it's been running for several minutes if I look at "Input Metadata" I
can see that the direct stream is consuming 1 record / partition / sec.  I
have maxrate set at 10,000 records / partition / sec.

I'll file a bug today unless someone has any ideas?

Thanks!

Jeff


On Fri, Sep 9, 2016 at 5:54 PM, Jeff Nadler <jn...@srcginc.com> wrote:

> Yes I'll test that next.
>
> On Sep 9, 2016 5:36 PM, "Cody Koeninger" <co...@koeninger.org> wrote:
>
>> Does the same thing happen if you're only using direct stream plus back
>> pressure, not the receiver stream?
>>
>> On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:
>>
>>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>>
>>> Here's the scenario:
>>> We have 1 Kafka topic using the reliable receiver (4 receivers, union
>>> the result).    In the same app, we consume another Kafka topic using a
>>> direct stream.
>>>
>>> This may seem strange, but it's necessary in my application to work
>>> around another problem:   Maxrate is set globally in SparkConf.    IMO It
>>> would be more flexible if we could set maxrate for each stream
>>> independently.   Since directstream uses a different config parameter for
>>> maxrate, we get the desired result.
>>>
>>> A bit hacky I know.
>>>
>>> Anyway, we recently turned on backpressure.   It works as expected for
>>> the receiver-based stream.     For the direct stream, it starts out at the
>>> maxrate (as expected) on the first batch.    Then it ratchets down the
>>> consumption until it is eventually consuming 1 record / second / partition.
>>>
>>> This happens even though there's no scheduling delay, and the
>>> receiver-based stream does not appear to be throttled.
>>>
>>> Anyone ever see anything like this?
>>>
>>> Thanks!
>>>
>>> Jeff Nadler
>>> Aerohive Networks
>>>
>>>

Re: Streaming Backpressure with Multiple Streams

Posted by Jeff Nadler <jn...@srcginc.com>.
Yes I'll test that next.

On Sep 9, 2016 5:36 PM, "Cody Koeninger" <co...@koeninger.org> wrote:

> Does the same thing happen if you're only using direct stream plus back
> pressure, not the receiver stream?
>
> On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:
>
>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>
>> Here's the scenario:
>> We have 1 Kafka topic using the reliable receiver (4 receivers, union the
>> result).    In the same app, we consume another Kafka topic using a direct
>> stream.
>>
>> This may seem strange, but it's necessary in my application to work
>> around another problem:   Maxrate is set globally in SparkConf.    IMO It
>> would be more flexible if we could set maxrate for each stream
>> independently.   Since directstream uses a different config parameter for
>> maxrate, we get the desired result.
>>
>> A bit hacky I know.
>>
>> Anyway, we recently turned on backpressure.   It works as expected for
>> the receiver-based stream.     For the direct stream, it starts out at the
>> maxrate (as expected) on the first batch.    Then it ratchets down the
>> consumption until it is eventually consuming 1 record / second / partition.
>>
>> This happens even though there's no scheduling delay, and the
>> receiver-based stream does not appear to be throttled.
>>
>> Anyone ever see anything like this?
>>
>> Thanks!
>>
>> Jeff Nadler
>> Aerohive Networks
>>
>>

Re: Streaming Backpressure with Multiple Streams

Posted by Cody Koeninger <co...@koeninger.org>.
Does the same thing happen if you're only using direct stream plus back
pressure, not the receiver stream?

On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:

> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
> behavior with backpressure plus multiple Kafka streams / direct streams.
>
> Here's the scenario:
> We have 1 Kafka topic using the reliable receiver (4 receivers, union the
> result).    In the same app, we consume another Kafka topic using a direct
> stream.
>
> This may seem strange, but it's necessary in my application to work around
> another problem:   Maxrate is set globally in SparkConf.    IMO It would be
> more flexible if we could set maxrate for each stream independently.
> Since directstream uses a different config parameter for maxrate, we get
> the desired result.
>
> A bit hacky I know.
>
> Anyway, we recently turned on backpressure.   It works as expected for the
> receiver-based stream.     For the direct stream, it starts out at the
> maxrate (as expected) on the first batch.    Then it ratchets down the
> consumption until it is eventually consuming 1 record / second / partition.
>
> This happens even though there's no scheduling delay, and the
> receiver-based stream does not appear to be throttled.
>
> Anyone ever see anything like this?
>
> Thanks!
>
> Jeff Nadler
> Aerohive Networks
>
>