You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeff Nadler <jn...@srcginc.com> on 2016/09/09 23:41:41 UTC
Streaming Backpressure with Multiple Streams
Maybe this is a pretty esoteric implementation, but I'm seeing some bad
behavior with backpressure plus multiple Kafka streams / direct streams.
Here's the scenario:
We have 1 Kafka topic using the reliable receiver (4 receivers, union the
result). In the same app, we consume another Kafka topic using a direct
stream.
This may seem strange, but it's necessary in my application to work around
another problem: Maxrate is set globally in SparkConf. IMO It would be
more flexible if we could set maxrate for each stream independently.
Since directstream uses a different config parameter for maxrate, we get
the desired result.
A bit hacky I know.
Anyway, we recently turned on backpressure. It works as expected for the
receiver-based stream. For the direct stream, it starts out at the
maxrate (as expected) on the first batch. Then it ratchets down the
consumption until it is eventually consuming 1 record / second / partition.
This happens even though there's no scheduling delay, and the
receiver-based stream does not appear to be throttled.
Anyone ever see anything like this?
Thanks!
Jeff Nadler
Aerohive Networks
Re: Streaming Backpressure with Multiple Streams
Posted by Jeff Nadler <jn...@srcginc.com>.
So as you were maybe thinking, it only happens with the combination:
Direct Stream only + backpressure = works as expected
4x Receiver on Topic A + Direct Stream on Topic B + backpressure = the
direct stream is throttled even in the absence of scheduling delay
This is using Spark 1.5.0 on CDH.
After it's been running for several minutes if I look at "Input Metadata" I
can see that the direct stream is consuming 1 record / partition / sec. I
have maxrate set at 10,000 records / partition / sec.
I'll file a bug today unless someone has any ideas?
Thanks!
Jeff
On Fri, Sep 9, 2016 at 5:54 PM, Jeff Nadler <jn...@srcginc.com> wrote:
> Yes I'll test that next.
>
> On Sep 9, 2016 5:36 PM, "Cody Koeninger" <co...@koeninger.org> wrote:
>
>> Does the same thing happen if you're only using direct stream plus back
>> pressure, not the receiver stream?
>>
>> On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:
>>
>>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>>
>>> Here's the scenario:
>>> We have 1 Kafka topic using the reliable receiver (4 receivers, union
>>> the result). In the same app, we consume another Kafka topic using a
>>> direct stream.
>>>
>>> This may seem strange, but it's necessary in my application to work
>>> around another problem: Maxrate is set globally in SparkConf. IMO It
>>> would be more flexible if we could set maxrate for each stream
>>> independently. Since directstream uses a different config parameter for
>>> maxrate, we get the desired result.
>>>
>>> A bit hacky I know.
>>>
>>> Anyway, we recently turned on backpressure. It works as expected for
>>> the receiver-based stream. For the direct stream, it starts out at the
>>> maxrate (as expected) on the first batch. Then it ratchets down the
>>> consumption until it is eventually consuming 1 record / second / partition.
>>>
>>> This happens even though there's no scheduling delay, and the
>>> receiver-based stream does not appear to be throttled.
>>>
>>> Anyone ever see anything like this?
>>>
>>> Thanks!
>>>
>>> Jeff Nadler
>>> Aerohive Networks
>>>
>>>
Re: Streaming Backpressure with Multiple Streams
Posted by Jeff Nadler <jn...@srcginc.com>.
Yes I'll test that next.
On Sep 9, 2016 5:36 PM, "Cody Koeninger" <co...@koeninger.org> wrote:
> Does the same thing happen if you're only using direct stream plus back
> pressure, not the receiver stream?
>
> On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:
>
>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>
>> Here's the scenario:
>> We have 1 Kafka topic using the reliable receiver (4 receivers, union the
>> result). In the same app, we consume another Kafka topic using a direct
>> stream.
>>
>> This may seem strange, but it's necessary in my application to work
>> around another problem: Maxrate is set globally in SparkConf. IMO It
>> would be more flexible if we could set maxrate for each stream
>> independently. Since directstream uses a different config parameter for
>> maxrate, we get the desired result.
>>
>> A bit hacky I know.
>>
>> Anyway, we recently turned on backpressure. It works as expected for
>> the receiver-based stream. For the direct stream, it starts out at the
>> maxrate (as expected) on the first batch. Then it ratchets down the
>> consumption until it is eventually consuming 1 record / second / partition.
>>
>> This happens even though there's no scheduling delay, and the
>> receiver-based stream does not appear to be throttled.
>>
>> Anyone ever see anything like this?
>>
>> Thanks!
>>
>> Jeff Nadler
>> Aerohive Networks
>>
>>
Re: Streaming Backpressure with Multiple Streams
Posted by Cody Koeninger <co...@koeninger.org>.
Does the same thing happen if you're only using direct stream plus back
pressure, not the receiver stream?
On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jn...@srcginc.com> wrote:
> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
> behavior with backpressure plus multiple Kafka streams / direct streams.
>
> Here's the scenario:
> We have 1 Kafka topic using the reliable receiver (4 receivers, union the
> result). In the same app, we consume another Kafka topic using a direct
> stream.
>
> This may seem strange, but it's necessary in my application to work around
> another problem: Maxrate is set globally in SparkConf. IMO It would be
> more flexible if we could set maxrate for each stream independently.
> Since directstream uses a different config parameter for maxrate, we get
> the desired result.
>
> A bit hacky I know.
>
> Anyway, we recently turned on backpressure. It works as expected for the
> receiver-based stream. For the direct stream, it starts out at the
> maxrate (as expected) on the first batch. Then it ratchets down the
> consumption until it is eventually consuming 1 record / second / partition.
>
> This happens even though there's no scheduling delay, and the
> receiver-based stream does not appear to be throttled.
>
> Anyone ever see anything like this?
>
> Thanks!
>
> Jeff Nadler
> Aerohive Networks
>
>