You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Dmitry Minkovsky <dm...@gmail.com> on 2019/01/03 12:34:47 UTC

Re: High end-to-end latency with processing.guarantee=exactly_once

Hi Matthias,

I get these errors even on reprocessing, when data is flowing full throttle
through the system. Can you help me understand how to tune this behavior,
if possible? I appreciate that it's aggressive, but it seems to be so
extremely aggressive that I get these errors constantly. Just how much data
do you need to have flowing before they go away?

Thank you,
Dmitry

On Thu, Dec 20, 2018 at 7:22 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> The problem is repartitions topics:
>
> Kafka Streams considers those topics as transient and purges consumed
> data aggressively (cf https://issues.apache.org/jira/browse/KAFKA-6150)
> resulting in lost producer state for those topics :(
>
>
> -Matthias
>
> On 12/20/18 3:18 AM, Dmitry Minkovsky wrote:
> > Also, I have read through that issue and KIP-360 to the extent my
> knowledge
> > allows and I don't understand why I get this error constantly when
> exactly
> > once is enabled. The KIP says
> >
> >> Idempotent/transactional semantics depend on the broker retaining state
> > for each active producer id (e.g. epoch and sequence number). When the
> > broker loses that state–due to segment deletion or a call to
> > DeleteRecords–then additional produce requests will result in the
> > UNKNOWN_PRODUCER_ID error.
> >
> > How much throughput do you need before this goes away? It seems like this
> > happens for me on every call... with the calls seconds apart.
> >
> > On Wed, Dec 19, 2018 at 9:12 PM Dmitry Minkovsky <dm...@gmail.com>
> > wrote:
> >
> >> Hello 王美功,
> >>
> >> I am using 2.1.0. And, I think you nailed it on the head, because my
> >> application is low throughput and I am seeing UNKNOWN_PRODUCER_ID all
> the
> >> time with exactly once enabled. I've googled this before but couldn't
> >> identify the cause. Thank you!
> >>
> >> Setting retry.backoff.ms to 5 brought the latency down from 1.3s to
> >> 750ms. That's tolerable for me, but I am wondering: when the KAFKA-7190
> is
> >> fixed, will the latency drop further?
> >>
> >> Dmitry
> >>
> >> On Wed, Dec 19, 2018 at 8:38 PM meigong.wang <me...@okcoin.com>
> >> wrote:
> >>
> >>> Which version are you using? This bug(
> >>> https://issues.apache.org/jira/browse/KAFKA-7190) may increase the
> >>> latency of your application, try to reduce the retry.backoff.ms,the
> >>> default value is 100 ms.
> >>>
> >>>
> >>> 王美功
> >>>
> >>>
> >>> 原始邮件
> >>> 发件人:Dmitry Minkovskydminkovsky@gmail.com
> >>> 收件人:usersusers@kafka.apache.org
> >>> 发送时间:2018年12月20日(周四) 09:25
> >>> 主题:High end-to-end latency with processing.guarantee=exactly_once
> >>>
> >>>
> >>> I have a process that spans several Kafka Streams applications. With
> the
> >>> streams commit interval and producer linger both set to 5ms, when
> exactly
> >>> once delivery is disabled, this process takes ~250ms. With exactly once
> >>> enabled, the same process takes anywhere from 800-1200ms. In Enabling
> >>> Exactly-Once in Kafka Streams ,">
> >>> https://www.confluent.io/blog/enabling-exactly-kafka-streams/,
> Guozhang
> >>> writes In Kafka Streams, because a new transaction is created whenever
> >>> commit is called, the average transaction size is determined by the
> commit
> >>> interval: with the same incoming traffic, a shorter commit interval
> will
> >>> result in smaller transactions. In practice users should therefore
> tune the
> >>> commit.interval.ms setting when exactly-once is enabled to make a good
> >>> trade-off between throughput versus end-to-end processing latency. But
> I am
> >>> not seeing much of a difference when I tune commit.interval.ms with
> >>> exactly once enabled. `though()` and `.to()/.stream()` take 100-250ms
> even
> >>> with commit.interval.ms set to 5ms. Do these latency differences sound
> >>> right? Is something off? Thank you, Dmitry
> >>
> >>
> >
>
>

Re: High end-to-end latency with processing.guarantee=exactly_once

Posted by Dmitry Minkovsky <dm...@gmail.com>.

Also, occasionally I see errors like this. Is it related to this issue?

[2018-12-30 18:21:00,552] ERROR
(org.apache.kafka.streams.processor.internals.RecordCollectorImpl:131) task
[0_4] Error sending record (key <ByteString@5c858701 size=16> value id:
"\227\236\022L\205\356\375\373\373\304\241\301n\250H\367" timestamp
1546194060180) to topic user-graphql-read-service-clients due to
org.apache.kafka.common.errors.UnknownProducerIdException: This exception
is raised by the broker if it could not locate the producer metadata
associated with the producerId in question. This could happen if, for
instance, the producer's records were deleted because their retention time
had elapsed. Once the last records of the producerId are removed, the
producer's metadata is removed from the broker, and future appends by the
producer will return this exception.; No more records will be sent and no
more offsets will be recorded for this task.

[2018-12-30 18:21:00,561] WARN
(org.apache.kafka.clients.producer.internals.Sender:596) [Producer
clientId=user-graphql-read-service-6e0e7ce4-fd83-4e21-a46a-df084ca64d06-StreamThread-1-0_4-producer,
transactionalId=user-graphql-read-service-0_4] Got error produce response
with correlation id 223 on topic-partition
user-graphql-read-service-KTABLE-AGGREGATE-STATE-STORE-0000000029-repartition-0,
retrying (2147483646 attempts left). Error: UNKNOWN_PRODUCER_ID

[2018-12-30 18:21:00,572] ERROR
(org.apache.kafka.streams.processor.internals.AssignedStreamsTasks:359)
stream-thread
[user-graphql-read-service-6e0e7ce4-fd83-4e21-a46a-df084ca64d06-StreamThread-1]
Failed to commit stream task 0_4 due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_4] Abort sending
since an error caught with a previous record (key <ByteString@5c858701
size=16> value id:
"\227\236\022L\205\356\375\373\373\304\241\301n\250H\367"


 timestamp 1546194060180) to topic user-graphql-read-service-clients due to
org.apache.kafka.common.errors.UnknownProducerIdException: This exception
is raised by the broker if it could not locate the producer metadata
associated with the producerId in question. This could happen if, for
instance, the producer's records were deleted because their retention time
had elapsed. Once the last records of the producerId are removed, the
producer's metadata is removed from the broker, and future appends by the
producer will return this exception.


        at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:133)

        at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:50)
                                                                        at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:192)
                                                                   at
org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1288)

        at
org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:230)
                                                                at
org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:196)


        at
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:715)
        at
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:683)
        at
org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:633)
        at
org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:555)
        at
org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
        at
org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:784)
        at
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
        at
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:557)
        at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:309)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:233)
        at java.lang.Thread.run(Thread.java:748)




On Thu, Jan 3, 2019 at 7:34 AM Dmitry Minkovsky <dm...@gmail.com>
wrote:

> Hi Matthias,
>
> I get these errors even on reprocessing, when data is flowing full
> throttle through the system. Can you help me understand how to tune this
> behavior, if possible? I appreciate that it's aggressive, but it seems to
> be so extremely aggressive that I get these errors constantly. Just how
> much data do you need to have flowing before they go away?
>
> Thank you,
> Dmitry
>
> On Thu, Dec 20, 2018 at 7:22 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
>
>> The problem is repartitions topics:
>>
>> Kafka Streams considers those topics as transient and purges consumed
>> data aggressively (cf https://issues.apache.org/jira/browse/KAFKA-6150)
>> resulting in lost producer state for those topics :(
>>
>>
>> -Matthias
>>
>> On 12/20/18 3:18 AM, Dmitry Minkovsky wrote:
>> > Also, I have read through that issue and KIP-360 to the extent my
>> knowledge
>> > allows and I don't understand why I get this error constantly when
>> exactly
>> > once is enabled. The KIP says
>> >
>> >> Idempotent/transactional semantics depend on the broker retaining state
>> > for each active producer id (e.g. epoch and sequence number). When the
>> > broker loses that state–due to segment deletion or a call to
>> > DeleteRecords–then additional produce requests will result in the
>> > UNKNOWN_PRODUCER_ID error.
>> >
>> > How much throughput do you need before this goes away? It seems like
>> this
>> > happens for me on every call... with the calls seconds apart.
>> >
>> > On Wed, Dec 19, 2018 at 9:12 PM Dmitry Minkovsky <dm...@gmail.com>
>> > wrote:
>> >
>> >> Hello 王美功,
>> >>
>> >> I am using 2.1.0. And, I think you nailed it on the head, because my
>> >> application is low throughput and I am seeing UNKNOWN_PRODUCER_ID all
>> the
>> >> time with exactly once enabled. I've googled this before but couldn't
>> >> identify the cause. Thank you!
>> >>
>> >> Setting retry.backoff.ms to 5 brought the latency down from 1.3s to
>> >> 750ms. That's tolerable for me, but I am wondering: when the
>> KAFKA-7190 is
>> >> fixed, will the latency drop further?
>> >>
>> >> Dmitry
>> >>
>> >> On Wed, Dec 19, 2018 at 8:38 PM meigong.wang <me...@okcoin.com>
>> >> wrote:
>> >>
>> >>> Which version are you using? This bug(
>> >>> https://issues.apache.org/jira/browse/KAFKA-7190) may increase the
>> >>> latency of your application, try to reduce the retry.backoff.ms,the
>> >>> default value is 100 ms.
>> >>>
>> >>>
>> >>> 王美功
>> >>>
>> >>>
>> >>> 原始邮件
>> >>> 发件人:Dmitry Minkovskydminkovsky@gmail.com
>> >>> 收件人:usersusers@kafka.apache.org
>> >>> 发送时间:2018年12月20日(周四) 09:25
>> >>> 主题:High end-to-end latency with processing.guarantee=exactly_once
>> >>>
>> >>>
>> >>> I have a process that spans several Kafka Streams applications. With
>> the
>> >>> streams commit interval and producer linger both set to 5ms, when
>> exactly
>> >>> once delivery is disabled, this process takes ~250ms. With exactly
>> once
>> >>> enabled, the same process takes anywhere from 800-1200ms. In Enabling
>> >>> Exactly-Once in Kafka Streams ,">
>> >>> https://www.confluent.io/blog/enabling-exactly-kafka-streams/,
>> Guozhang
>> >>> writes In Kafka Streams, because a new transaction is created whenever
>> >>> commit is called, the average transaction size is determined by the
>> commit
>> >>> interval: with the same incoming traffic, a shorter commit interval
>> will
>> >>> result in smaller transactions. In practice users should therefore
>> tune the
>> >>> commit.interval.ms setting when exactly-once is enabled to make a
>> good
>> >>> trade-off between throughput versus end-to-end processing latency.
>> But I am
>> >>> not seeing much of a difference when I tune commit.interval.ms with
>> >>> exactly once enabled. `though()` and `.to()/.stream()` take 100-250ms
>> even
>> >>> with commit.interval.ms set to 5ms. Do these latency differences
>> sound
>> >>> right? Is something off? Thank you, Dmitry
>> >>
>> >>
>> >
>>
>>