You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Jeremiah Adams <JA...@helixeducation.com.INVALID> on 2020/03/11 21:38:56 UTC

Re: Got Error Produce Respons with Correlation Id.

Can anyone take a look at the message below? We are trying to gauge our risk before moving forward.


Jeremiah Adams
Software Engineer
www.helixeducation.com
Blog | Twitter | Facebook | LinkedIn

________________________________________
From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
Sent: Wednesday, March 4, 2020 2:28 PM
To: dev@samza.apache.org
Subject: Got Error Produce Response iwth Correlation Id.

Hello devs,


I've got a warning showing up in the logs while testing our new Confluent Cloud config.  Can anyone tell me how concerned I should be about this warning? Is there a setting to control timeouts?


Also, logs stop at that point, so I can't tell if the "metatdata update" was complete.



2020-03-04 21:17:51 Sender [WARN] [Producer clientId=kafka_producer-application_submission-1] Got error produce response with correlation id 144 on topic-partition __samza_checkpoint_ver_1_for_application-submission_1-0, retrying (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
2020-03-04 21:17:51 Sender [WARN] [Producer clientId=kafka_producer-application_submission-1] Received invalid metadata error in produce request on partition __samza_checkpoint_ver_1_for_application-submission_1-0 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now


Jeremiah Adams
Software Engineer
https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~<https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~>
Blog<https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~> | Twitter<https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~> | Facebook<https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~> | LinkedIn<https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0>

Re: Got Error Produce Respons with Correlation Id.

Posted by Yi Pan <ni...@gmail.com>.
Cool! Sounds good to me! Happy to be the help!

-Yi

On Fri, Mar 13, 2020 at 1:13 PM Jeremiah Adams
<JA...@helixeducation.com.invalid> wrote:

> Yes, I explicitly commit via code for this job as an effort to ensure only
> once processing.
>
> Thanks for taking the time to look into our concerns.
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Yi Pan <ni...@gmail.com>
> Sent: Friday, March 13, 2020 1:27 PM
> To: dev@samza.apache.org
> Subject: Re: Got Error Produce Respons with Correlation Id.
>
> Hi, Jeremiah,
>
> From what you have answered, it looks to me as a transient error (probably
> timeout due to some transient network errors as you mentioned) and your job
> was able to retry/recover and make progress.
>
> Just one thing to confirm: I saw your configured task.commit.ms=-1, and
> you
> have mentioned that your checkpointed offset metrics DOES increment over
> time. Are you calling commit in your user code?
>
> Thanks!
>
> -Yi
>
> On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams
> <JA...@helixeducation.com.invalid> wrote:
>
> >  Do you see the Samza job hanging after that?
> > The job does not hang.
> >
> >
> > Is the checkpointed offset metrics incrementing in this case?
> > We do get incremented offsets.
> >
> > Not clear on your claiming: "logs stop at that point". No logs are
> written
> > after the WARN lines?
> > My apologies for the confusion - I see no lag messages related to the
> > warning. I see all of our normal processing logs. I'm assuming this means
> > the retry worked.
> >
> >
> > What's your Samza configuration?
> >
> >
> job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
> > job.coordinator.replication.factor=1
> > job.default.system=kafka
> > systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
> >
> >
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> > systems.kafka.producer.ssl.endpoint.identification.algorithm=https
> > systems.kafka.producer.sasl.mechanism=PLAIN
> > systems.kafka.producer.request.timeout.ms=20000
> > systems.kafka.producer.retry.backoff.ms=500
> >
> systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> > required username="" password="";
> > systems.kafka.producer.security.protocol=SASL_SSL
> > systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
> > systems.kafka.consumer.sasl.mechanism=PLAIN
> > systems.kafka.consumer.request.timeout.ms=20000
> > systems.kafka.consumer.retry.backoff.ms=500
> >
> systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> > required username="" password="";
> > systems.kafka.consumer.security.protocol=SASL_SSL
> > processor.id=0
> >
> > # checkpointing
> >
> >
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> > task.checkpoint.system=kafka
> > task.checkpoint.replication.factor=3
> > task.commit.ms=-1
> >
> > Is the Samza container still running after you see those WARN logs?
> > Yes.
> >
> >
> > I am thinking this is a timeout issue. We've never seen the issue before.
> > The warning first appeared after  testing Confluent's Cloud kafka
> offering.
> > We had no issues when running our own kafka clusters in aws.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > Blog | Twitter | Facebook | LinkedIn
> >
> > ________________________________________
> > From: Yi Pan <ni...@gmail.com>
> > Sent: Wednesday, March 11, 2020 5:48 PM
> > To: dev@samza.apache.org
> > Subject: Re: Got Error Produce Respons with Correlation Id.
> >
> > Hi, Jeremiah,
> >
> > Sorry to reply late. This WARN message indicates that producer failed to
> > flush to checkpoint topic and would retry. Do you see the Samza job
> hanging
> > after that? Is the checkpointed offset metrics incrementing in this case?
> > Not clear on your claiming: "logs stop at that point". No logs are
> written
> > after the WARN lines? What's your Samza configuration? Is the Samza
> > container still running after you see those WARN logs?
> >
> > Thanks!
> >
> > -Yi
> >
> > On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
> > <JA...@helixeducation.com.invalid> wrote:
> >
> > > Can anyone take a look at the message below? We are trying to gauge our
> > > risk before moving forward.
> > >
> > >
> > > Jeremiah Adams
> > > Software Engineer
> > >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > > Blog | Twitter | Facebook | LinkedIn
> > >
> > > ________________________________________
> > > From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
> > > Sent: Wednesday, March 4, 2020 2:28 PM
> > > To: dev@samza.apache.org
> > > Subject: Got Error Produce Response iwth Correlation Id.
> > >
> > > Hello devs,
> > >
> > >
> > > I've got a warning showing up in the logs while testing our new
> Confluent
> > > Cloud config.  Can anyone tell me how concerned I should be about this
> > > warning? Is there a setting to control timeouts?
> > >
> > >
> > > Also, logs stop at that point, so I can't tell if the "metatdata
> update"
> > > was complete.
> > >
> > >
> > >
> > > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > > clientId=kafka_producer-application_submission-1] Got error produce
> > > response with correlation id 144 on topic-partition
> > > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> > > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> > > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > > clientId=kafka_producer-application_submission-1] Received invalid
> > metadata
> > > error in produce request on partition
> > > __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> > > org.apache.kafka.common.errors.NetworkException: The server
> disconnected
> > > before a response was received.. Going to request metadata update now
> > >
> > >
> > > Jeremiah Adams
> > > Software Engineer
> > >
> > >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > > <
> > >
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> > > >
> > > Blog<
> > >
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~
> > >
> > > | Twitter<
> > >
> >
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~
> > >
> > > | Facebook<
> > >
> >
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~
> > >
> > > | LinkedIn<
> > >
> >
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> > > >
> > >
> >
>

Re: Got Error Produce Respons with Correlation Id.

Posted by Jeremiah Adams <JA...@helixeducation.com.INVALID>.
Yes, I explicitly commit via code for this job as an effort to ensure only once processing. 

Thanks for taking the time to look into our concerns. 

Jeremiah Adams
Software Engineer
www.helixeducation.com
Blog | Twitter | Facebook | LinkedIn

________________________________________
From: Yi Pan <ni...@gmail.com>
Sent: Friday, March 13, 2020 1:27 PM
To: dev@samza.apache.org
Subject: Re: Got Error Produce Respons with Correlation Id.

Hi, Jeremiah,

From what you have answered, it looks to me as a transient error (probably
timeout due to some transient network errors as you mentioned) and your job
was able to retry/recover and make progress.

Just one thing to confirm: I saw your configured task.commit.ms=-1, and you
have mentioned that your checkpointed offset metrics DOES increment over
time. Are you calling commit in your user code?

Thanks!

-Yi

On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams
<JA...@helixeducation.com.invalid> wrote:

>  Do you see the Samza job hanging after that?
> The job does not hang.
>
>
> Is the checkpointed offset metrics incrementing in this case?
> We do get incremented offsets.
>
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines?
> My apologies for the confusion - I see no lag messages related to the
> warning. I see all of our normal processing logs. I'm assuming this means
> the retry worked.
>
>
> What's your Samza configuration?
>
> job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
> job.coordinator.replication.factor=1
> job.default.system=kafka
> systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
>
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> systems.kafka.producer.ssl.endpoint.identification.algorithm=https
> systems.kafka.producer.sasl.mechanism=PLAIN
> systems.kafka.producer.request.timeout.ms=20000
> systems.kafka.producer.retry.backoff.ms=500
> systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.producer.security.protocol=SASL_SSL
> systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
> systems.kafka.consumer.sasl.mechanism=PLAIN
> systems.kafka.consumer.request.timeout.ms=20000
> systems.kafka.consumer.retry.backoff.ms=500
> systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.consumer.security.protocol=SASL_SSL
> processor.id=0
>
> # checkpointing
>
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> task.checkpoint.system=kafka
> task.checkpoint.replication.factor=3
> task.commit.ms=-1
>
> Is the Samza container still running after you see those WARN logs?
> Yes.
>
>
> I am thinking this is a timeout issue. We've never seen the issue before.
> The warning first appeared after  testing Confluent's Cloud kafka offering.
> We had no issues when running our own kafka clusters in aws.
>
>
> Jeremiah Adams
> Software Engineer
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Yi Pan <ni...@gmail.com>
> Sent: Wednesday, March 11, 2020 5:48 PM
> To: dev@samza.apache.org
> Subject: Re: Got Error Produce Respons with Correlation Id.
>
> Hi, Jeremiah,
>
> Sorry to reply late. This WARN message indicates that producer failed to
> flush to checkpoint topic and would retry. Do you see the Samza job hanging
> after that? Is the checkpointed offset metrics incrementing in this case?
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines? What's your Samza configuration? Is the Samza
> container still running after you see those WARN logs?
>
> Thanks!
>
> -Yi
>
> On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
> <JA...@helixeducation.com.invalid> wrote:
>
> > Can anyone take a look at the message below? We are trying to gauge our
> > risk before moving forward.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > Blog | Twitter | Facebook | LinkedIn
> >
> > ________________________________________
> > From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
> > Sent: Wednesday, March 4, 2020 2:28 PM
> > To: dev@samza.apache.org
> > Subject: Got Error Produce Response iwth Correlation Id.
> >
> > Hello devs,
> >
> >
> > I've got a warning showing up in the logs while testing our new Confluent
> > Cloud config.  Can anyone tell me how concerned I should be about this
> > warning? Is there a setting to control timeouts?
> >
> >
> > Also, logs stop at that point, so I can't tell if the "metatdata update"
> > was complete.
> >
> >
> >
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Got error produce
> > response with correlation id 144 on topic-partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Received invalid
> metadata
> > error in produce request on partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> > org.apache.kafka.common.errors.NetworkException: The server disconnected
> > before a response was received.. Going to request metadata update now
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > <
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> > >
> > Blog<
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~
> >
> > | Twitter<
> >
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~
> >
> > | Facebook<
> >
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~
> >
> > | LinkedIn<
> >
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> > >
> >
>

Re: Got Error Produce Respons with Correlation Id.

Posted by Yi Pan <ni...@gmail.com>.
Hi, Jeremiah,

From what you have answered, it looks to me as a transient error (probably
timeout due to some transient network errors as you mentioned) and your job
was able to retry/recover and make progress.

Just one thing to confirm: I saw your configured task.commit.ms=-1, and you
have mentioned that your checkpointed offset metrics DOES increment over
time. Are you calling commit in your user code?

Thanks!

-Yi

On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams
<JA...@helixeducation.com.invalid> wrote:

>  Do you see the Samza job hanging after that?
> The job does not hang.
>
>
> Is the checkpointed offset metrics incrementing in this case?
> We do get incremented offsets.
>
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines?
> My apologies for the confusion - I see no lag messages related to the
> warning. I see all of our normal processing logs. I'm assuming this means
> the retry worked.
>
>
> What's your Samza configuration?
>
> job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
> job.coordinator.replication.factor=1
> job.default.system=kafka
> systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
>
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> systems.kafka.producer.ssl.endpoint.identification.algorithm=https
> systems.kafka.producer.sasl.mechanism=PLAIN
> systems.kafka.producer.request.timeout.ms=20000
> systems.kafka.producer.retry.backoff.ms=500
> systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.producer.security.protocol=SASL_SSL
> systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
> systems.kafka.consumer.sasl.mechanism=PLAIN
> systems.kafka.consumer.request.timeout.ms=20000
> systems.kafka.consumer.retry.backoff.ms=500
> systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.consumer.security.protocol=SASL_SSL
> processor.id=0
>
> # checkpointing
>
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> task.checkpoint.system=kafka
> task.checkpoint.replication.factor=3
> task.commit.ms=-1
>
> Is the Samza container still running after you see those WARN logs?
> Yes.
>
>
> I am thinking this is a timeout issue. We've never seen the issue before.
> The warning first appeared after  testing Confluent's Cloud kafka offering.
> We had no issues when running our own kafka clusters in aws.
>
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Yi Pan <ni...@gmail.com>
> Sent: Wednesday, March 11, 2020 5:48 PM
> To: dev@samza.apache.org
> Subject: Re: Got Error Produce Respons with Correlation Id.
>
> Hi, Jeremiah,
>
> Sorry to reply late. This WARN message indicates that producer failed to
> flush to checkpoint topic and would retry. Do you see the Samza job hanging
> after that? Is the checkpointed offset metrics incrementing in this case?
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines? What's your Samza configuration? Is the Samza
> container still running after you see those WARN logs?
>
> Thanks!
>
> -Yi
>
> On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
> <JA...@helixeducation.com.invalid> wrote:
>
> > Can anyone take a look at the message below? We are trying to gauge our
> > risk before moving forward.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > Blog | Twitter | Facebook | LinkedIn
> >
> > ________________________________________
> > From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
> > Sent: Wednesday, March 4, 2020 2:28 PM
> > To: dev@samza.apache.org
> > Subject: Got Error Produce Response iwth Correlation Id.
> >
> > Hello devs,
> >
> >
> > I've got a warning showing up in the logs while testing our new Confluent
> > Cloud config.  Can anyone tell me how concerned I should be about this
> > warning? Is there a setting to control timeouts?
> >
> >
> > Also, logs stop at that point, so I can't tell if the "metatdata update"
> > was complete.
> >
> >
> >
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Got error produce
> > response with correlation id 144 on topic-partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Received invalid
> metadata
> > error in produce request on partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> > org.apache.kafka.common.errors.NetworkException: The server disconnected
> > before a response was received.. Going to request metadata update now
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > <
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> > >
> > Blog<
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~
> >
> > | Twitter<
> >
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~
> >
> > | Facebook<
> >
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~
> >
> > | LinkedIn<
> >
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> > >
> >
>

Re: Got Error Produce Respons with Correlation Id.

Posted by Jeremiah Adams <JA...@helixeducation.com.INVALID>.
 Do you see the Samza job hanging after that? 
The job does not hang.


Is the checkpointed offset metrics incrementing in this case?
We do get incremented offsets.

Not clear on your claiming: "logs stop at that point". No logs are written
after the WARN lines? 
My apologies for the confusion - I see no lag messages related to the warning. I see all of our normal processing logs. I'm assuming this means the retry worked.


What's your Samza configuration?
job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
job.coordinator.replication.factor=1
job.default.system=kafka
systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
systems.kafka.producer.ssl.endpoint.identification.algorithm=https
systems.kafka.producer.sasl.mechanism=PLAIN
systems.kafka.producer.request.timeout.ms=20000
systems.kafka.producer.retry.backoff.ms=500
systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="" password="";
systems.kafka.producer.security.protocol=SASL_SSL
systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
systems.kafka.consumer.sasl.mechanism=PLAIN
systems.kafka.consumer.request.timeout.ms=20000
systems.kafka.consumer.retry.backoff.ms=500
systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="" password="";
systems.kafka.consumer.security.protocol=SASL_SSL
processor.id=0

# checkpointing
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
task.checkpoint.system=kafka
task.checkpoint.replication.factor=3
task.commit.ms=-1

Is the Samza container still running after you see those WARN logs?
Yes.


I am thinking this is a timeout issue. We've never seen the issue before. The warning first appeared after  testing Confluent's Cloud kafka offering. We had no issues when running our own kafka clusters in aws.


Jeremiah Adams
Software Engineer
www.helixeducation.com
Blog | Twitter | Facebook | LinkedIn

________________________________________
From: Yi Pan <ni...@gmail.com>
Sent: Wednesday, March 11, 2020 5:48 PM
To: dev@samza.apache.org
Subject: Re: Got Error Produce Respons with Correlation Id.

Hi, Jeremiah,

Sorry to reply late. This WARN message indicates that producer failed to
flush to checkpoint topic and would retry. Do you see the Samza job hanging
after that? Is the checkpointed offset metrics incrementing in this case?
Not clear on your claiming: "logs stop at that point". No logs are written
after the WARN lines? What's your Samza configuration? Is the Samza
container still running after you see those WARN logs?

Thanks!

-Yi

On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
<JA...@helixeducation.com.invalid> wrote:

> Can anyone take a look at the message below? We are trying to gauge our
> risk before moving forward.
>
>
> Jeremiah Adams
> Software Engineer
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
> Sent: Wednesday, March 4, 2020 2:28 PM
> To: dev@samza.apache.org
> Subject: Got Error Produce Response iwth Correlation Id.
>
> Hello devs,
>
>
> I've got a warning showing up in the logs while testing our new Confluent
> Cloud config.  Can anyone tell me how concerned I should be about this
> warning? Is there a setting to control timeouts?
>
>
> Also, logs stop at that point, so I can't tell if the "metatdata update"
> was complete.
>
>
>
> 2020-03-04 21:17:51 Sender [WARN] [Producer
> clientId=kafka_producer-application_submission-1] Got error produce
> response with correlation id 144 on topic-partition
> __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> 2020-03-04 21:17:51 Sender [WARN] [Producer
> clientId=kafka_producer-application_submission-1] Received invalid metadata
> error in produce request on partition
> __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> org.apache.kafka.common.errors.NetworkException: The server disconnected
> before a response was received.. Going to request metadata update now
>
>
> Jeremiah Adams
> Software Engineer
>
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> <
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> >
> Blog<
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~>
> | Twitter<
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~>
> | Facebook<
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~>
> | LinkedIn<
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> >
>

Re: Got Error Produce Respons with Correlation Id.

Posted by Yi Pan <ni...@gmail.com>.
Hi, Jeremiah,

Sorry to reply late. This WARN message indicates that producer failed to
flush to checkpoint topic and would retry. Do you see the Samza job hanging
after that? Is the checkpointed offset metrics incrementing in this case?
Not clear on your claiming: "logs stop at that point". No logs are written
after the WARN lines? What's your Samza configuration? Is the Samza
container still running after you see those WARN logs?

Thanks!

-Yi

On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
<JA...@helixeducation.com.invalid> wrote:

> Can anyone take a look at the message below? We are trying to gauge our
> risk before moving forward.
>
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Jeremiah Adams <JA...@helixeducation.com.INVALID>
> Sent: Wednesday, March 4, 2020 2:28 PM
> To: dev@samza.apache.org
> Subject: Got Error Produce Response iwth Correlation Id.
>
> Hello devs,
>
>
> I've got a warning showing up in the logs while testing our new Confluent
> Cloud config.  Can anyone tell me how concerned I should be about this
> warning? Is there a setting to control timeouts?
>
>
> Also, logs stop at that point, so I can't tell if the "metatdata update"
> was complete.
>
>
>
> 2020-03-04 21:17:51 Sender [WARN] [Producer
> clientId=kafka_producer-application_submission-1] Got error produce
> response with correlation id 144 on topic-partition
> __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> 2020-03-04 21:17:51 Sender [WARN] [Producer
> clientId=kafka_producer-application_submission-1] Received invalid metadata
> error in produce request on partition
> __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> org.apache.kafka.common.errors.NetworkException: The server disconnected
> before a response was received.. Going to request metadata update now
>
>
> Jeremiah Adams
> Software Engineer
>
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> <
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> >
> Blog<
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~>
> | Twitter<
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~>
> | Facebook<
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~>
> | LinkedIn<
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> >
>