You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Abhishek Ravi (JIRA)" <ji...@apache.org> on 2018/09/05 17:12:00 UTC

[jira] [Comment Edited] (DRILL-6625) Intermittent failures in Kafka unit tests

    [ https://issues.apache.org/jira/browse/DRILL-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604682#comment-16604682 ] 

Abhishek Ravi edited comment on DRILL-6625 at 9/5/18 5:11 PM:
--------------------------------------------------------------

Had some time to dig further into this issue. [~ben-zvi] - mentioned that the issues are seen when running on Mac and they seem to be intermittent failures.
h3. h3. Issue 1 - "{{The server disconnected before a response was received"}}

In {{KafkaMessageGenerator}}, the following properties are set when creating a producer.
{code:java}
producerProperties.put(ProducerConfig.RETRIES_CONFIG, 0);
producerProperties.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1);
producerProperties.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 1000);

{code}
Consider slower systems or heavily loaded system, where the acknowledgement for the message produced did not arrive in 1000ms - since {{RETRIES_CONFIG}} is set to 0, produce will fail with {{org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.}}

Although, the unit tests never failed in my environment, I was able to reproduce this error by reducing the value for {{REQUEST_TIMEOUT_MS_CONFIG}} to as low as *50 ms*.
{noformat}
23:19:55.136 [main] ERROR o.a.d.e.s.k.KafkaMessageGenerator - org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.drill.exec.store.kafka.KafkaMessageGenerator.populateJsonMsgIntoKafka(KafkaMessageGenerator.java:126) ~[test-classes/:na]
 at org.apache.drill.exec.store.kafka.TestKafkaSuit.initKafka(TestKafkaSuit.java:88) [test-classes/:na]

{noformat}
 
h3. h3. ISSUE 2 - {{Failed to fetch messages within 200 milliseconds}}

Again, this issue may occur in slower systems or systems under heavy load where a consumer poll did not return messages within {{Poll Timeout}} set.

For unit tests, {{KAFKA_POLL_TIMEOUT}} is set to *200 ms*.
{code:java}
testNoResult(String.format("alter session set `%s` = %d", ExecConstants.KAFKA_POLL_TIMEOUT, 200));

{code}
If a poll does not return a message within this time then following exception is thrown.

{{org.apache.drill.exec.rpc.RpcException: org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Failed to fetch messages within 10 milliseconds. Consider increasing the value of the property : store.kafka.poll.timeout}}

 

I was able to reproduce this issue by reducing {{KAFKA_POLL_TIMEOUT}} value to as low as *50 ms.*

 
h3. h3. Solution
 * The value for producer {{REQUEST_TIMEOUT_MS_CONFIG}} should be increased (to say 10s) and similarly the value for consumer {{KAFKA_POLL_TIMEOUT}} should also be increased.
 * We should increase producer {{RETRIES_CONFIG}} to allow retries. We can eliminate the possibility of having duplicate messages by using *{{Idempotent Producer}}*{{.}}

 

[~akumarb2010], [~kam_iitkgp], [~kkhatua] - any other suggestions?


was (Author: aravi5):
Had some time to dig further into this issue. [~ben-zvi] - mentioned that the issues are seen when running on Mac and they seem to be intermittent failures.
h3. h3. Issue 1 - "{{The server disconnected before a response was received"}}

In {{KafkaMessageGenerator}}, the following properties are set when creating a producer.

{code}

producerProperties.put(ProducerConfig.RETRIES_CONFIG, 0);
producerProperties.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1);
producerProperties.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 1000);

{code}

Consider slower systems or heavily loaded system, where the acknowledgement for the message produced did not arrive in 1000ms - since {{RETRIES_CONFIG}} is set to 0, produce will fail with {{org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.}}

Although, the unit tests never failed in my environment, I was able to reproduce this error by reducing the value for {{REQUEST_TIMEOUT_MS_CONFIG}} to as low as *50 ms*.

{noformat}

23:19:55.136 [main] ERROR o.a.d.e.s.k.KafkaMessageGenerator - org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) ~[kafka-clients-0.11.0.1.jar:na]
 at org.apache.drill.exec.store.kafka.KafkaMessageGenerator.populateJsonMsgIntoKafka(KafkaMessageGenerator.java:126) ~[test-classes/:na]
 at org.apache.drill.exec.store.kafka.TestKafkaSuit.initKafka(TestKafkaSuit.java:88) [test-classes/:na]

{noformat}

 
h3. h3. ISSUE 2 - {{Failed to fetch messages within 200 milliseconds}}

Again, this issue may occur in slower systems or systems under heavy load where a consumer poll did not return messages within {{Poll Timeout}} set.

For unit tests, {{KAFKA_POLL_TIMEOUT}} is set to *200 ms*.

{code}

testNoResult(String.format("alter session set `%s` = %d", ExecConstants.KAFKA_POLL_TIMEOUT, 200));

{code}

If a poll does not return a message within this time then following exception is thrown.

{{org.apache.drill.exec.rpc.RpcException: org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Failed to fetch messages within 10 milliseconds. Consider increasing the value of the property : store.kafka.poll.timeout}}

 

I was able to reproduce this issue by reducing {{KAFKA_POLL_TIMEOUT}} value to as low as *50 ms.*

 
h3. h3. Solution
 * The value for producer {{REQUEST_TIMEOUT_MS_CONFIG}} should be increased (to say 10s) and similarly the value for consumer {{KAFKA_POLL_TIMEOUT}} should also be increased.
 * We should increase producer {{RETRIES_CONFIG}} to allow retries. We can eliminate the possibility of having duplicate messages by using *{{Idempotent Producer}}*{{.}}

 

[~akumarb2010], [~kkhatua] - any thoughts?

> Intermittent failures in Kafka unit tests
> -----------------------------------------
>
>                 Key: DRILL-6625
>                 URL: https://issues.apache.org/jira/browse/DRILL-6625
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Other
>    Affects Versions: 1.13.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Abhishek Ravi
>            Priority: Major
>             Fix For: 1.15.0
>
>
> The following failures have been seen (consistently on my Mac, or occasionally on Jenkins) when running the unit tests, in the Kafka test suit. After the failure, maven hangs for a long time.
>  Cost was 0.0 (instead of 26.0) :
> {code:java}
> Running org.apache.drill.exec.store.kafka.KafkaFilterPushdownTest
> 16:46:57.748 [main] ERROR org.apache.drill.TestReporter - Test Failed (d: -65.3 KiB(73.6 KiB), h: -573.5 MiB(379.5 MiB), nh: 1.2 MiB(117.1 MiB)): testPushdownWithOr(org.apache.drill.exec.store.kafka.KafkaFilterPushdownTest)
> java.lang.AssertionError: Unable to find expected string     "kafkaScanSpec" : {
>       "topicName" : "drill-pushdown-topic"
>     },
>     "cost" : 26.0 in plan: {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ {
>       "kind" : "STRING",
>       "accessibleScopes" : "ALL",
>       "name" : "store.kafka.record.reader",
>       "string_val" : "org.apache.drill.exec.store.kafka.decoders.JsonMessageReader",
>       "scope" : "SESSION"
>     }, {
>       "kind" : "LONG",
>       "accessibleScopes" : "ALL",
>       "name" : "planner.width.max_per_node",
>       "num_val" : 2,
>       "scope" : "SESSION"
>     }, {
>       "kind" : "BOOLEAN",
>       "accessibleScopes" : "ALL",
>       "name" : "exec.errors.verbose",
>       "bool_val" : true,
>       "scope" : "SESSION"
>     }, {
>       "kind" : "LONG",
>       "accessibleScopes" : "ALL",
>       "name" : "store.kafka.poll.timeout",
>       "num_val" : 200,
>       "scope" : "SESSION"
>     } ],
>     "queue" : 0,
>     "hasResourcePlan" : false,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "kafka-scan",
>     "@id" : 6,
>     "userName" : "",
>     "kafkaStoragePluginConfig" : {
>       "type" : "kafka",
>       "kafkaConsumerProps" : {
>         "bootstrap.servers" : "127.0.0.1:63751",
>         "group.id" : "drill-test-consumer"
>       },
>       "enabled" : true
>     },
>     "columns" : [ "`**`" ],
>     "kafkaScanSpec" : {
>       "topicName" : "drill-pushdown-topic"
>     },
>     "cost" : 0.0
>   }, {
> {code}
> Or occasionally:
> {code}
> -------------------------------------------------------
>  T E S T S
> -------------------------------------------------------
> 11:52:57.571 [main] ERROR o.a.d.e.s.k.KafkaMessageGenerator - org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
> java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)