You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (Jira)" <ji...@apache.org> on 2022/10/16 03:00:00 UTC

[jira] [Commented] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster

    [ https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618191#comment-17618191 ] 

Alexey Serbin commented on KUDU-3169:
-------------------------------------

The following comment explains how the issue might be worked around.

It seems the connection from Kudu Java client might be closed by the server side due to inactivity: the timeout for an idle connection is controlled by the {{\-\-rpc_default_keepalive_time_ms}}. So, if there's been no activity on the connection established to a tablet server (for a scan or write operation), even if the scanner object is kept alive in accordance to the setting of the {{\-\-scanner_ttl_ms}} flag.

As a workaround for the described feature/bug in the Kudu Java client, either call KeepAlive for the corresponding scanner at least once every scanner_ttl_ms, or set both the {{\-\-scanner_ttl_ms}} and the {{\-\-rpc_default_keepalive_time_ms}} flags to high enough value.

> kudu java client throws scanner expired error while processing large scan on  High-load cluster
> -----------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3169
>                 URL: https://issues.apache.org/jira/browse/KUDU-3169
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, java
>    Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1
>            Reporter: mintao
>            Priority: Major
>              Labels: scalability, stability
>
> user submits a spark task to scan  a kudu table with large amount records, after just few minutes the job failed after 4 attempts, each attempt failed with error :
> {code:java}
>  org.apache.kudu.client.NonRecoverableException: Scanner 4e34e6f821be42b889022ec681e235cc not found (it may have expired) org.apache.kudu.client.NonRecoverableException: Scanner 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at org.apache.kudu.client.KuduException.transformException(KuduException.java:110) at org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: org.apache.kudu.client.KuduException$OriginalException: Original asynchronous stack trace at org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ... 3 more{code}
>  Each task ran just for about 19 seconds then throws scanner not found error  while tserver uses a default scanner_ttl_ms (60s).In tserver log, We found the scanner that  memtioned in client log expired after spark job failed, and another tserver receives the scan request with that scannerId specifies.
>  it seems AsyncKuduScanner in kudu java client will choose a random server when retrying scanNextRows, even though the AsyncKuduScanner already has a scannerId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)