You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2016/04/29 09:55:12 UTC
[jira] [Commented] (SOLR-9050) IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

    [ https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263708#comment-15263708 ] 

Timothy Potter commented on SOLR-9050:
--------------------------------------

What's not clear to me is how we're getting into the cleanup method? I'm not seeing multiple retries to download the file after the first WARN is logged. There's about 4 minutes between the SocketTimeoutException and the Unable to download ERROR.

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9050
>                 URL: https://issues.apache.org/jira/browse/SOLR-9050
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 5.3.1
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
>         at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
>         at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
>         at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
>         at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
>         at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
>         at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
>         at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
>         at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
>         at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
>         at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
>         at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Unable to download _405k.cfs completely. Downloaded 7314866176!=9990844536
>         at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
>         at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
>         at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
>         at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
>         at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
>         at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
>         at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
>         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
>         at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
>         at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
>         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
>         at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> {code}
> So a simple read timeout exception leads to re-downloading the whole index again, and again, and again ...
> It also looks like any exception raised in fetchPackets would be squelched if an exception is raised in cleanup (called in the finally block)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org