You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/11/01 10:18:31 UTC

RE: trunk is unable to replicate between nodes ( Unable to download ... completely)

Hi,

What we're seeing is two replica's that are, for whatever reason, not completely in sync and unable to recover. While one node is in the `recovering state` we can fool it to be in active state by simply reloading the core on the machine on which it is in a bad state. Some arbitrary time later it will return to recovering.

Below is the communication between and the exceptions on the nodes in order:


Recovering machine starts the recovery process:

2012-11-01 08:55:13,533 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_i
2012-11-01 08:55:13,534 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:13,732 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=10 
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=10 
2012-11-01 08:55:14,700 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/admin/ping params={} hits=1 status=0 QTime=5 
2012-11-01 08:55:14,701 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/admin/ping params={} status=0 QTime=6 
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_i
2012-11-01 08:55:15,546 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot09.nl.idx.openindex.io:8080/solr/shard_i/. core=shard_i
2012-11-01 08:55:15,546 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:15,557 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  No value set for 'pollInterval'. Timer Task not started.
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 166
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 3
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process
2012-11-01 08:55:15,568 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 324
2012-11-01 08:55:15,569 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.NativeFSLockFactory@6f8288d2; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_i/data/index.20121101085515568 forceNew:false
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data
2012-11-01 08:55:15,572 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true


Right now the leader is transmitting packets as we expect:


2012-11-01 08:55:15,567 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/replication params={command=filelist&generation=166&wt=javabin&q
t=/replication&version=2} status=0 QTime=4 
2012-11-01 08:55:15,576 INFO [solr.core.SolrCore] - [http-8080-exec-3] - : [shard_i] webapp=/solr path=/replication params={file=_2b9_nrm.cfs&command=filecontent&checks
um=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,583 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr path=/replication params={file=_2t8_Lucene41_0.doc&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,589 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_i] webapp=/solr path=/replication params={file=_2sc.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,598 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_i] webapp=/solr path=/replication params={file=_2qk.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,603 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_i] webapp=/solr path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 


Until the leader cannot read the index and therefore not complete the upload:


2012-11-01 08:55:15,685 WARN [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Exception while writing response for params: file=_2ii_Lucene41_0.pos&command=filecontent&checksum=true&generation=166&qt=/replication&wt=filestream
java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/shard_i/data/index.20121030152231942/_2ii_Lucene41_0.pos")
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,690 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&offset=1048576&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,691 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,692 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


The recovering node isn't very happy and drops the replication and waits...


2012-11-01 08:55:15,686 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets 
java.io.EOFException
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
        at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-01 08:55:15,692 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,692 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-01 08:55:15,693 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index.20121101085515568
2012-11-01 08:55:15,695 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,695 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _2ii_Lucene41_0.pos completely. Downloaded 1048576!=1501300
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
        at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,695 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,696 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-01 08:55:15,696 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (8) core=shard_i
2012-11-01 08:55:15,696 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 512.0 seconds before trying to recover again (9)


It seems to me the SolrCloud part is actually running and behaving as it should but the index file on the leader is corrupt, and that should not happen even on power loss. Any hints?


Thanks
Markus
 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Wed 31-Oct-2012 14:14
> To: solr-user@lucene.apache.org; Markus Jelsma <ma...@openindex.io>
> Subject: RE: trunk is unable to replicate between nodes ( Unable to download ... completely)
> 
> Ah, we're also seeing Solr lookup an unexisting directory:
> 
> 2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - [http-8080-exec-2] - : IO error while trying to get the size of the Directory:org.apache.lucene.store.NoSuchDirectoryException: directory '/opt/solr/cores/shard_a/data/index' does not exist
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220)
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
>         at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
>         at org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
> 
> Instead of data/index it should be looking for data/index.20121030152324761/, which actually does exist.
> 
>  
>  
> -----Original message-----
> > From:Markus Jelsma <ma...@openindex.io>
> > Sent: Tue 30-Oct-2012 17:30
> > To: solr-user@lucene.apache.org
> > Subject: trunk is unable to replicate between nodes ( Unable to download ... completely)
> > 
> > Hi,
> > 
> > We're testing again with today's trunk and using the new Lucene 4.1 format by default. When nodes are not restarted things are kind of stable but restarting nodes leads to a lot of mayhem. It seems we can get the cluster back up and running by clearing ZK and restarting everything (another issue) but replication becomes impossible for some nodes leading to a continuous state of failing recovery etc.
> > 
> > Here are some excepts from the logs:
> > 
> > 2012-10-30 16:12:39,674 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exe
> > c-5] - : null:java.lang.IndexOutOfBoundsException
> >         at java.nio.Buffer.checkBounds(Buffer.java:530)
> >         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
> >         at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferInde
> > xInput.java:91)
> >         at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(
> > ReplicationHandler.java:1065)
> >         at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > 2012-10-30 16:10:32,220 ERROR [solr.handler.ReplicationHandler] - [RecoveryThrea
> > d] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download
> >  _x.fdt completely. Downloaded 13631488!=13843504
> >         at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapP
> > uller.java:1237)
> >         at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(Sna
> > pPuller.java:1118)
> >         at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java
> > :716)
> >         at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
> >         at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
> >         at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
> >         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
> > 
> > 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - [http-8080-exec
> > -3] - : Exception while writing response for params: file=_p_Lucene41_0.doc&comm
> > and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream
> > java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc")
> >         at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
> >         at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
> >         at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > Needless to say i'm puzzled so i'm wondering if anyone has seen this before or have some hints that might help digg further.
> > 
> > Thanks,
> > Markus
> > 
>