You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vikas Mehra <vi...@gmail.com> on 2017/09/25 13:25:44 UTC

Problem with live Solr cloud (6.6) backup using collection API

Cluster has 1 zookeeper node and 3 solr nodes. There is only one collection
with 3 shards. Data is continuously indexed using SolrJ API. System is
running on AWS and I am taking backup on EFS (Elastic File System).

Observed behavior:
If indexing is not in progress, I take a backup of cluster using collection
API, backup succeeds and restore works as expected.

snapshotscli.sh works as expected if I first take snapshot of index while
indexing is in progress and then take backup. There is no error during
restore.

However, I get error most of the time if I try to restore collection from
the backup taken using collection API when indexing was still in progress.
Error is always missing segment and I can see that segment its trying to
read during restore does not exist in the backup shard directory.

Also, Is there a way to take snapshot of solr cloud using collection api?
User guide only has documentation to take snapshot of core using collection
api.

2017-09-08 19:47:22.592 WARN
(parallelCoreAdminExecutor-5-thread-8-processing-n:ec2-34-201-149-27.compute-1.amazonaws.com:8983_solr
t1cloudbackuponefs-r2187461299681393 RESTORECORE) [   ] o.a.s.h.RestoreCore
Could not switch to restored index. Rolling back to the current index
org.apache.lucene.index.CorruptIndexException: Unexpected file read error
while reading index.
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/var/solr/data/t1cloud3_shard2_replica0/data/restore.20170908194722131/segments_y")))
    at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:930)
    at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:118)
    at
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:93)
    at
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:248)
    at
org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:211)
    at
org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:220)
    at
org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:726)
    at org.apache.solr.handler.RestoreCore.doRestore(RestoreCore.java:108)
    at
org.apache.solr.handler.admin.RestoreCoreOp.execute(RestoreCoreOp.java:65)
    at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384)
    at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388)
    at
org.apache.solr.handler.admin.CoreAdminHandler.lambda$handleRequestBody$0(CoreAdminHandler.java:182)
    at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
    at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.NoSuchFileException:
/var/solr/data/t1cloud3_shard2_replica0/data/restore.20170908194722131/_
4m.si
    at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
    at java.nio.channels.FileChannel.open(FileChannel.java:287)
    at java.nio.channels.FileChannel.open(FileChannel.java:335)
    at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
    at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192)
    at
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
    at
org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89)
    at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
    at
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288)
    ... 17 more

Re: Problem with live Solr cloud (6.6) backup using collection API

Posted by sw90000 <sw...@gmail.com>.

I am having the same problem. when i trying to restore a backup index.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Problem with live Solr cloud (6.6) backup using collection API

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/25/2017 7:25 AM, Vikas Mehra wrote:
> Cluster has 1 zookeeper node and 3 solr nodes. There is only one collection
> with 3 shards. Data is continuously indexed using SolrJ API. System is
> running on AWS and I am taking backup on EFS (Elastic File System).
>
> Observed behavior:
> If indexing is not in progress, I take a backup of cluster using collection
> API, backup succeeds and restore works as expected.
>
> snapshotscli.sh works as expected if I first take snapshot of index while
> indexing is in progress and then take backup. There is no error during
> restore.

I was completely unaware of the snapshotcli.sh script.  Just found where
it was added to Solr:

https://issues.apache.org/jira/browse/SOLR-9688

> However, I get error most of the time if I try to restore collection from
> the backup taken using collection API when indexing was still in progress.
> Error is always missing segment and I can see that segment its trying to
> read during restore does not exist in the backup shard directory.

My best guess: When you manually create a snapshot, the BACKUP feature
in the Collections API finds that snapshot and backs it up.  When you
don't create a snapshot, perhaps it only copies from the live index,
which can change if there is indexing underway.

When there are no snapshots, I think that the BACKUP feature should
create one, then delete it once the backup is done.  Or it could use a
Lucene feature called a commit point to ensure that files cannot
disappear during the backup, and delete that when the backup is done.

I've always found it hard to decipher the code for the Collections API. 
I can never figure out exactly where the work is being done.  I've poked
around a bit but I cannot see where the BACKUP action is being actually
handled.  The code is very difficult to follow, so I have no idea
whether I've even found the right code to look at.

Thanks,
Shawn