You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rob Nicholls <ro...@hotmail.com> on 2011/10/25 17:32:12 UTC

Replication issues with multiple Slaves

Hey guys,

We have a Master (1 server) and 2 Slaves (2 servers) setup and running replication across multiple cores.

However, the replication appears to behave sporadically and often fails when left to replicate automatically via poll. More often than not a replicate will fail after the slave has finished pulling down the segment files, because it cannot find a particular file, giving errors such as:

Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
SEVERE: Unable to move index file from: D:\web\solr\collection\data\index.20111025100000\_3u.tii to: D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy

SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.20111025100000\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt
java.io.FileNotFoundException: D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(Unknown Source)
    at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
    at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
    at org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:267)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source)
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

For these files, I checked the master, and they did indeed exist.

Both slave machines are configured the same, with the same replication settings and a 60 minutes poll interval.

Is it perhaps because both slave machines are trying to pull down files at the same time? (and the other has a lock on the file, thus it gets skipped maybe?)

Note: If I manually force replication on each slave, one at a time, the replication always seems to work fine.



Is there any obvious explanation or oddities I should be aware of that may cause this?

Thanks,
Rob

RE: Replication issues with multiple Slaves

Posted by Rob Nicholls <ro...@hotmail.com>.

Thanks... Yes, and no. 

The main thing is, after the replicate failed below, I checked the master
and the files that it complains about below (and several others) did
exist... which is where I'm stumped about what is causing the issue (I have
added the maxCommits setting you mention below already).

I'll retest to confirm that there is only a single commit happening in this
scenario, and it's not some weird oddity to do with Windows just being an
arrse with file and path capitalization.


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: 25 October 2011 20:51
To: solr-user@lucene.apache.org
Cc: Jaeger, Jay - DOT
Subject: Re: Replication issues with multiple Slaves

Are you frequently adding and deleting documents and committing those
mutations? Then it might try to download a file that doesnt exist anymore.
If that is the case try increasing :

<str name="maxCommitsToKeep"></str>

> I noted that in these messages the left hand side is lower case 
> collection, but the right hand side is upper case Collection.  
> Assuming you did a cut/paste, could you have a core name mismatch 
> between a master and a slave somehow?
> 
> Otherwise (shudder):  could you be doing a commit while the 
> replication is in progress, causing files to shift about on it?  I'd 
> have expected (perhaps naively) solr to have some sort of lock to 
> prevent such a problem.  But if there is no internal lock, that would 
> be a serious matter (and could happen to us, too, down the road).
> 
> JRJ
> 
> -----Original Message-----
> From: Rob Nicholls [mailto:robsta_1@hotmail.com]
> Sent: Tuesday, October 25, 2011 10:32 AM
> To: solr-user@lucene.apache.org
> Subject: Replication issues with multiple Slaves
> 
> 
> Hey guys,
> 
> We have a Master (1 server) and 2 Slaves (2 servers) setup and running 
> replication across multiple cores.
> 
> However, the replication appears to behave sporadically and often 
> fails when left to replicate automatically via poll. More often than 
> not a replicate will fail after the slave has finished pulling down 
> the segment files, because it cannot find a particular file, giving errors
such as:
> 
> Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
> SEVERE: Unable to move index file from:
> D:\web\solr\collection\data\index.20111025100000\_3u.tii to:
> D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy
> 
> SEVERE: Unable to copy index file from:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt to:
> D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system 
> cannot find the file specified) at java.io.FileInputStream.open(Native
> Method)
>     at java.io.FileInputStream.<init>(Unknown Source)
>     at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
>     at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
>     at
> org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621) 
> at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:31
> 7)
> at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.
> java
> :267) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) 
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
> at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown 
> Source) at java.util.concurrent.FutureTask.runAndReset(Unknown Source) 
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
> cces
> s$101(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
> unPe
> riodic(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
> un(U
> nknown Source) at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) 
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> at java.lang.Thread.run(Unknown Source)
> 
> For these files, I checked the master, and they did indeed exist.
> 
> Both slave machines are configured the same, with the same replication 
> settings and a 60 minutes poll interval.
> 
> Is it perhaps because both slave machines are trying to pull down 
> files at the same time? (and the other has a lock on the file, thus it 
> gets skipped
> maybe?)
> 
> Note: If I manually force replication on each slave, one at a time, 
> the replication always seems to work fine.
> 
> 
> 
> Is there any obvious explanation or oddities I should be aware of that 
> may cause this?
> 
> Thanks,
> Rob

Re: Replication issues with multiple Slaves

Posted by Markus Jelsma <ma...@openindex.io>.

Are you frequently adding and deleting documents and committing those 
mutations? Then it might try to download a file that doesnt exist anymore. If 
that is the case try increasing :

<str name="maxCommitsToKeep"></str>

> I noted that in these messages the left hand side is lower case collection,
> but the right hand side is upper case Collection.  Assuming you did a
> cut/paste, could you have a core name mismatch between a master and a
> slave somehow?
> 
> Otherwise (shudder):  could you be doing a commit while the replication is
> in progress, causing files to shift about on it?  I'd have expected
> (perhaps naively) solr to have some sort of lock to prevent such a
> problem.  But if there is no internal lock, that would be a serious matter
> (and could happen to us, too, down the road).
> 
> JRJ
> 
> -----Original Message-----
> From: Rob Nicholls [mailto:robsta_1@hotmail.com]
> Sent: Tuesday, October 25, 2011 10:32 AM
> To: solr-user@lucene.apache.org
> Subject: Replication issues with multiple Slaves
> 
> 
> Hey guys,
> 
> We have a Master (1 server) and 2 Slaves (2 servers) setup and running
> replication across multiple cores.
> 
> However, the replication appears to behave sporadically and often fails
> when left to replicate automatically via poll. More often than not a
> replicate will fail after the slave has finished pulling down the segment
> files, because it cannot find a particular file, giving errors such as:
> 
> Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
> SEVERE: Unable to move index file from:
> D:\web\solr\collection\data\index.20111025100000\_3u.tii to:
> D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy
> 
> SEVERE: Unable to copy index file from:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt to:
> D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system
> cannot find the file specified) at java.io.FileInputStream.open(Native
> Method)
>     at java.io.FileInputStream.<init>(Unknown Source)
>     at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
>     at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
>     at
> org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621) at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
> at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java
> :267) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at
> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source) at
> java.util.concurrent.FutureTask.runAndReset(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acces
> s$101(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPe
> riodic(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(U
> nknown Source) at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
> java.lang.Thread.run(Unknown Source)
> 
> For these files, I checked the master, and they did indeed exist.
> 
> Both slave machines are configured the same, with the same replication
> settings and a 60 minutes poll interval.
> 
> Is it perhaps because both slave machines are trying to pull down files at
> the same time? (and the other has a lock on the file, thus it gets skipped
> maybe?)
> 
> Note: If I manually force replication on each slave, one at a time, the
> replication always seems to work fine.
> 
> 
> 
> Is there any obvious explanation or oddities I should be aware of that may
> cause this?
> 
> Thanks,
> Rob

RE: Replication issues with multiple Slaves

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.

Thanks for that information.  It was most useful.  

Does anyone know:  when this happens does the slave continue using its old index, and then try again at the next time interval?  (I sure hope so).

JRJ

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, October 25, 2011 3:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Replication issues with multiple Slaves


> 1) Hmm, maybe, didn't notice that... but I'd be very confused why it works
> occasionally, and manual replication (through Solr Admin) always works ok
> in that case?
> 2) This was my initial thought, it was happening on one core (multiple
> commits while replication in progress), but I noticed it happening on
> another core (the one mentioned below) which only had 1 commit and a single
> generation (11 > 12) change to replicate.
> 
> 
> I too hoped and presumed that the Master is being Locked while replication
> is copying files... can anyone confirm this? We are using the native Lock
> type on a Windows/Tomcat server.

Replication does not lock the index from being written to.

> 
> Is anyone aware of any reason why the replication skips files, or fails to
> copy/find files other than because of presumably a commit or optimize
> re-chunking the segments and deleting them on the Master?

Slaves receive a list of files to download. Files further on the list may 
disappear before it gets a change to download them. By keeping older commits 
we were able to work around this issue.

> 
> -----Original Message-----
> From: Jaeger, Jay - DOT [mailto:Jay.Jaeger@dot.wi.gov]
> Sent: 25 October 2011 20:48
> To: solr-user@lucene.apache.org
> Subject: RE: Replication issues with multiple Slaves
> 
> I noted that in these messages the left hand side is lower case collection,
> but the right hand side is upper case Collection.  Assuming you did a
> cut/paste, could you have a core name mismatch between a master and a slave
> somehow?
> 
> Otherwise (shudder):  could you be doing a commit while the replication is
> in progress, causing files to shift about on it?  I'd have expected
> (perhaps naively) solr to have some sort of lock to prevent such a
> problem.  But if there is no internal lock, that would be a serious matter
> (and could happen to us, too, down the road).
> 
> JRJ
> 
> -----Original Message-----
> From: Rob Nicholls [mailto:robsta_1@hotmail.com]
> Sent: Tuesday, October 25, 2011 10:32 AM
> To: solr-user@lucene.apache.org
> Subject: Replication issues with multiple Slaves
> 
> 
> Hey guys,
> 
> We have a Master (1 server) and 2 Slaves (2 servers) setup and running
> replication across multiple cores.
> 
> However, the replication appears to behave sporadically and often fails
> when left to replicate automatically via poll. More often than not a
> replicate will fail after the slave has finished pulling down the segment
> files, because it cannot find a particular file, giving errors such as:
> 
> Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
> SEVERE: Unable to move index file from:
> D:\web\solr\collection\data\index.20111025100000\_3u.tii to:
> D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy
> 
> SEVERE: Unable to copy index file from:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt to:
> D:\web\solr\Collection\data\index\_3s.fdt
> java.io.FileNotFoundException:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot
> find the file specified)
>     at java.io.FileInputStream.open(Native Method)
>     at java.io.FileInputStream.<init>(Unknown Source)
>     at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
>     at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
>     at
> org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
>     at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
>     at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:
> 2 67)
>     at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>     at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown
> Source) at java.util.concurrent.FutureTask.runAndReset(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access
> $ 101(Unknown Source)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPer
> i odic(Unknown Source)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Un
> k nown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> 
> For these files, I checked the master, and they did indeed exist.
> 
> Both slave machines are configured the same, with the same replication
> settings and a 60 minutes poll interval.
> 
> Is it perhaps because both slave machines are trying to pull down files at
> the same time? (and the other has a lock on the file, thus it gets skipped
> maybe?)
> 
> Note: If I manually force replication on each slave, one at a time, the
> replication always seems to work fine.
> 
> 
> 
> Is there any obvious explanation or oddities I should be aware of that may
> cause this?
> 
> Thanks,
> Rob

Re: Replication issues with multiple Slaves

Posted by Markus Jelsma <ma...@openindex.io>.

> 1) Hmm, maybe, didn't notice that... but I'd be very confused why it works
> occasionally, and manual replication (through Solr Admin) always works ok
> in that case?
> 2) This was my initial thought, it was happening on one core (multiple
> commits while replication in progress), but I noticed it happening on
> another core (the one mentioned below) which only had 1 commit and a single
> generation (11 > 12) change to replicate.
> 
> 
> I too hoped and presumed that the Master is being Locked while replication
> is copying files... can anyone confirm this? We are using the native Lock
> type on a Windows/Tomcat server.

Replication does not lock the index from being written to.

> 
> Is anyone aware of any reason why the replication skips files, or fails to
> copy/find files other than because of presumably a commit or optimize
> re-chunking the segments and deleting them on the Master?

Slaves receive a list of files to download. Files further on the list may 
disappear before it gets a change to download them. By keeping older commits 
we were able to work around this issue.

> 
> -----Original Message-----
> From: Jaeger, Jay - DOT [mailto:Jay.Jaeger@dot.wi.gov]
> Sent: 25 October 2011 20:48
> To: solr-user@lucene.apache.org
> Subject: RE: Replication issues with multiple Slaves
> 
> I noted that in these messages the left hand side is lower case collection,
> but the right hand side is upper case Collection.  Assuming you did a
> cut/paste, could you have a core name mismatch between a master and a slave
> somehow?
> 
> Otherwise (shudder):  could you be doing a commit while the replication is
> in progress, causing files to shift about on it?  I'd have expected
> (perhaps naively) solr to have some sort of lock to prevent such a
> problem.  But if there is no internal lock, that would be a serious matter
> (and could happen to us, too, down the road).
> 
> JRJ
> 
> -----Original Message-----
> From: Rob Nicholls [mailto:robsta_1@hotmail.com]
> Sent: Tuesday, October 25, 2011 10:32 AM
> To: solr-user@lucene.apache.org
> Subject: Replication issues with multiple Slaves
> 
> 
> Hey guys,
> 
> We have a Master (1 server) and 2 Slaves (2 servers) setup and running
> replication across multiple cores.
> 
> However, the replication appears to behave sporadically and often fails
> when left to replicate automatically via poll. More often than not a
> replicate will fail after the slave has finished pulling down the segment
> files, because it cannot find a particular file, giving errors such as:
> 
> Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
> SEVERE: Unable to move index file from:
> D:\web\solr\collection\data\index.20111025100000\_3u.tii to:
> D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy
> 
> SEVERE: Unable to copy index file from:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt to:
> D:\web\solr\Collection\data\index\_3s.fdt
> java.io.FileNotFoundException:
> D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot
> find the file specified)
>     at java.io.FileInputStream.open(Native Method)
>     at java.io.FileInputStream.<init>(Unknown Source)
>     at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
>     at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
>     at
> org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
>     at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
>     at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:
> 2 67)
>     at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>     at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown
> Source) at java.util.concurrent.FutureTask.runAndReset(Unknown Source) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access
> $ 101(Unknown Source)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPer
> i odic(Unknown Source)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Un
> k nown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> 
> For these files, I checked the master, and they did indeed exist.
> 
> Both slave machines are configured the same, with the same replication
> settings and a 60 minutes poll interval.
> 
> Is it perhaps because both slave machines are trying to pull down files at
> the same time? (and the other has a lock on the file, thus it gets skipped
> maybe?)
> 
> Note: If I manually force replication on each slave, one at a time, the
> replication always seems to work fine.
> 
> 
> 
> Is there any obvious explanation or oddities I should be aware of that may
> cause this?
> 
> Thanks,
> Rob

RE: Replication issues with multiple Slaves

Posted by Rob Nicholls <ro...@hotmail.com>.

1) Hmm, maybe, didn't notice that... but I'd be very confused why it works
occasionally, and manual replication (through Solr Admin) always works ok in
that case?
2) This was my initial thought, it was happening on one core (multiple
commits while replication in progress), but I noticed it happening on
another core (the one mentioned below) which only had 1 commit and a single
generation (11 > 12) change to replicate. 


I too hoped and presumed that the Master is being Locked while replication
is copying files... can anyone confirm this? We are using the native Lock
type on a Windows/Tomcat server.

Is anyone aware of any reason why the replication skips files, or fails to
copy/find files other than because of presumably a commit or optimize
re-chunking the segments and deleting them on the Master?

-----Original Message-----
From: Jaeger, Jay - DOT [mailto:Jay.Jaeger@dot.wi.gov] 
Sent: 25 October 2011 20:48
To: solr-user@lucene.apache.org
Subject: RE: Replication issues with multiple Slaves

I noted that in these messages the left hand side is lower case collection,
but the right hand side is upper case Collection.  Assuming you did a
cut/paste, could you have a core name mismatch between a master and a slave
somehow?

Otherwise (shudder):  could you be doing a commit while the replication is
in progress, causing files to shift about on it?  I'd have expected (perhaps
naively) solr to have some sort of lock to prevent such a problem.  But if
there is no internal lock, that would be a serious matter (and could happen
to us, too, down the road).

JRJ

-----Original Message-----
From: Rob Nicholls [mailto:robsta_1@hotmail.com] 
Sent: Tuesday, October 25, 2011 10:32 AM
To: solr-user@lucene.apache.org
Subject: Replication issues with multiple Slaves


Hey guys,

We have a Master (1 server) and 2 Slaves (2 servers) setup and running
replication across multiple cores.

However, the replication appears to behave sporadically and often fails when
left to replicate automatically via poll. More often than not a replicate
will fail after the slave has finished pulling down the segment files,
because it cannot find a particular file, giving errors such as:

Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
SEVERE: Unable to move index file from:
D:\web\solr\collection\data\index.20111025100000\_3u.tii to:
D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy

SEVERE: Unable to copy index file from:
D:\web\solr\collection\data\index.20111025100000\_3s.fdt to:
D:\web\solr\Collection\data\index\_3s.fdt
java.io.FileNotFoundException:
D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot
find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(Unknown Source)
    at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
    at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
    at
org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
    at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
    at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:2
67)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source)
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
101(Unknown Source)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeri
odic(Unknown Source)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unk
nown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

For these files, I checked the master, and they did indeed exist.

Both slave machines are configured the same, with the same replication
settings and a 60 minutes poll interval.

Is it perhaps because both slave machines are trying to pull down files at
the same time? (and the other has a lock on the file, thus it gets skipped
maybe?)

Note: If I manually force replication on each slave, one at a time, the
replication always seems to work fine.



Is there any obvious explanation or oddities I should be aware of that may
cause this?

Thanks,
Rob

RE: Replication issues with multiple Slaves

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.

I noted that in these messages the left hand side is lower case collection, but the right hand side is upper case Collection.  Assuming you did a cut/paste, could you have a core name mismatch between a master and a slave somehow?

Otherwise (shudder):  could you be doing a commit while the replication is in progress, causing files to shift about on it?  I'd have expected (perhaps naively) solr to have some sort of lock to prevent such a problem.  But if there is no internal lock, that would be a serious matter (and could happen to us, too, down the road).

JRJ

-----Original Message-----
From: Rob Nicholls [mailto:robsta_1@hotmail.com] 
Sent: Tuesday, October 25, 2011 10:32 AM
To: solr-user@lucene.apache.org
Subject: Replication issues with multiple Slaves

Hey guys,

We have a Master (1 server) and 2 Slaves (2 servers) setup and running replication across multiple cores.

However, the replication appears to behave sporadically and often fails when left to replicate automatically via poll. More often than not a replicate will fail after the slave has finished pulling down the segment files, because it cannot find a particular file, giving errors such as:

Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
SEVERE: Unable to move index file from: D:\web\solr\collection\data\index.20111025100000\_3u.tii to: D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy

SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.20111025100000\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt
java.io.FileNotFoundException: D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(Unknown Source)
    at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
    at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
    at org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:267)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source)
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

For these files, I checked the master, and they did indeed exist.

Both slave machines are configured the same, with the same replication settings and a 60 minutes poll interval.

Is it perhaps because both slave machines are trying to pull down files at the same time? (and the other has a lock on the file, thus it gets skipped maybe?)

Note: If I manually force replication on each slave, one at a time, the replication always seems to work fine.

Is there any obvious explanation or oddities I should be aware of that may cause this?

Thanks,
Rob

Replication issues with multiple Slaves

Posted by Rob Nicholls <ro...@hotmail.com>.

Hey all,

We have a Master (1 server) and 2 Slaves (2 servers) setup and running replication across multiple cores.

However, the replication appears to behave sporadically and often fails when left to replicate automatically via poll. More often than not a replicate will fail after the slave has finished pulling down the segment files, because it cannot find a particular file, giving errors such as:

Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile
SEVERE: Unable to move index file from: D:\web\solr\collection\data\index.20111025100000\_3u.tii to: D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy

SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.20111025100000\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt
java.io.FileNotFoundException: D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(Unknown Source)
    at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47)
    at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585)
    at org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621)
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317)
    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:267)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source)
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

For these files, I checked the master, and they did indeed exist.

Both slave machines are configured the same, with the same replication settings and a 60 minutes poll interval. Using Solr 3.1

Is it perhaps because both slave machines are trying to pull down files at the same time? (and the other has a lock on the file, thus it gets skipped maybe?)

Note: If I manually force replication on each slave, one at a time, the replication always seems to work fine.




Is there any obvious explanation or oddities I should be aware of that may cause this?

Thanks,
Rob