You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Neal Ensor <ne...@gmail.com> on 2013/03/07 16:57:06 UTC

SOLR 4.2 (snapshot, March 1, 2013): replication not "finishing"?

I'm getting intermittent issues with replication in my current
arrangement:  one master, 3 slaves; all the same SOLR version/war file
deployment.

I update the master, which kicks off replication across the other
three; however, they never seem to "finish".  In the data/ folders I
get an empty index.timestamp folder, the admin page for replication
shows it "stuck" pulling a file (no progress shown, just constant
refresh).  The index never changes over (always claiming to be
out-of-date).  Abort messages are ignored, both from the admin console
and through a curl "abortfetch" request to the slaves.  The searcher
is still responsive, I can query, but the master's changes are of
course not there.

If I kill my container (tomcat 6) and start it back up, magically the
replication has "finished" and the slave is up-to-date.  This sort of
leads me to believe something isn't finalizing the change over and
opening a new searcher on the index (looks like the "main" index is
actually being updated, but a close/open, or reopen, is not
happening?)

The relevant thread dump when in that state seems to be:

snapPuller-7-thread-1 (12)
java.util.concurrent.FutureTask$Sync@55ccfb48
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt
(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly
(AbstractQueuedSynchronizer.java:969)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly
(AbstractQueuedSynchronizer.java:1281)
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
java.util.concurrent.FutureTask.get(FutureTask.java:83)
org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:655)
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:466)
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:281)
org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101
(ScheduledThreadPoolExecutor.java:98)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic
(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run
(ScheduledThreadPoolExecutor.java:204)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:895)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:662)

Tomcat containers are all the same; each of the slaves is running
entirely alone on its own container, separate machines. SOLR reported
versions:

Versions
solr-spec
4.2.0.2013.03.01.10.10.50
solr-impl
4.2-SNAPSHOT 1451604 - ensorn - 2013-03-01 10:10:50
lucene-spec
4.2-SNAPSHOT
lucene-impl
4.2-SNAPSHOT 1451604 - ensorn - 2013-03-01 10:02:53

Any help would be appreciated.  This is getting very frustrating.  To
make things worse, I have set up a new "slave" on my work PC (Mac),
and it replicates FLAWLESSLY on the same set up; only difference is
the slaves on the servers are on a SAN array (not sure if locking is
causing the heartburn?)  Any pointers would be great.  This is
obviously becoming a pain to work with, even with fairly infrequent
replications.

Thanks!

Neal Ensor
nensor@gmail.com