You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2013/04/02 22:52:10 UTC

Solr 4.2 Cloud Replication Replica has higher version than Master?

I am currently looking at moving our Solr cluster to 4.2 and noticed a
strange issue while testing today.  Specifically the replica has a higher
version than the master which is causing the index to not replicate.
 Because of this the replica has fewer documents than the master.  What
could cause this and how can I resolve it short of taking down the index
and scping the right version in?

MASTER:
Last Modified:about an hour ago
Num Docs:164880
Max Doc:164880
Deleted Docs:0
Version:2387
Segment Count:23

REPLICA:
Last Modified: about an hour ago
Num Docs:164773
Max Doc:164773
Deleted Docs:0
Version:3001
Segment Count:30

in the replicas log it says this:

INFO: Creating new http client,
config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false

Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync

INFO: PeerSync: core=dsc-shard5-core2
url=http://10.38.33.17:7577/solrSTART replicas=[
http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100

Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions

INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr
Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/

Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions

INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr  Our
versions are newer. ourLowThreshold=1431233788792274944
otherHigh=1431233789440294912

Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync

INFO: PeerSync: core=dsc-shard5-core2
url=http://10.38.33.17:7577/solrDONE. sync succeeded


which again seems to point that it thinks it has a newer version of the
index so it aborts.  This happened while having 10 threads indexing 10,000
items writing to a 6 shard (1 replica each) cluster.  Any thoughts on this
or what I should look for would be appreciated.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

sorry that should say none of the _* files were present, not one....


On Wed, Apr 3, 2013 at 10:16 PM, Jamie Johnson <je...@gmail.com> wrote:

> I have since removed the files but when I had looked there was an index
> directory, the only files I remember being there were the segments, one of
> the _* files were present.  I'll watch it to see if it happens again but it
> happened on 2 of the shards while heavy indexing.
>
>
> On Wed, Apr 3, 2013 at 10:13 PM, Mark Miller <ma...@gmail.com>wrote:
>
>> Is that file still there when you look? Not being able to find an index
>> file is not a common error I've seen recently.
>>
>> Do those replicas have an index directory or when you look on disk, is it
>> an index.timestamp directory?
>>
>> - Mark
>>
>> On Apr 3, 2013, at 10:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > so something is still not right.  Things were going ok, but I'm seeing
>> this
>> > in the logs of several of the replicas
>> >
>> > SEVERE: Unable to create core: dsc-shard3-core1
>> > org.apache.solr.common.SolrException: Error opening new searcher
>> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
>> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
>> >        at
>> > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
>> >        at
>> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
>> >        at
>> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
>> >        at
>> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
>> >        at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >        at
>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >        at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >        at java.lang.Thread.run(Thread.java:662)
>> > Caused by: org.apache.solr.common.SolrException: Error opening new
>> searcher
>> >        at
>> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
>> >        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
>> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
>> >        ... 13 more
>> > Caused by: org.apache.solr.common.SolrException: Error opening Reader
>> >        at
>> >
>> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
>> >        at
>> >
>> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
>> >        at
>> >
>> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
>> >        at
>> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
>> >        ... 15 more
>> > Caused by: java.io.FileNotFoundException:
>> > /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or
>> directory)
>> >        at java.io.RandomAccessFile.open(Native Method)
>> >        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
>> >        at
>> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
>> >        at
>> >
>> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
>> >        at
>> >
>> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
>> >        at
>> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
>> >        at
>> >
>> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
>> >        at
>> >
>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
>> >        at
>> >
>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
>> >        at
>> > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
>> >        at
>> >
>> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
>> >        at
>> >
>> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
>> >        ... 18 more
>> >
>> >
>> >
>> > On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >
>> >> Thanks I will try that.
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>
>> >>>
>> >>>
>> >>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>>
>> >>>> I am not using the concurrent low pause garbage collector, I could
>> look
>> >>> at
>> >>>> switching, I'm assuming you're talking about adding
>> >>> -XX:+UseConcMarkSweepGC
>> >>>> correct?
>> >>>
>> >>> Right - if you don't do that, the default is almost always the
>> throughput
>> >>> collector (I've only seen OSX buck this trend when apple handled
>> java).
>> >>> That means stop the world garbage collections, so with larger heaps,
>> that
>> >>> can be a fair amount of time that no threads can run. It's not that
>> great
>> >>> for something as interactive as search generally is anyway, but it's
>> always
>> >>> not that great when added to heavy load and a 15 sec session timeout
>> >>> between solr and zk.
>> >>>
>> >>>
>> >>> The below is odd - a replica node is waiting for the leader to see it
>> as
>> >>> recovering and live - live means it has created an ephemeral node for
>> that
>> >>> Solr corecontainer in zk - it's very strange if that didn't happen,
>> unless
>> >>> this happened during shutdown or something.
>> >>>
>> >>>>
>> >>>> I also just had a shard go down and am seeing this in the log
>> >>>>
>> >>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>> >>> state
>> >>>> down for 10.38.33.17:7576_solr but I still do not see the requested
>> >>> state.
>> >>>> I see state: recovering live:false
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>> >>>>       at
>> >>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>> >>>>       at
>> >>>>
>> >>>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>> >>>>       at
>> >>>>
>> >>>
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>> >>>>       at
>> >>>>
>> >>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>> >>>>       at
>> >>>>
>> >>>
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>> >>>>       at
>> >>>>
>> >>>
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>> >>>>
>> >>>> Nothing other than this in the log jumps out as interesting though.
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> This shouldn't be a problem though, if things are working as they
>> are
>> >>>>> supposed to. Another node should simply take over as the overseer
>> and
>> >>>>> continue processing the work queue. It's just best if you configure
>> so
>> >>> that
>> >>>>> session timeouts don't happen unless a node is really down. On the
>> >>> other
>> >>>>> hand, it's nicer to detect that faster. Your tradeoff to make.
>> >>>>>
>> >>>>> - Mark
>> >>>>>
>> >>>>> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Yeah. Are you using the concurrent low pause garbage collector?
>> >>>>>>
>> >>>>>> This means the overseer wasn't able to communicate with zk for 15
>> >>>>> seconds - due to load or gc or whatever. If you can't resolve the
>> root
>> >>>>> cause of that, or the load just won't allow for it, next best thing
>> >>> you can
>> >>>>> do is raise it to 30 seconds.
>> >>>>>>
>> >>>>>> - Mark
>> >>>>>>
>> >>>>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>>> I am occasionally seeing this in the log, is this just a timeout
>> >>> issue?
>> >>>>>>> Should I be increasing the zk client timeout?
>> >>>>>>>
>> >>>>>>> WARNING: Overseer cannot talk to ZK
>> >>>>>>> Apr 3, 2013 11:14:25 PM
>> >>>>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>> >>>>>>> INFO: Watcher fired on path: null state: Expired type None
>> >>>>>>> Apr 3, 2013 11:14:25 PM
>> >>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>> >>>>>>> run
>> >>>>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>> >>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >>>>>>> KeeperErrorCode = Session expired for /overseer/queue
>> >>>>>>>     at
>> >>>>>>>
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>> >>>>>>>     at
>> >>>>>>>
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> >>>>>>>     at
>> >>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>> >>>>>>>     at
>> >>>>>>>
>> >>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>> >>>>>>>     at
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>> >>>>>>>     at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>>> just an update, I'm at 1M records now with no issues.  This looks
>> >>>>>>>> promising as to the cause of my issues, thanks for the help.  Is
>> the
>> >>>>>>>> routing method with numShards documented anywhere?  I know
>> >>> numShards is
>> >>>>>>>> documented but I didn't know that the routing changed if you
>> don't
>> >>>>> specify
>> >>>>>>>> it.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <jej2003@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> with these changes things are looking good, I'm up to 600,000
>> >>>>> documents
>> >>>>>>>>> without any issues as of right now.  I'll keep going and add
>> more
>> >>> to
>> >>>>> see if
>> >>>>>>>>> I find anything.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
>> >>> match
>> >>>>> the
>> >>>>>>>>>> shards that are auto created and it looks like things are
>> happy.
>> >>>>> I'll go
>> >>>>>>>>>> ahead and try my test to see if I can get things out of sync.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
>> >>> markrmiller@gmail.com
>> >>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> I had thought you could - but looking at the code recently, I
>> >>> don't
>> >>>>>>>>>>> think you can anymore. I think that's a technical limitation
>> more
>> >>>>> than
>> >>>>>>>>>>> anything though. When these changes were made, I think support
>> >>> for
>> >>>>> that was
>> >>>>>>>>>>> simply not added at the time.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I'm not sure exactly how straightforward it would be, but it
>> >>> seems
>> >>>>>>>>>>> doable - as it is, the overseer will preallocate shards when
>> >>> first
>> >>>>> creating
>> >>>>>>>>>>> the collection - that's when they get named shard(n). There
>> would
>> >>>>> have to
>> >>>>>>>>>>> be logic to replace shard(n) with the custom shard name when
>> the
>> >>>>> core
>> >>>>>>>>>>> actually registers.
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
>> >>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> answered my own question, it now says compositeId.  What is
>> >>>>>>>>>>> problematic
>> >>>>>>>>>>>> though is that in addition to my shards (which are say
>> >>>>> jamie-shard1)
>> >>>>>>>>>>> I see
>> >>>>>>>>>>>> the solr created shards (shard1).  I assume that these were
>> >>> created
>> >>>>>>>>>>> because
>> >>>>>>>>>>>> of the numShards param.  Is there no way to specify the
>> names of
>> >>>>> these
>> >>>>>>>>>>>> shards?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
>> >>> jej2003@gmail.com>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> ah interesting....so I need to specify num shards, blow out
>> zk
>> >>> and
>> >>>>>>>>>>> then
>> >>>>>>>>>>>>> try this again to see if things work properly now.  What is
>> >>> really
>> >>>>>>>>>>> strange
>> >>>>>>>>>>>>> is that for the most part things have worked right and on
>> >>> 4.2.1 I
>> >>>>>>>>>>> have
>> >>>>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I
>> will
>> >>>>>>>>>>> specify num
>> >>>>>>>>>>>>> shards clear out zk and begin again.  If this works properly
>> >>> what
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>> the router type be?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
>> >>>>> markrmiller@gmail.com>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> If you don't specify numShards after 4.1, you get an
>> implicit
>> >>> doc
>> >>>>>>>>>>> router
>> >>>>>>>>>>>>>> and it's up to you to distribute updates. In the past,
>> >>>>> partitioning
>> >>>>>>>>>>> was
>> >>>>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>> >>>>>>>>>>> features, we
>> >>>>>>>>>>>>>> now divvy up the hash range up front based on numShards and
>> >>> store
>> >>>>>>>>>>> it in
>> >>>>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete
>> control
>> >>> of
>> >>>>>>>>>>> updates
>> >>>>>>>>>>>>>> yourself.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
>> >>> state
>> >>>>> but
>> >>>>>>>>>>>>>> perhaps
>> >>>>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
>> >>> the
>> >>>>>>>>>>>>>>> clusterstate.json is shown below.  What is the process
>> that
>> >>>>> should
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>> done
>> >>>>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I
>> listed
>> >>>>>>>>>>> above?  My
>> >>>>>>>>>>>>>>> process right now is run those ZkCLI commands and then
>> start
>> >>>>> solr
>> >>>>>>>>>>> on
>> >>>>>>>>>>>>>> all of
>> >>>>>>>>>>>>>>> the instances with a command like this
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>> >>>>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>> >>>>>>>>>>>>>> -Dcollection.configName=solr-conf
>> >>>>>>>>>>>>>>> -Dcollection=collection1
>> >>>>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> >>>>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I feel like maybe I'm missing a step.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> "shard5":{
>> >>>>>>>>>>>>>>>   "state":"active",
>> >>>>>>>>>>>>>>>   "replicas":{
>> >>>>>>>>>>>>>>>     "10.38.33.16:7575_solr_shard5-core1":{
>> >>>>>>>>>>>>>>>       "shard":"shard5",
>> >>>>>>>>>>>>>>>       "state":"active",
>> >>>>>>>>>>>>>>>       "core":"shard5-core1",
>> >>>>>>>>>>>>>>>       "collection":"collection1",
>> >>>>>>>>>>>>>>>       "node_name":"10.38.33.16:7575_solr",
>> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.16:7575/solr",
>> >>>>>>>>>>>>>>>       "leader":"true"},
>> >>>>>>>>>>>>>>>     "10.38.33.17:7577_solr_shard5-core2":{
>> >>>>>>>>>>>>>>>       "shard":"shard5",
>> >>>>>>>>>>>>>>>       "state":"recovering",
>> >>>>>>>>>>>>>>>       "core":"shard5-core2",
>> >>>>>>>>>>>>>>>       "collection":"collection1",
>> >>>>>>>>>>>>>>>       "node_name":"10.38.33.17:7577_solr",
>> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.17:7577/solr"}}}
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
>> >>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users
>> have
>> >>>>>>>>>>> reported
>> >>>>>>>>>>>>>>>> trouble upgrading a previous zk install when this change
>> >>> came.
>> >>>>> I
>> >>>>>>>>>>>>>>>> recommended manually updating the clusterstate.json to
>> have
>> >>> the
>> >>>>>>>>>>> right
>> >>>>>>>>>>>>>> info,
>> >>>>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
>> >>> start
>> >>>>>>>>>>> from a
>> >>>>>>>>>>>>>> clean
>> >>>>>>>>>>>>>>>> zk state.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> If you don't have that range information, I think there
>> >>> will be
>> >>>>>>>>>>>>>> trouble.
>> >>>>>>>>>>>>>>>> Do you have an router type defined in the
>> clusterstate.json?
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
>> >>> jej2003@gmail.com>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it
>> in
>> >>> the
>> >>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
>> >>> when I
>> >>>>>>>>>>> start
>> >>>>>>>>>>>>>> from
>> >>>>>>>>>>>>>>>>> scratch is the following
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>> >>>>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
>> >>>>> should
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> steps
>> >>>>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased
>> snapshot
>> >>> of
>> >>>>> 4.0
>> >>>>>>>>>>> so I
>> >>>>>>>>>>>>>>>> never
>> >>>>>>>>>>>>>>>>> did that previously either so perhaps I did create the
>> >>>>>>>>>>> collection in
>> >>>>>>>>>>>>>> one
>> >>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>> my steps to get this working but have forgotten it along
>> >>> the
>> >>>>> way.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>> >>>>>>>>>>> markrmiller@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
>> >>> assigned up
>> >>>>>>>>>>> front
>> >>>>>>>>>>>>>>>> when a
>> >>>>>>>>>>>>>>>>>> collection is created - each shard gets a range, which
>> is
>> >>>>>>>>>>> stored in
>> >>>>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the
>> same
>> >>> id
>> >>>>> on
>> >>>>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>>> shards - something very odd going on.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
>> >>> reproduce.
>> >>>>>>>>>>> Ideally
>> >>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>>> can capture it in a test case.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
>> >>> jej2003@gmail.com
>> >>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with
>> the
>> >>>>>>>>>>> parameter
>> >>>>>>>>>>>>>> set I
>> >>>>>>>>>>>>>>>>>> am
>> >>>>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it
>> on
>> >>>>> 4.2.0
>> >>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I
>> get
>> >>> to
>> >>>>>>>>>>> 400,000
>> >>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>> so.
>> >>>>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
>> >>> behavior
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>> >>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
>> >>> exported
>> >>>>> all
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> keys
>> >>>>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
>> >>> checks
>> >>>>> for
>> >>>>>>>>>>>>>>>>>> duplicates.
>> >>>>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a
>> grep
>> >>> of
>> >>>>> the
>> >>>>>>>>>>>>>> files
>> >>>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
>> >>> wrong
>> >>>>>>>>>>> places.
>> >>>>>>>>>>>>>>>> If
>> >>>>>>>>>>>>>>>>>> you
>> >>>>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
>> >>> shard
>> >>>>> 5.
>> >>>>>>>>>>> Is
>> >>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking
>> into
>> >>>>>>>>>>> account only
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
>> >>> numShards
>> >>>>>>>>>>> param @
>> >>>>>>>>>>>>>>>>>> startup
>> >>>>>>>>>>>>>>>>>>>> so could this be what is happening?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> >>>>>>>>>>>>>>>>>>>> shard1-core1:0
>> >>>>>>>>>>>>>>>>>>>> shard1-core2:0
>> >>>>>>>>>>>>>>>>>>>> shard2-core1:0
>> >>>>>>>>>>>>>>>>>>>> shard2-core2:0
>> >>>>>>>>>>>>>>>>>>>> shard3-core1:1
>> >>>>>>>>>>>>>>>>>>>> shard3-core2:1
>> >>>>>>>>>>>>>>>>>>>> shard4-core1:0
>> >>>>>>>>>>>>>>>>>>>> shard4-core2:0
>> >>>>>>>>>>>>>>>>>>>> shard5-core1:1
>> >>>>>>>>>>>>>>>>>>>> shard5-core2:1
>> >>>>>>>>>>>>>>>>>>>> shard6-core1:0
>> >>>>>>>>>>>>>>>>>>>> shard6-core2:0
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>> >>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I
>> just
>> >>>>>>>>>>> indexed
>> >>>>>>>>>>>>>>>> 300,000
>> >>>>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.
>>  I
>> >>>>> thought
>> >>>>>>>>>>>>>>>> perhaps I
>> >>>>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again
>> and
>> >>>>>>>>>>> indexed
>> >>>>>>>>>>>>>>>> another
>> >>>>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good
>> way to
>> >>>>> find
>> >>>>>>>>>>>>>>>> possibile
>> >>>>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id
>> field)
>> >>>>> but
>> >>>>>>>>>>> that
>> >>>>>>>>>>>>>>>> didn't
>> >>>>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>> >>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things
>> to
>> >>> go
>> >>>>>>>>>>> again.
>> >>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>> am
>> >>>>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
>> >>>>> problem on
>> >>>>>>>>>>>>>> 4.2.0
>> >>>>>>>>>>>>>>>>>> and then
>> >>>>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>> >>>>>>>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we
>> need to
>> >>>>> get
>> >>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>> bottom
>> >>>>>>>>>>>>>>>>>>>>>>> of it.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>> >>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Mark
>> >>>>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
>> >>> may
>> >>>>>>>>>>> address
>> >>>>>>>>>>>>>>>> this?
>> >>>>>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>>>>>> read
>> >>>>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
>> >>> out
>> >>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>> >>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it
>> did
>> >>>>>>>>>>> nothing.  I
>> >>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>>>>> clear
>> >>>>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
>> >>> and
>> >>>>> see
>> >>>>>>>>>>> if
>> >>>>>>>>>>>>>> there
>> >>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>>>> anything else odd
>> >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>> >>>>>>>>>>> markrmiller@gmail.com>
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
>> >>> said.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
>> >>> best
>> >>>>> to
>> >>>>>>>>>>> start
>> >>>>>>>>>>>>>>>>>>>>>>> tracking in
>> >>>>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
>> >>>>> again.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we
>> really
>> >>>>> need
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> get
>> >>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
>> >>>>> fixed in
>> >>>>>>>>>>>>>> 4.2.1
>> >>>>>>>>>>>>>>>>>>>>>>> (spreading
>> >>>>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>> >>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
>> >>> there
>> >>>>>>>>>>> anything
>> >>>>>>>>>>>>>>>> else
>> >>>>>>>>>>>>>>>>>>>>>>> that I
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
>> >>> I'd
>> >>>>> be
>> >>>>>>>>>>> happy
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> troll
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information
>> is
>> >>>>>>>>>>> needed, just
>> >>>>>>>>>>>>>>>> let
>> >>>>>>>>>>>>>>>>>> me
>> >>>>>>>>>>>>>>>>>>>>>>>>>> know.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
>> >>> fix
>> >>>>>>>>>>> this.
>> >>>>>>>>>>>>>> Is it
>> >>>>>>>>>>>>>>>>>>>>>>>>>> required to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let
>> solr
>> >>>>> resync
>> >>>>>>>>>>>>>> things?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson
>> <
>> >>>>>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having
>> issues
>> >>>>>>>>>>> with...
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>>>>>>>>>> org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>>> log
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Server at
>> >>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>> >>>>>>>>>>>>>>>>>>>>>>> non
>> >>>>>>>>>>>>>>>>>>>>>>>>>> ok
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie
>> Johnson <
>> >>>>>>>>>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>>>>>>>>>>>>> org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>>> log
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE:
>> org.apache.solr.common.SolrException:
>> >>>>>>>>>>> ClusterState
>> >>>>>>>>>>>>>>>> says
>> >>>>>>>>>>>>>>>>>>>>>>> we are
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie
>> Johnson <
>> >>>>>>>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
>> >>> point
>> >>>>>>>>>>> there
>> >>>>>>>>>>>>>> were
>> >>>>>>>>>>>>>>>>>>>>>>> shards
>> >>>>>>>>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
>> >>>>> below.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> >>>>>>>>>>>>>>>> state:SyncConnected
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes,
>> has
>> >>>>>>>>>>> occurred -
>> >>>>>>>>>>>>>>>>>>>>>>>>>> updating... (live
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
>> >>> leader.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active,
>> it's
>> >>>>> okay
>> >>>>>>>>>>> to be
>> >>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> leader.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and
>> sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark
>> Miller <
>> >>>>>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are
>> thinking
>> >>> of
>> >>>>>>>>>>> apply
>> >>>>>>>>>>>>>> here.
>> >>>>>>>>>>>>>>>>>>>>>>> Peersync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at
>> version
>> >>>>>>>>>>> numbers for
>> >>>>>>>>>>>>>>>>>>>>>>> updates in
>> >>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last
>> 100 of
>> >>>>> them
>> >>>>>>>>>>> on
>> >>>>>>>>>>>>>>>> leader
>> >>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>> replica.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica
>> seems to
>> >>>>> have
>> >>>>>>>>>>>>>> versions
>> >>>>>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>>>>> the leader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for
>> any
>> >>>>>>>>>>> interesting
>> >>>>>>>>>>>>>>>>>>>>>>> exceptions?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
>> >>> indexing?
>> >>>>>>>>>>> Did
>> >>>>>>>>>>>>>> any zk
>> >>>>>>>>>>>>>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson
>> <
>> >>>>>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
>> >>>>> cluster
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> 4.2
>> >>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>> noticed a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>> >>>>> Specifically
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> replica
>> >>>>>>>>>>>>>>>>>>>>>>> has a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing
>> the
>> >>>>>>>>>>> index to
>> >>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>>>> replicate.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
>> >>> documents
>> >>>>>>>>>>> than
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> master.
>> >>>>>>>>>>>>>>>>>>>>>>>>>> What
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
>> >>>>> short of
>> >>>>>>>>>>>>>> taking
>> >>>>>>>>>>>>>>>>>>>>>>> down the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>
>> >>>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
>> >>> http://10.38.33.17:7577/solrSTARTreplicas=[
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>> ]
>> >>>>>>>>>>>>>>>>>> nUpdates=100
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> url=
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>> >>>>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> url=
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>> >>>>>>>>>>> ourLowThreshold=1431233788792274944
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE.
>> sync
>> >>>>>>>>>>> succeeded
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it
>> thinks it
>> >>>>> has a
>> >>>>>>>>>>>>>> newer
>> >>>>>>>>>>>>>>>>>>>>>>> version of
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
>> >>>>> having 10
>> >>>>>>>>>>>>>> threads
>> >>>>>>>>>>>>>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica
>> each)
>> >>>>>>>>>>> cluster.
>> >>>>>>>>>>>>>> Any
>> >>>>>>>>>>>>>>>>>>>>>>> thoughts
>> >>>>>>>>>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
>> >>> appreciated.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >>
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

I have since removed the files but when I had looked there was an index
directory, the only files I remember being there were the segments, one of
the _* files were present.  I'll watch it to see if it happens again but it
happened on 2 of the shards while heavy indexing.


On Wed, Apr 3, 2013 at 10:13 PM, Mark Miller <ma...@gmail.com> wrote:

> Is that file still there when you look? Not being able to find an index
> file is not a common error I've seen recently.
>
> Do those replicas have an index directory or when you look on disk, is it
> an index.timestamp directory?
>
> - Mark
>
> On Apr 3, 2013, at 10:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > so something is still not right.  Things were going ok, but I'm seeing
> this
> > in the logs of several of the replicas
> >
> > SEVERE: Unable to create core: dsc-shard3-core1
> > org.apache.solr.common.SolrException: Error opening new searcher
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
> >        at
> > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
> >        at
> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
> >        at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
> >        at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
> >        at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >        at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.solr.common.SolrException: Error opening new
> searcher
> >        at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
> >        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
> >        ... 13 more
> > Caused by: org.apache.solr.common.SolrException: Error opening Reader
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
> >        at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
> >        ... 15 more
> > Caused by: java.io.FileNotFoundException:
> > /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or
> directory)
> >        at java.io.RandomAccessFile.open(Native Method)
> >        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
> >        at
> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
> >        at
> >
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
> >        at
> >
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
> >        at
> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
> >        at
> >
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
> >        at
> >
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
> >        at
> >
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
> >        at
> > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
> >        at
> >
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
> >        ... 18 more
> >
> >
> >
> > On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:
> >
> >> Thanks I will try that.
> >>
> >>
> >> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>
> >>>
> >>>
> >>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>
> >>>> I am not using the concurrent low pause garbage collector, I could
> look
> >>> at
> >>>> switching, I'm assuming you're talking about adding
> >>> -XX:+UseConcMarkSweepGC
> >>>> correct?
> >>>
> >>> Right - if you don't do that, the default is almost always the
> throughput
> >>> collector (I've only seen OSX buck this trend when apple handled java).
> >>> That means stop the world garbage collections, so with larger heaps,
> that
> >>> can be a fair amount of time that no threads can run. It's not that
> great
> >>> for something as interactive as search generally is anyway, but it's
> always
> >>> not that great when added to heavy load and a 15 sec session timeout
> >>> between solr and zk.
> >>>
> >>>
> >>> The below is odd - a replica node is waiting for the leader to see it
> as
> >>> recovering and live - live means it has created an ephemeral node for
> that
> >>> Solr corecontainer in zk - it's very strange if that didn't happen,
> unless
> >>> this happened during shutdown or something.
> >>>
> >>>>
> >>>> I also just had a shard go down and am seeing this in the log
> >>>>
> >>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
> >>> state
> >>>> down for 10.38.33.17:7576_solr but I still do not see the requested
> >>> state.
> >>>> I see state: recovering live:false
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >>>>
> >>>> Nothing other than this in the log jumps out as interesting though.
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> This shouldn't be a problem though, if things are working as they are
> >>>>> supposed to. Another node should simply take over as the overseer and
> >>>>> continue processing the work queue. It's just best if you configure
> so
> >>> that
> >>>>> session timeouts don't happen unless a node is really down. On the
> >>> other
> >>>>> hand, it's nicer to detect that faster. Your tradeoff to make.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Yeah. Are you using the concurrent low pause garbage collector?
> >>>>>>
> >>>>>> This means the overseer wasn't able to communicate with zk for 15
> >>>>> seconds - due to load or gc or whatever. If you can't resolve the
> root
> >>>>> cause of that, or the load just won't allow for it, next best thing
> >>> you can
> >>>>> do is raise it to 30 seconds.
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> I am occasionally seeing this in the log, is this just a timeout
> >>> issue?
> >>>>>>> Should I be increasing the zk client timeout?
> >>>>>>>
> >>>>>>> WARNING: Overseer cannot talk to ZK
> >>>>>>> Apr 3, 2013 11:14:25 PM
> >>>>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> >>>>>>> INFO: Watcher fired on path: null state: Expired type None
> >>>>>>> Apr 3, 2013 11:14:25 PM
> >>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
> >>>>>>> run
> >>>>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> >>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>> KeeperErrorCode = Session expired for /overseer/queue
> >>>>>>>     at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>>>>     at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>>>>     at
> >>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
> >>>>>>>     at
> >>>>>>>
> >>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
> >>>>>>>     at java.lang.Thread.run(Thread.java:662)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> just an update, I'm at 1M records now with no issues.  This looks
> >>>>>>>> promising as to the cause of my issues, thanks for the help.  Is
> the
> >>>>>>>> routing method with numShards documented anywhere?  I know
> >>> numShards is
> >>>>>>>> documented but I didn't know that the routing changed if you don't
> >>>>> specify
> >>>>>>>> it.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> with these changes things are looking good, I'm up to 600,000
> >>>>> documents
> >>>>>>>>> without any issues as of right now.  I'll keep going and add more
> >>> to
> >>>>> see if
> >>>>>>>>> I find anything.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
> >>> match
> >>>>> the
> >>>>>>>>>> shards that are auto created and it looks like things are happy.
> >>>>> I'll go
> >>>>>>>>>> ahead and try my test to see if I can get things out of sync.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
> >>> markrmiller@gmail.com
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I had thought you could - but looking at the code recently, I
> >>> don't
> >>>>>>>>>>> think you can anymore. I think that's a technical limitation
> more
> >>>>> than
> >>>>>>>>>>> anything though. When these changes were made, I think support
> >>> for
> >>>>> that was
> >>>>>>>>>>> simply not added at the time.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not sure exactly how straightforward it would be, but it
> >>> seems
> >>>>>>>>>>> doable - as it is, the overseer will preallocate shards when
> >>> first
> >>>>> creating
> >>>>>>>>>>> the collection - that's when they get named shard(n). There
> would
> >>>>> have to
> >>>>>>>>>>> be logic to replace shard(n) with the custom shard name when
> the
> >>>>> core
> >>>>>>>>>>> actually registers.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> answered my own question, it now says compositeId.  What is
> >>>>>>>>>>> problematic
> >>>>>>>>>>>> though is that in addition to my shards (which are say
> >>>>> jamie-shard1)
> >>>>>>>>>>> I see
> >>>>>>>>>>>> the solr created shards (shard1).  I assume that these were
> >>> created
> >>>>>>>>>>> because
> >>>>>>>>>>>> of the numShards param.  Is there no way to specify the names
> of
> >>>>> these
> >>>>>>>>>>>> shards?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
> >>> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> ah interesting....so I need to specify num shards, blow out
> zk
> >>> and
> >>>>>>>>>>> then
> >>>>>>>>>>>>> try this again to see if things work properly now.  What is
> >>> really
> >>>>>>>>>>> strange
> >>>>>>>>>>>>> is that for the most part things have worked right and on
> >>> 4.2.1 I
> >>>>>>>>>>> have
> >>>>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I
> will
> >>>>>>>>>>> specify num
> >>>>>>>>>>>>> shards clear out zk and begin again.  If this works properly
> >>> what
> >>>>>>>>>>> should
> >>>>>>>>>>>>> the router type be?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
> >>>>> markrmiller@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you don't specify numShards after 4.1, you get an
> implicit
> >>> doc
> >>>>>>>>>>> router
> >>>>>>>>>>>>>> and it's up to you to distribute updates. In the past,
> >>>>> partitioning
> >>>>>>>>>>> was
> >>>>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
> >>>>>>>>>>> features, we
> >>>>>>>>>>>>>> now divvy up the hash range up front based on numShards and
> >>> store
> >>>>>>>>>>> it in
> >>>>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control
> >>> of
> >>>>>>>>>>> updates
> >>>>>>>>>>>>>> yourself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
> >>> state
> >>>>> but
> >>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
> >>> the
> >>>>>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
> >>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I
> listed
> >>>>>>>>>>> above?  My
> >>>>>>>>>>>>>>> process right now is run those ZkCLI commands and then
> start
> >>>>> solr
> >>>>>>>>>>> on
> >>>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>> the instances with a command like this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
> >>>>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
> >>>>>>>>>>>>>> -Dcollection.configName=solr-conf
> >>>>>>>>>>>>>>> -Dcollection=collection1
> >>>>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> >>>>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I feel like maybe I'm missing a step.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "shard5":{
> >>>>>>>>>>>>>>>   "state":"active",
> >>>>>>>>>>>>>>>   "replicas":{
> >>>>>>>>>>>>>>>     "10.38.33.16:7575_solr_shard5-core1":{
> >>>>>>>>>>>>>>>       "shard":"shard5",
> >>>>>>>>>>>>>>>       "state":"active",
> >>>>>>>>>>>>>>>       "core":"shard5-core1",
> >>>>>>>>>>>>>>>       "collection":"collection1",
> >>>>>>>>>>>>>>>       "node_name":"10.38.33.16:7575_solr",
> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.16:7575/solr",
> >>>>>>>>>>>>>>>       "leader":"true"},
> >>>>>>>>>>>>>>>     "10.38.33.17:7577_solr_shard5-core2":{
> >>>>>>>>>>>>>>>       "shard":"shard5",
> >>>>>>>>>>>>>>>       "state":"recovering",
> >>>>>>>>>>>>>>>       "core":"shard5-core2",
> >>>>>>>>>>>>>>>       "collection":"collection1",
> >>>>>>>>>>>>>>>       "node_name":"10.38.33.17:7577_solr",
> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.17:7577/solr"}}}
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
> >>>>> markrmiller@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users
> have
> >>>>>>>>>>> reported
> >>>>>>>>>>>>>>>> trouble upgrading a previous zk install when this change
> >>> came.
> >>>>> I
> >>>>>>>>>>>>>>>> recommended manually updating the clusterstate.json to
> have
> >>> the
> >>>>>>>>>>> right
> >>>>>>>>>>>>>> info,
> >>>>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
> >>> start
> >>>>>>>>>>> from a
> >>>>>>>>>>>>>> clean
> >>>>>>>>>>>>>>>> zk state.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If you don't have that range information, I think there
> >>> will be
> >>>>>>>>>>>>>> trouble.
> >>>>>>>>>>>>>>>> Do you have an router type defined in the
> clusterstate.json?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
> >>> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it
> in
> >>> the
> >>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
> >>> when I
> >>>>>>>>>>> start
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch is the following
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
> >>>>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
> >>>>> should
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> steps
> >>>>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot
> >>> of
> >>>>> 4.0
> >>>>>>>>>>> so I
> >>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>> did that previously either so perhaps I did create the
> >>>>>>>>>>> collection in
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> my steps to get this working but have forgotten it along
> >>> the
> >>>>> way.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
> >>>>>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
> >>> assigned up
> >>>>>>>>>>> front
> >>>>>>>>>>>>>>>> when a
> >>>>>>>>>>>>>>>>>> collection is created - each shard gets a range, which
> is
> >>>>>>>>>>> stored in
> >>>>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the
> same
> >>> id
> >>>>> on
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>> shards - something very odd going on.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
> >>> reproduce.
> >>>>>>>>>>> Ideally
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> can capture it in a test case.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
> >>> jej2003@gmail.com
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
> >>>>>>>>>>> parameter
> >>>>>>>>>>>>>> set I
> >>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it
> on
> >>>>> 4.2.0
> >>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I
> get
> >>> to
> >>>>>>>>>>> 400,000
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
> >>> behavior
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
> >>> exported
> >>>>> all
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> keys
> >>>>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
> >>> checks
> >>>>> for
> >>>>>>>>>>>>>>>>>> duplicates.
> >>>>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a
> grep
> >>> of
> >>>>> the
> >>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
> >>> wrong
> >>>>>>>>>>> places.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
> >>> shard
> >>>>> 5.
> >>>>>>>>>>> Is
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
> >>>>>>>>>>> account only
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
> >>> numShards
> >>>>>>>>>>> param @
> >>>>>>>>>>>>>>>>>> startup
> >>>>>>>>>>>>>>>>>>>> so could this be what is happening?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>>>>>>>>>>>>>>>> shard1-core1:0
> >>>>>>>>>>>>>>>>>>>> shard1-core2:0
> >>>>>>>>>>>>>>>>>>>> shard2-core1:0
> >>>>>>>>>>>>>>>>>>>> shard2-core2:0
> >>>>>>>>>>>>>>>>>>>> shard3-core1:1
> >>>>>>>>>>>>>>>>>>>> shard3-core2:1
> >>>>>>>>>>>>>>>>>>>> shard4-core1:0
> >>>>>>>>>>>>>>>>>>>> shard4-core2:0
> >>>>>>>>>>>>>>>>>>>> shard5-core1:1
> >>>>>>>>>>>>>>>>>>>> shard5-core2:1
> >>>>>>>>>>>>>>>>>>>> shard6-core1:0
> >>>>>>>>>>>>>>>>>>>> shard6-core2:0
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I
> just
> >>>>>>>>>>> indexed
> >>>>>>>>>>>>>>>> 300,000
> >>>>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
> >>>>> thought
> >>>>>>>>>>>>>>>> perhaps I
> >>>>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again
> and
> >>>>>>>>>>> indexed
> >>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way
> to
> >>>>> find
> >>>>>>>>>>>>>>>> possibile
> >>>>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id
> field)
> >>>>> but
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>> didn't
> >>>>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things
> to
> >>> go
> >>>>>>>>>>> again.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
> >>>>> problem on
> >>>>>>>>>>>>>> 4.2.0
> >>>>>>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> >>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need
> to
> >>>>> get
> >>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>> bottom
> >>>>>>>>>>>>>>>>>>>>>>> of it.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
> >>> may
> >>>>>>>>>>> address
> >>>>>>>>>>>>>>>> this?
> >>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
> >>> out
> >>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
> >>>>>>>>>>> nothing.  I
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
> >>> and
> >>>>> see
> >>>>>>>>>>> if
> >>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> anything else odd
> >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
> >>>>>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
> >>> said.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
> >>> best
> >>>>> to
> >>>>>>>>>>> start
> >>>>>>>>>>>>>>>>>>>>>>> tracking in
> >>>>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
> >>>>> again.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we
> really
> >>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> get
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
> >>>>> fixed in
> >>>>>>>>>>>>>> 4.2.1
> >>>>>>>>>>>>>>>>>>>>>>> (spreading
> >>>>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
> >>> there
> >>>>>>>>>>> anything
> >>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>> that I
> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
> >>> I'd
> >>>>> be
> >>>>>>>>>>> happy
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> troll
> >>>>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
> >>>>>>>>>>> needed, just
> >>>>>>>>>>>>>>>> let
> >>>>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>>>>>>>> know.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
> >>> fix
> >>>>>>>>>>> this.
> >>>>>>>>>>>>>> Is it
> >>>>>>>>>>>>>>>>>>>>>>>>>> required to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
> >>>>> resync
> >>>>>>>>>>>>>> things?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having
> issues
> >>>>>>>>>>> with...
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Server at
> >>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>>>>>>>>>>>>>>>> non
> >>>>>>>>>>>>>>>>>>>>>>>>>> ok
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson
> <
> >>>>>>>>>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
> >>>>>>>>>>> ClusterState
> >>>>>>>>>>>>>>>> says
> >>>>>>>>>>>>>>>>>>>>>>> we are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie
> Johnson <
> >>>>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
> >>> point
> >>>>>>>>>>> there
> >>>>>>>>>>>>>> were
> >>>>>>>>>>>>>>>>>>>>>>> shards
> >>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
> >>>>> below.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >>>>>>>>>>>>>>>> state:SyncConnected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes,
> has
> >>>>>>>>>>> occurred -
> >>>>>>>>>>>>>>>>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
> >>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active,
> it's
> >>>>> okay
> >>>>>>>>>>> to be
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller
> <
> >>>>>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking
> >>> of
> >>>>>>>>>>> apply
> >>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
> >>>>>>>>>>> numbers for
> >>>>>>>>>>>>>>>>>>>>>>> updates in
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100
> of
> >>>>> them
> >>>>>>>>>>> on
> >>>>>>>>>>>>>>>> leader
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems
> to
> >>>>> have
> >>>>>>>>>>>>>> versions
> >>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
> >>>>>>>>>>> interesting
> >>>>>>>>>>>>>>>>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
> >>> indexing?
> >>>>>>>>>>> Did
> >>>>>>>>>>>>>> any zk
> >>>>>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
> >>>>> cluster
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> 4.2
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
> >>>>> Specifically
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> replica
> >>>>>>>>>>>>>>>>>>>>>>> has a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing
> the
> >>>>>>>>>>> index to
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
> >>> documents
> >>>>>>>>>>> than
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> master.
> >>>>>>>>>>>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
> >>>>> short of
> >>>>>>>>>>>>>> taking
> >>>>>>>>>>>>>>>>>>>>>>> down the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
> >>> http://10.38.33.17:7577/solrSTARTreplicas=[
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>> ]
> >>>>>>>>>>>>>>>>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
> >>>>>>>>>>> ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE.
> sync
> >>>>>>>>>>> succeeded
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks
> it
> >>>>> has a
> >>>>>>>>>>>>>> newer
> >>>>>>>>>>>>>>>>>>>>>>> version of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
> >>>>> having 10
> >>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica
> each)
> >>>>>>>>>>> cluster.
> >>>>>>>>>>>>>> Any
> >>>>>>>>>>>>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
> >>> appreciated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

Is that file still there when you look? Not being able to find an index file is not a common error I've seen recently.

Do those replicas have an index directory or when you look on disk, is it an index.timestamp directory?

- Mark

On Apr 3, 2013, at 10:01 PM, Jamie Johnson <je...@gmail.com> wrote:

> so something is still not right.  Things were going ok, but I'm seeing this
> in the logs of several of the replicas
> 
> SEVERE: Unable to create core: dsc-shard3-core1
> org.apache.solr.common.SolrException: Error opening new searcher
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
>        at
> org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
>        at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
>        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
>        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
>        ... 13 more
> Caused by: org.apache.solr.common.SolrException: Error opening Reader
>        at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
>        at
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
>        at
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
>        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
>        ... 15 more
> Caused by: java.io.FileNotFoundException:
> /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or directory)
>        at java.io.RandomAccessFile.open(Native Method)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
>        at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
>        at
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
>        at
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
>        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
>        at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
>        at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
>        at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
>        at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
>        at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
>        at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
>        ... 18 more
> 
> 
> 
> On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> Thanks I will try that.
>> 
>> 
>> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <ma...@gmail.com> wrote:
>> 
>>> 
>>> 
>>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> I am not using the concurrent low pause garbage collector, I could look
>>> at
>>>> switching, I'm assuming you're talking about adding
>>> -XX:+UseConcMarkSweepGC
>>>> correct?
>>> 
>>> Right - if you don't do that, the default is almost always the throughput
>>> collector (I've only seen OSX buck this trend when apple handled java).
>>> That means stop the world garbage collections, so with larger heaps, that
>>> can be a fair amount of time that no threads can run. It's not that great
>>> for something as interactive as search generally is anyway, but it's always
>>> not that great when added to heavy load and a 15 sec session timeout
>>> between solr and zk.
>>> 
>>> 
>>> The below is odd - a replica node is waiting for the leader to see it as
>>> recovering and live - live means it has created an ephemeral node for that
>>> Solr corecontainer in zk - it's very strange if that didn't happen, unless
>>> this happened during shutdown or something.
>>> 
>>>> 
>>>> I also just had a shard go down and am seeing this in the log
>>>> 
>>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>>> state
>>>> down for 10.38.33.17:7576_solr but I still do not see the requested
>>> state.
>>>> I see state: recovering live:false
>>>>       at
>>>> 
>>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>>>>       at
>>>> 
>>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>>>>       at
>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>>>>       at
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>>>       at
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>>>       at
>>>> 
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>>>       at
>>>> 
>>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>>>       at
>>>> 
>>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>>> 
>>>> Nothing other than this in the log jumps out as interesting though.
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>> 
>>>>> This shouldn't be a problem though, if things are working as they are
>>>>> supposed to. Another node should simply take over as the overseer and
>>>>> continue processing the work queue. It's just best if you configure so
>>> that
>>>>> session timeouts don't happen unless a node is really down. On the
>>> other
>>>>> hand, it's nicer to detect that faster. Your tradeoff to make.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>> 
>>>>>> Yeah. Are you using the concurrent low pause garbage collector?
>>>>>> 
>>>>>> This means the overseer wasn't able to communicate with zk for 15
>>>>> seconds - due to load or gc or whatever. If you can't resolve the root
>>>>> cause of that, or the load just won't allow for it, next best thing
>>> you can
>>>>> do is raise it to 30 seconds.
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>> 
>>>>>>> I am occasionally seeing this in the log, is this just a timeout
>>> issue?
>>>>>>> Should I be increasing the zk client timeout?
>>>>>>> 
>>>>>>> WARNING: Overseer cannot talk to ZK
>>>>>>> Apr 3, 2013 11:14:25 PM
>>>>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>>>>>>> INFO: Watcher fired on path: null state: Expired type None
>>>>>>> Apr 3, 2013 11:14:25 PM
>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>>>>>>> run
>>>>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>> KeeperErrorCode = Session expired for /overseer/queue
>>>>>>>     at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>>>>     at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>>     at
>>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>>>>>>>     at
>>>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>>>>>>>     at java.lang.Thread.run(Thread.java:662)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> just an update, I'm at 1M records now with no issues.  This looks
>>>>>>>> promising as to the cause of my issues, thanks for the help.  Is the
>>>>>>>> routing method with numShards documented anywhere?  I know
>>> numShards is
>>>>>>>> documented but I didn't know that the routing changed if you don't
>>>>> specify
>>>>>>>> it.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> with these changes things are looking good, I'm up to 600,000
>>>>> documents
>>>>>>>>> without any issues as of right now.  I'll keep going and add more
>>> to
>>>>> see if
>>>>>>>>> I find anything.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
>>> match
>>>>> the
>>>>>>>>>> shards that are auto created and it looks like things are happy.
>>>>> I'll go
>>>>>>>>>> ahead and try my test to see if I can get things out of sync.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
>>> markrmiller@gmail.com
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I had thought you could - but looking at the code recently, I
>>> don't
>>>>>>>>>>> think you can anymore. I think that's a technical limitation more
>>>>> than
>>>>>>>>>>> anything though. When these changes were made, I think support
>>> for
>>>>> that was
>>>>>>>>>>> simply not added at the time.
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure exactly how straightforward it would be, but it
>>> seems
>>>>>>>>>>> doable - as it is, the overseer will preallocate shards when
>>> first
>>>>> creating
>>>>>>>>>>> the collection - that's when they get named shard(n). There would
>>>>> have to
>>>>>>>>>>> be logic to replace shard(n) with the custom shard name when the
>>>>> core
>>>>>>>>>>> actually registers.
>>>>>>>>>>> 
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> answered my own question, it now says compositeId.  What is
>>>>>>>>>>> problematic
>>>>>>>>>>>> though is that in addition to my shards (which are say
>>>>> jamie-shard1)
>>>>>>>>>>> I see
>>>>>>>>>>>> the solr created shards (shard1).  I assume that these were
>>> created
>>>>>>>>>>> because
>>>>>>>>>>>> of the numShards param.  Is there no way to specify the names of
>>>>> these
>>>>>>>>>>>> shards?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
>>> jej2003@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> ah interesting....so I need to specify num shards, blow out zk
>>> and
>>>>>>>>>>> then
>>>>>>>>>>>>> try this again to see if things work properly now.  What is
>>> really
>>>>>>>>>>> strange
>>>>>>>>>>>>> is that for the most part things have worked right and on
>>> 4.2.1 I
>>>>>>>>>>> have
>>>>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>>>>>>>>>>> specify num
>>>>>>>>>>>>> shards clear out zk and begin again.  If this works properly
>>> what
>>>>>>>>>>> should
>>>>>>>>>>>>> the router type be?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
>>>>> markrmiller@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit
>>> doc
>>>>>>>>>>> router
>>>>>>>>>>>>>> and it's up to you to distribute updates. In the past,
>>>>> partitioning
>>>>>>>>>>> was
>>>>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>>>>>>>>>>> features, we
>>>>>>>>>>>>>> now divvy up the hash range up front based on numShards and
>>> store
>>>>>>>>>>> it in
>>>>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control
>>> of
>>>>>>>>>>> updates
>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
>>> state
>>>>> but
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
>>> the
>>>>>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>>>>>>>>> above?  My
>>>>>>>>>>>>>>> process right now is run those ZkCLI commands and then start
>>>>> solr
>>>>>>>>>>> on
>>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>> the instances with a command like this
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>>>>>>>>>>>> -Dcollection.configName=solr-conf
>>>>>>>>>>>>>>> -Dcollection=collection1
>>>>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I feel like maybe I'm missing a step.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "shard5":{
>>>>>>>>>>>>>>>   "state":"active",
>>>>>>>>>>>>>>>   "replicas":{
>>>>>>>>>>>>>>>     "10.38.33.16:7575_solr_shard5-core1":{
>>>>>>>>>>>>>>>       "shard":"shard5",
>>>>>>>>>>>>>>>       "state":"active",
>>>>>>>>>>>>>>>       "core":"shard5-core1",
>>>>>>>>>>>>>>>       "collection":"collection1",
>>>>>>>>>>>>>>>       "node_name":"10.38.33.16:7575_solr",
>>>>>>>>>>>>>>>       "base_url":"http://10.38.33.16:7575/solr",
>>>>>>>>>>>>>>>       "leader":"true"},
>>>>>>>>>>>>>>>     "10.38.33.17:7577_solr_shard5-core2":{
>>>>>>>>>>>>>>>       "shard":"shard5",
>>>>>>>>>>>>>>>       "state":"recovering",
>>>>>>>>>>>>>>>       "core":"shard5-core2",
>>>>>>>>>>>>>>>       "collection":"collection1",
>>>>>>>>>>>>>>>       "node_name":"10.38.33.17:7577_solr",
>>>>>>>>>>>>>>>       "base_url":"http://10.38.33.17:7577/solr"}}}
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
>>>>> markrmiller@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>>>>>>>>>>> reported
>>>>>>>>>>>>>>>> trouble upgrading a previous zk install when this change
>>> came.
>>>>> I
>>>>>>>>>>>>>>>> recommended manually updating the clusterstate.json to have
>>> the
>>>>>>>>>>> right
>>>>>>>>>>>>>> info,
>>>>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
>>> start
>>>>>>>>>>> from a
>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>> zk state.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If you don't have that range information, I think there
>>> will be
>>>>>>>>>>>>>> trouble.
>>>>>>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
>>> jej2003@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in
>>> the
>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
>>> when I
>>>>>>>>>>> start
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> scratch is the following
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>>>>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
>>>>> should
>>>>>>>>>>> the
>>>>>>>>>>>>>> steps
>>>>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot
>>> of
>>>>> 4.0
>>>>>>>>>>> so I
>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>> did that previously either so perhaps I did create the
>>>>>>>>>>> collection in
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> my steps to get this working but have forgotten it along
>>> the
>>>>> way.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>>>>>>>>> markrmiller@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
>>> assigned up
>>>>>>>>>>> front
>>>>>>>>>>>>>>>> when a
>>>>>>>>>>>>>>>>>> collection is created - each shard gets a range, which is
>>>>>>>>>>> stored in
>>>>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same
>>> id
>>>>> on
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> shards - something very odd going on.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
>>> reproduce.
>>>>>>>>>>> Ideally
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> can capture it in a test case.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
>>> jej2003@gmail.com
>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>>>>>>>>>>> parameter
>>>>>>>>>>>>>> set I
>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
>>>>> 4.2.0
>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get
>>> to
>>>>>>>>>>> 400,000
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
>>> behavior
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
>>> exported
>>>>> all
>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> keys
>>>>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
>>> checks
>>>>> for
>>>>>>>>>>>>>>>>>> duplicates.
>>>>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep
>>> of
>>>>> the
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
>>> wrong
>>>>>>>>>>> places.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
>>> shard
>>>>> 5.
>>>>>>>>>>> Is
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>>>>>>>>>>> account only
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
>>> numShards
>>>>>>>>>>> param @
>>>>>>>>>>>>>>>>>> startup
>>>>>>>>>>>>>>>>>>>> so could this be what is happening?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>>>>>>>>>>>>> shard1-core1:0
>>>>>>>>>>>>>>>>>>>> shard1-core2:0
>>>>>>>>>>>>>>>>>>>> shard2-core1:0
>>>>>>>>>>>>>>>>>>>> shard2-core2:0
>>>>>>>>>>>>>>>>>>>> shard3-core1:1
>>>>>>>>>>>>>>>>>>>> shard3-core2:1
>>>>>>>>>>>>>>>>>>>> shard4-core1:0
>>>>>>>>>>>>>>>>>>>> shard4-core2:0
>>>>>>>>>>>>>>>>>>>> shard5-core1:1
>>>>>>>>>>>>>>>>>>>> shard5-core2:1
>>>>>>>>>>>>>>>>>>>> shard6-core1:0
>>>>>>>>>>>>>>>>>>>> shard6-core2:0
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>>>>>>>>> indexed
>>>>>>>>>>>>>>>> 300,000
>>>>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
>>>>> thought
>>>>>>>>>>>>>>>> perhaps I
>>>>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>>>>>>>>>>> indexed
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
>>>>> find
>>>>>>>>>>>>>>>> possibile
>>>>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
>>>>> but
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to
>>> go
>>>>>>>>>>> again.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
>>>>> problem on
>>>>>>>>>>>>>> 4.2.0
>>>>>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
>>>>> get
>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>> bottom
>>>>>>>>>>>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
>>> may
>>>>>>>>>>> address
>>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
>>> out
>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>>>>>>>>> nothing.  I
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
>>> and
>>>>> see
>>>>>>>>>>> if
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>>>>>>>>> markrmiller@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
>>> said.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
>>> best
>>>>> to
>>>>>>>>>>> start
>>>>>>>>>>>>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
>>>>> again.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
>>>>> fixed in
>>>>>>>>>>>>>> 4.2.1
>>>>>>>>>>>>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
>>> there
>>>>>>>>>>> anything
>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
>>> I'd
>>>>> be
>>>>>>>>>>> happy
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>>>>>>>>>>> needed, just
>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
>>> fix
>>>>>>>>>>> this.
>>>>>>>>>>>>>> Is it
>>>>>>>>>>>>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
>>>>> resync
>>>>>>>>>>>>>> things?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>>>>>>>>> with...
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Server at
>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>>>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>>>>>>>>> ClusterState
>>>>>>>>>>>>>>>> says
>>>>>>>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
>>> point
>>>>>>>>>>> there
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>>>>>>>>>>> shards
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
>>>>> below.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>>>>>>>>>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>>>>>>>>> occurred -
>>>>>>>>>>>>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
>>>>> okay
>>>>>>>>>>> to be
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking
>>> of
>>>>>>>>>>> apply
>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>>>>>>>>> numbers for
>>>>>>>>>>>>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
>>>>> them
>>>>>>>>>>> on
>>>>>>>>>>>>>>>> leader
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
>>>>> have
>>>>>>>>>>>>>> versions
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>>>>>>>>> interesting
>>>>>>>>>>>>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
>>> indexing?
>>>>>>>>>>> Did
>>>>>>>>>>>>>> any zk
>>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
>>>>> cluster
>>>>>>>>>>> to
>>>>>>>>>>>>>> 4.2
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>>>>> Specifically
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> replica
>>>>>>>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>>>>>>>>> index to
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
>>> documents
>>>>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> master.
>>>>>>>>>>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
>>>>> short of
>>>>>>>>>>>>>> taking
>>>>>>>>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
>>> http://10.38.33.17:7577/solrSTARTreplicas=[
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>> ]
>>>>>>>>>>>>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>>>>>>>>> ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>>>>>>>>> succeeded
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
>>>>> has a
>>>>>>>>>>>>>> newer
>>>>>>>>>>>>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
>>>>> having 10
>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>>>>>>>>> cluster.
>>>>>>>>>>>>>> Any
>>>>>>>>>>>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
>>> appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

so something is still not right.  Things were going ok, but I'm seeing this
in the logs of several of the replicas

SEVERE: Unable to create core: dsc-shard3-core1
org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
        at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
        at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
        ... 13 more
Caused by: org.apache.solr.common.SolrException: Error opening Reader
        at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
        at
org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
        at
org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
        ... 15 more
Caused by: java.io.FileNotFoundException:
/cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or directory)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
        at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
        at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
        at
org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
        at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
        at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
        at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
        at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
        at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
        at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
        ... 18 more



On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:

> Thanks I will try that.
>
>
> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <ma...@gmail.com> wrote:
>
>>
>>
>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > I am not using the concurrent low pause garbage collector, I could look
>> at
>> > switching, I'm assuming you're talking about adding
>> -XX:+UseConcMarkSweepGC
>> > correct?
>>
>> Right - if you don't do that, the default is almost always the throughput
>> collector (I've only seen OSX buck this trend when apple handled java).
>> That means stop the world garbage collections, so with larger heaps, that
>> can be a fair amount of time that no threads can run. It's not that great
>> for something as interactive as search generally is anyway, but it's always
>> not that great when added to heavy load and a 15 sec session timeout
>> between solr and zk.
>>
>>
>> The below is odd - a replica node is waiting for the leader to see it as
>> recovering and live - live means it has created an ephemeral node for that
>> Solr corecontainer in zk - it's very strange if that didn't happen, unless
>> this happened during shutdown or something.
>>
>> >
>> > I also just had a shard go down and am seeing this in the log
>> >
>> > SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>> state
>> > down for 10.38.33.17:7576_solr but I still do not see the requested
>> state.
>> > I see state: recovering live:false
>> >        at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>> >        at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>> >        at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >        at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>> >        at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>> >        at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>> >        at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>> >        at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>> >        at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>> >        at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>> >        at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>> >
>> > Nothing other than this in the log jumps out as interesting though.
>> >
>> >
>> > On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >
>> >> This shouldn't be a problem though, if things are working as they are
>> >> supposed to. Another node should simply take over as the overseer and
>> >> continue processing the work queue. It's just best if you configure so
>> that
>> >> session timeouts don't happen unless a node is really down. On the
>> other
>> >> hand, it's nicer to detect that faster. Your tradeoff to make.
>> >>
>> >> - Mark
>> >>
>> >> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:
>> >>
>> >>> Yeah. Are you using the concurrent low pause garbage collector?
>> >>>
>> >>> This means the overseer wasn't able to communicate with zk for 15
>> >> seconds - due to load or gc or whatever. If you can't resolve the root
>> >> cause of that, or the load just won't allow for it, next best thing
>> you can
>> >> do is raise it to 30 seconds.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>>
>> >>>> I am occasionally seeing this in the log, is this just a timeout
>> issue?
>> >>>> Should I be increasing the zk client timeout?
>> >>>>
>> >>>> WARNING: Overseer cannot talk to ZK
>> >>>> Apr 3, 2013 11:14:25 PM
>> >>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>> >>>> INFO: Watcher fired on path: null state: Expired type None
>> >>>> Apr 3, 2013 11:14:25 PM
>> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>> >>>> run
>> >>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >>>> KeeperErrorCode = Session expired for /overseer/queue
>> >>>>      at
>> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>> >>>>      at
>> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> >>>>      at
>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>> >>>>      at
>> >>>>
>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>> >>>>      at
>> >>>>
>> >>
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>> >>>>      at java.lang.Thread.run(Thread.java:662)
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
>> >> wrote:
>> >>>>
>> >>>>> just an update, I'm at 1M records now with no issues.  This looks
>> >>>>> promising as to the cause of my issues, thanks for the help.  Is the
>> >>>>> routing method with numShards documented anywhere?  I know
>> numShards is
>> >>>>> documented but I didn't know that the routing changed if you don't
>> >> specify
>> >>>>> it.
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
>> >> wrote:
>> >>>>>
>> >>>>>> with these changes things are looking good, I'm up to 600,000
>> >> documents
>> >>>>>> without any issues as of right now.  I'll keep going and add more
>> to
>> >> see if
>> >>>>>> I find anything.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com>
>> >> wrote:
>> >>>>>>
>> >>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
>> match
>> >> the
>> >>>>>>> shards that are auto created and it looks like things are happy.
>> >> I'll go
>> >>>>>>> ahead and try my test to see if I can get things out of sync.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
>> markrmiller@gmail.com
>> >>> wrote:
>> >>>>>>>
>> >>>>>>>> I had thought you could - but looking at the code recently, I
>> don't
>> >>>>>>>> think you can anymore. I think that's a technical limitation more
>> >> than
>> >>>>>>>> anything though. When these changes were made, I think support
>> for
>> >> that was
>> >>>>>>>> simply not added at the time.
>> >>>>>>>>
>> >>>>>>>> I'm not sure exactly how straightforward it would be, but it
>> seems
>> >>>>>>>> doable - as it is, the overseer will preallocate shards when
>> first
>> >> creating
>> >>>>>>>> the collection - that's when they get named shard(n). There would
>> >> have to
>> >>>>>>>> be logic to replace shard(n) with the custom shard name when the
>> >> core
>> >>>>>>>> actually registers.
>> >>>>>>>>
>> >>>>>>>> - Mark
>> >>>>>>>>
>> >>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>>> answered my own question, it now says compositeId.  What is
>> >>>>>>>> problematic
>> >>>>>>>>> though is that in addition to my shards (which are say
>> >> jamie-shard1)
>> >>>>>>>> I see
>> >>>>>>>>> the solr created shards (shard1).  I assume that these were
>> created
>> >>>>>>>> because
>> >>>>>>>>> of the numShards param.  Is there no way to specify the names of
>> >> these
>> >>>>>>>>> shards?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> ah interesting....so I need to specify num shards, blow out zk
>> and
>> >>>>>>>> then
>> >>>>>>>>>> try this again to see if things work properly now.  What is
>> really
>> >>>>>>>> strange
>> >>>>>>>>>> is that for the most part things have worked right and on
>> 4.2.1 I
>> >>>>>>>> have
>> >>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>> >>>>>>>> specify num
>> >>>>>>>>>> shards clear out zk and begin again.  If this works properly
>> what
>> >>>>>>>> should
>> >>>>>>>>>> the router type be?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
>> >> markrmiller@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit
>> doc
>> >>>>>>>> router
>> >>>>>>>>>>> and it's up to you to distribute updates. In the past,
>> >> partitioning
>> >>>>>>>> was
>> >>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>> >>>>>>>> features, we
>> >>>>>>>>>>> now divvy up the hash range up front based on numShards and
>> store
>> >>>>>>>> it in
>> >>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control
>> of
>> >>>>>>>> updates
>> >>>>>>>>>>> yourself.
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
>> state
>> >> but
>> >>>>>>>>>>> perhaps
>> >>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
>> the
>> >>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
>> >> should
>> >>>>>>>> be
>> >>>>>>>>>>> done
>> >>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>> >>>>>>>> above?  My
>> >>>>>>>>>>>> process right now is run those ZkCLI commands and then start
>> >> solr
>> >>>>>>>> on
>> >>>>>>>>>>> all of
>> >>>>>>>>>>>> the instances with a command like this
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>> >>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>> >>>>>>>>>>> -Dcollection.configName=solr-conf
>> >>>>>>>>>>>> -Dcollection=collection1
>> >>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> >>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I feel like maybe I'm missing a step.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> "shard5":{
>> >>>>>>>>>>>>    "state":"active",
>> >>>>>>>>>>>>    "replicas":{
>> >>>>>>>>>>>>      "10.38.33.16:7575_solr_shard5-core1":{
>> >>>>>>>>>>>>        "shard":"shard5",
>> >>>>>>>>>>>>        "state":"active",
>> >>>>>>>>>>>>        "core":"shard5-core1",
>> >>>>>>>>>>>>        "collection":"collection1",
>> >>>>>>>>>>>>        "node_name":"10.38.33.16:7575_solr",
>> >>>>>>>>>>>>        "base_url":"http://10.38.33.16:7575/solr",
>> >>>>>>>>>>>>        "leader":"true"},
>> >>>>>>>>>>>>      "10.38.33.17:7577_solr_shard5-core2":{
>> >>>>>>>>>>>>        "shard":"shard5",
>> >>>>>>>>>>>>        "state":"recovering",
>> >>>>>>>>>>>>        "core":"shard5-core2",
>> >>>>>>>>>>>>        "collection":"collection1",
>> >>>>>>>>>>>>        "node_name":"10.38.33.17:7577_solr",
>> >>>>>>>>>>>>        "base_url":"http://10.38.33.17:7577/solr"}}}
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
>> >> markrmiller@gmail.com
>> >>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>> >>>>>>>> reported
>> >>>>>>>>>>>>> trouble upgrading a previous zk install when this change
>> came.
>> >> I
>> >>>>>>>>>>>>> recommended manually updating the clusterstate.json to have
>> the
>> >>>>>>>> right
>> >>>>>>>>>>> info,
>> >>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
>> start
>> >>>>>>>> from a
>> >>>>>>>>>>> clean
>> >>>>>>>>>>>>> zk state.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> If you don't have that range information, I think there
>> will be
>> >>>>>>>>>>> trouble.
>> >>>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in
>> the
>> >>>>>>>> cluster
>> >>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
>> when I
>> >>>>>>>> start
>> >>>>>>>>>>> from
>> >>>>>>>>>>>>>> scratch is the following
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>> >>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
>> >> should
>> >>>>>>>> the
>> >>>>>>>>>>> steps
>> >>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot
>> of
>> >> 4.0
>> >>>>>>>> so I
>> >>>>>>>>>>>>> never
>> >>>>>>>>>>>>>> did that previously either so perhaps I did create the
>> >>>>>>>> collection in
>> >>>>>>>>>>> one
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>> my steps to get this working but have forgotten it along
>> the
>> >> way.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>> >>>>>>>> markrmiller@gmail.com>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
>> assigned up
>> >>>>>>>> front
>> >>>>>>>>>>>>> when a
>> >>>>>>>>>>>>>>> collection is created - each shard gets a range, which is
>> >>>>>>>> stored in
>> >>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same
>> id
>> >> on
>> >>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>> shards - something very odd going on.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
>> reproduce.
>> >>>>>>>> Ideally
>> >>>>>>>>>>> we
>> >>>>>>>>>>>>>>> can capture it in a test case.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
>> jej2003@gmail.com
>> >>>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>> >>>>>>>> parameter
>> >>>>>>>>>>> set I
>> >>>>>>>>>>>>>>> am
>> >>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
>> >> 4.2.0
>> >>>>>>>> by
>> >>>>>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get
>> to
>> >>>>>>>> 400,000
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>>>> so.
>> >>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
>> behavior
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>> >>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
>> exported
>> >> all
>> >>>>>>>> of
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>> keys
>> >>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
>> checks
>> >> for
>> >>>>>>>>>>>>>>> duplicates.
>> >>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep
>> of
>> >> the
>> >>>>>>>>>>> files
>> >>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
>> wrong
>> >>>>>>>> places.
>> >>>>>>>>>>>>> If
>> >>>>>>>>>>>>>>> you
>> >>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
>> shard
>> >> 5.
>> >>>>>>>> Is
>> >>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>> >>>>>>>> account only
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
>> numShards
>> >>>>>>>> param @
>> >>>>>>>>>>>>>>> startup
>> >>>>>>>>>>>>>>>>> so could this be what is happening?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> >>>>>>>>>>>>>>>>> shard1-core1:0
>> >>>>>>>>>>>>>>>>> shard1-core2:0
>> >>>>>>>>>>>>>>>>> shard2-core1:0
>> >>>>>>>>>>>>>>>>> shard2-core2:0
>> >>>>>>>>>>>>>>>>> shard3-core1:1
>> >>>>>>>>>>>>>>>>> shard3-core2:1
>> >>>>>>>>>>>>>>>>> shard4-core1:0
>> >>>>>>>>>>>>>>>>> shard4-core2:0
>> >>>>>>>>>>>>>>>>> shard5-core1:1
>> >>>>>>>>>>>>>>>>> shard5-core2:1
>> >>>>>>>>>>>>>>>>> shard6-core1:0
>> >>>>>>>>>>>>>>>>> shard6-core2:0
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>> >>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>> >>>>>>>> indexed
>> >>>>>>>>>>>>> 300,000
>> >>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
>> >> thought
>> >>>>>>>>>>>>> perhaps I
>> >>>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>> >>>>>>>> indexed
>> >>>>>>>>>>>>> another
>> >>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
>> >> find
>> >>>>>>>>>>>>> possibile
>> >>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
>> >> but
>> >>>>>>>> that
>> >>>>>>>>>>>>> didn't
>> >>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>> >>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to
>> go
>> >>>>>>>> again.
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>> am
>> >>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
>> >> problem on
>> >>>>>>>>>>> 4.2.0
>> >>>>>>>>>>>>>>> and then
>> >>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>> >>>>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
>> >> get
>> >>>>>>>> to the
>> >>>>>>>>>>>>>>> bottom
>> >>>>>>>>>>>>>>>>>>>> of it.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>> >>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Mark
>> >>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
>> may
>> >>>>>>>> address
>> >>>>>>>>>>>>> this?
>> >>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>>>> read
>> >>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
>> out
>> >>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>> >>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>> >>>>>>>> nothing.  I
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>>>> clear
>> >>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
>> and
>> >> see
>> >>>>>>>> if
>> >>>>>>>>>>> there
>> >>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>> anything else odd
>> >>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>> >>>>>>>> markrmiller@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
>> said.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
>> best
>> >> to
>> >>>>>>>> start
>> >>>>>>>>>>>>>>>>>>>> tracking in
>> >>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
>> >> again.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
>> >> need
>> >>>>>>>> to
>> >>>>>>>>>>> get
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
>> >> fixed in
>> >>>>>>>>>>> 4.2.1
>> >>>>>>>>>>>>>>>>>>>> (spreading
>> >>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>> >>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
>> there
>> >>>>>>>> anything
>> >>>>>>>>>>>>> else
>> >>>>>>>>>>>>>>>>>>>> that I
>> >>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
>>  I'd
>> >> be
>> >>>>>>>> happy
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>> troll
>> >>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>> >>>>>>>> needed, just
>> >>>>>>>>>>>>> let
>> >>>>>>>>>>>>>>> me
>> >>>>>>>>>>>>>>>>>>>>>>> know.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
>> fix
>> >>>>>>>> this.
>> >>>>>>>>>>> Is it
>> >>>>>>>>>>>>>>>>>>>>>>> required to
>> >>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
>> >> resync
>> >>>>>>>>>>> things?
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>> >>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>> >>>>>>>> with...
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>>>>>>> org.apache.solr.common.SolrException
>> >>>>>>>>>>>>> log
>> >>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>>>>>>>>>> Server at
>> >>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>> >>>>>>>>>>>>>>>>>>>> non
>> >>>>>>>>>>>>>>>>>>>>>>> ok
>> >>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> >>>>>>>>>>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>>>>>>>>>> org.apache.solr.common.SolrException
>> >>>>>>>>>>>>> log
>> >>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>> >>>>>>>> ClusterState
>> >>>>>>>>>>>>> says
>> >>>>>>>>>>>>>>>>>>>> we are
>> >>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>
>> >> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>>>>>>>>>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> >>>>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
>> point
>> >>>>>>>> there
>> >>>>>>>>>>> were
>> >>>>>>>>>>>>>>>>>>>> shards
>> >>>>>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
>> >> below.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> >>>>>>>>>>>>> state:SyncConnected
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>> >>>>>>>> occurred -
>> >>>>>>>>>>>>>>>>>>>>>>> updating... (live
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
>> leader.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
>> >> okay
>> >>>>>>>> to be
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> leader.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>> >>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking
>> of
>> >>>>>>>> apply
>> >>>>>>>>>>> here.
>> >>>>>>>>>>>>>>>>>>>> Peersync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>> >>>>>>>> numbers for
>> >>>>>>>>>>>>>>>>>>>> updates in
>> >>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
>> >> them
>> >>>>>>>> on
>> >>>>>>>>>>>>> leader
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> replica.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
>> >> have
>> >>>>>>>>>>> versions
>> >>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>> the leader
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>> >>>>>>>> interesting
>> >>>>>>>>>>>>>>>>>>>> exceptions?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
>> indexing?
>> >>>>>>>> Did
>> >>>>>>>>>>> any zk
>> >>>>>>>>>>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>> >>>>>>>>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
>> >> cluster
>> >>>>>>>> to
>> >>>>>>>>>>> 4.2
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> noticed a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>> >> Specifically
>> >>>>>>>> the
>> >>>>>>>>>>>>> replica
>> >>>>>>>>>>>>>>>>>>>> has a
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>> >>>>>>>> index to
>> >>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>> replicate.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
>> documents
>> >>>>>>>> than
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> master.
>> >>>>>>>>>>>>>>>>>>>>>>> What
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
>> >> short of
>> >>>>>>>>>>> taking
>> >>>>>>>>>>>>>>>>>>>> down the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> index
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>
>> >>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
>> http://10.38.33.17:7577/solrSTARTreplicas=[
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
>> >> ]
>> >>>>>>>>>>>>>>> nUpdates=100
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>> >>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>> >>>>>>>> ourLowThreshold=1431233788792274944
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> >>>>>>>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>> >>>>>>>> succeeded
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
>> >> has a
>> >>>>>>>>>>> newer
>> >>>>>>>>>>>>>>>>>>>> version of
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
>> >> having 10
>> >>>>>>>>>>> threads
>> >>>>>>>>>>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>> >>>>>>>> cluster.
>> >>>>>>>>>>> Any
>> >>>>>>>>>>>>>>>>>>>> thoughts
>> >>>>>>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
>> appreciated.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> >>
>> >>
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Thanks I will try that.


On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <ma...@gmail.com> wrote:

>
>
> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > I am not using the concurrent low pause garbage collector, I could look
> at
> > switching, I'm assuming you're talking about adding
> -XX:+UseConcMarkSweepGC
> > correct?
>
> Right - if you don't do that, the default is almost always the throughput
> collector (I've only seen OSX buck this trend when apple handled java).
> That means stop the world garbage collections, so with larger heaps, that
> can be a fair amount of time that no threads can run. It's not that great
> for something as interactive as search generally is anyway, but it's always
> not that great when added to heavy load and a 15 sec session timeout
> between solr and zk.
>
>
> The below is odd - a replica node is waiting for the leader to see it as
> recovering and live - live means it has created an ephemeral node for that
> Solr corecontainer in zk - it's very strange if that didn't happen, unless
> this happened during shutdown or something.
>
> >
> > I also just had a shard go down and am seeing this in the log
> >
> > SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
> state
> > down for 10.38.33.17:7576_solr but I still do not see the requested
> state.
> > I see state: recovering live:false
> >        at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
> >        at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> >        at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >        at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >        at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >        at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >        at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >
> > Nothing other than this in the log jumps out as interesting though.
> >
> >
> > On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >> This shouldn't be a problem though, if things are working as they are
> >> supposed to. Another node should simply take over as the overseer and
> >> continue processing the work queue. It's just best if you configure so
> that
> >> session timeouts don't happen unless a node is really down. On the other
> >> hand, it's nicer to detect that faster. Your tradeoff to make.
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:
> >>
> >>> Yeah. Are you using the concurrent low pause garbage collector?
> >>>
> >>> This means the overseer wasn't able to communicate with zk for 15
> >> seconds - due to load or gc or whatever. If you can't resolve the root
> >> cause of that, or the load just won't allow for it, next best thing you
> can
> >> do is raise it to 30 seconds.
> >>>
> >>> - Mark
> >>>
> >>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>
> >>>> I am occasionally seeing this in the log, is this just a timeout
> issue?
> >>>> Should I be increasing the zk client timeout?
> >>>>
> >>>> WARNING: Overseer cannot talk to ZK
> >>>> Apr 3, 2013 11:14:25 PM
> >>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> >>>> INFO: Watcher fired on path: null state: Expired type None
> >>>> Apr 3, 2013 11:14:25 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater
> >>>> run
> >>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /overseer/queue
> >>>>      at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>      at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>      at
> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> >>>>      at
> >>>>
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
> >>>>      at
> >>>>
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
> >>>>      at
> >>>>
> >>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
> >>>>      at
> >>>>
> >>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
> >>>>      at
> >>>>
> >>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
> >>>>      at
> >>>>
> >>
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
> >>>>      at
> >>>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
> >>>>      at
> >>>>
> >>
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
> >>>>      at java.lang.Thread.run(Thread.java:662)
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>
> >>>>> just an update, I'm at 1M records now with no issues.  This looks
> >>>>> promising as to the cause of my issues, thanks for the help.  Is the
> >>>>> routing method with numShards documented anywhere?  I know numShards
> is
> >>>>> documented but I didn't know that the routing changed if you don't
> >> specify
> >>>>> it.
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> with these changes things are looking good, I'm up to 600,000
> >> documents
> >>>>>> without any issues as of right now.  I'll keep going and add more to
> >> see if
> >>>>>> I find anything.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
> match
> >> the
> >>>>>>> shards that are auto created and it looks like things are happy.
> >> I'll go
> >>>>>>> ahead and try my test to see if I can get things out of sync.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmiller@gmail.com
> >>> wrote:
> >>>>>>>
> >>>>>>>> I had thought you could - but looking at the code recently, I
> don't
> >>>>>>>> think you can anymore. I think that's a technical limitation more
> >> than
> >>>>>>>> anything though. When these changes were made, I think support for
> >> that was
> >>>>>>>> simply not added at the time.
> >>>>>>>>
> >>>>>>>> I'm not sure exactly how straightforward it would be, but it seems
> >>>>>>>> doable - as it is, the overseer will preallocate shards when first
> >> creating
> >>>>>>>> the collection - that's when they get named shard(n). There would
> >> have to
> >>>>>>>> be logic to replace shard(n) with the custom shard name when the
> >> core
> >>>>>>>> actually registers.
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> answered my own question, it now says compositeId.  What is
> >>>>>>>> problematic
> >>>>>>>>> though is that in addition to my shards (which are say
> >> jamie-shard1)
> >>>>>>>> I see
> >>>>>>>>> the solr created shards (shard1).  I assume that these were
> created
> >>>>>>>> because
> >>>>>>>>> of the numShards param.  Is there no way to specify the names of
> >> these
> >>>>>>>>> shards?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ah interesting....so I need to specify num shards, blow out zk
> and
> >>>>>>>> then
> >>>>>>>>>> try this again to see if things work properly now.  What is
> really
> >>>>>>>> strange
> >>>>>>>>>> is that for the most part things have worked right and on 4.2.1
> I
> >>>>>>>> have
> >>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
> >>>>>>>> specify num
> >>>>>>>>>> shards clear out zk and begin again.  If this works properly
> what
> >>>>>>>> should
> >>>>>>>>>> the router type be?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
> >> markrmiller@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit
> doc
> >>>>>>>> router
> >>>>>>>>>>> and it's up to you to distribute updates. In the past,
> >> partitioning
> >>>>>>>> was
> >>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
> >>>>>>>> features, we
> >>>>>>>>>>> now divvy up the hash range up front based on numShards and
> store
> >>>>>>>> it in
> >>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control of
> >>>>>>>> updates
> >>>>>>>>>>> yourself.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> The router says "implicit".  I did start from a blank zk state
> >> but
> >>>>>>>>>>> perhaps
> >>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from the
> >>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
> >> should
> >>>>>>>> be
> >>>>>>>>>>> done
> >>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
> >>>>>>>> above?  My
> >>>>>>>>>>>> process right now is run those ZkCLI commands and then start
> >> solr
> >>>>>>>> on
> >>>>>>>>>>> all of
> >>>>>>>>>>>> the instances with a command like this
> >>>>>>>>>>>>
> >>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
> >>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
> >>>>>>>>>>> -Dcollection.configName=solr-conf
> >>>>>>>>>>>> -Dcollection=collection1
> >>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> >>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>> I feel like maybe I'm missing a step.
> >>>>>>>>>>>>
> >>>>>>>>>>>> "shard5":{
> >>>>>>>>>>>>    "state":"active",
> >>>>>>>>>>>>    "replicas":{
> >>>>>>>>>>>>      "10.38.33.16:7575_solr_shard5-core1":{
> >>>>>>>>>>>>        "shard":"shard5",
> >>>>>>>>>>>>        "state":"active",
> >>>>>>>>>>>>        "core":"shard5-core1",
> >>>>>>>>>>>>        "collection":"collection1",
> >>>>>>>>>>>>        "node_name":"10.38.33.16:7575_solr",
> >>>>>>>>>>>>        "base_url":"http://10.38.33.16:7575/solr",
> >>>>>>>>>>>>        "leader":"true"},
> >>>>>>>>>>>>      "10.38.33.17:7577_solr_shard5-core2":{
> >>>>>>>>>>>>        "shard":"shard5",
> >>>>>>>>>>>>        "state":"recovering",
> >>>>>>>>>>>>        "core":"shard5-core2",
> >>>>>>>>>>>>        "collection":"collection1",
> >>>>>>>>>>>>        "node_name":"10.38.33.17:7577_solr",
> >>>>>>>>>>>>        "base_url":"http://10.38.33.17:7577/solr"}}}
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
> >> markrmiller@gmail.com
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
> >>>>>>>> reported
> >>>>>>>>>>>>> trouble upgrading a previous zk install when this change
> came.
> >> I
> >>>>>>>>>>>>> recommended manually updating the clusterstate.json to have
> the
> >>>>>>>> right
> >>>>>>>>>>> info,
> >>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to start
> >>>>>>>> from a
> >>>>>>>>>>> clean
> >>>>>>>>>>>>> zk state.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you don't have that range information, I think there will
> be
> >>>>>>>>>>> trouble.
> >>>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in
> the
> >>>>>>>> cluster
> >>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
> when I
> >>>>>>>> start
> >>>>>>>>>>> from
> >>>>>>>>>>>>>> scratch is the following
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
> >>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
> >> should
> >>>>>>>> the
> >>>>>>>>>>> steps
> >>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot of
> >> 4.0
> >>>>>>>> so I
> >>>>>>>>>>>>> never
> >>>>>>>>>>>>>> did that previously either so perhaps I did create the
> >>>>>>>> collection in
> >>>>>>>>>>> one
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> my steps to get this working but have forgotten it along the
> >> way.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
> >>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned
> up
> >>>>>>>> front
> >>>>>>>>>>>>> when a
> >>>>>>>>>>>>>>> collection is created - each shard gets a range, which is
> >>>>>>>> stored in
> >>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same
> id
> >> on
> >>>>>>>>>>>>> different
> >>>>>>>>>>>>>>> shards - something very odd going on.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
> reproduce.
> >>>>>>>> Ideally
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>> can capture it in a test case.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
> >>>>>>>> parameter
> >>>>>>>>>>> set I
> >>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
> >> 4.2.0
> >>>>>>>> by
> >>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get
> to
> >>>>>>>> 400,000
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
> behavior
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
> >>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Since I don't have that many items in my index I exported
> >> all
> >>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> keys
> >>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
> checks
> >> for
> >>>>>>>>>>>>>>> duplicates.
> >>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep
> of
> >> the
> >>>>>>>>>>> files
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
> wrong
> >>>>>>>> places.
> >>>>>>>>>>>>> If
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
> shard
> >> 5.
> >>>>>>>> Is
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
> >>>>>>>> account only
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the numShards
> >>>>>>>> param @
> >>>>>>>>>>>>>>> startup
> >>>>>>>>>>>>>>>>> so could this be what is happening?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>>>>>>>>>>>>> shard1-core1:0
> >>>>>>>>>>>>>>>>> shard1-core2:0
> >>>>>>>>>>>>>>>>> shard2-core1:0
> >>>>>>>>>>>>>>>>> shard2-core2:0
> >>>>>>>>>>>>>>>>> shard3-core1:1
> >>>>>>>>>>>>>>>>> shard3-core2:1
> >>>>>>>>>>>>>>>>> shard4-core1:0
> >>>>>>>>>>>>>>>>> shard4-core2:0
> >>>>>>>>>>>>>>>>> shard5-core1:1
> >>>>>>>>>>>>>>>>> shard5-core2:1
> >>>>>>>>>>>>>>>>> shard6-core1:0
> >>>>>>>>>>>>>>>>> shard6-core2:0
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
> >>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
> >>>>>>>> indexed
> >>>>>>>>>>>>> 300,000
> >>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
> >> thought
> >>>>>>>>>>>>> perhaps I
> >>>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
> >>>>>>>> indexed
> >>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
> >> find
> >>>>>>>>>>>>> possibile
> >>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
> >> but
> >>>>>>>> that
> >>>>>>>>>>>>> didn't
> >>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
> >>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to
> go
> >>>>>>>> again.
> >>>>>>>>>>> I
> >>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
> >> problem on
> >>>>>>>>>>> 4.2.0
> >>>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> >>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
> >> get
> >>>>>>>> to the
> >>>>>>>>>>>>>>> bottom
> >>>>>>>>>>>>>>>>>>>> of it.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
> >>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think may
> >>>>>>>> address
> >>>>>>>>>>>>> this?
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
> >>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
> >>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
> >>>>>>>> nothing.  I
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and
> >> see
> >>>>>>>> if
> >>>>>>>>>>> there
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> anything else odd
> >>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
> >>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
> said.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best
> >> to
> >>>>>>>> start
> >>>>>>>>>>>>>>>>>>>> tracking in
> >>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
> >> again.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
> >> need
> >>>>>>>> to
> >>>>>>>>>>> get
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
> >> fixed in
> >>>>>>>>>>> 4.2.1
> >>>>>>>>>>>>>>>>>>>> (spreading
> >>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
> >>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
> >>>>>>>> anything
> >>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>> that I
> >>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd
> >> be
> >>>>>>>> happy
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> troll
> >>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
> >>>>>>>> needed, just
> >>>>>>>>>>>>> let
> >>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>>>>> know.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
> >>>>>>>> this.
> >>>>>>>>>>> Is it
> >>>>>>>>>>>>>>>>>>>>>>> required to
> >>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
> >> resync
> >>>>>>>>>>> things?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
> >>>>>>>> with...
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>> Server at
> >>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>>>>>>>>>>>>> non
> >>>>>>>>>>>>>>>>>>>>>>> ok
> >>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> >>>>>>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
> >>>>>>>> ClusterState
> >>>>>>>>>>>>> says
> >>>>>>>>>>>>>>>>>>>> we are
> >>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>
> >> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> >>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
> point
> >>>>>>>> there
> >>>>>>>>>>> were
> >>>>>>>>>>>>>>>>>>>> shards
> >>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
> >> below.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >>>>>>>>>>>>> state:SyncConnected
> >>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
> >>>>>>>> occurred -
> >>>>>>>>>>>>>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
> >> okay
> >>>>>>>> to be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
> >>>>>>>> apply
> >>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
> >>>>>>>> numbers for
> >>>>>>>>>>>>>>>>>>>> updates in
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
> >> them
> >>>>>>>> on
> >>>>>>>>>>>>> leader
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
> >> have
> >>>>>>>>>>> versions
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
> >>>>>>>> interesting
> >>>>>>>>>>>>>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
> indexing?
> >>>>>>>> Did
> >>>>>>>>>>> any zk
> >>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
> >> cluster
> >>>>>>>> to
> >>>>>>>>>>> 4.2
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
> >> Specifically
> >>>>>>>> the
> >>>>>>>>>>>>> replica
> >>>>>>>>>>>>>>>>>>>> has a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
> >>>>>>>> index to
> >>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
> documents
> >>>>>>>> than
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> master.
> >>>>>>>>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
> >> short of
> >>>>>>>>>>> taking
> >>>>>>>>>>>>>>>>>>>> down the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
> http://10.38.33.17:7577/solrSTARTreplicas=[
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> http://10.38.33.16:7575/solr/dsc-shard5-core1/
> >> ]
> >>>>>>>>>>>>>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
> >>>>>>>> ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
> >>>>>>>> succeeded
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
> >> has a
> >>>>>>>>>>> newer
> >>>>>>>>>>>>>>>>>>>> version of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
> >> having 10
> >>>>>>>>>>> threads
> >>>>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
> >>>>>>>> cluster.
> >>>>>>>>>>> Any
> >>>>>>>>>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
> appreciated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.


On Apr 3, 2013, at 8:17 PM, Jamie Johnson <je...@gmail.com> wrote:

> I am not using the concurrent low pause garbage collector, I could look at
> switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
> correct?

Right - if you don't do that, the default is almost always the throughput collector (I've only seen OSX buck this trend when apple handled java). That means stop the world garbage collections, so with larger heaps, that can be a fair amount of time that no threads can run. It's not that great for something as interactive as search generally is anyway, but it's always not that great when added to heavy load and a 15 sec session timeout between solr and zk.


The below is odd - a replica node is waiting for the leader to see it as recovering and live - live means it has created an ephemeral node for that Solr corecontainer in zk - it's very strange if that didn't happen, unless this happened during shutdown or something.

> 
> I also just had a shard go down and am seeing this in the log
> 
> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
> down for 10.38.33.17:7576_solr but I still do not see the requested state.
> I see state: recovering live:false
>        at
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>        at
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>        at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>        at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>        at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>        at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>        at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> 
> Nothing other than this in the log jumps out as interesting though.
> 
> 
> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> This shouldn't be a problem though, if things are working as they are
>> supposed to. Another node should simply take over as the overseer and
>> continue processing the work queue. It's just best if you configure so that
>> session timeouts don't happen unless a node is really down. On the other
>> hand, it's nicer to detect that faster. Your tradeoff to make.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:
>> 
>>> Yeah. Are you using the concurrent low pause garbage collector?
>>> 
>>> This means the overseer wasn't able to communicate with zk for 15
>> seconds - due to load or gc or whatever. If you can't resolve the root
>> cause of that, or the load just won't allow for it, next best thing you can
>> do is raise it to 30 seconds.
>>> 
>>> - Mark
>>> 
>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> I am occasionally seeing this in the log, is this just a timeout issue?
>>>> Should I be increasing the zk client timeout?
>>>> 
>>>> WARNING: Overseer cannot talk to ZK
>>>> Apr 3, 2013 11:14:25 PM
>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>>>> INFO: Watcher fired on path: null state: Expired type None
>>>> Apr 3, 2013 11:14:25 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>>>> run
>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /overseer/queue
>>>>      at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>      at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>      at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>>>>      at
>>>> 
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>>>>      at
>>>> 
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>>>>      at
>>>> 
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>>>>      at
>>>> 
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>>>>      at
>>>> 
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>>>>      at
>>>> 
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>>>>      at
>>>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>>>>      at
>>>> 
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>>>>      at java.lang.Thread.run(Thread.java:662)
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>> 
>>>>> just an update, I'm at 1M records now with no issues.  This looks
>>>>> promising as to the cause of my issues, thanks for the help.  Is the
>>>>> routing method with numShards documented anywhere?  I know numShards is
>>>>> documented but I didn't know that the routing changed if you don't
>> specify
>>>>> it.
>>>>> 
>>>>> 
>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>> 
>>>>>> with these changes things are looking good, I'm up to 600,000
>> documents
>>>>>> without any issues as of right now.  I'll keep going and add more to
>> see if
>>>>>> I find anything.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to match
>> the
>>>>>>> shards that are auto created and it looks like things are happy.
>> I'll go
>>>>>>> ahead and try my test to see if I can get things out of sync.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmiller@gmail.com
>>> wrote:
>>>>>>> 
>>>>>>>> I had thought you could - but looking at the code recently, I don't
>>>>>>>> think you can anymore. I think that's a technical limitation more
>> than
>>>>>>>> anything though. When these changes were made, I think support for
>> that was
>>>>>>>> simply not added at the time.
>>>>>>>> 
>>>>>>>> I'm not sure exactly how straightforward it would be, but it seems
>>>>>>>> doable - as it is, the overseer will preallocate shards when first
>> creating
>>>>>>>> the collection - that's when they get named shard(n). There would
>> have to
>>>>>>>> be logic to replace shard(n) with the custom shard name when the
>> core
>>>>>>>> actually registers.
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>>> answered my own question, it now says compositeId.  What is
>>>>>>>> problematic
>>>>>>>>> though is that in addition to my shards (which are say
>> jamie-shard1)
>>>>>>>> I see
>>>>>>>>> the solr created shards (shard1).  I assume that these were created
>>>>>>>> because
>>>>>>>>> of the numShards param.  Is there no way to specify the names of
>> these
>>>>>>>>> shards?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> ah interesting....so I need to specify num shards, blow out zk and
>>>>>>>> then
>>>>>>>>>> try this again to see if things work properly now.  What is really
>>>>>>>> strange
>>>>>>>>>> is that for the most part things have worked right and on 4.2.1 I
>>>>>>>> have
>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>>>>>>>> specify num
>>>>>>>>>> shards clear out zk and begin again.  If this works properly what
>>>>>>>> should
>>>>>>>>>> the router type be?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
>> markrmiller@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit doc
>>>>>>>> router
>>>>>>>>>>> and it's up to you to distribute updates. In the past,
>> partitioning
>>>>>>>> was
>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>>>>>>>> features, we
>>>>>>>>>>> now divvy up the hash range up front based on numShards and store
>>>>>>>> it in
>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control of
>>>>>>>> updates
>>>>>>>>>>> yourself.
>>>>>>>>>>> 
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk state
>> but
>>>>>>>>>>> perhaps
>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
>> should
>>>>>>>> be
>>>>>>>>>>> done
>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>>>>>> above?  My
>>>>>>>>>>>> process right now is run those ZkCLI commands and then start
>> solr
>>>>>>>> on
>>>>>>>>>>> all of
>>>>>>>>>>>> the instances with a command like this
>>>>>>>>>>>> 
>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>>>>>>>>> -Dcollection.configName=solr-conf
>>>>>>>>>>>> -Dcollection=collection1
>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>>>>>>>>>> 
>>>>>>>>>>>> I feel like maybe I'm missing a step.
>>>>>>>>>>>> 
>>>>>>>>>>>> "shard5":{
>>>>>>>>>>>>    "state":"active",
>>>>>>>>>>>>    "replicas":{
>>>>>>>>>>>>      "10.38.33.16:7575_solr_shard5-core1":{
>>>>>>>>>>>>        "shard":"shard5",
>>>>>>>>>>>>        "state":"active",
>>>>>>>>>>>>        "core":"shard5-core1",
>>>>>>>>>>>>        "collection":"collection1",
>>>>>>>>>>>>        "node_name":"10.38.33.16:7575_solr",
>>>>>>>>>>>>        "base_url":"http://10.38.33.16:7575/solr",
>>>>>>>>>>>>        "leader":"true"},
>>>>>>>>>>>>      "10.38.33.17:7577_solr_shard5-core2":{
>>>>>>>>>>>>        "shard":"shard5",
>>>>>>>>>>>>        "state":"recovering",
>>>>>>>>>>>>        "core":"shard5-core2",
>>>>>>>>>>>>        "collection":"collection1",
>>>>>>>>>>>>        "node_name":"10.38.33.17:7577_solr",
>>>>>>>>>>>>        "base_url":"http://10.38.33.17:7577/solr"}}}
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
>> markrmiller@gmail.com
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>>>>>>>> reported
>>>>>>>>>>>>> trouble upgrading a previous zk install when this change came.
>> I
>>>>>>>>>>>>> recommended manually updating the clusterstate.json to have the
>>>>>>>> right
>>>>>>>>>>> info,
>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to start
>>>>>>>> from a
>>>>>>>>>>> clean
>>>>>>>>>>>>> zk state.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you don't have that range information, I think there will be
>>>>>>>>>>> trouble.
>>>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in the
>>>>>>>> cluster
>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do when I
>>>>>>>> start
>>>>>>>>>>> from
>>>>>>>>>>>>>> scratch is the following
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
>> should
>>>>>>>> the
>>>>>>>>>>> steps
>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot of
>> 4.0
>>>>>>>> so I
>>>>>>>>>>>>> never
>>>>>>>>>>>>>> did that previously either so perhaps I did create the
>>>>>>>> collection in
>>>>>>>>>>> one
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> my steps to get this working but have forgotten it along the
>> way.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>>>>>> markrmiller@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>>>>>>> front
>>>>>>>>>>>>> when a
>>>>>>>>>>>>>>> collection is created - each shard gets a range, which is
>>>>>>>> stored in
>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same id
>> on
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> shards - something very odd going on.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>>>>>>> Ideally
>>>>>>>>>>> we
>>>>>>>>>>>>>>> can capture it in a test case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2003@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>>>>>>>> parameter
>>>>>>>>>>> set I
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
>> 4.2.0
>>>>>>>> by
>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>>>>>>> 400,000
>>>>>>>>>>> or
>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I exported
>> all
>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> keys
>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that checks
>> for
>>>>>>>>>>>>>>> duplicates.
>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep of
>> the
>>>>>>>>>>> files
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the wrong
>>>>>>>> places.
>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and shard
>> 5.
>>>>>>>> Is
>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>>>>>>>> account only
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>>>>>>> param @
>>>>>>>>>>>>>>> startup
>>>>>>>>>>>>>>>>> so could this be what is happening?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>>>>>>>>>> shard1-core1:0
>>>>>>>>>>>>>>>>> shard1-core2:0
>>>>>>>>>>>>>>>>> shard2-core1:0
>>>>>>>>>>>>>>>>> shard2-core2:0
>>>>>>>>>>>>>>>>> shard3-core1:1
>>>>>>>>>>>>>>>>> shard3-core2:1
>>>>>>>>>>>>>>>>> shard4-core1:0
>>>>>>>>>>>>>>>>> shard4-core2:0
>>>>>>>>>>>>>>>>> shard5-core1:1
>>>>>>>>>>>>>>>>> shard5-core2:1
>>>>>>>>>>>>>>>>> shard6-core1:0
>>>>>>>>>>>>>>>>> shard6-core2:0
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>>>>>> indexed
>>>>>>>>>>>>> 300,000
>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
>> thought
>>>>>>>>>>>>> perhaps I
>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>>>>>>>> indexed
>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
>> find
>>>>>>>>>>>>> possibile
>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
>> but
>>>>>>>> that
>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>>>>>>> again.
>>>>>>>>>>> I
>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
>> problem on
>>>>>>>>>>> 4.2.0
>>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
>> get
>>>>>>>> to the
>>>>>>>>>>>>>>> bottom
>>>>>>>>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think may
>>>>>>>> address
>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>>>>>> nothing.  I
>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and
>> see
>>>>>>>> if
>>>>>>>>>>> there
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>>>>>> markrmiller@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best
>> to
>>>>>>>> start
>>>>>>>>>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
>> again.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
>> need
>>>>>>>> to
>>>>>>>>>>> get
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
>> fixed in
>>>>>>>>>>> 4.2.1
>>>>>>>>>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>>>>>>> anything
>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd
>> be
>>>>>>>> happy
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>>>>>>>> needed, just
>>>>>>>>>>>>> let
>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
>>>>>>>> this.
>>>>>>>>>>> Is it
>>>>>>>>>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
>> resync
>>>>>>>>>>> things?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>>>>>> with...
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>> Server at
>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>>>>>> ClusterState
>>>>>>>>>>>>> says
>>>>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>> 
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>>>>>>> there
>>>>>>>>>>> were
>>>>>>>>>>>>>>>>>>>> shards
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
>> below.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>>>>>>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>>>>>> occurred -
>>>>>>>>>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
>> okay
>>>>>>>> to be
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
>>>>>>>> apply
>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>>>>>> numbers for
>>>>>>>>>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
>> them
>>>>>>>> on
>>>>>>>>>>>>> leader
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
>> have
>>>>>>>>>>> versions
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>>>>>> interesting
>>>>>>>>>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
>>>>>>>> Did
>>>>>>>>>>> any zk
>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
>> cluster
>>>>>>>> to
>>>>>>>>>>> 4.2
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>> Specifically
>>>>>>>> the
>>>>>>>>>>>>> replica
>>>>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>>>>>> index to
>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>>>>>>> than
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> master.
>>>>>>>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
>> short of
>>>>>>>>>>> taking
>>>>>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTARTreplicas=[
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
>> ]
>>>>>>>>>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>>>>>> ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>>>>>> succeeded
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
>> has a
>>>>>>>>>>> newer
>>>>>>>>>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
>> having 10
>>>>>>>>>>> threads
>>>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>>>>>> cluster.
>>>>>>>>>>> Any
>>>>>>>>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

I am not using the concurrent low pause garbage collector, I could look at
switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
correct?

I also just had a shard go down and am seeing this in the log

SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
down for 10.38.33.17:7576_solr but I still do not see the requested state.
I see state: recovering live:false
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

Nothing other than this in the log jumps out as interesting though.


On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <ma...@gmail.com> wrote:

> This shouldn't be a problem though, if things are working as they are
> supposed to. Another node should simply take over as the overseer and
> continue processing the work queue. It's just best if you configure so that
> session timeouts don't happen unless a node is really down. On the other
> hand, it's nicer to detect that faster. Your tradeoff to make.
>
> - Mark
>
> On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:
>
> > Yeah. Are you using the concurrent low pause garbage collector?
> >
> > This means the overseer wasn't able to communicate with zk for 15
> seconds - due to load or gc or whatever. If you can't resolve the root
> cause of that, or the load just won't allow for it, next best thing you can
> do is raise it to 30 seconds.
> >
> > - Mark
> >
> > On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
> >
> >> I am occasionally seeing this in the log, is this just a timeout issue?
> >> Should I be increasing the zk client timeout?
> >>
> >> WARNING: Overseer cannot talk to ZK
> >> Apr 3, 2013 11:14:25 PM
> >> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> >> INFO: Watcher fired on path: null state: Expired type None
> >> Apr 3, 2013 11:14:25 PM
> org.apache.solr.cloud.Overseer$ClusterStateUpdater
> >> run
> >> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> KeeperErrorCode = Session expired for /overseer/queue
> >>       at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>       at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>       at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> >>       at
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
> >>       at
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
> >>       at
> >>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
> >>       at
> >>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
> >>       at
> >>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
> >>       at
> >>
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
> >>       at
> >> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
> >>       at
> >>
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
> >>       at java.lang.Thread.run(Thread.java:662)
> >>
> >>
> >>
> >> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >>> just an update, I'm at 1M records now with no issues.  This looks
> >>> promising as to the cause of my issues, thanks for the help.  Is the
> >>> routing method with numShards documented anywhere?  I know numShards is
> >>> documented but I didn't know that the routing changed if you don't
> specify
> >>> it.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>
> >>>> with these changes things are looking good, I'm up to 600,000
> documents
> >>>> without any issues as of right now.  I'll keep going and add more to
> see if
> >>>> I find anything.
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>
> >>>>> ok, so that's not a deal breaker for me.  I just changed it to match
> the
> >>>>> shards that are auto created and it looks like things are happy.
>  I'll go
> >>>>> ahead and try my test to see if I can get things out of sync.
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmiller@gmail.com
> >wrote:
> >>>>>
> >>>>>> I had thought you could - but looking at the code recently, I don't
> >>>>>> think you can anymore. I think that's a technical limitation more
> than
> >>>>>> anything though. When these changes were made, I think support for
> that was
> >>>>>> simply not added at the time.
> >>>>>>
> >>>>>> I'm not sure exactly how straightforward it would be, but it seems
> >>>>>> doable - as it is, the overseer will preallocate shards when first
> creating
> >>>>>> the collection - that's when they get named shard(n). There would
> have to
> >>>>>> be logic to replace shard(n) with the custom shard name when the
> core
> >>>>>> actually registers.
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> answered my own question, it now says compositeId.  What is
> >>>>>> problematic
> >>>>>>> though is that in addition to my shards (which are say
> jamie-shard1)
> >>>>>> I see
> >>>>>>> the solr created shards (shard1).  I assume that these were created
> >>>>>> because
> >>>>>>> of the numShards param.  Is there no way to specify the names of
> these
> >>>>>>> shards?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> ah interesting....so I need to specify num shards, blow out zk and
> >>>>>> then
> >>>>>>>> try this again to see if things work properly now.  What is really
> >>>>>> strange
> >>>>>>>> is that for the most part things have worked right and on 4.2.1 I
> >>>>>> have
> >>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
> >>>>>> specify num
> >>>>>>>> shards clear out zk and begin again.  If this works properly what
> >>>>>> should
> >>>>>>>> the router type be?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> If you don't specify numShards after 4.1, you get an implicit doc
> >>>>>> router
> >>>>>>>>> and it's up to you to distribute updates. In the past,
> partitioning
> >>>>>> was
> >>>>>>>>> done on the fly - but for shard splitting and perhaps other
> >>>>>> features, we
> >>>>>>>>> now divvy up the hash range up front based on numShards and store
> >>>>>> it in
> >>>>>>>>> ZooKeeper. No numShards is now how you take complete control of
> >>>>>> updates
> >>>>>>>>> yourself.
> >>>>>>>>>
> >>>>>>>>> - Mark
> >>>>>>>>>
> >>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The router says "implicit".  I did start from a blank zk state
> but
> >>>>>>>>> perhaps
> >>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from the
> >>>>>>>>>> clusterstate.json is shown below.  What is the process that
> should
> >>>>>> be
> >>>>>>>>> done
> >>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
> >>>>>> above?  My
> >>>>>>>>>> process right now is run those ZkCLI commands and then start
> solr
> >>>>>> on
> >>>>>>>>> all of
> >>>>>>>>>> the instances with a command like this
> >>>>>>>>>>
> >>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
> >>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
> >>>>>>>>> -Dcollection.configName=solr-conf
> >>>>>>>>>> -Dcollection=collection1
> >>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> >>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >>>>>>>>>>
> >>>>>>>>>> I feel like maybe I'm missing a step.
> >>>>>>>>>>
> >>>>>>>>>> "shard5":{
> >>>>>>>>>>     "state":"active",
> >>>>>>>>>>     "replicas":{
> >>>>>>>>>>       "10.38.33.16:7575_solr_shard5-core1":{
> >>>>>>>>>>         "shard":"shard5",
> >>>>>>>>>>         "state":"active",
> >>>>>>>>>>         "core":"shard5-core1",
> >>>>>>>>>>         "collection":"collection1",
> >>>>>>>>>>         "node_name":"10.38.33.16:7575_solr",
> >>>>>>>>>>         "base_url":"http://10.38.33.16:7575/solr",
> >>>>>>>>>>         "leader":"true"},
> >>>>>>>>>>       "10.38.33.17:7577_solr_shard5-core2":{
> >>>>>>>>>>         "shard":"shard5",
> >>>>>>>>>>         "state":"recovering",
> >>>>>>>>>>         "core":"shard5-core2",
> >>>>>>>>>>         "collection":"collection1",
> >>>>>>>>>>         "node_name":"10.38.33.17:7577_solr",
> >>>>>>>>>>         "base_url":"http://10.38.33.17:7577/solr"}}}
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
> markrmiller@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> It should be part of your clusterstate.json. Some users have
> >>>>>> reported
> >>>>>>>>>>> trouble upgrading a previous zk install when this change came.
> I
> >>>>>>>>>>> recommended manually updating the clusterstate.json to have the
> >>>>>> right
> >>>>>>>>> info,
> >>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to start
> >>>>>> from a
> >>>>>>>>> clean
> >>>>>>>>>>> zk state.
> >>>>>>>>>>>
> >>>>>>>>>>> If you don't have that range information, I think there will be
> >>>>>>>>> trouble.
> >>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in the
> >>>>>> cluster
> >>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Perhaps something with my process is broken.  What I do when I
> >>>>>> start
> >>>>>>>>> from
> >>>>>>>>>>>> scratch is the following
> >>>>>>>>>>>>
> >>>>>>>>>>>> ZkCLI -cmd upconfig ...
> >>>>>>>>>>>> ZkCLI -cmd linkconfig ....
> >>>>>>>>>>>>
> >>>>>>>>>>>> but I don't ever explicitly create the collection.  What
> should
> >>>>>> the
> >>>>>>>>> steps
> >>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot of
> 4.0
> >>>>>> so I
> >>>>>>>>>>> never
> >>>>>>>>>>>> did that previously either so perhaps I did create the
> >>>>>> collection in
> >>>>>>>>> one
> >>>>>>>>>>> of
> >>>>>>>>>>>> my steps to get this working but have forgotten it along the
> way.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
> >>>>>> markrmiller@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
> >>>>>> front
> >>>>>>>>>>> when a
> >>>>>>>>>>>>> collection is created - each shard gets a range, which is
> >>>>>> stored in
> >>>>>>>>>>>>> zookeeper. You should not be able to end up with the same id
> on
> >>>>>>>>>>> different
> >>>>>>>>>>>>> shards - something very odd going on.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hopefully I'll have some time to try and help you reproduce.
> >>>>>> Ideally
> >>>>>>>>> we
> >>>>>>>>>>>>> can capture it in a test case.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
> >>>>>> parameter
> >>>>>>>>> set I
> >>>>>>>>>>>>> am
> >>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
> 4.2.0
> >>>>>> by
> >>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
> >>>>>> 400,000
> >>>>>>>>> or
> >>>>>>>>>>>>> so.
> >>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
> >>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Since I don't have that many items in my index I exported
> all
> >>>>>> of
> >>>>>>>>> the
> >>>>>>>>>>>>> keys
> >>>>>>>>>>>>>>> for each shard and wrote a simple java program that checks
> for
> >>>>>>>>>>>>> duplicates.
> >>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep of
> the
> >>>>>>>>> files
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> the keys found does indicate that they made it to the wrong
> >>>>>> places.
> >>>>>>>>>>> If
> >>>>>>>>>>>>> you
> >>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and shard
> 5.
> >>>>>> Is
> >>>>>>>>> it
> >>>>>>>>>>>>>>> possible that the hash is being calculated taking into
> >>>>>> account only
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the numShards
> >>>>>> param @
> >>>>>>>>>>>>> startup
> >>>>>>>>>>>>>>> so could this be what is happening?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>>>>>>>>>>> shard1-core1:0
> >>>>>>>>>>>>>>> shard1-core2:0
> >>>>>>>>>>>>>>> shard2-core1:0
> >>>>>>>>>>>>>>> shard2-core2:0
> >>>>>>>>>>>>>>> shard3-core1:1
> >>>>>>>>>>>>>>> shard3-core2:1
> >>>>>>>>>>>>>>> shard4-core1:0
> >>>>>>>>>>>>>>> shard4-core2:0
> >>>>>>>>>>>>>>> shard5-core1:1
> >>>>>>>>>>>>>>> shard5-core2:1
> >>>>>>>>>>>>>>> shard6-core1:0
> >>>>>>>>>>>>>>> shard6-core2:0
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
> >>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
> >>>>>> indexed
> >>>>>>>>>>> 300,000
> >>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
> thought
> >>>>>>>>>>> perhaps I
> >>>>>>>>>>>>>>>> messed something up so I started the indexing again and
> >>>>>> indexed
> >>>>>>>>>>> another
> >>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
> find
> >>>>>>>>>>> possibile
> >>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
> but
> >>>>>> that
> >>>>>>>>>>> didn't
> >>>>>>>>>>>>>>>> give me anything with more than a count of 1.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
> >>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
> >>>>>> again.
> >>>>>>>>> I
> >>>>>>>>>>> am
> >>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
> problem on
> >>>>>>>>> 4.2.0
> >>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>> I'll try on 4.2.1
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> >>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
> get
> >>>>>> to the
> >>>>>>>>>>>>> bottom
> >>>>>>>>>>>>>>>>>> of it.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
> >>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think may
> >>>>>> address
> >>>>>>>>>>> this?
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
> >>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
> >>>>>> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
> >>>>>> nothing.  I
> >>>>>>>>> can
> >>>>>>>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and
> see
> >>>>>> if
> >>>>>>>>> there
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> anything else odd
> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
> >>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best
> to
> >>>>>> start
> >>>>>>>>>>>>>>>>>> tracking in
> >>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
> again.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
> need
> >>>>>> to
> >>>>>>>>> get
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
> fixed in
> >>>>>>>>> 4.2.1
> >>>>>>>>>>>>>>>>>> (spreading
> >>>>>>>>>>>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
> >>>>>> jej2003@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
> >>>>>> anything
> >>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>> that I
> >>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd
> be
> >>>>>> happy
> >>>>>>>>> to
> >>>>>>>>>>>>>>>>>> troll
> >>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
> >>>>>> needed, just
> >>>>>>>>>>> let
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>>> know.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
> >>>>>> this.
> >>>>>>>>> Is it
> >>>>>>>>>>>>>>>>>>>>> required to
> >>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
> resync
> >>>>>>>>> things?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
> >>>>>> with...
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>> Server at
> >>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>>>>>>>>>>> non
> >>>>>>>>>>>>>>>>>>>>> ok
> >>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> >>>>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
> >>>>>> ClusterState
> >>>>>>>>>>> says
> >>>>>>>>>>>>>>>>>> we are
> >>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> >>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
> >>>>>> there
> >>>>>>>>> were
> >>>>>>>>>>>>>>>>>> shards
> >>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
> below.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >>>>>>>>>>> state:SyncConnected
> >>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
> >>>>>> occurred -
> >>>>>>>>>>>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
> okay
> >>>>>> to be
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>>>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
> >>>>>> apply
> >>>>>>>>> here.
> >>>>>>>>>>>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
> >>>>>> numbers for
> >>>>>>>>>>>>>>>>>> updates in
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
> them
> >>>>>> on
> >>>>>>>>>>> leader
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
> have
> >>>>>>>>> versions
> >>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
> >>>>>> interesting
> >>>>>>>>>>>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
> >>>>>> Did
> >>>>>>>>> any zk
> >>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
> cluster
> >>>>>> to
> >>>>>>>>> 4.2
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>  Specifically
> >>>>>> the
> >>>>>>>>>>> replica
> >>>>>>>>>>>>>>>>>> has a
> >>>>>>>>>>>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
> >>>>>> index to
> >>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
> >>>>>> than
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>> master.
> >>>>>>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
> short of
> >>>>>>>>> taking
> >>>>>>>>>>>>>>>>>> down the
> >>>>>>>>>>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTARTreplicas=[
> >>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
> ]
> >>>>>>>>>>>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
> >>>>>> ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
> >>>>>> succeeded
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
> has a
> >>>>>>>>> newer
> >>>>>>>>>>>>>>>>>> version of
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
> having 10
> >>>>>>>>> threads
> >>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
> >>>>>> cluster.
> >>>>>>>>> Any
> >>>>>>>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

This shouldn't be a problem though, if things are working as they are supposed to. Another node should simply take over as the overseer and continue processing the work queue. It's just best if you configure so that session timeouts don't happen unless a node is really down. On the other hand, it's nicer to detect that faster. Your tradeoff to make.

- Mark

On Apr 3, 2013, at 7:46 PM, Mark Miller <ma...@gmail.com> wrote:

> Yeah. Are you using the concurrent low pause garbage collector?
> 
> This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds.
> 
> - Mark
> 
> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> I am occasionally seeing this in the log, is this just a timeout issue?
>> Should I be increasing the zk client timeout?
>> 
>> WARNING: Overseer cannot talk to ZK
>> Apr 3, 2013 11:14:25 PM
>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>> INFO: Watcher fired on path: null state: Expired type None
>> Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
>> run
>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer/queue
>>       at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>       at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>       at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>>       at
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>>       at
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>>       at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>>       at
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>>       at
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>>       at
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>>       at
>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>>       at
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>>       at java.lang.Thread.run(Thread.java:662)
>> 
>> 
>> 
>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> just an update, I'm at 1M records now with no issues.  This looks
>>> promising as to the cause of my issues, thanks for the help.  Is the
>>> routing method with numShards documented anywhere?  I know numShards is
>>> documented but I didn't know that the routing changed if you don't specify
>>> it.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> with these changes things are looking good, I'm up to 600,000 documents
>>>> without any issues as of right now.  I'll keep going and add more to see if
>>>> I find anything.
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> 
>>>>> ok, so that's not a deal breaker for me.  I just changed it to match the
>>>>> shards that are auto created and it looks like things are happy.  I'll go
>>>>> ahead and try my test to see if I can get things out of sync.
>>>>> 
>>>>> 
>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com>wrote:
>>>>> 
>>>>>> I had thought you could - but looking at the code recently, I don't
>>>>>> think you can anymore. I think that's a technical limitation more than
>>>>>> anything though. When these changes were made, I think support for that was
>>>>>> simply not added at the time.
>>>>>> 
>>>>>> I'm not sure exactly how straightforward it would be, but it seems
>>>>>> doable - as it is, the overseer will preallocate shards when first creating
>>>>>> the collection - that's when they get named shard(n). There would have to
>>>>>> be logic to replace shard(n) with the custom shard name when the core
>>>>>> actually registers.
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>> 
>>>>>>> answered my own question, it now says compositeId.  What is
>>>>>> problematic
>>>>>>> though is that in addition to my shards (which are say jamie-shard1)
>>>>>> I see
>>>>>>> the solr created shards (shard1).  I assume that these were created
>>>>>> because
>>>>>>> of the numShards param.  Is there no way to specify the names of these
>>>>>>> shards?
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> ah interesting....so I need to specify num shards, blow out zk and
>>>>>> then
>>>>>>>> try this again to see if things work properly now.  What is really
>>>>>> strange
>>>>>>>> is that for the most part things have worked right and on 4.2.1 I
>>>>>> have
>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>>>>>> specify num
>>>>>>>> shards clear out zk and begin again.  If this works properly what
>>>>>> should
>>>>>>>> the router type be?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit doc
>>>>>> router
>>>>>>>>> and it's up to you to distribute updates. In the past, partitioning
>>>>>> was
>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>>>>>> features, we
>>>>>>>>> now divvy up the hash range up front based on numShards and store
>>>>>> it in
>>>>>>>>> ZooKeeper. No numShards is now how you take complete control of
>>>>>> updates
>>>>>>>>> yourself.
>>>>>>>>> 
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> The router says "implicit".  I did start from a blank zk state but
>>>>>>>>> perhaps
>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>>>>>>>> clusterstate.json is shown below.  What is the process that should
>>>>>> be
>>>>>>>>> done
>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>>>> above?  My
>>>>>>>>>> process right now is run those ZkCLI commands and then start solr
>>>>>> on
>>>>>>>>> all of
>>>>>>>>>> the instances with a command like this
>>>>>>>>>> 
>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>>>>>>> -Dcollection.configName=solr-conf
>>>>>>>>>> -Dcollection=collection1
>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>>>>>>>> 
>>>>>>>>>> I feel like maybe I'm missing a step.
>>>>>>>>>> 
>>>>>>>>>> "shard5":{
>>>>>>>>>>     "state":"active",
>>>>>>>>>>     "replicas":{
>>>>>>>>>>       "10.38.33.16:7575_solr_shard5-core1":{
>>>>>>>>>>         "shard":"shard5",
>>>>>>>>>>         "state":"active",
>>>>>>>>>>         "core":"shard5-core1",
>>>>>>>>>>         "collection":"collection1",
>>>>>>>>>>         "node_name":"10.38.33.16:7575_solr",
>>>>>>>>>>         "base_url":"http://10.38.33.16:7575/solr",
>>>>>>>>>>         "leader":"true"},
>>>>>>>>>>       "10.38.33.17:7577_solr_shard5-core2":{
>>>>>>>>>>         "shard":"shard5",
>>>>>>>>>>         "state":"recovering",
>>>>>>>>>>         "core":"shard5-core2",
>>>>>>>>>>         "collection":"collection1",
>>>>>>>>>>         "node_name":"10.38.33.17:7577_solr",
>>>>>>>>>>         "base_url":"http://10.38.33.17:7577/solr"}}}
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmiller@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>>>>>> reported
>>>>>>>>>>> trouble upgrading a previous zk install when this change came. I
>>>>>>>>>>> recommended manually updating the clusterstate.json to have the
>>>>>> right
>>>>>>>>> info,
>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to start
>>>>>> from a
>>>>>>>>> clean
>>>>>>>>>>> zk state.
>>>>>>>>>>> 
>>>>>>>>>>> If you don't have that range information, I think there will be
>>>>>>>>> trouble.
>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>>>>>>>>>>> 
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in the
>>>>>> cluster
>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>>>>>>>> 
>>>>>>>>>>>> Perhaps something with my process is broken.  What I do when I
>>>>>> start
>>>>>>>>> from
>>>>>>>>>>>> scratch is the following
>>>>>>>>>>>> 
>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>>>>>>>>>>>> 
>>>>>>>>>>>> but I don't ever explicitly create the collection.  What should
>>>>>> the
>>>>>>>>> steps
>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>>>>>> so I
>>>>>>>>>>> never
>>>>>>>>>>>> did that previously either so perhaps I did create the
>>>>>> collection in
>>>>>>>>> one
>>>>>>>>>>> of
>>>>>>>>>>>> my steps to get this working but have forgotten it along the way.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>>>> markrmiller@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>>>>> front
>>>>>>>>>>> when a
>>>>>>>>>>>>> collection is created - each shard gets a range, which is
>>>>>> stored in
>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same id on
>>>>>>>>>>> different
>>>>>>>>>>>>> shards - something very odd going on.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>>>>> Ideally
>>>>>>>>> we
>>>>>>>>>>>>> can capture it in a test case.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>>>>>> parameter
>>>>>>>>> set I
>>>>>>>>>>>>> am
>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0
>>>>>> by
>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>>>>> 400,000
>>>>>>>>> or
>>>>>>>>>>>>> so.
>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Since I don't have that many items in my index I exported all
>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> keys
>>>>>>>>>>>>>>> for each shard and wrote a simple java program that checks for
>>>>>>>>>>>>> duplicates.
>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>>>>>>>> files
>>>>>>>>>>> for
>>>>>>>>>>>>>>> the keys found does indicate that they made it to the wrong
>>>>>> places.
>>>>>>>>>>> If
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>>>>> Is
>>>>>>>>> it
>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>>>>>> account only
>>>>>>>>>>> the
>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>>>>> param @
>>>>>>>>>>>>> startup
>>>>>>>>>>>>>>> so could this be what is happening?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>>>>>>>> shard1-core1:0
>>>>>>>>>>>>>>> shard1-core2:0
>>>>>>>>>>>>>>> shard2-core1:0
>>>>>>>>>>>>>>> shard2-core2:0
>>>>>>>>>>>>>>> shard3-core1:1
>>>>>>>>>>>>>>> shard3-core2:1
>>>>>>>>>>>>>>> shard4-core1:0
>>>>>>>>>>>>>>> shard4-core2:0
>>>>>>>>>>>>>>> shard5-core1:1
>>>>>>>>>>>>>>> shard5-core2:1
>>>>>>>>>>>>>>> shard6-core1:0
>>>>>>>>>>>>>>> shard6-core2:0
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>>>> indexed
>>>>>>>>>>> 300,000
>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>>>>>>>>>> perhaps I
>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>>>>>> indexed
>>>>>>>>>>> another
>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>>>>>>>>>> possibile
>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>>>>>> that
>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>>>>> again.
>>>>>>>>> I
>>>>>>>>>>> am
>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the problem on
>>>>>>>>> 4.2.0
>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to get
>>>>>> to the
>>>>>>>>>>>>> bottom
>>>>>>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think may
>>>>>> address
>>>>>>>>>>> this?
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>>>> jej2003@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>>>> nothing.  I
>>>>>>>>> can
>>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see
>>>>>> if
>>>>>>>>> there
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>>>> markrmiller@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>>>>>> start
>>>>>>>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need
>>>>>> to
>>>>>>>>> get
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>>>>>>>> 4.2.1
>>>>>>>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>>>> jej2003@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>>>>> anything
>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>>>>>> happy
>>>>>>>>> to
>>>>>>>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>>>>>> needed, just
>>>>>>>>>>> let
>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
>>>>>> this.
>>>>>>>>> Is it
>>>>>>>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>>>>>>>> things?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>>>> with...
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>> Server at
>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>>>> ClusterState
>>>>>>>>>>> says
>>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>>>>> there
>>>>>>>>> were
>>>>>>>>>>>>>>>>>> shards
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>>>>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>>>> occurred -
>>>>>>>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>>>>>> to be
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
>>>>>> apply
>>>>>>>>> here.
>>>>>>>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>>>> numbers for
>>>>>>>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them
>>>>>> on
>>>>>>>>>>> leader
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>>>>>>>> versions
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>>>> interesting
>>>>>>>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
>>>>>> Did
>>>>>>>>> any zk
>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster
>>>>>> to
>>>>>>>>> 4.2
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically
>>>>>> the
>>>>>>>>>>> replica
>>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>>>> index to
>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>>>>> than
>>>>>>>>> the
>>>>>>>>>>>>>>>>>> master.
>>>>>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>>>>>>>> taking
>>>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>>>>>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>>>> ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>>>> succeeded
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>>>>>>>> newer
>>>>>>>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>>>>>>>> threads
>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>>>> cluster.
>>>>>>>>> Any
>>>>>>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

Yeah. Are you using the concurrent low pause garbage collector?

This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds.

- Mark

On Apr 3, 2013, at 7:41 PM, Jamie Johnson <je...@gmail.com> wrote:

> I am occasionally seeing this in the log, is this just a timeout issue?
> Should I be increasing the zk client timeout?
> 
> WARNING: Overseer cannot talk to ZK
> Apr 3, 2013 11:14:25 PM
> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> INFO: Watcher fired on path: null state: Expired type None
> Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> run
> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer/queue
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>        at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>        at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>        at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>        at
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>        at
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>        at
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>        at
> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>        at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>        at java.lang.Thread.run(Thread.java:662)
> 
> 
> 
> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> just an update, I'm at 1M records now with no issues.  This looks
>> promising as to the cause of my issues, thanks for the help.  Is the
>> routing method with numShards documented anywhere?  I know numShards is
>> documented but I didn't know that the routing changed if you don't specify
>> it.
>> 
>> 
>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> with these changes things are looking good, I'm up to 600,000 documents
>>> without any issues as of right now.  I'll keep going and add more to see if
>>> I find anything.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> ok, so that's not a deal breaker for me.  I just changed it to match the
>>>> shards that are auto created and it looks like things are happy.  I'll go
>>>> ahead and try my test to see if I can get things out of sync.
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com>wrote:
>>>> 
>>>>> I had thought you could - but looking at the code recently, I don't
>>>>> think you can anymore. I think that's a technical limitation more than
>>>>> anything though. When these changes were made, I think support for that was
>>>>> simply not added at the time.
>>>>> 
>>>>> I'm not sure exactly how straightforward it would be, but it seems
>>>>> doable - as it is, the overseer will preallocate shards when first creating
>>>>> the collection - that's when they get named shard(n). There would have to
>>>>> be logic to replace shard(n) with the custom shard name when the core
>>>>> actually registers.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> 
>>>>>> answered my own question, it now says compositeId.  What is
>>>>> problematic
>>>>>> though is that in addition to my shards (which are say jamie-shard1)
>>>>> I see
>>>>>> the solr created shards (shard1).  I assume that these were created
>>>>> because
>>>>>> of the numShards param.  Is there no way to specify the names of these
>>>>>> shards?
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> ah interesting....so I need to specify num shards, blow out zk and
>>>>> then
>>>>>>> try this again to see if things work properly now.  What is really
>>>>> strange
>>>>>>> is that for the most part things have worked right and on 4.2.1 I
>>>>> have
>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>>>>> specify num
>>>>>>> shards clear out zk and begin again.  If this works properly what
>>>>> should
>>>>>>> the router type be?
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> If you don't specify numShards after 4.1, you get an implicit doc
>>>>> router
>>>>>>>> and it's up to you to distribute updates. In the past, partitioning
>>>>> was
>>>>>>>> done on the fly - but for shard splitting and perhaps other
>>>>> features, we
>>>>>>>> now divvy up the hash range up front based on numShards and store
>>>>> it in
>>>>>>>> ZooKeeper. No numShards is now how you take complete control of
>>>>> updates
>>>>>>>> yourself.
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> The router says "implicit".  I did start from a blank zk state but
>>>>>>>> perhaps
>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>>>>>>> clusterstate.json is shown below.  What is the process that should
>>>>> be
>>>>>>>> done
>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>>> above?  My
>>>>>>>>> process right now is run those ZkCLI commands and then start solr
>>>>> on
>>>>>>>> all of
>>>>>>>>> the instances with a command like this
>>>>>>>>> 
>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>>>>>> -Dcollection.configName=solr-conf
>>>>>>>>> -Dcollection=collection1
>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>>>>>>> 
>>>>>>>>> I feel like maybe I'm missing a step.
>>>>>>>>> 
>>>>>>>>> "shard5":{
>>>>>>>>>      "state":"active",
>>>>>>>>>      "replicas":{
>>>>>>>>>        "10.38.33.16:7575_solr_shard5-core1":{
>>>>>>>>>          "shard":"shard5",
>>>>>>>>>          "state":"active",
>>>>>>>>>          "core":"shard5-core1",
>>>>>>>>>          "collection":"collection1",
>>>>>>>>>          "node_name":"10.38.33.16:7575_solr",
>>>>>>>>>          "base_url":"http://10.38.33.16:7575/solr",
>>>>>>>>>          "leader":"true"},
>>>>>>>>>        "10.38.33.17:7577_solr_shard5-core2":{
>>>>>>>>>          "shard":"shard5",
>>>>>>>>>          "state":"recovering",
>>>>>>>>>          "core":"shard5-core2",
>>>>>>>>>          "collection":"collection1",
>>>>>>>>>          "node_name":"10.38.33.17:7577_solr",
>>>>>>>>>          "base_url":"http://10.38.33.17:7577/solr"}}}
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmiller@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>>>>> reported
>>>>>>>>>> trouble upgrading a previous zk install when this change came. I
>>>>>>>>>> recommended manually updating the clusterstate.json to have the
>>>>> right
>>>>>>>> info,
>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to start
>>>>> from a
>>>>>>>> clean
>>>>>>>>>> zk state.
>>>>>>>>>> 
>>>>>>>>>> If you don't have that range information, I think there will be
>>>>>>>> trouble.
>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in the
>>>>> cluster
>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>>>>>>> 
>>>>>>>>>>> Perhaps something with my process is broken.  What I do when I
>>>>> start
>>>>>>>> from
>>>>>>>>>>> scratch is the following
>>>>>>>>>>> 
>>>>>>>>>>> ZkCLI -cmd upconfig ...
>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>>>>>>>>>>> 
>>>>>>>>>>> but I don't ever explicitly create the collection.  What should
>>>>> the
>>>>>>>> steps
>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>>>>> so I
>>>>>>>>>> never
>>>>>>>>>>> did that previously either so perhaps I did create the
>>>>> collection in
>>>>>>>> one
>>>>>>>>>> of
>>>>>>>>>>> my steps to get this working but have forgotten it along the way.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>>> markrmiller@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>>>> front
>>>>>>>>>> when a
>>>>>>>>>>>> collection is created - each shard gets a range, which is
>>>>> stored in
>>>>>>>>>>>> zookeeper. You should not be able to end up with the same id on
>>>>>>>>>> different
>>>>>>>>>>>> shards - something very odd going on.
>>>>>>>>>>>> 
>>>>>>>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>>>> Ideally
>>>>>>>> we
>>>>>>>>>>>> can capture it in a test case.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mark
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>>>>> parameter
>>>>>>>> set I
>>>>>>>>>>>> am
>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0
>>>>> by
>>>>>>>>>>>> indexing
>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>>>> 400,000
>>>>>>>> or
>>>>>>>>>>>> so.
>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>>> jej2003@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Since I don't have that many items in my index I exported all
>>>>> of
>>>>>>>> the
>>>>>>>>>>>> keys
>>>>>>>>>>>>>> for each shard and wrote a simple java program that checks for
>>>>>>>>>>>> duplicates.
>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>>>>>>> files
>>>>>>>>>> for
>>>>>>>>>>>>>> the keys found does indicate that they made it to the wrong
>>>>> places.
>>>>>>>>>> If
>>>>>>>>>>>> you
>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>>>> Is
>>>>>>>> it
>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>>>>> account only
>>>>>>>>>> the
>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>>>> param @
>>>>>>>>>>>> startup
>>>>>>>>>>>>>> so could this be what is happening?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>>>>>>> shard1-core1:0
>>>>>>>>>>>>>> shard1-core2:0
>>>>>>>>>>>>>> shard2-core1:0
>>>>>>>>>>>>>> shard2-core2:0
>>>>>>>>>>>>>> shard3-core1:1
>>>>>>>>>>>>>> shard3-core2:1
>>>>>>>>>>>>>> shard4-core1:0
>>>>>>>>>>>>>> shard4-core2:0
>>>>>>>>>>>>>> shard5-core1:1
>>>>>>>>>>>>>> shard5-core2:1
>>>>>>>>>>>>>> shard6-core1:0
>>>>>>>>>>>>>> shard6-core2:0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>>> jej2003@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>>> indexed
>>>>>>>>>> 300,000
>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>>>>>>>>> perhaps I
>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>>>>> indexed
>>>>>>>>>> another
>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>>>>>>>>> possibile
>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>>>>> that
>>>>>>>>>> didn't
>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>>> jej2003@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>>>> again.
>>>>>>>> I
>>>>>>>>>> am
>>>>>>>>>>>>>>>> going to clear the index and try to replicate the problem on
>>>>>>>> 4.2.0
>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to get
>>>>> to the
>>>>>>>>>>>> bottom
>>>>>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>>> jej2003@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think may
>>>>> address
>>>>>>>>>> this?
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>>> jej2003@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>>> nothing.  I
>>>>>>>> can
>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see
>>>>> if
>>>>>>>> there
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>>> markrmiller@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>>>>> start
>>>>>>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need
>>>>> to
>>>>>>>> get
>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>>>>>>> 4.2.1
>>>>>>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>>> jej2003@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>>>> anything
>>>>>>>>>> else
>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>>>>> happy
>>>>>>>> to
>>>>>>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>>>>> needed, just
>>>>>>>>>> let
>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
>>>>> this.
>>>>>>>> Is it
>>>>>>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>>>>>>> things?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>>> with...
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>> org.apache.solr.common.SolrException
>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>> Server at
>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>>> ClusterState
>>>>>>>>>> says
>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>>>> there
>>>>>>>> were
>>>>>>>>>>>>>>>>> shards
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>>>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>>> occurred -
>>>>>>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>>>>> to be
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
>>>>> apply
>>>>>>>> here.
>>>>>>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>>> numbers for
>>>>>>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them
>>>>> on
>>>>>>>>>> leader
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>>>>>>> versions
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>>> interesting
>>>>>>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
>>>>> Did
>>>>>>>> any zk
>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster
>>>>> to
>>>>>>>> 4.2
>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically
>>>>> the
>>>>>>>>>> replica
>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>>> index to
>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>>>> than
>>>>>>>> the
>>>>>>>>>>>>>>>>> master.
>>>>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>>>>>>> taking
>>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>>>>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>>> ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>>> succeeded
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>>>>>>> newer
>>>>>>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>>>>>>> threads
>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>>> cluster.
>>>>>>>> Any
>>>>>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?

WARNING: Overseer cannot talk to ZK
Apr 3, 2013 11:14:25 PM
org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
INFO: Watcher fired on path: null state: Expired type None
Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
run
WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
        at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
        at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
        at
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
        at
org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
        at
org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
        at
org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
        at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
        at java.lang.Thread.run(Thread.java:662)



On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <je...@gmail.com> wrote:

> just an update, I'm at 1M records now with no issues.  This looks
> promising as to the cause of my issues, thanks for the help.  Is the
> routing method with numShards documented anywhere?  I know numShards is
> documented but I didn't know that the routing changed if you don't specify
> it.
>
>
> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> with these changes things are looking good, I'm up to 600,000 documents
>> without any issues as of right now.  I'll keep going and add more to see if
>> I find anything.
>>
>>
>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> ok, so that's not a deal breaker for me.  I just changed it to match the
>>> shards that are auto created and it looks like things are happy.  I'll go
>>> ahead and try my test to see if I can get things out of sync.
>>>
>>>
>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com>wrote:
>>>
>>>> I had thought you could - but looking at the code recently, I don't
>>>> think you can anymore. I think that's a technical limitation more than
>>>> anything though. When these changes were made, I think support for that was
>>>> simply not added at the time.
>>>>
>>>> I'm not sure exactly how straightforward it would be, but it seems
>>>> doable - as it is, the overseer will preallocate shards when first creating
>>>> the collection - that's when they get named shard(n). There would have to
>>>> be logic to replace shard(n) with the custom shard name when the core
>>>> actually registers.
>>>>
>>>> - Mark
>>>>
>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>
>>>> > answered my own question, it now says compositeId.  What is
>>>> problematic
>>>> > though is that in addition to my shards (which are say jamie-shard1)
>>>> I see
>>>> > the solr created shards (shard1).  I assume that these were created
>>>> because
>>>> > of the numShards param.  Is there no way to specify the names of these
>>>> > shards?
>>>> >
>>>> >
>>>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> ah interesting....so I need to specify num shards, blow out zk and
>>>> then
>>>> >> try this again to see if things work properly now.  What is really
>>>> strange
>>>> >> is that for the most part things have worked right and on 4.2.1 I
>>>> have
>>>> >> 600,000 items indexed with no duplicates.  In any event I will
>>>> specify num
>>>> >> shards clear out zk and begin again.  If this works properly what
>>>> should
>>>> >> the router type be?
>>>> >>
>>>> >>
>>>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>>>> router
>>>> >>> and it's up to you to distribute updates. In the past, partitioning
>>>> was
>>>> >>> done on the fly - but for shard splitting and perhaps other
>>>> features, we
>>>> >>> now divvy up the hash range up front based on numShards and store
>>>> it in
>>>> >>> ZooKeeper. No numShards is now how you take complete control of
>>>> updates
>>>> >>> yourself.
>>>> >>>
>>>> >>> - Mark
>>>> >>>
>>>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> The router says "implicit".  I did start from a blank zk state but
>>>> >>> perhaps
>>>> >>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>> >>>> clusterstate.json is shown below.  What is the process that should
>>>> be
>>>> >>> done
>>>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>> above?  My
>>>> >>>> process right now is run those ZkCLI commands and then start solr
>>>> on
>>>> >>> all of
>>>> >>>> the instances with a command like this
>>>> >>>>
>>>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>> >>> -Dcollection.configName=solr-conf
>>>> >>>> -Dcollection=collection1
>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>> >>>>
>>>> >>>> I feel like maybe I'm missing a step.
>>>> >>>>
>>>> >>>> "shard5":{
>>>> >>>>       "state":"active",
>>>> >>>>       "replicas":{
>>>> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
>>>> >>>>           "shard":"shard5",
>>>> >>>>           "state":"active",
>>>> >>>>           "core":"shard5-core1",
>>>> >>>>           "collection":"collection1",
>>>> >>>>           "node_name":"10.38.33.16:7575_solr",
>>>> >>>>           "base_url":"http://10.38.33.16:7575/solr",
>>>> >>>>           "leader":"true"},
>>>> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
>>>> >>>>           "shard":"shard5",
>>>> >>>>           "state":"recovering",
>>>> >>>>           "core":"shard5-core2",
>>>> >>>>           "collection":"collection1",
>>>> >>>>           "node_name":"10.38.33.17:7577_solr",
>>>> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmiller@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>> It should be part of your clusterstate.json. Some users have
>>>> reported
>>>> >>>>> trouble upgrading a previous zk install when this change came. I
>>>> >>>>> recommended manually updating the clusterstate.json to have the
>>>> right
>>>> >>> info,
>>>> >>>>> and that seemed to work. Otherwise, I guess you have to start
>>>> from a
>>>> >>> clean
>>>> >>>>> zk state.
>>>> >>>>>
>>>> >>>>> If you don't have that range information, I think there will be
>>>> >>> trouble.
>>>> >>>>> Do you have an router type defined in the clusterstate.json?
>>>> >>>>>
>>>> >>>>> - Mark
>>>> >>>>>
>>>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Where is this information stored in ZK?  I don't see it in the
>>>> cluster
>>>> >>>>>> state (or perhaps I don't understand it ;) ).
>>>> >>>>>>
>>>> >>>>>> Perhaps something with my process is broken.  What I do when I
>>>> start
>>>> >>> from
>>>> >>>>>> scratch is the following
>>>> >>>>>>
>>>> >>>>>> ZkCLI -cmd upconfig ...
>>>> >>>>>> ZkCLI -cmd linkconfig ....
>>>> >>>>>>
>>>> >>>>>> but I don't ever explicitly create the collection.  What should
>>>> the
>>>> >>> steps
>>>> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>>>> so I
>>>> >>>>> never
>>>> >>>>>> did that previously either so perhaps I did create the
>>>> collection in
>>>> >>> one
>>>> >>>>> of
>>>> >>>>>> my steps to get this working but have forgotten it along the way.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>> markrmiller@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>>> front
>>>> >>>>> when a
>>>> >>>>>>> collection is created - each shard gets a range, which is
>>>> stored in
>>>> >>>>>>> zookeeper. You should not be able to end up with the same id on
>>>> >>>>> different
>>>> >>>>>>> shards - something very odd going on.
>>>> >>>>>>>
>>>> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>>> Ideally
>>>> >>> we
>>>> >>>>>>> can capture it in a test case.
>>>> >>>>>>>
>>>> >>>>>>> - Mark
>>>> >>>>>>>
>>>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> no, my thought was wrong, it appears that even with the
>>>> parameter
>>>> >>> set I
>>>> >>>>>>> am
>>>> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0
>>>> by
>>>> >>>>>>> indexing
>>>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>>> 400,000
>>>> >>> or
>>>> >>>>>>> so.
>>>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> Since I don't have that many items in my index I exported all
>>>> of
>>>> >>> the
>>>> >>>>>>> keys
>>>> >>>>>>>>> for each shard and wrote a simple java program that checks for
>>>> >>>>>>> duplicates.
>>>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>>> >>> files
>>>> >>>>> for
>>>> >>>>>>>>> the keys found does indicate that they made it to the wrong
>>>> places.
>>>> >>>>> If
>>>> >>>>>>> you
>>>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>>>  Is
>>>> >>> it
>>>> >>>>>>>>> possible that the hash is being calculated taking into
>>>> account only
>>>> >>>>> the
>>>> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>>> param @
>>>> >>>>>>> startup
>>>> >>>>>>>>> so could this be what is happening?
>>>> >>>>>>>>>
>>>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>> >>>>>>>>> shard1-core1:0
>>>> >>>>>>>>> shard1-core2:0
>>>> >>>>>>>>> shard2-core1:0
>>>> >>>>>>>>> shard2-core2:0
>>>> >>>>>>>>> shard3-core1:1
>>>> >>>>>>>>> shard3-core2:1
>>>> >>>>>>>>> shard4-core1:0
>>>> >>>>>>>>> shard4-core2:0
>>>> >>>>>>>>> shard5-core1:1
>>>> >>>>>>>>> shard5-core2:1
>>>> >>>>>>>>> shard6-core1:0
>>>> >>>>>>>>> shard6-core2:0
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>> indexed
>>>> >>>>> 300,000
>>>> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>>> >>>>> perhaps I
>>>> >>>>>>>>>> messed something up so I started the indexing again and
>>>> indexed
>>>> >>>>> another
>>>> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>>> >>>>> possibile
>>>> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>>>> that
>>>> >>>>> didn't
>>>> >>>>>>>>>> give me anything with more than a count of 1.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>>> again.
>>>> >>> I
>>>> >>>>> am
>>>> >>>>>>>>>>> going to clear the index and try to replicate the problem on
>>>> >>> 4.2.0
>>>> >>>>>>> and then
>>>> >>>>>>>>>>> I'll try on 4.2.1
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>> >>> markrmiller@gmail.com
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get
>>>> to the
>>>> >>>>>>> bottom
>>>> >>>>>>>>>>>> of it.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Mark
>>>> >>>>>>>>>>>>> It's there a particular jira issue that you think may
>>>> address
>>>> >>>>> this?
>>>> >>>>>>> I
>>>> >>>>>>>>>>>> read
>>>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>> jej2003@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>> nothing.  I
>>>> >>> can
>>>> >>>>>>>>>>>> clear
>>>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see
>>>> if
>>>> >>> there
>>>> >>>>>>> is
>>>> >>>>>>>>>>>>>> anything else odd
>>>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>> markrmiller@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>>>> start
>>>> >>>>>>>>>>>> tracking in
>>>> >>>>>>>>>>>>>>> a JIRA issue as well.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need
>>>> to
>>>> >>> get
>>>> >>>>> to
>>>> >>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>>> >>> 4.2.1
>>>> >>>>>>>>>>>> (spreading
>>>> >>>>>>>>>>>>>>> to mirrors now).
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>> jej2003@gmail.com
>>>> >>>>
>>>> >>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>>> anything
>>>> >>>>> else
>>>> >>>>>>>>>>>> that I
>>>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>>>> happy
>>>> >>> to
>>>> >>>>>>>>>>>> troll
>>>> >>>>>>>>>>>>>>>> through the logs further if more information is
>>>> needed, just
>>>> >>>>> let
>>>> >>>>>>> me
>>>> >>>>>>>>>>>>>>> know.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
>>>> this.
>>>> >>> Is it
>>>> >>>>>>>>>>>>>>> required to
>>>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>>> >>> things?
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>> >>>>> jej2003@gmail.com
>>>> >>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> sorry for spamming here....
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>> with...
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>> org.apache.solr.common.SolrException
>>>> >>>>> log
>>>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>> >>>>>>>>>>>>>>> :
>>>> >>>>>>>>>>>>>>>>> Server at
>>>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>> >>>>>>>>>>>> non
>>>> >>>>>>>>>>>>>>> ok
>>>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>> >>>>>>> jej2003@gmail.com>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>> >>> org.apache.solr.common.SolrException
>>>> >>>>> log
>>>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>> ClusterState
>>>> >>>>> says
>>>> >>>>>>>>>>>> we are
>>>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>> >>>>>>> jej2003@gmail.com
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>>> there
>>>> >>> were
>>>> >>>>>>>>>>>> shards
>>>> >>>>>>>>>>>>>>> that
>>>> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>> >>>>> state:SyncConnected
>>>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>> occurred -
>>>> >>>>>>>>>>>>>>> updating... (live
>>>> >>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>> >>>>>>>>>>>>>>>>>>> process
>>>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>>>> to be
>>>> >>>>> the
>>>> >>>>>>>>>>>> leader.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>> >>>>>>>>>>>> markrmiller@gmail.com
>>>> >>>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
>>>> apply
>>>> >>> here.
>>>> >>>>>>>>>>>> Peersync
>>>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>> numbers for
>>>> >>>>>>>>>>>> updates in
>>>> >>>>>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them
>>>> on
>>>> >>>>> leader
>>>> >>>>>>>>>>>> and
>>>> >>>>>>>>>>>>>>> replica.
>>>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>>> >>> versions
>>>> >>>>>>>>>>>> that
>>>> >>>>>>>>>>>>>>> the leader
>>>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>> interesting
>>>> >>>>>>>>>>>> exceptions?
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
>>>> Did
>>>> >>> any zk
>>>> >>>>>>>>>>>> session
>>>> >>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>> >>>>> jej2003@gmail.com
>>>> >>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster
>>>> to
>>>> >>> 4.2
>>>> >>>>> and
>>>> >>>>>>>>>>>>>>> noticed a
>>>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically
>>>> the
>>>> >>>>> replica
>>>> >>>>>>>>>>>> has a
>>>> >>>>>>>>>>>>>>>>>>>> higher
>>>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>> index to
>>>> >>> not
>>>> >>>>>>>>>>>>>>> replicate.
>>>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>>> than
>>>> >>> the
>>>> >>>>>>>>>>>> master.
>>>> >>>>>>>>>>>>>>> What
>>>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>>> >>> taking
>>>> >>>>>>>>>>>> down the
>>>> >>>>>>>>>>>>>>>>>>>> index
>>>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> MASTER:
>>>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>>>>>>>>>>>>>> Version:2387
>>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>>>>>>>>>>>>>> Version:3001
>>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>> sync
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>> >>>>>>> nUpdates=100
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>>>>>>>>>>>> handleVersions
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>>>>>>>>>>>> handleVersions
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>> >>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>> ourLowThreshold=1431233788792274944
>>>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>> sync
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>> succeeded
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>>> >>> newer
>>>> >>>>>>>>>>>> version of
>>>> >>>>>>>>>>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>>> >>> threads
>>>> >>>>>>>>>>>> indexing
>>>> >>>>>>>>>>>>>>>>>>>> 10,000
>>>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>> cluster.
>>>> >>> Any
>>>> >>>>>>>>>>>> thoughts
>>>> >>>>>>>>>>>>>>> on
>>>> >>>>>>>>>>>>>>>>>>>> this
>>>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

just an update, I'm at 1M records now with no issues.  This looks promising
as to the cause of my issues, thanks for the help.  Is the routing method
with numShards documented anywhere?  I know numShards is documented but I
didn't know that the routing changed if you don't specify it.


On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <je...@gmail.com> wrote:

> with these changes things are looking good, I'm up to 600,000 documents
> without any issues as of right now.  I'll keep going and add more to see if
> I find anything.
>
>
> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> ok, so that's not a deal breaker for me.  I just changed it to match the
>> shards that are auto created and it looks like things are happy.  I'll go
>> ahead and try my test to see if I can get things out of sync.
>>
>>
>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com>wrote:
>>
>>> I had thought you could - but looking at the code recently, I don't
>>> think you can anymore. I think that's a technical limitation more than
>>> anything though. When these changes were made, I think support for that was
>>> simply not added at the time.
>>>
>>> I'm not sure exactly how straightforward it would be, but it seems
>>> doable - as it is, the overseer will preallocate shards when first creating
>>> the collection - that's when they get named shard(n). There would have to
>>> be logic to replace shard(n) with the custom shard name when the core
>>> actually registers.
>>>
>>> - Mark
>>>
>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>> > answered my own question, it now says compositeId.  What is problematic
>>> > though is that in addition to my shards (which are say jamie-shard1) I
>>> see
>>> > the solr created shards (shard1).  I assume that these were created
>>> because
>>> > of the numShards param.  Is there no way to specify the names of these
>>> > shards?
>>> >
>>> >
>>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> >
>>> >> ah interesting....so I need to specify num shards, blow out zk and
>>> then
>>> >> try this again to see if things work properly now.  What is really
>>> strange
>>> >> is that for the most part things have worked right and on 4.2.1 I have
>>> >> 600,000 items indexed with no duplicates.  In any event I will
>>> specify num
>>> >> shards clear out zk and begin again.  If this works properly what
>>> should
>>> >> the router type be?
>>> >>
>>> >>
>>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>> >>
>>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>>> router
>>> >>> and it's up to you to distribute updates. In the past, partitioning
>>> was
>>> >>> done on the fly - but for shard splitting and perhaps other
>>> features, we
>>> >>> now divvy up the hash range up front based on numShards and store it
>>> in
>>> >>> ZooKeeper. No numShards is now how you take complete control of
>>> updates
>>> >>> yourself.
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> >>>
>>> >>>> The router says "implicit".  I did start from a blank zk state but
>>> >>> perhaps
>>> >>>> I missed one of the ZkCLI commands?  One of my shards from the
>>> >>>> clusterstate.json is shown below.  What is the process that should
>>> be
>>> >>> done
>>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>> above?  My
>>> >>>> process right now is run those ZkCLI commands and then start solr on
>>> >>> all of
>>> >>>> the instances with a command like this
>>> >>>>
>>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>> >>> -Dcollection.configName=solr-conf
>>> >>>> -Dcollection=collection1
>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>> >>>>
>>> >>>> I feel like maybe I'm missing a step.
>>> >>>>
>>> >>>> "shard5":{
>>> >>>>       "state":"active",
>>> >>>>       "replicas":{
>>> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
>>> >>>>           "shard":"shard5",
>>> >>>>           "state":"active",
>>> >>>>           "core":"shard5-core1",
>>> >>>>           "collection":"collection1",
>>> >>>>           "node_name":"10.38.33.16:7575_solr",
>>> >>>>           "base_url":"http://10.38.33.16:7575/solr",
>>> >>>>           "leader":"true"},
>>> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
>>> >>>>           "shard":"shard5",
>>> >>>>           "state":"recovering",
>>> >>>>           "core":"shard5-core2",
>>> >>>>           "collection":"collection1",
>>> >>>>           "node_name":"10.38.33.17:7577_solr",
>>> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>>> It should be part of your clusterstate.json. Some users have
>>> reported
>>> >>>>> trouble upgrading a previous zk install when this change came. I
>>> >>>>> recommended manually updating the clusterstate.json to have the
>>> right
>>> >>> info,
>>> >>>>> and that seemed to work. Otherwise, I guess you have to start from
>>> a
>>> >>> clean
>>> >>>>> zk state.
>>> >>>>>
>>> >>>>> If you don't have that range information, I think there will be
>>> >>> trouble.
>>> >>>>> Do you have an router type defined in the clusterstate.json?
>>> >>>>>
>>> >>>>> - Mark
>>> >>>>>
>>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Where is this information stored in ZK?  I don't see it in the
>>> cluster
>>> >>>>>> state (or perhaps I don't understand it ;) ).
>>> >>>>>>
>>> >>>>>> Perhaps something with my process is broken.  What I do when I
>>> start
>>> >>> from
>>> >>>>>> scratch is the following
>>> >>>>>>
>>> >>>>>> ZkCLI -cmd upconfig ...
>>> >>>>>> ZkCLI -cmd linkconfig ....
>>> >>>>>>
>>> >>>>>> but I don't ever explicitly create the collection.  What should
>>> the
>>> >>> steps
>>> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>>> so I
>>> >>>>> never
>>> >>>>>> did that previously either so perhaps I did create the collection
>>> in
>>> >>> one
>>> >>>>> of
>>> >>>>>> my steps to get this working but have forgotten it along the way.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>> markrmiller@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>> front
>>> >>>>> when a
>>> >>>>>>> collection is created - each shard gets a range, which is stored
>>> in
>>> >>>>>>> zookeeper. You should not be able to end up with the same id on
>>> >>>>> different
>>> >>>>>>> shards - something very odd going on.
>>> >>>>>>>
>>> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>> Ideally
>>> >>> we
>>> >>>>>>> can capture it in a test case.
>>> >>>>>>>
>>> >>>>>>> - Mark
>>> >>>>>>>
>>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> no, my thought was wrong, it appears that even with the
>>> parameter
>>> >>> set I
>>> >>>>>>> am
>>> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0
>>> by
>>> >>>>>>> indexing
>>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>> 400,000
>>> >>> or
>>> >>>>>>> so.
>>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>> jej2003@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Since I don't have that many items in my index I exported all
>>> of
>>> >>> the
>>> >>>>>>> keys
>>> >>>>>>>>> for each shard and wrote a simple java program that checks for
>>> >>>>>>> duplicates.
>>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>> >>> files
>>> >>>>> for
>>> >>>>>>>>> the keys found does indicate that they made it to the wrong
>>> places.
>>> >>>>> If
>>> >>>>>>> you
>>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>>  Is
>>> >>> it
>>> >>>>>>>>> possible that the hash is being calculated taking into account
>>> only
>>> >>>>> the
>>> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>> param @
>>> >>>>>>> startup
>>> >>>>>>>>> so could this be what is happening?
>>> >>>>>>>>>
>>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>> >>>>>>>>> shard1-core1:0
>>> >>>>>>>>> shard1-core2:0
>>> >>>>>>>>> shard2-core1:0
>>> >>>>>>>>> shard2-core2:0
>>> >>>>>>>>> shard3-core1:1
>>> >>>>>>>>> shard3-core2:1
>>> >>>>>>>>> shard4-core1:0
>>> >>>>>>>>> shard4-core2:0
>>> >>>>>>>>> shard5-core1:1
>>> >>>>>>>>> shard5-core2:1
>>> >>>>>>>>> shard6-core1:0
>>> >>>>>>>>> shard6-core2:0
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>> jej2003@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Something interesting that I'm noticing as well, I just
>>> indexed
>>> >>>>> 300,000
>>> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>> >>>>> perhaps I
>>> >>>>>>>>>> messed something up so I started the indexing again and
>>> indexed
>>> >>>>> another
>>> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>> >>>>> possibile
>>> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>>> that
>>> >>>>> didn't
>>> >>>>>>>>>> give me anything with more than a count of 1.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>> jej2003@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>> again.
>>> >>> I
>>> >>>>> am
>>> >>>>>>>>>>> going to clear the index and try to replicate the problem on
>>> >>> 4.2.0
>>> >>>>>>> and then
>>> >>>>>>>>>>> I'll try on 4.2.1
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>> >>> markrmiller@gmail.com
>>> >>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get
>>> to the
>>> >>>>>>> bottom
>>> >>>>>>>>>>>> of it.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>> jej2003@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> Mark
>>> >>>>>>>>>>>>> It's there a particular jira issue that you think may
>>> address
>>> >>>>> this?
>>> >>>>>>> I
>>> >>>>>>>>>>>> read
>>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>> jej2003@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>> nothing.  I
>>> >>> can
>>> >>>>>>>>>>>> clear
>>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see
>>> if
>>> >>> there
>>> >>>>>>> is
>>> >>>>>>>>>>>>>> anything else odd
>>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>> markrmiller@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>>> start
>>> >>>>>>>>>>>> tracking in
>>> >>>>>>>>>>>>>>> a JIRA issue as well.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need
>>> to
>>> >>> get
>>> >>>>> to
>>> >>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>> >>> 4.2.1
>>> >>>>>>>>>>>> (spreading
>>> >>>>>>>>>>>>>>> to mirrors now).
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>> jej2003@gmail.com
>>> >>>>
>>> >>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>> anything
>>> >>>>> else
>>> >>>>>>>>>>>> that I
>>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>>> happy
>>> >>> to
>>> >>>>>>>>>>>> troll
>>> >>>>>>>>>>>>>>>> through the logs further if more information is needed,
>>> just
>>> >>>>> let
>>> >>>>>>> me
>>> >>>>>>>>>>>>>>> know.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
>>> >>> Is it
>>> >>>>>>>>>>>>>>> required to
>>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>> >>> things?
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>> >>>>> jej2003@gmail.com
>>> >>>>>>>>
>>> >>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> sorry for spamming here....
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>> with...
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>> org.apache.solr.common.SolrException
>>> >>>>> log
>>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> >>>>>>>>>>>>>>> :
>>> >>>>>>>>>>>>>>>>> Server at
>>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>> >>>>>>>>>>>> non
>>> >>>>>>>>>>>>>>> ok
>>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> >>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>> >>>>>>> jej2003@gmail.com>
>>> >>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>> >>> org.apache.solr.common.SolrException
>>> >>>>> log
>>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>> ClusterState
>>> >>>>> says
>>> >>>>>>>>>>>> we are
>>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>> >>>>>>>>>>>>>>>>>>   at
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>> >>>>>>> jej2003@gmail.com
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>> there
>>> >>> were
>>> >>>>>>>>>>>> shards
>>> >>>>>>>>>>>>>>> that
>>> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>> >>>>> state:SyncConnected
>>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>> occurred -
>>> >>>>>>>>>>>>>>> updating... (live
>>> >>>>>>>>>>>>>>>>>>> nodes size: 12)
>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>> >>>>>>>>>>>>>>>>>>> process
>>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>>> to be
>>> >>>>> the
>>> >>>>>>>>>>>> leader.
>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>> >>>>>>>>>>>> markrmiller@gmail.com
>>> >>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
>>> >>> here.
>>> >>>>>>>>>>>> Peersync
>>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers
>>> for
>>> >>>>>>>>>>>> updates in
>>> >>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them
>>> on
>>> >>>>> leader
>>> >>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>> replica.
>>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>> >>> versions
>>> >>>>>>>>>>>> that
>>> >>>>>>>>>>>>>>> the leader
>>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>> interesting
>>> >>>>>>>>>>>> exceptions?
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
>>> >>> any zk
>>> >>>>>>>>>>>> session
>>> >>>>>>>>>>>>>>>>>>>> timeouts occur?
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> - Mark
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>> >>>>> jej2003@gmail.com
>>> >>>>>>>>
>>> >>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster
>>> to
>>> >>> 4.2
>>> >>>>> and
>>> >>>>>>>>>>>>>>> noticed a
>>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically
>>> the
>>> >>>>> replica
>>> >>>>>>>>>>>> has a
>>> >>>>>>>>>>>>>>>>>>>> higher
>>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the index
>>> to
>>> >>> not
>>> >>>>>>>>>>>>>>> replicate.
>>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>> than
>>> >>> the
>>> >>>>>>>>>>>> master.
>>> >>>>>>>>>>>>>>> What
>>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>> >>> taking
>>> >>>>>>>>>>>> down the
>>> >>>>>>>>>>>>>>>>>>>> index
>>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> MASTER:
>>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>> >>>>>>>>>>>>>>>>>>>>> Version:2387
>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> REPLICA:
>>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>> >>>>>>>>>>>>>>>>>>>>> Version:3001
>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>> org.apache.solr.update.PeerSync
>>> >>>>> sync
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>> >>>>>>> nUpdates=100
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>> org.apache.solr.update.PeerSync
>>> >>>>>>>>>>>>>>> handleVersions
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>> org.apache.solr.update.PeerSync
>>> >>>>>>>>>>>>>>> handleVersions
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>> >>>>>>>>>>>>>>>>>>>>> versions are newer.
>>> ourLowThreshold=1431233788792274944
>>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>> org.apache.solr.update.PeerSync
>>> >>>>> sync
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>> succeeded
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>> >>> newer
>>> >>>>>>>>>>>> version of
>>> >>>>>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>> >>> threads
>>> >>>>>>>>>>>> indexing
>>> >>>>>>>>>>>>>>>>>>>> 10,000
>>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>> cluster.
>>> >>> Any
>>> >>>>>>>>>>>> thoughts
>>> >>>>>>>>>>>>>>> on
>>> >>>>>>>>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

with these changes things are looking good, I'm up to 600,000 documents
without any issues as of right now.  I'll keep going and add more to see if
I find anything.


On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <je...@gmail.com> wrote:

> ok, so that's not a deal breaker for me.  I just changed it to match the
> shards that are auto created and it looks like things are happy.  I'll go
> ahead and try my test to see if I can get things out of sync.
>
>
> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> I had thought you could - but looking at the code recently, I don't think
>> you can anymore. I think that's a technical limitation more than anything
>> though. When these changes were made, I think support for that was simply
>> not added at the time.
>>
>> I'm not sure exactly how straightforward it would be, but it seems doable
>> - as it is, the overseer will preallocate shards when first creating the
>> collection - that's when they get named shard(n). There would have to be
>> logic to replace shard(n) with the custom shard name when the core actually
>> registers.
>>
>> - Mark
>>
>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > answered my own question, it now says compositeId.  What is problematic
>> > though is that in addition to my shards (which are say jamie-shard1) I
>> see
>> > the solr created shards (shard1).  I assume that these were created
>> because
>> > of the numShards param.  Is there no way to specify the names of these
>> > shards?
>> >
>> >
>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >
>> >> ah interesting....so I need to specify num shards, blow out zk and then
>> >> try this again to see if things work properly now.  What is really
>> strange
>> >> is that for the most part things have worked right and on 4.2.1 I have
>> >> 600,000 items indexed with no duplicates.  In any event I will specify
>> num
>> >> shards clear out zk and begin again.  If this works properly what
>> should
>> >> the router type be?
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >>
>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>> router
>> >>> and it's up to you to distribute updates. In the past, partitioning
>> was
>> >>> done on the fly - but for shard splitting and perhaps other features,
>> we
>> >>> now divvy up the hash range up front based on numShards and store it
>> in
>> >>> ZooKeeper. No numShards is now how you take complete control of
>> updates
>> >>> yourself.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>>
>> >>>> The router says "implicit".  I did start from a blank zk state but
>> >>> perhaps
>> >>>> I missed one of the ZkCLI commands?  One of my shards from the
>> >>>> clusterstate.json is shown below.  What is the process that should be
>> >>> done
>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed above?
>>  My
>> >>>> process right now is run those ZkCLI commands and then start solr on
>> >>> all of
>> >>>> the instances with a command like this
>> >>>>
>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
>> >>> -Dcollection.configName=solr-conf
>> >>>> -Dcollection=collection1
>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >>>>
>> >>>> I feel like maybe I'm missing a step.
>> >>>>
>> >>>> "shard5":{
>> >>>>       "state":"active",
>> >>>>       "replicas":{
>> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
>> >>>>           "shard":"shard5",
>> >>>>           "state":"active",
>> >>>>           "core":"shard5-core1",
>> >>>>           "collection":"collection1",
>> >>>>           "node_name":"10.38.33.16:7575_solr",
>> >>>>           "base_url":"http://10.38.33.16:7575/solr",
>> >>>>           "leader":"true"},
>> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
>> >>>>           "shard":"shard5",
>> >>>>           "state":"recovering",
>> >>>>           "core":"shard5-core2",
>> >>>>           "collection":"collection1",
>> >>>>           "node_name":"10.38.33.17:7577_solr",
>> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> It should be part of your clusterstate.json. Some users have
>> reported
>> >>>>> trouble upgrading a previous zk install when this change came. I
>> >>>>> recommended manually updating the clusterstate.json to have the
>> right
>> >>> info,
>> >>>>> and that seemed to work. Otherwise, I guess you have to start from a
>> >>> clean
>> >>>>> zk state.
>> >>>>>
>> >>>>> If you don't have that range information, I think there will be
>> >>> trouble.
>> >>>>> Do you have an router type defined in the clusterstate.json?
>> >>>>>
>> >>>>> - Mark
>> >>>>>
>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Where is this information stored in ZK?  I don't see it in the
>> cluster
>> >>>>>> state (or perhaps I don't understand it ;) ).
>> >>>>>>
>> >>>>>> Perhaps something with my process is broken.  What I do when I
>> start
>> >>> from
>> >>>>>> scratch is the following
>> >>>>>>
>> >>>>>> ZkCLI -cmd upconfig ...
>> >>>>>> ZkCLI -cmd linkconfig ....
>> >>>>>>
>> >>>>>> but I don't ever explicitly create the collection.  What should the
>> >>> steps
>> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>> so I
>> >>>>> never
>> >>>>>> did that previously either so perhaps I did create the collection
>> in
>> >>> one
>> >>>>> of
>> >>>>>> my steps to get this working but have forgotten it along the way.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmiller@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>>
>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>> front
>> >>>>> when a
>> >>>>>>> collection is created - each shard gets a range, which is stored
>> in
>> >>>>>>> zookeeper. You should not be able to end up with the same id on
>> >>>>> different
>> >>>>>>> shards - something very odd going on.
>> >>>>>>>
>> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
>> Ideally
>> >>> we
>> >>>>>>> can capture it in a test case.
>> >>>>>>>
>> >>>>>>> - Mark
>> >>>>>>>
>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> no, my thought was wrong, it appears that even with the parameter
>> >>> set I
>> >>>>>>> am
>> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> >>>>>>> indexing
>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>> 400,000
>> >>> or
>> >>>>>>> so.
>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Since I don't have that many items in my index I exported all of
>> >>> the
>> >>>>>>> keys
>> >>>>>>>>> for each shard and wrote a simple java program that checks for
>> >>>>>>> duplicates.
>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
>> >>> files
>> >>>>> for
>> >>>>>>>>> the keys found does indicate that they made it to the wrong
>> places.
>> >>>>> If
>> >>>>>>> you
>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>  Is
>> >>> it
>> >>>>>>>>> possible that the hash is being calculated taking into account
>> only
>> >>>>> the
>> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards param
>> @
>> >>>>>>> startup
>> >>>>>>>>> so could this be what is happening?
>> >>>>>>>>>
>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> >>>>>>>>> shard1-core1:0
>> >>>>>>>>> shard1-core2:0
>> >>>>>>>>> shard2-core1:0
>> >>>>>>>>> shard2-core2:0
>> >>>>>>>>> shard3-core1:1
>> >>>>>>>>> shard3-core2:1
>> >>>>>>>>> shard4-core1:0
>> >>>>>>>>> shard4-core2:0
>> >>>>>>>>> shard5-core1:1
>> >>>>>>>>> shard5-core2:1
>> >>>>>>>>> shard6-core1:0
>> >>>>>>>>> shard6-core2:0
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Something interesting that I'm noticing as well, I just indexed
>> >>>>> 300,000
>> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>> >>>>> perhaps I
>> >>>>>>>>>> messed something up so I started the indexing again and indexed
>> >>>>> another
>> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>> >>>>> possibile
>> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>> that
>> >>>>> didn't
>> >>>>>>>>>> give me anything with more than a count of 1.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>> again.
>> >>> I
>> >>>>> am
>> >>>>>>>>>>> going to clear the index and try to replicate the problem on
>> >>> 4.2.0
>> >>>>>>> and then
>> >>>>>>>>>>> I'll try on 4.2.1
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>> >>> markrmiller@gmail.com
>> >>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get to
>> the
>> >>>>>>> bottom
>> >>>>>>>>>>>> of it.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>> jej2003@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Mark
>> >>>>>>>>>>>>> It's there a particular jira issue that you think may
>> address
>> >>>>> this?
>> >>>>>>> I
>> >>>>>>>>>>>> read
>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2003@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did nothing.
>>  I
>> >>> can
>> >>>>>>>>>>>> clear
>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
>> >>> there
>> >>>>>>> is
>> >>>>>>>>>>>>>> anything else odd
>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>> markrmiller@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>> start
>> >>>>>>>>>>>> tracking in
>> >>>>>>>>>>>>>>> a JIRA issue as well.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to
>> >>> get
>> >>>>> to
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>> >>> 4.2.1
>> >>>>>>>>>>>> (spreading
>> >>>>>>>>>>>>>>> to mirrors now).
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>> jej2003@gmail.com
>> >>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>> anything
>> >>>>> else
>> >>>>>>>>>>>> that I
>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>> happy
>> >>> to
>> >>>>>>>>>>>> troll
>> >>>>>>>>>>>>>>>> through the logs further if more information is needed,
>> just
>> >>>>> let
>> >>>>>>> me
>> >>>>>>>>>>>>>>> know.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
>> >>> Is it
>> >>>>>>>>>>>>>>> required to
>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>> >>> things?
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>> >>>>> jej2003@gmail.com
>> >>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> sorry for spamming here....
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> org.apache.solr.common.SolrException
>> >>>>> log
>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>> Server at
>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>> >>>>>>>>>>>> non
>> >>>>>>>>>>>>>>> ok
>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> >>>>>>> jej2003@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>> org.apache.solr.common.SolrException
>> >>>>> log
>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>> ClusterState
>> >>>>> says
>> >>>>>>>>>>>> we are
>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> >>>>>>> jej2003@gmail.com
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>> there
>> >>> were
>> >>>>>>>>>>>> shards
>> >>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> >>>>> state:SyncConnected
>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>> occurred -
>> >>>>>>>>>>>>>>> updating... (live
>> >>>>>>>>>>>>>>>>>>> nodes size: 12)
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>> to be
>> >>>>> the
>> >>>>>>>>>>>> leader.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>> >>>>>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
>> >>> here.
>> >>>>>>>>>>>> Peersync
>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers
>> for
>> >>>>>>>>>>>> updates in
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
>> >>>>> leader
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>>>> replica.
>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>> >>> versions
>> >>>>>>>>>>>> that
>> >>>>>>>>>>>>>>> the leader
>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>> interesting
>> >>>>>>>>>>>> exceptions?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
>> >>> any zk
>> >>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>>>>> timeouts occur?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>> >>>>> jej2003@gmail.com
>> >>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to
>> >>> 4.2
>> >>>>> and
>> >>>>>>>>>>>>>>> noticed a
>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
>> >>>>> replica
>> >>>>>>>>>>>> has a
>> >>>>>>>>>>>>>>>>>>>> higher
>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the index
>> to
>> >>> not
>> >>>>>>>>>>>>>>> replicate.
>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than
>> >>> the
>> >>>>>>>>>>>> master.
>> >>>>>>>>>>>>>>> What
>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>> >>> taking
>> >>>>>>>>>>>> down the
>> >>>>>>>>>>>>>>>>>>>> index
>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> MASTER:
>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>> Version:2387
>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> REPLICA:
>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>> Version:3001
>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>> >>>>>>> nUpdates=100
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>>>>>>>>>>>>>> versions are newer.
>> ourLowThreshold=1431233788792274944
>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>> succeeded
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>> >>> newer
>> >>>>>>>>>>>> version of
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>> >>> threads
>> >>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>>>>> 10,000
>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
>> >>> Any
>> >>>>>>>>>>>> thoughts
>> >>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >>
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

ok, so that's not a deal breaker for me.  I just changed it to match the
shards that are auto created and it looks like things are happy.  I'll go
ahead and try my test to see if I can get things out of sync.


On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <ma...@gmail.com> wrote:

> I had thought you could - but looking at the code recently, I don't think
> you can anymore. I think that's a technical limitation more than anything
> though. When these changes were made, I think support for that was simply
> not added at the time.
>
> I'm not sure exactly how straightforward it would be, but it seems doable
> - as it is, the overseer will preallocate shards when first creating the
> collection - that's when they get named shard(n). There would have to be
> logic to replace shard(n) with the custom shard name when the core actually
> registers.
>
> - Mark
>
> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > answered my own question, it now says compositeId.  What is problematic
> > though is that in addition to my shards (which are say jamie-shard1) I
> see
> > the solr created shards (shard1).  I assume that these were created
> because
> > of the numShards param.  Is there no way to specify the names of these
> > shards?
> >
> >
> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com> wrote:
> >
> >> ah interesting....so I need to specify num shards, blow out zk and then
> >> try this again to see if things work properly now.  What is really
> strange
> >> is that for the most part things have worked right and on 4.2.1 I have
> >> 600,000 items indexed with no duplicates.  In any event I will specify
> num
> >> shards clear out zk and begin again.  If this works properly what should
> >> the router type be?
> >>
> >>
> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>
> >>> If you don't specify numShards after 4.1, you get an implicit doc
> router
> >>> and it's up to you to distribute updates. In the past, partitioning was
> >>> done on the fly - but for shard splitting and perhaps other features,
> we
> >>> now divvy up the hash range up front based on numShards and store it in
> >>> ZooKeeper. No numShards is now how you take complete control of updates
> >>> yourself.
> >>>
> >>> - Mark
> >>>
> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>
> >>>> The router says "implicit".  I did start from a blank zk state but
> >>> perhaps
> >>>> I missed one of the ZkCLI commands?  One of my shards from the
> >>>> clusterstate.json is shown below.  What is the process that should be
> >>> done
> >>>> to bootstrap a cluster other than the ZkCLI commands I listed above?
>  My
> >>>> process right now is run those ZkCLI commands and then start solr on
> >>> all of
> >>>> the instances with a command like this
> >>>>
> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
> >>> -Dcollection.configName=solr-conf
> >>>> -Dcollection=collection1
> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >>>>
> >>>> I feel like maybe I'm missing a step.
> >>>>
> >>>> "shard5":{
> >>>>       "state":"active",
> >>>>       "replicas":{
> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
> >>>>           "shard":"shard5",
> >>>>           "state":"active",
> >>>>           "core":"shard5-core1",
> >>>>           "collection":"collection1",
> >>>>           "node_name":"10.38.33.16:7575_solr",
> >>>>           "base_url":"http://10.38.33.16:7575/solr",
> >>>>           "leader":"true"},
> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
> >>>>           "shard":"shard5",
> >>>>           "state":"recovering",
> >>>>           "core":"shard5-core2",
> >>>>           "collection":"collection1",
> >>>>           "node_name":"10.38.33.17:7577_solr",
> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> It should be part of your clusterstate.json. Some users have reported
> >>>>> trouble upgrading a previous zk install when this change came. I
> >>>>> recommended manually updating the clusterstate.json to have the right
> >>> info,
> >>>>> and that seemed to work. Otherwise, I guess you have to start from a
> >>> clean
> >>>>> zk state.
> >>>>>
> >>>>> If you don't have that range information, I think there will be
> >>> trouble.
> >>>>> Do you have an router type defined in the clusterstate.json?
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>>>
> >>>>>> Where is this information stored in ZK?  I don't see it in the
> cluster
> >>>>>> state (or perhaps I don't understand it ;) ).
> >>>>>>
> >>>>>> Perhaps something with my process is broken.  What I do when I start
> >>> from
> >>>>>> scratch is the following
> >>>>>>
> >>>>>> ZkCLI -cmd upconfig ...
> >>>>>> ZkCLI -cmd linkconfig ....
> >>>>>>
> >>>>>> but I don't ever explicitly create the collection.  What should the
> >>> steps
> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so
> I
> >>>>> never
> >>>>>> did that previously either so perhaps I did create the collection in
> >>> one
> >>>>> of
> >>>>>> my steps to get this working but have forgotten it along the way.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> >>>>> when a
> >>>>>>> collection is created - each shard gets a range, which is stored in
> >>>>>>> zookeeper. You should not be able to end up with the same id on
> >>>>> different
> >>>>>>> shards - something very odd going on.
> >>>>>>>
> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
> Ideally
> >>> we
> >>>>>>> can capture it in a test case.
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> no, my thought was wrong, it appears that even with the parameter
> >>> set I
> >>>>>>> am
> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> >>>>>>> indexing
> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
> 400,000
> >>> or
> >>>>>>> so.
> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Since I don't have that many items in my index I exported all of
> >>> the
> >>>>>>> keys
> >>>>>>>>> for each shard and wrote a simple java program that checks for
> >>>>>>> duplicates.
> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
> >>> files
> >>>>> for
> >>>>>>>>> the keys found does indicate that they made it to the wrong
> places.
> >>>>> If
> >>>>>>> you
> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.  Is
> >>> it
> >>>>>>>>> possible that the hash is being calculated taking into account
> only
> >>>>> the
> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards param @
> >>>>>>> startup
> >>>>>>>>> so could this be what is happening?
> >>>>>>>>>
> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>>>>> shard1-core1:0
> >>>>>>>>> shard1-core2:0
> >>>>>>>>> shard2-core1:0
> >>>>>>>>> shard2-core2:0
> >>>>>>>>> shard3-core1:1
> >>>>>>>>> shard3-core2:1
> >>>>>>>>> shard4-core1:0
> >>>>>>>>> shard4-core2:0
> >>>>>>>>> shard5-core1:1
> >>>>>>>>> shard5-core2:1
> >>>>>>>>> shard6-core1:0
> >>>>>>>>> shard6-core2:0
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Something interesting that I'm noticing as well, I just indexed
> >>>>> 300,000
> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
> >>>>> perhaps I
> >>>>>>>>>> messed something up so I started the indexing again and indexed
> >>>>> another
> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
> >>>>> possibile
> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but that
> >>>>> didn't
> >>>>>>>>>> give me anything with more than a count of 1.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go again.
> >>> I
> >>>>> am
> >>>>>>>>>>> going to clear the index and try to replicate the problem on
> >>> 4.2.0
> >>>>>>> and then
> >>>>>>>>>>> I'll try on 4.2.1
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> >>> markrmiller@gmail.com
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get to
> the
> >>>>>>> bottom
> >>>>>>>>>>>> of it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mark
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Mark
> >>>>>>>>>>>>> It's there a particular jira issue that you think may address
> >>>>> this?
> >>>>>>> I
> >>>>>>>>>>>> read
> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I brought the bad one down and back up and it did nothing.
>  I
> >>> can
> >>>>>>>>>>>> clear
> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
> >>> there
> >>>>>>> is
> >>>>>>>>>>>>>> anything else odd
> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
> markrmiller@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
> start
> >>>>>>>>>>>> tracking in
> >>>>>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to
> >>> get
> >>>>> to
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
> >>> 4.2.1
> >>>>>>>>>>>> (spreading
> >>>>>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
> anything
> >>>>> else
> >>>>>>>>>>>> that I
> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
> happy
> >>> to
> >>>>>>>>>>>> troll
> >>>>>>>>>>>>>>>> through the logs further if more information is needed,
> just
> >>>>> let
> >>>>>>> me
> >>>>>>>>>>>>>>> know.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
> >>> Is it
> >>>>>>>>>>>>>>> required to
> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
> >>> things?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >>>>> jej2003@gmail.com
> >>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> org.apache.solr.common.SolrException
> >>>>> log
> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>> Server at
> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>>>>> non
> >>>>>>>>>>>>>>> ok
> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>   at
> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>   at
> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> >>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>> org.apache.solr.common.SolrException
> >>>>> log
> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
> ClusterState
> >>>>> says
> >>>>>>>>>>>> we are
> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>>>>   at
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> >>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point there
> >>> were
> >>>>>>>>>>>> shards
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >>>>> state:SyncConnected
> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
> occurred -
> >>>>>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to
> be
> >>>>> the
> >>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
> >>> here.
> >>>>>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers
> for
> >>>>>>>>>>>> updates in
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
> >>>>> leader
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
> >>> versions
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
> interesting
> >>>>>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
> >>> any zk
> >>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >>>>> jej2003@gmail.com
> >>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to
> >>> 4.2
> >>>>> and
> >>>>>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
> >>>>> replica
> >>>>>>>>>>>> has a
> >>>>>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the index to
> >>> not
> >>>>>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than
> >>> the
> >>>>>>>>>>>> master.
> >>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
> >>> taking
> >>>>>>>>>>>> down the
> >>>>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> org.apache.solr.update.PeerSync
> >>>>> sync
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
> >>>>>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>>>>> versions are newer.
> ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> org.apache.solr.update.PeerSync
> >>>>> sync
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
> >>> newer
> >>>>>>>>>>>> version of
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
> >>> threads
> >>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
> >>> Any
> >>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time.

I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers.

- Mark

On Apr 3, 2013, at 3:42 PM, Jamie Johnson <je...@gmail.com> wrote:

> answered my own question, it now says compositeId.  What is problematic
> though is that in addition to my shards (which are say jamie-shard1) I see
> the solr created shards (shard1).  I assume that these were created because
> of the numShards param.  Is there no way to specify the names of these
> shards?
> 
> 
> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> ah interesting....so I need to specify num shards, blow out zk and then
>> try this again to see if things work properly now.  What is really strange
>> is that for the most part things have worked right and on 4.2.1 I have
>> 600,000 items indexed with no duplicates.  In any event I will specify num
>> shards clear out zk and begin again.  If this works properly what should
>> the router type be?
>> 
>> 
>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com> wrote:
>> 
>>> If you don't specify numShards after 4.1, you get an implicit doc router
>>> and it's up to you to distribute updates. In the past, partitioning was
>>> done on the fly - but for shard splitting and perhaps other features, we
>>> now divvy up the hash range up front based on numShards and store it in
>>> ZooKeeper. No numShards is now how you take complete control of updates
>>> yourself.
>>> 
>>> - Mark
>>> 
>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> The router says "implicit".  I did start from a blank zk state but
>>> perhaps
>>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>> clusterstate.json is shown below.  What is the process that should be
>>> done
>>>> to bootstrap a cluster other than the ZkCLI commands I listed above?  My
>>>> process right now is run those ZkCLI commands and then start solr on
>>> all of
>>>> the instances with a command like this
>>>> 
>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>> -Dcollection.configName=solr-conf
>>>> -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>> 
>>>> I feel like maybe I'm missing a step.
>>>> 
>>>> "shard5":{
>>>>       "state":"active",
>>>>       "replicas":{
>>>>         "10.38.33.16:7575_solr_shard5-core1":{
>>>>           "shard":"shard5",
>>>>           "state":"active",
>>>>           "core":"shard5-core1",
>>>>           "collection":"collection1",
>>>>           "node_name":"10.38.33.16:7575_solr",
>>>>           "base_url":"http://10.38.33.16:7575/solr",
>>>>           "leader":"true"},
>>>>         "10.38.33.17:7577_solr_shard5-core2":{
>>>>           "shard":"shard5",
>>>>           "state":"recovering",
>>>>           "core":"shard5-core2",
>>>>           "collection":"collection1",
>>>>           "node_name":"10.38.33.17:7577_solr",
>>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>> 
>>>>> It should be part of your clusterstate.json. Some users have reported
>>>>> trouble upgrading a previous zk install when this change came. I
>>>>> recommended manually updating the clusterstate.json to have the right
>>> info,
>>>>> and that seemed to work. Otherwise, I guess you have to start from a
>>> clean
>>>>> zk state.
>>>>> 
>>>>> If you don't have that range information, I think there will be
>>> trouble.
>>>>> Do you have an router type defined in the clusterstate.json?
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> 
>>>>>> Where is this information stored in ZK?  I don't see it in the cluster
>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>> 
>>>>>> Perhaps something with my process is broken.  What I do when I start
>>> from
>>>>>> scratch is the following
>>>>>> 
>>>>>> ZkCLI -cmd upconfig ...
>>>>>> ZkCLI -cmd linkconfig ....
>>>>>> 
>>>>>> but I don't ever explicitly create the collection.  What should the
>>> steps
>>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
>>>>> never
>>>>>> did that previously either so perhaps I did create the collection in
>>> one
>>>>> of
>>>>>> my steps to get this working but have forgotten it along the way.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
>>>>> when a
>>>>>>> collection is created - each shard gets a range, which is stored in
>>>>>>> zookeeper. You should not be able to end up with the same id on
>>>>> different
>>>>>>> shards - something very odd going on.
>>>>>>> 
>>>>>>> Hopefully I'll have some time to try and help you reproduce. Ideally
>>> we
>>>>>>> can capture it in a test case.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> no, my thought was wrong, it appears that even with the parameter
>>> set I
>>>>>>> am
>>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>>>>>>> indexing
>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000
>>> or
>>>>>>> so.
>>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Since I don't have that many items in my index I exported all of
>>> the
>>>>>>> keys
>>>>>>>>> for each shard and wrote a simple java program that checks for
>>>>>>> duplicates.
>>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>> files
>>>>> for
>>>>>>>>> the keys found does indicate that they made it to the wrong places.
>>>>> If
>>>>>>> you
>>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.  Is
>>> it
>>>>>>>>> possible that the hash is being calculated taking into account only
>>>>> the
>>>>>>>>> "live" nodes?  I know that we don't specify the numShards param @
>>>>>>> startup
>>>>>>>>> so could this be what is happening?
>>>>>>>>> 
>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>> shard1-core1:0
>>>>>>>>> shard1-core2:0
>>>>>>>>> shard2-core1:0
>>>>>>>>> shard2-core2:0
>>>>>>>>> shard3-core1:1
>>>>>>>>> shard3-core2:1
>>>>>>>>> shard4-core1:0
>>>>>>>>> shard4-core2:0
>>>>>>>>> shard5-core1:1
>>>>>>>>> shard5-core2:1
>>>>>>>>> shard6-core1:0
>>>>>>>>> shard6-core2:0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Something interesting that I'm noticing as well, I just indexed
>>>>> 300,000
>>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>>>> perhaps I
>>>>>>>>>> messed something up so I started the indexing again and indexed
>>>>> another
>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>>>> possibile
>>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but that
>>>>> didn't
>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ok, so clearing the transaction log allowed things to go again.
>>> I
>>>>> am
>>>>>>>>>>> going to clear the index and try to replicate the problem on
>>> 4.2.0
>>>>>>> and then
>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>> markrmiller@gmail.com
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> No, not that I know if, which is why I say we need to get to the
>>>>>>> bottom
>>>>>>>>>>>> of it.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mark
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Mark
>>>>>>>>>>>>> It's there a particular jira issue that you think may address
>>>>> this?
>>>>>>> I
>>>>>>>>>>>> read
>>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I brought the bad one down and back up and it did nothing.  I
>>> can
>>>>>>>>>>>> clear
>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
>>> there
>>>>>>> is
>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to
>>> get
>>>>> to
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>> 4.2.1
>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2003@gmail.com
>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything
>>>>> else
>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy
>>> to
>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>> through the logs further if more information is needed, just
>>>>> let
>>>>>>> me
>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
>>> Is it
>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>> things?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>> jej2003@gmail.com
>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
>>>>> log
>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> Server at
>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>> non
>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>   at
>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>   at
>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>> jej2003@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>> org.apache.solr.common.SolrException
>>>>> log
>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState
>>>>> says
>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>   at
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>> jej2003@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point there
>>> were
>>>>>>>>>>>> shards
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be
>>>>> the
>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
>>> here.
>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
>>>>> leader
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>> versions
>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
>>> any zk
>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>> jej2003@gmail.com
>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to
>>> 4.2
>>>>> and
>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
>>>>> replica
>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the index to
>>> not
>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than
>>> the
>>>>>>>>>>>> master.
>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>> taking
>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>> sync
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>> sync
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>> newer
>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>> threads
>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
>>> Any
>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

answered my own question, it now says compositeId.  What is problematic
though is that in addition to my shards (which are say jamie-shard1) I see
the solr created shards (shard1).  I assume that these were created because
of the numShards param.  Is there no way to specify the names of these
shards?


On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <je...@gmail.com> wrote:

> ah interesting....so I need to specify num shards, blow out zk and then
> try this again to see if things work properly now.  What is really strange
> is that for the most part things have worked right and on 4.2.1 I have
> 600,000 items indexed with no duplicates.  In any event I will specify num
> shards clear out zk and begin again.  If this works properly what should
> the router type be?
>
>
> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> If you don't specify numShards after 4.1, you get an implicit doc router
>> and it's up to you to distribute updates. In the past, partitioning was
>> done on the fly - but for shard splitting and perhaps other features, we
>> now divvy up the hash range up front based on numShards and store it in
>> ZooKeeper. No numShards is now how you take complete control of updates
>> yourself.
>>
>> - Mark
>>
>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > The router says "implicit".  I did start from a blank zk state but
>> perhaps
>> > I missed one of the ZkCLI commands?  One of my shards from the
>> > clusterstate.json is shown below.  What is the process that should be
>> done
>> > to bootstrap a cluster other than the ZkCLI commands I listed above?  My
>> > process right now is run those ZkCLI commands and then start solr on
>> all of
>> > the instances with a command like this
>> >
>> > java -server -Dshard=shard5 -DcoreName=shard5-core1
>> > -Dsolr.data.dir=/solr/data/shard5-core1
>> -Dcollection.configName=solr-conf
>> > -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> > -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >
>> > I feel like maybe I'm missing a step.
>> >
>> > "shard5":{
>> >        "state":"active",
>> >        "replicas":{
>> >          "10.38.33.16:7575_solr_shard5-core1":{
>> >            "shard":"shard5",
>> >            "state":"active",
>> >            "core":"shard5-core1",
>> >            "collection":"collection1",
>> >            "node_name":"10.38.33.16:7575_solr",
>> >            "base_url":"http://10.38.33.16:7575/solr",
>> >            "leader":"true"},
>> >          "10.38.33.17:7577_solr_shard5-core2":{
>> >            "shard":"shard5",
>> >            "state":"recovering",
>> >            "core":"shard5-core2",
>> >            "collection":"collection1",
>> >            "node_name":"10.38.33.17:7577_solr",
>> >            "base_url":"http://10.38.33.17:7577/solr"}}}
>> >
>> >
>> > On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >
>> >> It should be part of your clusterstate.json. Some users have reported
>> >> trouble upgrading a previous zk install when this change came. I
>> >> recommended manually updating the clusterstate.json to have the right
>> info,
>> >> and that seemed to work. Otherwise, I guess you have to start from a
>> clean
>> >> zk state.
>> >>
>> >> If you don't have that range information, I think there will be
>> trouble.
>> >> Do you have an router type defined in the clusterstate.json?
>> >>
>> >> - Mark
>> >>
>> >> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>
>> >>> Where is this information stored in ZK?  I don't see it in the cluster
>> >>> state (or perhaps I don't understand it ;) ).
>> >>>
>> >>> Perhaps something with my process is broken.  What I do when I start
>> from
>> >>> scratch is the following
>> >>>
>> >>> ZkCLI -cmd upconfig ...
>> >>> ZkCLI -cmd linkconfig ....
>> >>>
>> >>> but I don't ever explicitly create the collection.  What should the
>> steps
>> >>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
>> >> never
>> >>> did that previously either so perhaps I did create the collection in
>> one
>> >> of
>> >>> my steps to get this working but have forgotten it along the way.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
>> >> wrote:
>> >>>
>> >>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
>> >> when a
>> >>>> collection is created - each shard gets a range, which is stored in
>> >>>> zookeeper. You should not be able to end up with the same id on
>> >> different
>> >>>> shards - something very odd going on.
>> >>>>
>> >>>> Hopefully I'll have some time to try and help you reproduce. Ideally
>> we
>> >>>> can capture it in a test case.
>> >>>>
>> >>>> - Mark
>> >>>>
>> >>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>>>
>> >>>>> no, my thought was wrong, it appears that even with the parameter
>> set I
>> >>>> am
>> >>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> >>>> indexing
>> >>>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000
>> or
>> >>>> so.
>> >>>>> I will try this on 4.2.1. to see if I see the same behavior
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Since I don't have that many items in my index I exported all of
>> the
>> >>>> keys
>> >>>>>> for each shard and wrote a simple java program that checks for
>> >>>> duplicates.
>> >>>>>> I found some duplicate keys on different shards, a grep of the
>> files
>> >> for
>> >>>>>> the keys found does indicate that they made it to the wrong places.
>> >> If
>> >>>> you
>> >>>>>> notice documents with the same ID are on shard 3 and shard 5.  Is
>> it
>> >>>>>> possible that the hash is being calculated taking into account only
>> >> the
>> >>>>>> "live" nodes?  I know that we don't specify the numShards param @
>> >>>> startup
>> >>>>>> so could this be what is happening?
>> >>>>>>
>> >>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> >>>>>> shard1-core1:0
>> >>>>>> shard1-core2:0
>> >>>>>> shard2-core1:0
>> >>>>>> shard2-core2:0
>> >>>>>> shard3-core1:1
>> >>>>>> shard3-core2:1
>> >>>>>> shard4-core1:0
>> >>>>>> shard4-core2:0
>> >>>>>> shard5-core1:1
>> >>>>>> shard5-core2:1
>> >>>>>> shard6-core1:0
>> >>>>>> shard6-core2:0
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
>> >>>> wrote:
>> >>>>>>
>> >>>>>>> Something interesting that I'm noticing as well, I just indexed
>> >> 300,000
>> >>>>>>> items, and some how 300,020 ended up in the index.  I thought
>> >> perhaps I
>> >>>>>>> messed something up so I started the indexing again and indexed
>> >> another
>> >>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>> >> possibile
>> >>>>>>> duplicates?  I had tried to facet on key (our id field) but that
>> >> didn't
>> >>>>>>> give me anything with more than a count of 1.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
>> >>>> wrote:
>> >>>>>>>
>> >>>>>>>> Ok, so clearing the transaction log allowed things to go again.
>>  I
>> >> am
>> >>>>>>>> going to clear the index and try to replicate the problem on
>> 4.2.0
>> >>>> and then
>> >>>>>>>> I'll try on 4.2.1
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>> markrmiller@gmail.com
>> >>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> No, not that I know if, which is why I say we need to get to the
>> >>>> bottom
>> >>>>>>>>> of it.
>> >>>>>>>>>
>> >>>>>>>>> - Mark
>> >>>>>>>>>
>> >>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Mark
>> >>>>>>>>>> It's there a particular jira issue that you think may address
>> >> this?
>> >>>> I
>> >>>>>>>>> read
>> >>>>>>>>>> through it quickly but didn't see one that jumped out
>> >>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
>> >> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> I brought the bad one down and back up and it did nothing.  I
>> can
>> >>>>>>>>> clear
>> >>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
>> there
>> >>>> is
>> >>>>>>>>>>> anything else odd
>> >>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> It would appear it's a bug given what you have said.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Any other exceptions would be useful. Might be best to start
>> >>>>>>>>> tracking in
>> >>>>>>>>>>>> a JIRA issue as well.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to
>> get
>> >> to
>> >>>>>>>>> the
>> >>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>> 4.2.1
>> >>>>>>>>> (spreading
>> >>>>>>>>>>>> to mirrors now).
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2003@gmail.com
>> >
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything
>> >> else
>> >>>>>>>>> that I
>> >>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy
>> to
>> >>>>>>>>> troll
>> >>>>>>>>>>>>> through the logs further if more information is needed, just
>> >> let
>> >>>> me
>> >>>>>>>>>>>> know.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
>>  Is it
>> >>>>>>>>>>>> required to
>> >>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>> things?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>> >> jej2003@gmail.com
>> >>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> sorry for spamming here....
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
>> >> log
>> >>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>> Server at
>> >> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>> >>>>>>>>> non
>> >>>>>>>>>>>> ok
>> >>>>>>>>>>>>>> status:503, message:Service Unavailable
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>    at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>    at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>>>>>>>>>>    at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> >>>> jej2003@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> here is another one that looks interesting
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> org.apache.solr.common.SolrException
>> >> log
>> >>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState
>> >> says
>> >>>>>>>>> we are
>> >>>>>>>>>>>>>>> the leader, but locally we don't think so
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>>>>>>>>>>    at
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> >>>> jej2003@gmail.com
>> >>>>>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Looking at the master it looks like at some point there
>> were
>> >>>>>>>>> shards
>> >>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> >> state:SyncConnected
>> >>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>> >>>>>>>>>>>> updating... (live
>> >>>>>>>>>>>>>>>> nodes size: 12)
>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>> INFO: Running the leader process.
>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be
>> >> the
>> >>>>>>>>> leader.
>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>> >>>>>>>>> markrmiller@gmail.com
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
>> here.
>> >>>>>>>>> Peersync
>> >>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>> >>>>>>>>> updates in
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
>> >> leader
>> >>>>>>>>> and
>> >>>>>>>>>>>> replica.
>> >>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>> versions
>> >>>>>>>>> that
>> >>>>>>>>>>>> the leader
>> >>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>> >>>>>>>>> exceptions?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
>> any zk
>> >>>>>>>>> session
>> >>>>>>>>>>>>>>>>> timeouts occur?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>> >> jej2003@gmail.com
>> >>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to
>> 4.2
>> >> and
>> >>>>>>>>>>>> noticed a
>> >>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
>> >> replica
>> >>>>>>>>> has a
>> >>>>>>>>>>>>>>>>> higher
>> >>>>>>>>>>>>>>>>>> version than the master which is causing the index to
>> not
>> >>>>>>>>>>>> replicate.
>> >>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than
>> the
>> >>>>>>>>> master.
>> >>>>>>>>>>>> What
>> >>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>> taking
>> >>>>>>>>> down the
>> >>>>>>>>>>>>>>>>> index
>> >>>>>>>>>>>>>>>>>> and scping the right version in?
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> MASTER:
>> >>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>>>>>>>>>>> Num Docs:164880
>> >>>>>>>>>>>>>>>>>> Max Doc:164880
>> >>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>> Version:2387
>> >>>>>>>>>>>>>>>>>> Segment Count:23
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> REPLICA:
>> >>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>>>>>>>>>>> Num Docs:164773
>> >>>>>>>>>>>>>>>>>> Max Doc:164773
>> >>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>> Version:3001
>> >>>>>>>>>>>>>>>>>> Segment Count:30
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>
>> >>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >> sync
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>> >>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>> >>>> nUpdates=100
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>>>>>>>>>>> Received 100 versions from
>> >>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>> >>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >> sync
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>> newer
>> >>>>>>>>> version of
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>> threads
>> >>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>> 10,000
>> >>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
>>  Any
>> >>>>>>>>> thoughts
>> >>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

ah interesting....so I need to specify num shards, blow out zk and then try
this again to see if things work properly now.  What is really strange is
that for the most part things have worked right and on 4.2.1 I have 600,000
items indexed with no duplicates.  In any event I will specify num shards
clear out zk and begin again.  If this works properly what should the
router type be?


On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <ma...@gmail.com> wrote:

> If you don't specify numShards after 4.1, you get an implicit doc router
> and it's up to you to distribute updates. In the past, partitioning was
> done on the fly - but for shard splitting and perhaps other features, we
> now divvy up the hash range up front based on numShards and store it in
> ZooKeeper. No numShards is now how you take complete control of updates
> yourself.
>
> - Mark
>
> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > The router says "implicit".  I did start from a blank zk state but
> perhaps
> > I missed one of the ZkCLI commands?  One of my shards from the
> > clusterstate.json is shown below.  What is the process that should be
> done
> > to bootstrap a cluster other than the ZkCLI commands I listed above?  My
> > process right now is run those ZkCLI commands and then start solr on all
> of
> > the instances with a command like this
> >
> > java -server -Dshard=shard5 -DcoreName=shard5-core1
> > -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
> > -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> > -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >
> > I feel like maybe I'm missing a step.
> >
> > "shard5":{
> >        "state":"active",
> >        "replicas":{
> >          "10.38.33.16:7575_solr_shard5-core1":{
> >            "shard":"shard5",
> >            "state":"active",
> >            "core":"shard5-core1",
> >            "collection":"collection1",
> >            "node_name":"10.38.33.16:7575_solr",
> >            "base_url":"http://10.38.33.16:7575/solr",
> >            "leader":"true"},
> >          "10.38.33.17:7577_solr_shard5-core2":{
> >            "shard":"shard5",
> >            "state":"recovering",
> >            "core":"shard5-core2",
> >            "collection":"collection1",
> >            "node_name":"10.38.33.17:7577_solr",
> >            "base_url":"http://10.38.33.17:7577/solr"}}}
> >
> >
> > On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >> It should be part of your clusterstate.json. Some users have reported
> >> trouble upgrading a previous zk install when this change came. I
> >> recommended manually updating the clusterstate.json to have the right
> info,
> >> and that seemed to work. Otherwise, I guess you have to start from a
> clean
> >> zk state.
> >>
> >> If you don't have that range information, I think there will be trouble.
> >> Do you have an router type defined in the clusterstate.json?
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>
> >>> Where is this information stored in ZK?  I don't see it in the cluster
> >>> state (or perhaps I don't understand it ;) ).
> >>>
> >>> Perhaps something with my process is broken.  What I do when I start
> from
> >>> scratch is the following
> >>>
> >>> ZkCLI -cmd upconfig ...
> >>> ZkCLI -cmd linkconfig ....
> >>>
> >>> but I don't ever explicitly create the collection.  What should the
> steps
> >>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
> >> never
> >>> did that previously either so perhaps I did create the collection in
> one
> >> of
> >>> my steps to get this working but have forgotten it along the way.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
> >> wrote:
> >>>
> >>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> >> when a
> >>>> collection is created - each shard gets a range, which is stored in
> >>>> zookeeper. You should not be able to end up with the same id on
> >> different
> >>>> shards - something very odd going on.
> >>>>
> >>>> Hopefully I'll have some time to try and help you reproduce. Ideally
> we
> >>>> can capture it in a test case.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>>
> >>>>> no, my thought was wrong, it appears that even with the parameter
> set I
> >>>> am
> >>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> >>>> indexing
> >>>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000
> or
> >>>> so.
> >>>>> I will try this on 4.2.1. to see if I see the same behavior
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Since I don't have that many items in my index I exported all of the
> >>>> keys
> >>>>>> for each shard and wrote a simple java program that checks for
> >>>> duplicates.
> >>>>>> I found some duplicate keys on different shards, a grep of the files
> >> for
> >>>>>> the keys found does indicate that they made it to the wrong places.
> >> If
> >>>> you
> >>>>>> notice documents with the same ID are on shard 3 and shard 5.  Is it
> >>>>>> possible that the hash is being calculated taking into account only
> >> the
> >>>>>> "live" nodes?  I know that we don't specify the numShards param @
> >>>> startup
> >>>>>> so could this be what is happening?
> >>>>>>
> >>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>> shard1-core1:0
> >>>>>> shard1-core2:0
> >>>>>> shard2-core1:0
> >>>>>> shard2-core2:0
> >>>>>> shard3-core1:1
> >>>>>> shard3-core2:1
> >>>>>> shard4-core1:0
> >>>>>> shard4-core2:0
> >>>>>> shard5-core1:1
> >>>>>> shard5-core2:1
> >>>>>> shard6-core1:0
> >>>>>> shard6-core2:0
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Something interesting that I'm noticing as well, I just indexed
> >> 300,000
> >>>>>>> items, and some how 300,020 ended up in the index.  I thought
> >> perhaps I
> >>>>>>> messed something up so I started the indexing again and indexed
> >> another
> >>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
> >> possibile
> >>>>>>> duplicates?  I had tried to facet on key (our id field) but that
> >> didn't
> >>>>>>> give me anything with more than a count of 1.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Ok, so clearing the transaction log allowed things to go again.  I
> >> am
> >>>>>>>> going to clear the index and try to replicate the problem on 4.2.0
> >>>> and then
> >>>>>>>> I'll try on 4.2.1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> markrmiller@gmail.com
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> No, not that I know if, which is why I say we need to get to the
> >>>> bottom
> >>>>>>>>> of it.
> >>>>>>>>>
> >>>>>>>>> - Mark
> >>>>>>>>>
> >>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Mark
> >>>>>>>>>> It's there a particular jira issue that you think may address
> >> this?
> >>>> I
> >>>>>>>>> read
> >>>>>>>>>> through it quickly but didn't see one that jumped out
> >>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I brought the bad one down and back up and it did nothing.  I
> can
> >>>>>>>>> clear
> >>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
> there
> >>>> is
> >>>>>>>>>>> anything else odd
> >>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any other exceptions would be useful. Might be best to start
> >>>>>>>>> tracking in
> >>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get
> >> to
> >>>>>>>>> the
> >>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> >>>>>>>>> (spreading
> >>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mark
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything
> >> else
> >>>>>>>>> that I
> >>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy
> to
> >>>>>>>>> troll
> >>>>>>>>>>>>> through the logs further if more information is needed, just
> >> let
> >>>> me
> >>>>>>>>>>>> know.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is
> it
> >>>>>>>>>>>> required to
> >>>>>>>>>>>>> kill the index that is out of sync and let solr resync
> things?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >> jej2003@gmail.com
> >>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
> >> log
> >>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>> Server at
> >> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>> non
> >>>>>>>>>>>> ok
> >>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>    at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>    at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>    at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> >>>> jej2003@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
> >> log
> >>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState
> >> says
> >>>>>>>>> we are
> >>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>    at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> >>>> jej2003@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Looking at the master it looks like at some point there
> were
> >>>>>>>>> shards
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >> state:SyncConnected
> >>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
> >>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be
> >> the
> >>>>>>>>> leader.
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
> here.
> >>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
> >>>>>>>>> updates in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
> >> leader
> >>>>>>>>> and
> >>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
> versions
> >>>>>>>>> that
> >>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
> >>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any
> zk
> >>>>>>>>> session
> >>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >> jej2003@gmail.com
> >>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2
> >> and
> >>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
> >> replica
> >>>>>>>>> has a
> >>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>> version than the master which is causing the index to
> not
> >>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
> >>>>>>>>> master.
> >>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
> taking
> >>>>>>>>> down the
> >>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >> sync
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
> >>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >> sync
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
> >>>>>>>>> version of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
> threads
> >>>>>>>>> indexing
> >>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
>  Any
> >>>>>>>>> thoughts
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself.

- Mark

On Apr 3, 2013, at 2:57 PM, Jamie Johnson <je...@gmail.com> wrote:

> The router says "implicit".  I did start from a blank zk state but perhaps
> I missed one of the ZkCLI commands?  One of my shards from the
> clusterstate.json is shown below.  What is the process that should be done
> to bootstrap a cluster other than the ZkCLI commands I listed above?  My
> process right now is run those ZkCLI commands and then start solr on all of
> the instances with a command like this
> 
> java -server -Dshard=shard5 -DcoreName=shard5-core1
> -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
> -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> 
> I feel like maybe I'm missing a step.
> 
> "shard5":{
>        "state":"active",
>        "replicas":{
>          "10.38.33.16:7575_solr_shard5-core1":{
>            "shard":"shard5",
>            "state":"active",
>            "core":"shard5-core1",
>            "collection":"collection1",
>            "node_name":"10.38.33.16:7575_solr",
>            "base_url":"http://10.38.33.16:7575/solr",
>            "leader":"true"},
>          "10.38.33.17:7577_solr_shard5-core2":{
>            "shard":"shard5",
>            "state":"recovering",
>            "core":"shard5-core2",
>            "collection":"collection1",
>            "node_name":"10.38.33.17:7577_solr",
>            "base_url":"http://10.38.33.17:7577/solr"}}}
> 
> 
> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> It should be part of your clusterstate.json. Some users have reported
>> trouble upgrading a previous zk install when this change came. I
>> recommended manually updating the clusterstate.json to have the right info,
>> and that seemed to work. Otherwise, I guess you have to start from a clean
>> zk state.
>> 
>> If you don't have that range information, I think there will be trouble.
>> Do you have an router type defined in the clusterstate.json?
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Where is this information stored in ZK?  I don't see it in the cluster
>>> state (or perhaps I don't understand it ;) ).
>>> 
>>> Perhaps something with my process is broken.  What I do when I start from
>>> scratch is the following
>>> 
>>> ZkCLI -cmd upconfig ...
>>> ZkCLI -cmd linkconfig ....
>>> 
>>> but I don't ever explicitly create the collection.  What should the steps
>>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
>> never
>>> did that previously either so perhaps I did create the collection in one
>> of
>>> my steps to get this working but have forgotten it along the way.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>>> 
>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
>> when a
>>>> collection is created - each shard gets a range, which is stored in
>>>> zookeeper. You should not be able to end up with the same id on
>> different
>>>> shards - something very odd going on.
>>>> 
>>>> Hopefully I'll have some time to try and help you reproduce. Ideally we
>>>> can capture it in a test case.
>>>> 
>>>> - Mark
>>>> 
>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> 
>>>>> no, my thought was wrong, it appears that even with the parameter set I
>>>> am
>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>>>> indexing
>>>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
>>>> so.
>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>>> 
>>>>> 
>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Since I don't have that many items in my index I exported all of the
>>>> keys
>>>>>> for each shard and wrote a simple java program that checks for
>>>> duplicates.
>>>>>> I found some duplicate keys on different shards, a grep of the files
>> for
>>>>>> the keys found does indicate that they made it to the wrong places.
>> If
>>>> you
>>>>>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>>>>>> possible that the hash is being calculated taking into account only
>> the
>>>>>> "live" nodes?  I know that we don't specify the numShards param @
>>>> startup
>>>>>> so could this be what is happening?
>>>>>> 
>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>> shard1-core1:0
>>>>>> shard1-core2:0
>>>>>> shard2-core1:0
>>>>>> shard2-core2:0
>>>>>> shard3-core1:1
>>>>>> shard3-core2:1
>>>>>> shard4-core1:0
>>>>>> shard4-core2:0
>>>>>> shard5-core1:1
>>>>>> shard5-core2:1
>>>>>> shard6-core1:0
>>>>>> shard6-core2:0
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Something interesting that I'm noticing as well, I just indexed
>> 300,000
>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>> perhaps I
>>>>>>> messed something up so I started the indexing again and indexed
>> another
>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>> possibile
>>>>>>> duplicates?  I had tried to facet on key (our id field) but that
>> didn't
>>>>>>> give me anything with more than a count of 1.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Ok, so clearing the transaction log allowed things to go again.  I
>> am
>>>>>>>> going to clear the index and try to replicate the problem on 4.2.0
>>>> and then
>>>>>>>> I'll try on 4.2.1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmiller@gmail.com
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> No, not that I know if, which is why I say we need to get to the
>>>> bottom
>>>>>>>>> of it.
>>>>>>>>> 
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Mark
>>>>>>>>>> It's there a particular jira issue that you think may address
>> this?
>>>> I
>>>>>>>>> read
>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I brought the bad one down and back up and it did nothing.  I can
>>>>>>>>> clear
>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if there
>>>> is
>>>>>>>>>>> anything else odd
>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>>>> 
>>>>>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>>>>>> tracking in
>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>> 
>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get
>> to
>>>>>>>>> the
>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>>>>>>> (spreading
>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mark
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything
>> else
>>>>>>>>> that I
>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
>>>>>>>>> troll
>>>>>>>>>>>>> through the logs further if more information is needed, just
>> let
>>>> me
>>>>>>>>>>>> know.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>>>>>>>>>> required to
>>>>>>>>>>>>> kill the index that is out of sync and let solr resync things?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>> jej2003@gmail.com
>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
>> log
>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>> :
>>>>>>>>>>>>>> Server at
>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>> non
>>>>>>>>>>>> ok
>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>    at
>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>    at
>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>    at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
>> log
>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState
>> says
>>>>>>>>> we are
>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>    at
>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>    at
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>> jej2003@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Looking at the master it looks like at some point there were
>>>>>>>>> shards
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> state:SyncConnected
>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be
>> the
>>>>>>>>> leader.
>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>> markrmiller@gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>>>>>> updates in
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
>> leader
>>>>>>>>> and
>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
>>>>>>>>> that
>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>>>>>>> session
>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>> jej2003@gmail.com
>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2
>> and
>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
>> replica
>>>>>>>>> has a
>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>> version than the master which is causing the index to not
>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
>>>>>>>>> master.
>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
>>>>>>>>> down the
>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>> 
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> sync
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>> nUpdates=100
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> sync
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
>>>>>>>>> version of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>>>>>>> indexing
>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>>>>>>> thoughts
>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

The router says "implicit".  I did start from a blank zk state but perhaps
I missed one of the ZkCLI commands?  One of my shards from the
clusterstate.json is shown below.  What is the process that should be done
to bootstrap a cluster other than the ZkCLI commands I listed above?  My
process right now is run those ZkCLI commands and then start solr on all of
the instances with a command like this

java -server -Dshard=shard5 -DcoreName=shard5-core1
-Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
-Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
-Djetty.port=7575 -DhostPort=7575 -jar start.jar

I feel like maybe I'm missing a step.

"shard5":{
        "state":"active",
        "replicas":{
          "10.38.33.16:7575_solr_shard5-core1":{
            "shard":"shard5",
            "state":"active",
            "core":"shard5-core1",
            "collection":"collection1",
            "node_name":"10.38.33.16:7575_solr",
            "base_url":"http://10.38.33.16:7575/solr",
            "leader":"true"},
          "10.38.33.17:7577_solr_shard5-core2":{
            "shard":"shard5",
            "state":"recovering",
            "core":"shard5-core2",
            "collection":"collection1",
            "node_name":"10.38.33.17:7577_solr",
            "base_url":"http://10.38.33.17:7577/solr"}}}


On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <ma...@gmail.com> wrote:

> It should be part of your clusterstate.json. Some users have reported
> trouble upgrading a previous zk install when this change came. I
> recommended manually updating the clusterstate.json to have the right info,
> and that seemed to work. Otherwise, I guess you have to start from a clean
> zk state.
>
> If you don't have that range information, I think there will be trouble.
> Do you have an router type defined in the clusterstate.json?
>
> - Mark
>
> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > Where is this information stored in ZK?  I don't see it in the cluster
> > state (or perhaps I don't understand it ;) ).
> >
> > Perhaps something with my process is broken.  What I do when I start from
> > scratch is the following
> >
> > ZkCLI -cmd upconfig ...
> > ZkCLI -cmd linkconfig ....
> >
> > but I don't ever explicitly create the collection.  What should the steps
> > from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
> never
> > did that previously either so perhaps I did create the collection in one
> of
> > my steps to get this working but have forgotten it along the way.
> >
> >
> > On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> when a
> >> collection is created - each shard gets a range, which is stored in
> >> zookeeper. You should not be able to end up with the same id on
> different
> >> shards - something very odd going on.
> >>
> >> Hopefully I'll have some time to try and help you reproduce. Ideally we
> >> can capture it in a test case.
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>
> >>> no, my thought was wrong, it appears that even with the parameter set I
> >> am
> >>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> >> indexing
> >>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
> >> so.
> >>> I will try this on 4.2.1. to see if I see the same behavior
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>
> >>>> Since I don't have that many items in my index I exported all of the
> >> keys
> >>>> for each shard and wrote a simple java program that checks for
> >> duplicates.
> >>>> I found some duplicate keys on different shards, a grep of the files
> for
> >>>> the keys found does indicate that they made it to the wrong places.
>  If
> >> you
> >>>> notice documents with the same ID are on shard 3 and shard 5.  Is it
> >>>> possible that the hash is being calculated taking into account only
> the
> >>>> "live" nodes?  I know that we don't specify the numShards param @
> >> startup
> >>>> so could this be what is happening?
> >>>>
> >>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>> shard1-core1:0
> >>>> shard1-core2:0
> >>>> shard2-core1:0
> >>>> shard2-core2:0
> >>>> shard3-core1:1
> >>>> shard3-core2:1
> >>>> shard4-core1:0
> >>>> shard4-core2:0
> >>>> shard5-core1:1
> >>>> shard5-core2:1
> >>>> shard6-core1:0
> >>>> shard6-core2:0
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Something interesting that I'm noticing as well, I just indexed
> 300,000
> >>>>> items, and some how 300,020 ended up in the index.  I thought
> perhaps I
> >>>>> messed something up so I started the indexing again and indexed
> another
> >>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
> possibile
> >>>>> duplicates?  I had tried to facet on key (our id field) but that
> didn't
> >>>>> give me anything with more than a count of 1.
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Ok, so clearing the transaction log allowed things to go again.  I
> am
> >>>>>> going to clear the index and try to replicate the problem on 4.2.0
> >> and then
> >>>>>> I'll try on 4.2.1
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmiller@gmail.com
> >>> wrote:
> >>>>>>
> >>>>>>> No, not that I know if, which is why I say we need to get to the
> >> bottom
> >>>>>>> of it.
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> Mark
> >>>>>>>> It's there a particular jira issue that you think may address
> this?
> >> I
> >>>>>>> read
> >>>>>>>> through it quickly but didn't see one that jumped out
> >>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> I brought the bad one down and back up and it did nothing.  I can
> >>>>>>> clear
> >>>>>>>>> the index and try4.2.1. I will save off the logs and see if there
> >> is
> >>>>>>>>> anything else odd
> >>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>>>
> >>>>>>>>>> Any other exceptions would be useful. Might be best to start
> >>>>>>> tracking in
> >>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>
> >>>>>>>>>> To fix, I'd bring the behind node down and back again.
> >>>>>>>>>>
> >>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get
> to
> >>>>>>> the
> >>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> >>>>>>> (spreading
> >>>>>>>>>> to mirrors now).
> >>>>>>>>>>
> >>>>>>>>>> - Mark
> >>>>>>>>>>
> >>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything
> else
> >>>>>>> that I
> >>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
> >>>>>>> troll
> >>>>>>>>>>> through the logs further if more information is needed, just
> let
> >> me
> >>>>>>>>>> know.
> >>>>>>>>>>>
> >>>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
> >>>>>>>>>> required to
> >>>>>>>>>>> kill the index that is out of sync and let solr resync things?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>
> >>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
> >>>>>>>>>>>>
> >>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>> :
> >>>>>>>>>>>> Server at
> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>> non
> >>>>>>>>>> ok
> >>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>     at
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>     at
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>     at
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>     at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> >> jej2003@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState
> says
> >>>>>>> we are
> >>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>     at
> >>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>     at
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> >> jej2003@gmail.com
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Looking at the master it looks like at some point there were
> >>>>>>> shards
> >>>>>>>>>> that
> >>>>>>>>>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> state:SyncConnected
> >>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
> >>>>>>>>>> updating... (live
> >>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>> process
> >>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be
> the
> >>>>>>> leader.
> >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>>>> markrmiller@gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
> >>>>>>> Peersync
> >>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
> >>>>>>> updates in
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
> leader
> >>>>>>> and
> >>>>>>>>>> replica.
> >>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
> >>>>>>> that
> >>>>>>>>>> the leader
> >>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
> >>>>>>> exceptions?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
> >>>>>>> session
> >>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2
> and
> >>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
> replica
> >>>>>>> has a
> >>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>> version than the master which is causing the index to not
> >>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
> >>>>>>> master.
> >>>>>>>>>> What
> >>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
> >>>>>>> down the
> >>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> sync
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
> >> nUpdates=100
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> sync
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
> >>>>>>> version of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
> >>>>>>> indexing
> >>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
> >>>>>>> thoughts
> >>>>>>>>>> on
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state.

If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json?

- Mark

On Apr 3, 2013, at 2:24 PM, Jamie Johnson <je...@gmail.com> wrote:

> Where is this information stored in ZK?  I don't see it in the cluster
> state (or perhaps I don't understand it ;) ).
> 
> Perhaps something with my process is broken.  What I do when I start from
> scratch is the following
> 
> ZkCLI -cmd upconfig ...
> ZkCLI -cmd linkconfig ....
> 
> but I don't ever explicitly create the collection.  What should the steps
> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
> did that previously either so perhaps I did create the collection in one of
> my steps to get this working but have forgotten it along the way.
> 
> 
> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
>> collection is created - each shard gets a range, which is stored in
>> zookeeper. You should not be able to end up with the same id on different
>> shards - something very odd going on.
>> 
>> Hopefully I'll have some time to try and help you reproduce. Ideally we
>> can capture it in a test case.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> no, my thought was wrong, it appears that even with the parameter set I
>> am
>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> indexing
>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
>> so.
>>> I will try this on 4.2.1. to see if I see the same behavior
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>> 
>>>> Since I don't have that many items in my index I exported all of the
>> keys
>>>> for each shard and wrote a simple java program that checks for
>> duplicates.
>>>> I found some duplicate keys on different shards, a grep of the files for
>>>> the keys found does indicate that they made it to the wrong places.  If
>> you
>>>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>>>> possible that the hash is being calculated taking into account only the
>>>> "live" nodes?  I know that we don't specify the numShards param @
>> startup
>>>> so could this be what is happening?
>>>> 
>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>> shard1-core1:0
>>>> shard1-core2:0
>>>> shard2-core1:0
>>>> shard2-core2:0
>>>> shard3-core1:1
>>>> shard3-core2:1
>>>> shard4-core1:0
>>>> shard4-core2:0
>>>> shard5-core1:1
>>>> shard5-core2:1
>>>> shard6-core1:0
>>>> shard6-core2:0
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>> 
>>>>> Something interesting that I'm noticing as well, I just indexed 300,000
>>>>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>>>>> messed something up so I started the indexing again and indexed another
>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>>>>> duplicates?  I had tried to facet on key (our id field) but that didn't
>>>>> give me anything with more than a count of 1.
>>>>> 
>>>>> 
>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>>>>> going to clear the index and try to replicate the problem on 4.2.0
>> and then
>>>>>> I'll try on 4.2.1
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmiller@gmail.com
>>> wrote:
>>>>>> 
>>>>>>> No, not that I know if, which is why I say we need to get to the
>> bottom
>>>>>>> of it.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Mark
>>>>>>>> It's there a particular jira issue that you think may address this?
>> I
>>>>>>> read
>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> I brought the bad one down and back up and it did nothing.  I can
>>>>>>> clear
>>>>>>>>> the index and try4.2.1. I will save off the logs and see if there
>> is
>>>>>>>>> anything else odd
>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>> 
>>>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>>>> tracking in
>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>> 
>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>> 
>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
>>>>>>> the
>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>>>>> (spreading
>>>>>>>>>> to mirrors now).
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>>>>> that I
>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
>>>>>>> troll
>>>>>>>>>>> through the logs further if more information is needed, just let
>> me
>>>>>>>>>> know.
>>>>>>>>>>> 
>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>>>>>>>> required to
>>>>>>>>>>> kill the index that is out of sync and let solr resync things?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2003@gmail.com
>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>> 
>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>>>> 
>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>> :
>>>>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>> non
>>>>>>>>>> ok
>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>     at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> jej2003@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>>>>> we are
>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>     at
>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> jej2003@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Looking at the master it looks like at some point there were
>>>>>>> shards
>>>>>>>>>> that
>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>>>> updating... (live
>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>> process
>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
>>>>>>> leader.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>> markrmiller@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
>>>>>>> Peersync
>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>>>> updates in
>>>>>>>>>> the
>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
>>>>>>> and
>>>>>>>>>> replica.
>>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
>>>>>>> that
>>>>>>>>>> the leader
>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>>>> exceptions?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>>>>> session
>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2003@gmail.com
>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
>>>>>>> has a
>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>> version than the master which is causing the index to not
>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
>>>>>>> master.
>>>>>>>>>> What
>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
>>>>>>> down the
>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>> nUpdates=100
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
>>>>>>> version of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>>>>> indexing
>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>>>>> thoughts
>>>>>>>>>> on
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Where is this information stored in ZK?  I don't see it in the cluster
state (or perhaps I don't understand it ;) ).

Perhaps something with my process is broken.  What I do when I start from
scratch is the following

ZkCLI -cmd upconfig ...
ZkCLI -cmd linkconfig ....

but I don't ever explicitly create the collection.  What should the steps
from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
did that previously either so perhaps I did create the collection in one of
my steps to get this working but have forgotten it along the way.


On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <ma...@gmail.com> wrote:

> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
> collection is created - each shard gets a range, which is stored in
> zookeeper. You should not be able to end up with the same id on different
> shards - something very odd going on.
>
> Hopefully I'll have some time to try and help you reproduce. Ideally we
> can capture it in a test case.
>
> - Mark
>
> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > no, my thought was wrong, it appears that even with the parameter set I
> am
> > seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> indexing
> > 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
> so.
> > I will try this on 4.2.1. to see if I see the same behavior
> >
> >
> > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >
> >> Since I don't have that many items in my index I exported all of the
> keys
> >> for each shard and wrote a simple java program that checks for
> duplicates.
> >> I found some duplicate keys on different shards, a grep of the files for
> >> the keys found does indicate that they made it to the wrong places.  If
> you
> >> notice documents with the same ID are on shard 3 and shard 5.  Is it
> >> possible that the hash is being calculated taking into account only the
> >> "live" nodes?  I know that we don't specify the numShards param @
> startup
> >> so could this be what is happening?
> >>
> >> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >> shard1-core1:0
> >> shard1-core2:0
> >> shard2-core1:0
> >> shard2-core2:0
> >> shard3-core1:1
> >> shard3-core2:1
> >> shard4-core1:0
> >> shard4-core2:0
> >> shard5-core1:1
> >> shard5-core2:1
> >> shard6-core1:0
> >> shard6-core2:0
> >>
> >>
> >> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >>> Something interesting that I'm noticing as well, I just indexed 300,000
> >>> items, and some how 300,020 ended up in the index.  I thought perhaps I
> >>> messed something up so I started the indexing again and indexed another
> >>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> >>> duplicates?  I had tried to facet on key (our id field) but that didn't
> >>> give me anything with more than a count of 1.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>
> >>>> Ok, so clearing the transaction log allowed things to go again.  I am
> >>>> going to clear the index and try to replicate the problem on 4.2.0
> and then
> >>>> I'll try on 4.2.1
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmiller@gmail.com
> >wrote:
> >>>>
> >>>>> No, not that I know if, which is why I say we need to get to the
> bottom
> >>>>> of it.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Mark
> >>>>>> It's there a particular jira issue that you think may address this?
> I
> >>>>> read
> >>>>>> through it quickly but didn't see one that jumped out
> >>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
> >>>>>>
> >>>>>>> I brought the bad one down and back up and it did nothing.  I can
> >>>>> clear
> >>>>>>> the index and try4.2.1. I will save off the logs and see if there
> is
> >>>>>>> anything else odd
> >>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>
> >>>>>>>> Any other exceptions would be useful. Might be best to start
> >>>>> tracking in
> >>>>>>>> a JIRA issue as well.
> >>>>>>>>
> >>>>>>>> To fix, I'd bring the behind node down and back again.
> >>>>>>>>
> >>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
> >>>>> the
> >>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> >>>>> (spreading
> >>>>>>>> to mirrors now).
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
> >>>>> that I
> >>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
> >>>>> troll
> >>>>>>>>> through the logs further if more information is needed, just let
> me
> >>>>>>>> know.
> >>>>>>>>>
> >>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
> >>>>>>>> required to
> >>>>>>>>> kill the index that is out of sync and let solr resync things?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>
> >>>>>>>>>> shard5-core2 is the instance we're having issues with...
> >>>>>>>>>>
> >>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>> :
> >>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>> non
> >>>>>>>> ok
> >>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>      at
> >>>>>>>>>>
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>      at
> >>>>>>>>>>
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>      at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>
> >>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
> >>>>> we are
> >>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>      at
> >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> jej2003@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Looking at the master it looks like at some point there were
> >>>>> shards
> >>>>>>>> that
> >>>>>>>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
> >>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
> >>>>>>>> updating... (live
> >>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>> process
> >>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
> >>>>> leader.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>> markrmiller@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
> >>>>> Peersync
> >>>>>>>>>>>>> does not look at that - it looks at version numbers for
> >>>>> updates in
> >>>>>>>> the
> >>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
> >>>>> and
> >>>>>>>> replica.
> >>>>>>>>>>>>> What it's saying is that the replica seems to have versions
> >>>>> that
> >>>>>>>> the leader
> >>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
> >>>>> exceptions?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
> >>>>> session
> >>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
> >>>>>>>> noticed a
> >>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
> >>>>> has a
> >>>>>>>>>>>>> higher
> >>>>>>>>>>>>>> version than the master which is causing the index to not
> >>>>>>>> replicate.
> >>>>>>>>>>>>>> Because of this the replica has fewer documents than the
> >>>>> master.
> >>>>>>>> What
> >>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
> >>>>> down the
> >>>>>>>>>>>>> index
> >>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>
> >>>>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
> nUpdates=100
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>> handleVersions
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>> Received 100 versions from
> >>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>> handleVersions
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
> >>>>> version of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
> >>>>> indexing
> >>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
> >>>>> thoughts
> >>>>>>>> on
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on.

Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case.

- Mark

On Apr 3, 2013, at 1:13 PM, Jamie Johnson <je...@gmail.com> wrote:

> no, my thought was wrong, it appears that even with the parameter set I am
> seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
> I will try this on 4.2.1. to see if I see the same behavior
> 
> 
> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> Since I don't have that many items in my index I exported all of the keys
>> for each shard and wrote a simple java program that checks for duplicates.
>> I found some duplicate keys on different shards, a grep of the files for
>> the keys found does indicate that they made it to the wrong places.  If you
>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>> possible that the hash is being calculated taking into account only the
>> "live" nodes?  I know that we don't specify the numShards param @ startup
>> so could this be what is happening?
>> 
>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> shard1-core1:0
>> shard1-core2:0
>> shard2-core1:0
>> shard2-core2:0
>> shard3-core1:1
>> shard3-core2:1
>> shard4-core1:0
>> shard4-core2:0
>> shard5-core1:1
>> shard5-core2:1
>> shard6-core1:0
>> shard6-core2:0
>> 
>> 
>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Something interesting that I'm noticing as well, I just indexed 300,000
>>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>>> messed something up so I started the indexing again and indexed another
>>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>>> duplicates?  I had tried to facet on key (our id field) but that didn't
>>> give me anything with more than a count of 1.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>>> going to clear the index and try to replicate the problem on 4.2.0 and then
>>>> I'll try on 4.2.1
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <ma...@gmail.com>wrote:
>>>> 
>>>>> No, not that I know if, which is why I say we need to get to the bottom
>>>>> of it.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> 
>>>>>> Mark
>>>>>> It's there a particular jira issue that you think may address this? I
>>>>> read
>>>>>> through it quickly but didn't see one that jumped out
>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>>>>>> 
>>>>>>> I brought the bad one down and back up and it did nothing.  I can
>>>>> clear
>>>>>>> the index and try4.2.1. I will save off the logs and see if there is
>>>>>>> anything else odd
>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>> 
>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>> tracking in
>>>>>>>> a JIRA issue as well.
>>>>>>>> 
>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>> 
>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
>>>>> the
>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>>> (spreading
>>>>>>>> to mirrors now).
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>>> that I
>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
>>>>> troll
>>>>>>>>> through the logs further if more information is needed, just let me
>>>>>>>> know.
>>>>>>>>> 
>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>>>>>> required to
>>>>>>>>> kill the index that is out of sync and let solr resync things?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> sorry for spamming here....
>>>>>>>>>> 
>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>> 
>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>> 
>>>>>>>> 
>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>> :
>>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>>>> non
>>>>>>>> ok
>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>      at
>>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>      at
>>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>      at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>> 
>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>>> we are
>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>      at
>>>>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>      at
>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2003@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Looking at the master it looks like at some point there were
>>>>> shards
>>>>>>>> that
>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>> 
>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>> updating... (live
>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>> process
>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
>>>>> leader.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>> markrmiller@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
>>>>> Peersync
>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>> updates in
>>>>>>>> the
>>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
>>>>> and
>>>>>>>> replica.
>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
>>>>> that
>>>>>>>> the leader
>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>> exceptions?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>>> session
>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>>>>>> noticed a
>>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
>>>>> has a
>>>>>>>>>>>>> higher
>>>>>>>>>>>>>> version than the master which is causing the index to not
>>>>>>>> replicate.
>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
>>>>> master.
>>>>>>>> What
>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
>>>>> down the
>>>>>>>>>>>>> index
>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>> handleVersions
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>> Received 100 versions from
>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>> handleVersions
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
>>>>> version of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>>> indexing
>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>>> thoughts
>>>>>>>> on
>>>>>>>>>>>>> this
>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

no, my thought was wrong, it appears that even with the parameter set I am
seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
 I will try this on 4.2.1. to see if I see the same behavior


On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <je...@gmail.com> wrote:

> Since I don't have that many items in my index I exported all of the keys
> for each shard and wrote a simple java program that checks for duplicates.
>  I found some duplicate keys on different shards, a grep of the files for
> the keys found does indicate that they made it to the wrong places.  If you
> notice documents with the same ID are on shard 3 and shard 5.  Is it
> possible that the hash is being calculated taking into account only the
> "live" nodes?  I know that we don't specify the numShards param @ startup
> so could this be what is happening?
>
> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> shard1-core1:0
> shard1-core2:0
> shard2-core1:0
> shard2-core2:0
> shard3-core1:1
> shard3-core2:1
> shard4-core1:0
> shard4-core2:0
> shard5-core1:1
> shard5-core2:1
> shard6-core1:0
> shard6-core2:0
>
>
> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com> wrote:
>
>> Something interesting that I'm noticing as well, I just indexed 300,000
>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>> messed something up so I started the indexing again and indexed another
>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>> duplicates?  I had tried to facet on key (our id field) but that didn't
>> give me anything with more than a count of 1.
>>
>>
>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>> going to clear the index and try to replicate the problem on 4.2.0 and then
>>> I'll try on 4.2.1
>>>
>>>
>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <ma...@gmail.com>wrote:
>>>
>>>> No, not that I know if, which is why I say we need to get to the bottom
>>>> of it.
>>>>
>>>> - Mark
>>>>
>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>
>>>> > Mark
>>>> > It's there a particular jira issue that you think may address this? I
>>>> read
>>>> > through it quickly but didn't see one that jumped out
>>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>>>> >
>>>> >> I brought the bad one down and back up and it did nothing.  I can
>>>> clear
>>>> >> the index and try4.2.1. I will save off the logs and see if there is
>>>> >> anything else odd
>>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>>>> >>
>>>> >>> It would appear it's a bug given what you have said.
>>>> >>>
>>>> >>> Any other exceptions would be useful. Might be best to start
>>>> tracking in
>>>> >>> a JIRA issue as well.
>>>> >>>
>>>> >>> To fix, I'd bring the behind node down and back again.
>>>> >>>
>>>> >>> Unfortunately, I'm pressed for time, but we really need to get to
>>>> the
>>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>> (spreading
>>>> >>> to mirrors now).
>>>> >>>
>>>> >>> - Mark
>>>> >>>
>>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>> that I
>>>> >>>> should be looking for here and is this a bug?  I'd be happy to
>>>> troll
>>>> >>>> through the logs further if more information is needed, just let me
>>>> >>> know.
>>>> >>>>
>>>> >>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>> >>> required to
>>>> >>>> kill the index that is out of sync and let solr resync things?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>> sorry for spamming here....
>>>> >>>>>
>>>> >>>>> shard5-core2 is the instance we're having issues with...
>>>> >>>>>
>>>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>> >>>>> SEVERE: shard update error StdNode:
>>>> >>>>>
>>>> >>>
>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>> >>> :
>>>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>>> non
>>>> >>> ok
>>>> >>>>> status:503, message:Service Unavailable
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>> >>>>>       at
>>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>       at
>>>> >>>>>
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>> >>>>>       at
>>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >>>>>       at java.lang.Thread.run(Thread.java:662)
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>>>> >>> wrote:
>>>> >>>>>
>>>> >>>>>> here is another one that looks interesting
>>>> >>>>>>
>>>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>> we are
>>>> >>>>>> the leader, but locally we don't think so
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>> >>>>>>       at
>>>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>> >>>>>>       at
>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2003@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Looking at the master it looks like at some point there were
>>>> shards
>>>> >>> that
>>>> >>>>>>> went down.  I am seeing things like what is below.
>>>> >>>>>>>
>>>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>> >>> updating... (live
>>>> >>>>>>> nodes size: 12)
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>> >>>>>>> process
>>>> >>>>>>> INFO: Updating live nodes... (9)
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> runLeaderProcess
>>>> >>>>>>> INFO: Running the leader process.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> shouldIBeLeader
>>>> >>>>>>> INFO: Checking if I should try and be the leader.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> shouldIBeLeader
>>>> >>>>>>> INFO: My last published State was Active, it's okay to be the
>>>> leader.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> runLeaderProcess
>>>> >>>>>>> INFO: I may be the new leader - try and sync
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>> markrmiller@gmail.com
>>>> >>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> I don't think the versions you are thinking of apply here.
>>>> Peersync
>>>> >>>>>>>> does not look at that - it looks at version numbers for
>>>> updates in
>>>> >>> the
>>>> >>>>>>>> transaction log - it compares the last 100 of them on leader
>>>> and
>>>> >>> replica.
>>>> >>>>>>>> What it's saying is that the replica seems to have versions
>>>> that
>>>> >>> the leader
>>>> >>>>>>>> does not. Have you scanned the logs for any interesting
>>>> exceptions?
>>>> >>>>>>>>
>>>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>> session
>>>> >>>>>>>> timeouts occur?
>>>> >>>>>>>>
>>>> >>>>>>>> - Mark
>>>> >>>>>>>>
>>>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>>>> >>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>> >>> noticed a
>>>> >>>>>>>>> strange issue while testing today.  Specifically the replica
>>>> has a
>>>> >>>>>>>> higher
>>>> >>>>>>>>> version than the master which is causing the index to not
>>>> >>> replicate.
>>>> >>>>>>>>> Because of this the replica has fewer documents than the
>>>> master.
>>>> >>> What
>>>> >>>>>>>>> could cause this and how can I resolve it short of taking
>>>> down the
>>>> >>>>>>>> index
>>>> >>>>>>>>> and scping the right version in?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MASTER:
>>>> >>>>>>>>> Last Modified:about an hour ago
>>>> >>>>>>>>> Num Docs:164880
>>>> >>>>>>>>> Max Doc:164880
>>>> >>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>> Version:2387
>>>> >>>>>>>>> Segment Count:23
>>>> >>>>>>>>>
>>>> >>>>>>>>> REPLICA:
>>>> >>>>>>>>> Last Modified: about an hour ago
>>>> >>>>>>>>> Num Docs:164773
>>>> >>>>>>>>> Max Doc:164773
>>>> >>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>> Version:3001
>>>> >>>>>>>>> Segment Count:30
>>>> >>>>>>>>>
>>>> >>>>>>>>> in the replicas log it says this:
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: Creating new http client,
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>
>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>> >>> handleVersions
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>> http://10.38.33.17:7577/solr
>>>> >>>>>>>>> Received 100 versions from
>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>> >>> handleVersions
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>> http://10.38.33.17:7577/solr  Our
>>>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>> >>>>>>>>> otherHigh=1431233789440294912
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> which again seems to point that it thinks it has a newer
>>>> version of
>>>> >>>>>>>> the
>>>> >>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>> indexing
>>>> >>>>>>>> 10,000
>>>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>> thoughts
>>>> >>> on
>>>> >>>>>>>> this
>>>> >>>>>>>>> or what I should look for would be appreciated.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>>
>>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Since I don't have that many items in my index I exported all of the keys
for each shard and wrote a simple java program that checks for duplicates.
 I found some duplicate keys on different shards, a grep of the files for
the keys found does indicate that they made it to the wrong places.  If you
notice documents with the same ID are on shard 3 and shard 5.  Is it
possible that the hash is being calculated taking into account only the
"live" nodes?  I know that we don't specify the numShards param @ startup
so could this be what is happening?

grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
shard1-core1:0
shard1-core2:0
shard2-core1:0
shard2-core2:0
shard3-core1:1
shard3-core2:1
shard4-core1:0
shard4-core2:0
shard5-core1:1
shard5-core2:1
shard6-core1:0
shard6-core2:0


On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <je...@gmail.com> wrote:

> Something interesting that I'm noticing as well, I just indexed 300,000
> items, and some how 300,020 ended up in the index.  I thought perhaps I
> messed something up so I started the indexing again and indexed another
> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> duplicates?  I had tried to facet on key (our id field) but that didn't
> give me anything with more than a count of 1.
>
>
> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com> wrote:
>
>> Ok, so clearing the transaction log allowed things to go again.  I am
>> going to clear the index and try to replicate the problem on 4.2.0 and then
>> I'll try on 4.2.1
>>
>>
>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <ma...@gmail.com>wrote:
>>
>>> No, not that I know if, which is why I say we need to get to the bottom
>>> of it.
>>>
>>> - Mark
>>>
>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>> > Mark
>>> > It's there a particular jira issue that you think may address this? I
>>> read
>>> > through it quickly but didn't see one that jumped out
>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>>> >
>>> >> I brought the bad one down and back up and it did nothing.  I can
>>> clear
>>> >> the index and try4.2.1. I will save off the logs and see if there is
>>> >> anything else odd
>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>>> >>
>>> >>> It would appear it's a bug given what you have said.
>>> >>>
>>> >>> Any other exceptions would be useful. Might be best to start
>>> tracking in
>>> >>> a JIRA issue as well.
>>> >>>
>>> >>> To fix, I'd bring the behind node down and back again.
>>> >>>
>>> >>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>> (spreading
>>> >>> to mirrors now).
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> >>>
>>> >>>> Sorry I didn't ask the obvious question.  Is there anything else
>>> that I
>>> >>>> should be looking for here and is this a bug?  I'd be happy to troll
>>> >>>> through the logs further if more information is needed, just let me
>>> >>> know.
>>> >>>>
>>> >>>> Also what is the most appropriate mechanism to fix this.  Is it
>>> >>> required to
>>> >>>> kill the index that is out of sync and let solr resync things?
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>>> sorry for spamming here....
>>> >>>>>
>>> >>>>> shard5-core2 is the instance we're having issues with...
>>> >>>>>
>>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> >>>>> SEVERE: shard update error StdNode:
>>> >>>>>
>>> >>>
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> >>> :
>>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>> non
>>> >>> ok
>>> >>>>> status:503, message:Service Unavailable
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>> >>>>>       at
>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>       at
>>> >>>>>
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>> >>>>>       at
>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> >>>>>       at java.lang.Thread.run(Thread.java:662)
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>>> >>> wrote:
>>> >>>>>
>>> >>>>>> here is another one that looks interesting
>>> >>>>>>
>>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>> we are
>>> >>>>>> the leader, but locally we don't think so
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>> >>>>>>       at
>>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>> >>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
>>> >>> wrote:
>>> >>>>>>
>>> >>>>>>> Looking at the master it looks like at some point there were
>>> shards
>>> >>> that
>>> >>>>>>> went down.  I am seeing things like what is below.
>>> >>>>>>>
>>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>> >>> updating... (live
>>> >>>>>>> nodes size: 12)
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.common.cloud.ZkStateReader$3
>>> >>>>>>> process
>>> >>>>>>> INFO: Updating live nodes... (9)
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> runLeaderProcess
>>> >>>>>>> INFO: Running the leader process.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> shouldIBeLeader
>>> >>>>>>> INFO: Checking if I should try and be the leader.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> shouldIBeLeader
>>> >>>>>>> INFO: My last published State was Active, it's okay to be the
>>> leader.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> runLeaderProcess
>>> >>>>>>> INFO: I may be the new leader - try and sync
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>> markrmiller@gmail.com
>>> >>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> I don't think the versions you are thinking of apply here.
>>> Peersync
>>> >>>>>>>> does not look at that - it looks at version numbers for updates
>>> in
>>> >>> the
>>> >>>>>>>> transaction log - it compares the last 100 of them on leader and
>>> >>> replica.
>>> >>>>>>>> What it's saying is that the replica seems to have versions that
>>> >>> the leader
>>> >>>>>>>> does not. Have you scanned the logs for any interesting
>>> exceptions?
>>> >>>>>>>>
>>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>> session
>>> >>>>>>>> timeouts occur?
>>> >>>>>>>>
>>> >>>>>>>> - Mark
>>> >>>>>>>>
>>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>>> >>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>> >>> noticed a
>>> >>>>>>>>> strange issue while testing today.  Specifically the replica
>>> has a
>>> >>>>>>>> higher
>>> >>>>>>>>> version than the master which is causing the index to not
>>> >>> replicate.
>>> >>>>>>>>> Because of this the replica has fewer documents than the
>>> master.
>>> >>> What
>>> >>>>>>>>> could cause this and how can I resolve it short of taking down
>>> the
>>> >>>>>>>> index
>>> >>>>>>>>> and scping the right version in?
>>> >>>>>>>>>
>>> >>>>>>>>> MASTER:
>>> >>>>>>>>> Last Modified:about an hour ago
>>> >>>>>>>>> Num Docs:164880
>>> >>>>>>>>> Max Doc:164880
>>> >>>>>>>>> Deleted Docs:0
>>> >>>>>>>>> Version:2387
>>> >>>>>>>>> Segment Count:23
>>> >>>>>>>>>
>>> >>>>>>>>> REPLICA:
>>> >>>>>>>>> Last Modified: about an hour ago
>>> >>>>>>>>> Num Docs:164773
>>> >>>>>>>>> Max Doc:164773
>>> >>>>>>>>> Deleted Docs:0
>>> >>>>>>>>> Version:3001
>>> >>>>>>>>> Segment Count:30
>>> >>>>>>>>>
>>> >>>>>>>>> in the replicas log it says this:
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: Creating new http client,
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> >>> handleVersions
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>> http://10.38.33.17:7577/solr
>>> >>>>>>>>> Received 100 versions from
>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> >>> handleVersions
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>> http://10.38.33.17:7577/solr  Our
>>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>> >>>>>>>>> otherHigh=1431233789440294912
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> which again seems to point that it thinks it has a newer
>>> version of
>>> >>>>>>>> the
>>> >>>>>>>>> index so it aborts.  This happened while having 10 threads
>>> indexing
>>> >>>>>>>> 10,000
>>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>> thoughts
>>> >>> on
>>> >>>>>>>> this
>>> >>>>>>>>> or what I should look for would be appreciated.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>
>>> >>>
>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Something interesting that I'm noticing as well, I just indexed 300,000
items, and some how 300,020 ended up in the index.  I thought perhaps I
messed something up so I started the indexing again and indexed another
400,000 and I see 400,064 docs.  Is there a good way to find possibile
duplicates?  I had tried to facet on key (our id field) but that didn't
give me anything with more than a count of 1.


On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <je...@gmail.com> wrote:

> Ok, so clearing the transaction log allowed things to go again.  I am
> going to clear the index and try to replicate the problem on 4.2.0 and then
> I'll try on 4.2.1
>
>
> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <ma...@gmail.com> wrote:
>
>> No, not that I know if, which is why I say we need to get to the bottom
>> of it.
>>
>> - Mark
>>
>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > Mark
>> > It's there a particular jira issue that you think may address this? I
>> read
>> > through it quickly but didn't see one that jumped out
>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
>> >
>> >> I brought the bad one down and back up and it did nothing.  I can clear
>> >> the index and try4.2.1. I will save off the logs and see if there is
>> >> anything else odd
>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>> >>
>> >>> It would appear it's a bug given what you have said.
>> >>>
>> >>> Any other exceptions would be useful. Might be best to start tracking
>> in
>> >>> a JIRA issue as well.
>> >>>
>> >>> To fix, I'd bring the behind node down and back again.
>> >>>
>> >>> Unfortunately, I'm pressed for time, but we really need to get to the
>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>> (spreading
>> >>> to mirrors now).
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>> >>>
>> >>>> Sorry I didn't ask the obvious question.  Is there anything else
>> that I
>> >>>> should be looking for here and is this a bug?  I'd be happy to troll
>> >>>> through the logs further if more information is needed, just let me
>> >>> know.
>> >>>>
>> >>>> Also what is the most appropriate mechanism to fix this.  Is it
>> >>> required to
>> >>>> kill the index that is out of sync and let solr resync things?
>> >>>>
>> >>>>
>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> sorry for spamming here....
>> >>>>>
>> >>>>> shard5-core2 is the instance we're having issues with...
>> >>>>>
>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> >>>>> SEVERE: shard update error StdNode:
>> >>>>>
>> >>>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>> :
>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>> non
>> >>> ok
>> >>>>> status:503, message:Service Unavailable
>> >>>>>       at
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>       at
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>       at
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>       at
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>       at
>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>       at
>> >>>>>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>       at
>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>       at
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>       at
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>       at java.lang.Thread.run(Thread.java:662)
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>> >>> wrote:
>> >>>>>
>> >>>>>> here is another one that looks interesting
>> >>>>>>
>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we
>> are
>> >>>>>> the leader, but locally we don't think so
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>       at
>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>       at
>> >>>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>>> Looking at the master it looks like at some point there were
>> shards
>> >>> that
>> >>>>>>> went down.  I am seeing things like what is below.
>> >>>>>>>
>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>> >>> updating... (live
>> >>>>>>> nodes size: 12)
>> >>>>>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>> process
>> >>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>> Apr 2, 2013 8:12:52 PM
>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>> runLeaderProcess
>> >>>>>>> INFO: Running the leader process.
>> >>>>>>> Apr 2, 2013 8:12:52 PM
>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>> shouldIBeLeader
>> >>>>>>> INFO: Checking if I should try and be the leader.
>> >>>>>>> Apr 2, 2013 8:12:52 PM
>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>> shouldIBeLeader
>> >>>>>>> INFO: My last published State was Active, it's okay to be the
>> leader.
>> >>>>>>> Apr 2, 2013 8:12:52 PM
>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>> runLeaderProcess
>> >>>>>>> INFO: I may be the new leader - try and sync
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>> markrmiller@gmail.com
>> >>>> wrote:
>> >>>>>>>
>> >>>>>>>> I don't think the versions you are thinking of apply here.
>> Peersync
>> >>>>>>>> does not look at that - it looks at version numbers for updates
>> in
>> >>> the
>> >>>>>>>> transaction log - it compares the last 100 of them on leader and
>> >>> replica.
>> >>>>>>>> What it's saying is that the replica seems to have versions that
>> >>> the leader
>> >>>>>>>> does not. Have you scanned the logs for any interesting
>> exceptions?
>> >>>>>>>>
>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
>> session
>> >>>>>>>> timeouts occur?
>> >>>>>>>>
>> >>>>>>>> - Mark
>> >>>>>>>>
>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>> >>> wrote:
>> >>>>>>>>
>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>> >>> noticed a
>> >>>>>>>>> strange issue while testing today.  Specifically the replica
>> has a
>> >>>>>>>> higher
>> >>>>>>>>> version than the master which is causing the index to not
>> >>> replicate.
>> >>>>>>>>> Because of this the replica has fewer documents than the master.
>> >>> What
>> >>>>>>>>> could cause this and how can I resolve it short of taking down
>> the
>> >>>>>>>> index
>> >>>>>>>>> and scping the right version in?
>> >>>>>>>>>
>> >>>>>>>>> MASTER:
>> >>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>> Num Docs:164880
>> >>>>>>>>> Max Doc:164880
>> >>>>>>>>> Deleted Docs:0
>> >>>>>>>>> Version:2387
>> >>>>>>>>> Segment Count:23
>> >>>>>>>>>
>> >>>>>>>>> REPLICA:
>> >>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>> Num Docs:164773
>> >>>>>>>>> Max Doc:164773
>> >>>>>>>>> Deleted Docs:0
>> >>>>>>>>> Version:3001
>> >>>>>>>>> Segment Count:30
>> >>>>>>>>>
>> >>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>
>> >>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>
>> >>>>>>>>
>> >>>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>
>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >>>>>>>>>
>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>> >>>>>>>>>
>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >>> handleVersions
>> >>>>>>>>>
>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>> Received 100 versions from
>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>
>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> >>> handleVersions
>> >>>>>>>>>
>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>> >>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>
>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >>>>>>>>>
>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> which again seems to point that it thinks it has a newer
>> version of
>> >>>>>>>> the
>> >>>>>>>>> index so it aborts.  This happened while having 10 threads
>> indexing
>> >>>>>>>> 10,000
>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>> thoughts
>> >>> on
>> >>>>>>>> this
>> >>>>>>>>> or what I should look for would be appreciated.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> >>>
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Ok, so clearing the transaction log allowed things to go again.  I am going
to clear the index and try to replicate the problem on 4.2.0 and then I'll
try on 4.2.1


On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <ma...@gmail.com> wrote:

> No, not that I know if, which is why I say we need to get to the bottom of
> it.
>
> - Mark
>
> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > Mark
> > It's there a particular jira issue that you think may address this? I
> read
> > through it quickly but didn't see one that jumped out
> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
> >
> >> I brought the bad one down and back up and it did nothing.  I can clear
> >> the index and try4.2.1. I will save off the logs and see if there is
> >> anything else odd
> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
> >>
> >>> It would appear it's a bug given what you have said.
> >>>
> >>> Any other exceptions would be useful. Might be best to start tracking
> in
> >>> a JIRA issue as well.
> >>>
> >>> To fix, I'd bring the behind node down and back again.
> >>>
> >>> Unfortunately, I'm pressed for time, but we really need to get to the
> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> (spreading
> >>> to mirrors now).
> >>>
> >>> - Mark
> >>>
> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>
> >>>> Sorry I didn't ask the obvious question.  Is there anything else that
> I
> >>>> should be looking for here and is this a bug?  I'd be happy to troll
> >>>> through the logs further if more information is needed, just let me
> >>> know.
> >>>>
> >>>> Also what is the most appropriate mechanism to fix this.  Is it
> >>> required to
> >>>> kill the index that is out of sync and let solr resync things?
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> sorry for spamming here....
> >>>>>
> >>>>> shard5-core2 is the instance we're having issues with...
> >>>>>
> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>> SEVERE: shard update error StdNode:
> >>>>>
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>> :
> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
> >>> ok
> >>>>> status:503, message:Service Unavailable
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>       at
> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>       at
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>       at
> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>       at
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>       at
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>       at java.lang.Thread.run(Thread.java:662)
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> here is another one that looks interesting
> >>>>>>
> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we
> are
> >>>>>> the leader, but locally we don't think so
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>       at
> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Looking at the master it looks like at some point there were shards
> >>> that
> >>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>
> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
> >>> updating... (live
> >>>>>>> nodes size: 12)
> >>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>> process
> >>>>>>> INFO: Updating live nodes... (9)
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>> INFO: Running the leader process.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>> INFO: My last published State was Active, it's okay to be the
> leader.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
> >>>> wrote:
> >>>>>>>
> >>>>>>>> I don't think the versions you are thinking of apply here.
> Peersync
> >>>>>>>> does not look at that - it looks at version numbers for updates in
> >>> the
> >>>>>>>> transaction log - it compares the last 100 of them on leader and
> >>> replica.
> >>>>>>>> What it's saying is that the replica seems to have versions that
> >>> the leader
> >>>>>>>> does not. Have you scanned the logs for any interesting
> exceptions?
> >>>>>>>>
> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
> session
> >>>>>>>> timeouts occur?
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
> >>> noticed a
> >>>>>>>>> strange issue while testing today.  Specifically the replica has
> a
> >>>>>>>> higher
> >>>>>>>>> version than the master which is causing the index to not
> >>> replicate.
> >>>>>>>>> Because of this the replica has fewer documents than the master.
> >>> What
> >>>>>>>>> could cause this and how can I resolve it short of taking down
> the
> >>>>>>>> index
> >>>>>>>>> and scping the right version in?
> >>>>>>>>>
> >>>>>>>>> MASTER:
> >>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>> Num Docs:164880
> >>>>>>>>> Max Doc:164880
> >>>>>>>>> Deleted Docs:0
> >>>>>>>>> Version:2387
> >>>>>>>>> Segment Count:23
> >>>>>>>>>
> >>>>>>>>> REPLICA:
> >>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>> Num Docs:164773
> >>>>>>>>> Max Doc:164773
> >>>>>>>>> Deleted Docs:0
> >>>>>>>>> Version:3001
> >>>>>>>>> Segment Count:30
> >>>>>>>>>
> >>>>>>>>> in the replicas log it says this:
> >>>>>>>>>
> >>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>
> >>>>>>>>
> >>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>> handleVersions
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>> Received 100 versions from
> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>> handleVersions
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> which again seems to point that it thinks it has a newer version
> of
> >>>>>>>> the
> >>>>>>>>> index so it aborts.  This happened while having 10 threads
> indexing
> >>>>>>>> 10,000
> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
> thoughts
> >>> on
> >>>>>>>> this
> >>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

No, not that I know if, which is why I say we need to get to the bottom of it.

- Mark

On Apr 2, 2013, at 10:18 PM, Jamie Johnson <je...@gmail.com> wrote:

> Mark
> It's there a particular jira issue that you think may address this? I read
> through it quickly but didn't see one that jumped out
> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:
> 
>> I brought the bad one down and back up and it did nothing.  I can clear
>> the index and try4.2.1. I will save off the logs and see if there is
>> anything else odd
>> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>> 
>>> It would appear it's a bug given what you have said.
>>> 
>>> Any other exceptions would be useful. Might be best to start tracking in
>>> a JIRA issue as well.
>>> 
>>> To fix, I'd bring the behind node down and back again.
>>> 
>>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
>>> to mirrors now).
>>> 
>>> - Mark
>>> 
>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> Sorry I didn't ask the obvious question.  Is there anything else that I
>>>> should be looking for here and is this a bug?  I'd be happy to troll
>>>> through the logs further if more information is needed, just let me
>>> know.
>>>> 
>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>> required to
>>>> kill the index that is out of sync and let solr resync things?
>>>> 
>>>> 
>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>> 
>>>>> sorry for spamming here....
>>>>> 
>>>>> shard5-core2 is the instance we're having issues with...
>>>>> 
>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: shard update error StdNode:
>>>>> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> :
>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
>>> ok
>>>>> status:503, message:Service Unavailable
>>>>>       at
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>       at
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>       at
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>       at
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>       at
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>       at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>       at
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>       at
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>       at
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>       at java.lang.Thread.run(Thread.java:662)
>>>>> 
>>>>> 
>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> here is another one that looks interesting
>>>>>> 
>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>>>>>> the leader, but locally we don't think so
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>       at
>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Looking at the master it looks like at some point there were shards
>>> that
>>>>>>> went down.  I am seeing things like what is below.
>>>>>>> 
>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>> updating... (live
>>>>>>> nodes size: 12)
>>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>> process
>>>>>>> INFO: Updating live nodes... (9)
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>> INFO: Running the leader process.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>> INFO: My last published State was Active, it's okay to be the leader.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
>>>> wrote:
>>>>>>> 
>>>>>>>> I don't think the versions you are thinking of apply here. Peersync
>>>>>>>> does not look at that - it looks at version numbers for updates in
>>> the
>>>>>>>> transaction log - it compares the last 100 of them on leader and
>>> replica.
>>>>>>>> What it's saying is that the replica seems to have versions that
>>> the leader
>>>>>>>> does not. Have you scanned the logs for any interesting exceptions?
>>>>>>>> 
>>>>>>>> Did the leader change during the heavy indexing? Did any zk session
>>>>>>>> timeouts occur?
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>> noticed a
>>>>>>>>> strange issue while testing today.  Specifically the replica has a
>>>>>>>> higher
>>>>>>>>> version than the master which is causing the index to not
>>> replicate.
>>>>>>>>> Because of this the replica has fewer documents than the master.
>>> What
>>>>>>>>> could cause this and how can I resolve it short of taking down the
>>>>>>>> index
>>>>>>>>> and scping the right version in?
>>>>>>>>> 
>>>>>>>>> MASTER:
>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>> Num Docs:164880
>>>>>>>>> Max Doc:164880
>>>>>>>>> Deleted Docs:0
>>>>>>>>> Version:2387
>>>>>>>>> Segment Count:23
>>>>>>>>> 
>>>>>>>>> REPLICA:
>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>> Num Docs:164773
>>>>>>>>> Max Doc:164773
>>>>>>>>> Deleted Docs:0
>>>>>>>>> Version:3001
>>>>>>>>> Segment Count:30
>>>>>>>>> 
>>>>>>>>> in the replicas log it says this:
>>>>>>>>> 
>>>>>>>>> INFO: Creating new http client,
>>>>>>>>> 
>>>>>>>> 
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> handleVersions
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> handleVersions
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> which again seems to point that it thinks it has a newer version of
>>>>>>>> the
>>>>>>>>> index so it aborts.  This happened while having 10 threads indexing
>>>>>>>> 10,000
>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any thoughts
>>> on
>>>>>>>> this
>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Mark
It's there a particular jira issue that you think may address this? I read
through it quickly but didn't see one that jumped out
On Apr 2, 2013 10:07 PM, "Jamie Johnson" <je...@gmail.com> wrote:

> I brought the bad one down and back up and it did nothing.  I can clear
> the index and try4.2.1. I will save off the logs and see if there is
> anything else odd
> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
>
>> It would appear it's a bug given what you have said.
>>
>> Any other exceptions would be useful. Might be best to start tracking in
>> a JIRA issue as well.
>>
>> To fix, I'd bring the behind node down and back again.
>>
>> Unfortunately, I'm pressed for time, but we really need to get to the
>> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
>> to mirrors now).
>>
>> - Mark
>>
>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > Sorry I didn't ask the obvious question.  Is there anything else that I
>> > should be looking for here and is this a bug?  I'd be happy to troll
>> > through the logs further if more information is needed, just let me
>> know.
>> >
>> > Also what is the most appropriate mechanism to fix this.  Is it
>> required to
>> > kill the index that is out of sync and let solr resync things?
>> >
>> >
>> > On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >
>> >> sorry for spamming here....
>> >>
>> >> shard5-core2 is the instance we're having issues with...
>> >>
>> >> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> >> SEVERE: shard update error StdNode:
>> >>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> :
>> >> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
>> ok
>> >> status:503, message:Service Unavailable
>> >>        at
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>        at
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>        at
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>        at
>> >>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>        at
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>        at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>        at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>        at java.lang.Thread.run(Thread.java:662)
>> >>
>> >>
>> >> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>
>> >>> here is another one that looks interesting
>> >>>
>> >>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>> >>> the leader, but locally we don't think so
>> >>>        at
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>        at
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>        at
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>        at
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>        at
>> >>>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>        at
>> >>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>        at
>> >>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>        at
>> >>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>        at
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>        at
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>        at
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>
>> >>>> Looking at the master it looks like at some point there were shards
>> that
>> >>>> went down.  I am seeing things like what is below.
>> >>>>
>> >>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>> >>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>> updating... (live
>> >>>> nodes size: 12)
>> >>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>> >>>> process
>> >>>> INFO: Updating live nodes... (9)
>> >>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>> runLeaderProcess
>> >>>> INFO: Running the leader process.
>> >>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>> shouldIBeLeader
>> >>>> INFO: Checking if I should try and be the leader.
>> >>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>> shouldIBeLeader
>> >>>> INFO: My last published State was Active, it's okay to be the leader.
>> >>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>> runLeaderProcess
>> >>>> INFO: I may be the new leader - try and sync
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
>> >wrote:
>> >>>>
>> >>>>> I don't think the versions you are thinking of apply here. Peersync
>> >>>>> does not look at that - it looks at version numbers for updates in
>> the
>> >>>>> transaction log - it compares the last 100 of them on leader and
>> replica.
>> >>>>> What it's saying is that the replica seems to have versions that
>> the leader
>> >>>>> does not. Have you scanned the logs for any interesting exceptions?
>> >>>>>
>> >>>>> Did the leader change during the heavy indexing? Did any zk session
>> >>>>> timeouts occur?
>> >>>>>
>> >>>>> - Mark
>> >>>>>
>> >>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>> noticed a
>> >>>>>> strange issue while testing today.  Specifically the replica has a
>> >>>>> higher
>> >>>>>> version than the master which is causing the index to not
>> replicate.
>> >>>>>> Because of this the replica has fewer documents than the master.
>>  What
>> >>>>>> could cause this and how can I resolve it short of taking down the
>> >>>>> index
>> >>>>>> and scping the right version in?
>> >>>>>>
>> >>>>>> MASTER:
>> >>>>>> Last Modified:about an hour ago
>> >>>>>> Num Docs:164880
>> >>>>>> Max Doc:164880
>> >>>>>> Deleted Docs:0
>> >>>>>> Version:2387
>> >>>>>> Segment Count:23
>> >>>>>>
>> >>>>>> REPLICA:
>> >>>>>> Last Modified: about an hour ago
>> >>>>>> Num Docs:164773
>> >>>>>> Max Doc:164773
>> >>>>>> Deleted Docs:0
>> >>>>>> Version:3001
>> >>>>>> Segment Count:30
>> >>>>>>
>> >>>>>> in the replicas log it says this:
>> >>>>>>
>> >>>>>> INFO: Creating new http client,
>> >>>>>>
>> >>>>>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>
>> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >>>>>>
>> >>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>> >>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>> >>>>>>
>> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> handleVersions
>> >>>>>>
>> >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>> http://10.38.33.17:7577/solr
>> >>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>
>> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> handleVersions
>> >>>>>>
>> >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>> versions are newer. ourLowThreshold=1431233788792274944
>> >>>>>> otherHigh=1431233789440294912
>> >>>>>>
>> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >>>>>>
>> >>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>> >>>>>>
>> >>>>>>
>> >>>>>> which again seems to point that it thinks it has a newer version of
>> >>>>> the
>> >>>>>> index so it aborts.  This happened while having 10 threads indexing
>> >>>>> 10,000
>> >>>>>> items writing to a 6 shard (1 replica each) cluster.  Any thoughts
>> on
>> >>>>> this
>> >>>>>> or what I should look for would be appreciated.
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

Clear out it's tlogs before starting it again may help.

- Mark

On Apr 2, 2013, at 10:07 PM, Jamie Johnson <je...@gmail.com> wrote:

> I brought the bad one down and back up and it did nothing.  I can clear the
> index and try4.2.1. I will save off the logs and see if there is anything
> else odd
> On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:
> 
>> It would appear it's a bug given what you have said.
>> 
>> Any other exceptions would be useful. Might be best to start tracking in a
>> JIRA issue as well.
>> 
>> To fix, I'd bring the behind node down and back again.
>> 
>> Unfortunately, I'm pressed for time, but we really need to get to the
>> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
>> to mirrors now).
>> 
>> - Mark
>> 
>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> Sorry I didn't ask the obvious question.  Is there anything else that I
>>> should be looking for here and is this a bug?  I'd be happy to troll
>>> through the logs further if more information is needed, just let me know.
>>> 
>>> Also what is the most appropriate mechanism to fix this.  Is it required
>> to
>>> kill the index that is out of sync and let solr resync things?
>>> 
>>> 
>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> sorry for spamming here....
>>>> 
>>>> shard5-core2 is the instance we're having issues with...
>>>> 
>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>> SEVERE: shard update error StdNode:
>>>> 
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> :
>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
>>>> status:503, message:Service Unavailable
>>>>       at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>       at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>       at
>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>       at
>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>       at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>       at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>       at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>       at
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>       at
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>       at java.lang.Thread.run(Thread.java:662)
>>>> 
>>>> 
>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>> 
>>>>> here is another one that looks interesting
>>>>> 
>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>>>>> the leader, but locally we don't think so
>>>>>       at
>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>       at
>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>       at
>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>       at
>>>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>       at
>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>       at
>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>       at
>>>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>       at
>>>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>       at
>>>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>       at
>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>       at
>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Looking at the master it looks like at some point there were shards
>> that
>>>>>> went down.  I am seeing things like what is below.
>>>>>> 
>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating...
>> (live
>>>>>> nodes size: 12)
>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>>>>>> process
>>>>>> INFO: Updating live nodes... (9)
>>>>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>> runLeaderProcess
>>>>>> INFO: Running the leader process.
>>>>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>> shouldIBeLeader
>>>>>> INFO: Checking if I should try and be the leader.
>>>>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>> shouldIBeLeader
>>>>>> INFO: My last published State was Active, it's okay to be the leader.
>>>>>> Apr 2, 2013 8:12:52 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>> runLeaderProcess
>>>>>> INFO: I may be the new leader - try and sync
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
>>> wrote:
>>>>>> 
>>>>>>> I don't think the versions you are thinking of apply here. Peersync
>>>>>>> does not look at that - it looks at version numbers for updates in
>> the
>>>>>>> transaction log - it compares the last 100 of them on leader and
>> replica.
>>>>>>> What it's saying is that the replica seems to have versions that the
>> leader
>>>>>>> does not. Have you scanned the logs for any interesting exceptions?
>>>>>>> 
>>>>>>> Did the leader change during the heavy indexing? Did any zk session
>>>>>>> timeouts occur?
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>> noticed a
>>>>>>>> strange issue while testing today.  Specifically the replica has a
>>>>>>> higher
>>>>>>>> version than the master which is causing the index to not replicate.
>>>>>>>> Because of this the replica has fewer documents than the master.
>> What
>>>>>>>> could cause this and how can I resolve it short of taking down the
>>>>>>> index
>>>>>>>> and scping the right version in?
>>>>>>>> 
>>>>>>>> MASTER:
>>>>>>>> Last Modified:about an hour ago
>>>>>>>> Num Docs:164880
>>>>>>>> Max Doc:164880
>>>>>>>> Deleted Docs:0
>>>>>>>> Version:2387
>>>>>>>> Segment Count:23
>>>>>>>> 
>>>>>>>> REPLICA:
>>>>>>>> Last Modified: about an hour ago
>>>>>>>> Num Docs:164773
>>>>>>>> Max Doc:164773
>>>>>>>> Deleted Docs:0
>>>>>>>> Version:3001
>>>>>>>> Segment Count:30
>>>>>>>> 
>>>>>>>> in the replicas log it says this:
>>>>>>>> 
>>>>>>>> INFO: Creating new http client,
>>>>>>>> 
>>>>>>> 
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>> 
>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>> 
>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>>>> 
>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> handleVersions
>>>>>>>> 
>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>> 
>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>> handleVersions
>>>>>>>> 
>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>> otherHigh=1431233789440294912
>>>>>>>> 
>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>> 
>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>> 
>>>>>>>> 
>>>>>>>> which again seems to point that it thinks it has a newer version of
>>>>>>> the
>>>>>>>> index so it aborts.  This happened while having 10 threads indexing
>>>>>>> 10,000
>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any thoughts
>> on
>>>>>>> this
>>>>>>>> or what I should look for would be appreciated.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

I brought the bad one down and back up and it did nothing.  I can clear the
index and try4.2.1. I will save off the logs and see if there is anything
else odd
On Apr 2, 2013 9:13 PM, "Mark Miller" <ma...@gmail.com> wrote:

> It would appear it's a bug given what you have said.
>
> Any other exceptions would be useful. Might be best to start tracking in a
> JIRA issue as well.
>
> To fix, I'd bring the behind node down and back again.
>
> Unfortunately, I'm pressed for time, but we really need to get to the
> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
> to mirrors now).
>
> - Mark
>
> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > Sorry I didn't ask the obvious question.  Is there anything else that I
> > should be looking for here and is this a bug?  I'd be happy to troll
> > through the logs further if more information is needed, just let me know.
> >
> > Also what is the most appropriate mechanism to fix this.  Is it required
> to
> > kill the index that is out of sync and let solr resync things?
> >
> >
> > On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com> wrote:
> >
> >> sorry for spamming here....
> >>
> >> shard5-core2 is the instance we're having issues with...
> >>
> >> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >> SEVERE: shard update error StdNode:
> >>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> :
> >> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
> >> status:503, message:Service Unavailable
> >>        at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>        at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>        at
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>        at
> >>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>        at
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>        at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>        at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>        at java.lang.Thread.run(Thread.java:662)
> >>
> >>
> >> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >>> here is another one that looks interesting
> >>>
> >>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
> >>> the leader, but locally we don't think so
> >>>        at
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>        at
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>        at
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>        at
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>        at
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>        at
> >>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>        at
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>        at
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>        at
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>        at
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>        at
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>
> >>>
> >>>
> >>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>>
> >>>> Looking at the master it looks like at some point there were shards
> that
> >>>> went down.  I am seeing things like what is below.
> >>>>
> >>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
> >>>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating...
> (live
> >>>> nodes size: 12)
> >>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
> >>>> process
> >>>> INFO: Updating live nodes... (9)
> >>>> Apr 2, 2013 8:12:52 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>> runLeaderProcess
> >>>> INFO: Running the leader process.
> >>>> Apr 2, 2013 8:12:52 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>> shouldIBeLeader
> >>>> INFO: Checking if I should try and be the leader.
> >>>> Apr 2, 2013 8:12:52 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>> shouldIBeLeader
> >>>> INFO: My last published State was Active, it's okay to be the leader.
> >>>> Apr 2, 2013 8:12:52 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>> runLeaderProcess
> >>>> INFO: I may be the new leader - try and sync
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
> >wrote:
> >>>>
> >>>>> I don't think the versions you are thinking of apply here. Peersync
> >>>>> does not look at that - it looks at version numbers for updates in
> the
> >>>>> transaction log - it compares the last 100 of them on leader and
> replica.
> >>>>> What it's saying is that the replica seems to have versions that the
> leader
> >>>>> does not. Have you scanned the logs for any interesting exceptions?
> >>>>>
> >>>>> Did the leader change during the heavy indexing? Did any zk session
> >>>>> timeouts occur?
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
> >>>>>
> >>>>>> I am currently looking at moving our Solr cluster to 4.2 and
> noticed a
> >>>>>> strange issue while testing today.  Specifically the replica has a
> >>>>> higher
> >>>>>> version than the master which is causing the index to not replicate.
> >>>>>> Because of this the replica has fewer documents than the master.
>  What
> >>>>>> could cause this and how can I resolve it short of taking down the
> >>>>> index
> >>>>>> and scping the right version in?
> >>>>>>
> >>>>>> MASTER:
> >>>>>> Last Modified:about an hour ago
> >>>>>> Num Docs:164880
> >>>>>> Max Doc:164880
> >>>>>> Deleted Docs:0
> >>>>>> Version:2387
> >>>>>> Segment Count:23
> >>>>>>
> >>>>>> REPLICA:
> >>>>>> Last Modified: about an hour ago
> >>>>>> Num Docs:164773
> >>>>>> Max Doc:164773
> >>>>>> Deleted Docs:0
> >>>>>> Version:3001
> >>>>>> Segment Count:30
> >>>>>>
> >>>>>> in the replicas log it says this:
> >>>>>>
> >>>>>> INFO: Creating new http client,
> >>>>>>
> >>>>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>
> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>
> >>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
> >>>>>>
> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> handleVersions
> >>>>>>
> >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>> http://10.38.33.17:7577/solr
> >>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>
> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> handleVersions
> >>>>>>
> >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>> http://10.38.33.17:7577/solr  Our
> >>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>> otherHigh=1431233789440294912
> >>>>>>
> >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>
> >>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>
> >>>>>>
> >>>>>> which again seems to point that it thinks it has a newer version of
> >>>>> the
> >>>>>> index so it aborts.  This happened while having 10 threads indexing
> >>>>> 10,000
> >>>>>> items writing to a 6 shard (1 replica each) cluster.  Any thoughts
> on
> >>>>> this
> >>>>>> or what I should look for would be appreciated.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

It would appear it's a bug given what you have said.

Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well.

To fix, I'd bring the behind node down and back again.

Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now).

- Mark

On Apr 2, 2013, at 7:21 PM, Jamie Johnson <je...@gmail.com> wrote:

> Sorry I didn't ask the obvious question.  Is there anything else that I
> should be looking for here and is this a bug?  I'd be happy to troll
> through the logs further if more information is needed, just let me know.
> 
> Also what is the most appropriate mechanism to fix this.  Is it required to
> kill the index that is out of sync and let solr resync things?
> 
> 
> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com> wrote:
> 
>> sorry for spamming here....
>> 
>> shard5-core2 is the instance we're having issues with...
>> 
>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> SEVERE: shard update error StdNode:
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException:
>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
>> status:503, message:Service Unavailable
>>        at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>        at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>        at
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>        at
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:662)
>> 
>> 
>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com> wrote:
>> 
>>> here is another one that looks interesting
>>> 
>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>>> the leader, but locally we don't think so
>>>        at
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>        at
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>        at
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>        at
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>        at
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>        at
>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>        at
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>        at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>        at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>        at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>        at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>> 
>>> 
>>> 
>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> 
>>>> Looking at the master it looks like at some point there were shards that
>>>> went down.  I am seeing things like what is below.
>>>> 
>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
>>>> nodes size: 12)
>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>>>> process
>>>> INFO: Updating live nodes... (9)
>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>> runLeaderProcess
>>>> INFO: Running the leader process.
>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>> shouldIBeLeader
>>>> INFO: Checking if I should try and be the leader.
>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>> shouldIBeLeader
>>>> INFO: My last published State was Active, it's okay to be the leader.
>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>> runLeaderProcess
>>>> INFO: I may be the new leader - try and sync
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <ma...@gmail.com>wrote:
>>>> 
>>>>> I don't think the versions you are thinking of apply here. Peersync
>>>>> does not look at that - it looks at version numbers for updates in the
>>>>> transaction log - it compares the last 100 of them on leader and replica.
>>>>> What it's saying is that the replica seems to have versions that the leader
>>>>> does not. Have you scanned the logs for any interesting exceptions?
>>>>> 
>>>>> Did the leader change during the heavy indexing? Did any zk session
>>>>> timeouts occur?
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> 
>>>>>> I am currently looking at moving our Solr cluster to 4.2 and noticed a
>>>>>> strange issue while testing today.  Specifically the replica has a
>>>>> higher
>>>>>> version than the master which is causing the index to not replicate.
>>>>>> Because of this the replica has fewer documents than the master.  What
>>>>>> could cause this and how can I resolve it short of taking down the
>>>>> index
>>>>>> and scping the right version in?
>>>>>> 
>>>>>> MASTER:
>>>>>> Last Modified:about an hour ago
>>>>>> Num Docs:164880
>>>>>> Max Doc:164880
>>>>>> Deleted Docs:0
>>>>>> Version:2387
>>>>>> Segment Count:23
>>>>>> 
>>>>>> REPLICA:
>>>>>> Last Modified: about an hour ago
>>>>>> Num Docs:164773
>>>>>> Max Doc:164773
>>>>>> Deleted Docs:0
>>>>>> Version:3001
>>>>>> Segment Count:30
>>>>>> 
>>>>>> in the replicas log it says this:
>>>>>> 
>>>>>> INFO: Creating new http client,
>>>>>> 
>>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>> 
>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>> 
>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>> 
>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>>>>> 
>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>> http://10.38.33.17:7577/solr
>>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>> 
>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>>>>> 
>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>> http://10.38.33.17:7577/solr  Our
>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>> otherHigh=1431233789440294912
>>>>>> 
>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>> 
>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>> 
>>>>>> 
>>>>>> which again seems to point that it thinks it has a newer version of
>>>>> the
>>>>>> index so it aborts.  This happened while having 10 threads indexing
>>>>> 10,000
>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any thoughts on
>>>>> this
>>>>>> or what I should look for would be appreciated.
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Sorry I didn't ask the obvious question.  Is there anything else that I
should be looking for here and is this a bug?  I'd be happy to troll
through the logs further if more information is needed, just let me know.

Also what is the most appropriate mechanism to fix this.  Is it required to
kill the index that is out of sync and let solr resync things?


On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <je...@gmail.com> wrote:

> sorry for spamming here....
>
> shard5-core2 is the instance we're having issues with...
>
> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> SEVERE: shard update error StdNode:
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException:
> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
> status:503, message:Service Unavailable
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>         at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>         at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
>
>
> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> here is another one that looks interesting
>>
>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>> the leader, but locally we don't think so
>>         at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>         at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>         at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>         at
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>         at
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>         at
>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>         at
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>         at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>         at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>         at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>         at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>
>>
>>
>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> Looking at the master it looks like at some point there were shards that
>>> went down.  I am seeing things like what is below.
>>>
>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
>>> nodes size: 12)
>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>>> process
>>> INFO: Updating live nodes... (9)
>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> INFO: Running the leader process.
>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> INFO: Checking if I should try and be the leader.
>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> INFO: My last published State was Active, it's okay to be the leader.
>>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> INFO: I may be the new leader - try and sync
>>>
>>>
>>>
>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <ma...@gmail.com>wrote:
>>>
>>>> I don't think the versions you are thinking of apply here. Peersync
>>>> does not look at that - it looks at version numbers for updates in the
>>>> transaction log - it compares the last 100 of them on leader and replica.
>>>> What it's saying is that the replica seems to have versions that the leader
>>>> does not. Have you scanned the logs for any interesting exceptions?
>>>>
>>>> Did the leader change during the heavy indexing? Did any zk session
>>>> timeouts occur?
>>>>
>>>> - Mark
>>>>
>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>
>>>> > I am currently looking at moving our Solr cluster to 4.2 and noticed a
>>>> > strange issue while testing today.  Specifically the replica has a
>>>> higher
>>>> > version than the master which is causing the index to not replicate.
>>>> > Because of this the replica has fewer documents than the master.  What
>>>> > could cause this and how can I resolve it short of taking down the
>>>> index
>>>> > and scping the right version in?
>>>> >
>>>> > MASTER:
>>>> > Last Modified:about an hour ago
>>>> > Num Docs:164880
>>>> > Max Doc:164880
>>>> > Deleted Docs:0
>>>> > Version:2387
>>>> > Segment Count:23
>>>> >
>>>> > REPLICA:
>>>> > Last Modified: about an hour ago
>>>> > Num Docs:164773
>>>> > Max Doc:164773
>>>> > Deleted Docs:0
>>>> > Version:3001
>>>> > Segment Count:30
>>>> >
>>>> > in the replicas log it says this:
>>>> >
>>>> > INFO: Creating new http client,
>>>> >
>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>> >
>>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >
>>>> > INFO: PeerSync: core=dsc-shard5-core2
>>>> > url=http://10.38.33.17:7577/solrSTART replicas=[
>>>> > http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>> >
>>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>>> >
>>>> > INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> http://10.38.33.17:7577/solr
>>>> > Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>> >
>>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>>> >
>>>> > INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> http://10.38.33.17:7577/solr  Our
>>>> > versions are newer. ourLowThreshold=1431233788792274944
>>>> > otherHigh=1431233789440294912
>>>> >
>>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >
>>>> > INFO: PeerSync: core=dsc-shard5-core2
>>>> > url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>> >
>>>> >
>>>> > which again seems to point that it thinks it has a newer version of
>>>> the
>>>> > index so it aborts.  This happened while having 10 threads indexing
>>>> 10,000
>>>> > items writing to a 6 shard (1 replica each) cluster.  Any thoughts on
>>>> this
>>>> > or what I should look for would be appreciated.
>>>>
>>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

sorry for spamming here....

shard5-core2 is the instance we're having issues with...

Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode:
http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException:
Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
status:503, message:Service Unavailable
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
        at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
        at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <je...@gmail.com> wrote:

> here is another one that looks interesting
>
> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
> leader, but locally we don't think so
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>         at
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>         at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>         at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>         at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>
>
>
> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com> wrote:
>
>> Looking at the master it looks like at some point there were shards that
>> went down.  I am seeing things like what is below.
>>
>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
>> nodes size: 12)
>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>> process
>> INFO: Updating live nodes... (9)
>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> INFO: Running the leader process.
>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> INFO: Checking if I should try and be the leader.
>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> INFO: My last published State was Active, it's okay to be the leader.
>> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> INFO: I may be the new leader - try and sync
>>
>>
>>
>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <ma...@gmail.com>wrote:
>>
>>> I don't think the versions you are thinking of apply here. Peersync does
>>> not look at that - it looks at version numbers for updates in the
>>> transaction log - it compares the last 100 of them on leader and replica.
>>> What it's saying is that the replica seems to have versions that the leader
>>> does not. Have you scanned the logs for any interesting exceptions?
>>>
>>> Did the leader change during the heavy indexing? Did any zk session
>>> timeouts occur?
>>>
>>> - Mark
>>>
>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>> > I am currently looking at moving our Solr cluster to 4.2 and noticed a
>>> > strange issue while testing today.  Specifically the replica has a
>>> higher
>>> > version than the master which is causing the index to not replicate.
>>> > Because of this the replica has fewer documents than the master.  What
>>> > could cause this and how can I resolve it short of taking down the
>>> index
>>> > and scping the right version in?
>>> >
>>> > MASTER:
>>> > Last Modified:about an hour ago
>>> > Num Docs:164880
>>> > Max Doc:164880
>>> > Deleted Docs:0
>>> > Version:2387
>>> > Segment Count:23
>>> >
>>> > REPLICA:
>>> > Last Modified: about an hour ago
>>> > Num Docs:164773
>>> > Max Doc:164773
>>> > Deleted Docs:0
>>> > Version:3001
>>> > Segment Count:30
>>> >
>>> > in the replicas log it says this:
>>> >
>>> > INFO: Creating new http client,
>>> >
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>> >
>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >
>>> > INFO: PeerSync: core=dsc-shard5-core2
>>> > url=http://10.38.33.17:7577/solrSTART replicas=[
>>> > http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>> >
>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>> >
>>> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr
>>> > Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>> >
>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>>> >
>>> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr Our
>>> > versions are newer. ourLowThreshold=1431233788792274944
>>> > otherHigh=1431233789440294912
>>> >
>>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >
>>> > INFO: PeerSync: core=dsc-shard5-core2
>>> > url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>> >
>>> >
>>> > which again seems to point that it thinks it has a newer version of the
>>> > index so it aborts.  This happened while having 10 threads indexing
>>> 10,000
>>> > items writing to a 6 shard (1 replica each) cluster.  Any thoughts on
>>> this
>>> > or what I should look for would be appreciated.
>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

here is another one that looks interesting

Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
leader, but locally we don't think so
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
        at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)



On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <je...@gmail.com> wrote:

> Looking at the master it looks like at some point there were shards that
> went down.  I am seeing things like what is below.
>
> NFO: A cluster state change: WatchedEvent state:SyncConnected
> type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
> nodes size: 12)
> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> INFO: Updating live nodes... (9)
> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: Running the leader process.
> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> INFO: Checking if I should try and be the leader.
> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> INFO: My last published State was Active, it's okay to be the leader.
> Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: I may be the new leader - try and sync
>
>
>
> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> I don't think the versions you are thinking of apply here. Peersync does
>> not look at that - it looks at version numbers for updates in the
>> transaction log - it compares the last 100 of them on leader and replica.
>> What it's saying is that the replica seems to have versions that the leader
>> does not. Have you scanned the logs for any interesting exceptions?
>>
>> Did the leader change during the heavy indexing? Did any zk session
>> timeouts occur?
>>
>> - Mark
>>
>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > I am currently looking at moving our Solr cluster to 4.2 and noticed a
>> > strange issue while testing today.  Specifically the replica has a
>> higher
>> > version than the master which is causing the index to not replicate.
>> > Because of this the replica has fewer documents than the master.  What
>> > could cause this and how can I resolve it short of taking down the index
>> > and scping the right version in?
>> >
>> > MASTER:
>> > Last Modified:about an hour ago
>> > Num Docs:164880
>> > Max Doc:164880
>> > Deleted Docs:0
>> > Version:2387
>> > Segment Count:23
>> >
>> > REPLICA:
>> > Last Modified: about an hour ago
>> > Num Docs:164773
>> > Max Doc:164773
>> > Deleted Docs:0
>> > Version:3001
>> > Segment Count:30
>> >
>> > in the replicas log it says this:
>> >
>> > INFO: Creating new http client,
>> >
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >
>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >
>> > INFO: PeerSync: core=dsc-shard5-core2
>> > url=http://10.38.33.17:7577/solrSTART replicas=[
>> > http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>> >
>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>> >
>> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr
>> > Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >
>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
>> >
>> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr Our
>> > versions are newer. ourLowThreshold=1431233788792274944
>> > otherHigh=1431233789440294912
>> >
>> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>> >
>> > INFO: PeerSync: core=dsc-shard5-core2
>> > url=http://10.38.33.17:7577/solrDONE. sync succeeded
>> >
>> >
>> > which again seems to point that it thinks it has a newer version of the
>> > index so it aborts.  This happened while having 10 threads indexing
>> 10,000
>> > items writing to a 6 shard (1 replica each) cluster.  Any thoughts on
>> this
>> > or what I should look for would be appreciated.
>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Jamie Johnson <je...@gmail.com>.

Looking at the master it looks like at some point there were shards that
went down.  I am seeing things like what is below.

NFO: A cluster state change: WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live
nodes size: 12)
Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 process
INFO: Updating live nodes... (9)
Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
INFO: Checking if I should try and be the leader.
Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
INFO: My last published State was Active, it's okay to be the leader.
Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: I may be the new leader - try and sync



On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <ma...@gmail.com> wrote:

> I don't think the versions you are thinking of apply here. Peersync does
> not look at that - it looks at version numbers for updates in the
> transaction log - it compares the last 100 of them on leader and replica.
> What it's saying is that the replica seems to have versions that the leader
> does not. Have you scanned the logs for any interesting exceptions?
>
> Did the leader change during the heavy indexing? Did any zk session
> timeouts occur?
>
> - Mark
>
> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > I am currently looking at moving our Solr cluster to 4.2 and noticed a
> > strange issue while testing today.  Specifically the replica has a higher
> > version than the master which is causing the index to not replicate.
> > Because of this the replica has fewer documents than the master.  What
> > could cause this and how can I resolve it short of taking down the index
> > and scping the right version in?
> >
> > MASTER:
> > Last Modified:about an hour ago
> > Num Docs:164880
> > Max Doc:164880
> > Deleted Docs:0
> > Version:2387
> > Segment Count:23
> >
> > REPLICA:
> > Last Modified: about an hour ago
> > Num Docs:164773
> > Max Doc:164773
> > Deleted Docs:0
> > Version:3001
> > Segment Count:30
> >
> > in the replicas log it says this:
> >
> > INFO: Creating new http client,
> >
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >
> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >
> > INFO: PeerSync: core=dsc-shard5-core2
> > url=http://10.38.33.17:7577/solrSTART replicas=[
> > http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
> >
> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
> >
> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr
> > Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
> >
> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
> >
> > INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr Our
> > versions are newer. ourLowThreshold=1431233788792274944
> > otherHigh=1431233789440294912
> >
> > Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >
> > INFO: PeerSync: core=dsc-shard5-core2
> > url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >
> >
> > which again seems to point that it thinks it has a newer version of the
> > index so it aborts.  This happened while having 10 threads indexing
> 10,000
> > items writing to a 6 shard (1 replica each) cluster.  Any thoughts on
> this
> > or what I should look for would be appreciated.
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Posted by Mark Miller <ma...@gmail.com>.

I don't think the versions you are thinking of apply here. Peersync does not look at that - it looks at version numbers for updates in the transaction log - it compares the last 100 of them on leader and replica. What it's saying is that the replica seems to have versions that the leader does not. Have you scanned the logs for any interesting exceptions?

Did the leader change during the heavy indexing? Did any zk session timeouts occur?

- Mark

On Apr 2, 2013, at 4:52 PM, Jamie Johnson <je...@gmail.com> wrote:

> I am currently looking at moving our Solr cluster to 4.2 and noticed a
> strange issue while testing today.  Specifically the replica has a higher
> version than the master which is causing the index to not replicate.
> Because of this the replica has fewer documents than the master.  What
> could cause this and how can I resolve it short of taking down the index
> and scping the right version in?
> 
> MASTER:
> Last Modified:about an hour ago
> Num Docs:164880
> Max Doc:164880
> Deleted Docs:0
> Version:2387
> Segment Count:23
> 
> REPLICA:
> Last Modified: about an hour ago
> Num Docs:164773
> Max Doc:164773
> Deleted Docs:0
> Version:3001
> Segment Count:30
> 
> in the replicas log it says this:
> 
> INFO: Creating new http client,
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> 
> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> 
> INFO: PeerSync: core=dsc-shard5-core2
> url=http://10.38.33.17:7577/solrSTART replicas=[
> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
> 
> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
> 
> INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr
> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
> 
> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync handleVersions
> 
> INFO: PeerSync: core=dsc-shard5-core2 url=http://10.38.33.17:7577/solr  Our
> versions are newer. ourLowThreshold=1431233788792274944
> otherHigh=1431233789440294912
> 
> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> 
> INFO: PeerSync: core=dsc-shard5-core2
> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> 
> 
> which again seems to point that it thinks it has a newer version of the
> index so it aborts.  This happened while having 10 threads indexing 10,000
> items writing to a 6 shard (1 replica each) cluster.  Any thoughts on this
> or what I should look for would be appreciated.