You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Hoggarth, Gil" <Gi...@bl.uk> on 2013/05/16 13:23:07 UTC

Solr 4.3.0: Shard instances using incorrect data directory on machine boot

Hi all, I hope you can advise a solution to our incorrect data directory
issue.

 

We have 2 physical servers using Solr 4.3.0, each with 24 separate
tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a
solr shard in each. This configuration means that each shard has its own
data directory declared. (Server OS, tomcat and solr, including shards,
created via automated builds.) 

 

That is, for example,

- tomcat instance, /var/local/tomcat/solrshard3/, port 8985

- corresponding solr instance, /usr/local/solrshard3/, with
/usr/local/solrshard3/collection1/conf/solrconfig.xml

- corresponding solr data directory,
/var/local/solrshard3/collection1/data/

 

We process ~1.5 billion documents, which is why we use so 48 shards (24
leaders, 24 replicas). These physical servers are rebooted regularly to
fsck their drives. When rebooted, we always see several (~10-20) shards
failing to start (UI cloud view shows them as 'Down' or 'Recovering'
though they never recover without intervention), though there is not a
pattern to which shards fail to start - we haven't recorded any that
always or never fail. On inspection, the UI dashboard for these failed
shards displays, for example:

- Host                    Server1

- Instance            /usr/local/sholrshard3/collection1

- Data                    /var/local/solrshard6/collection1/data

- Index                  /var/local/solrshard6/collection1/data/index

 

To fix such failed shards, I manually restart the shard leader and
replicas, which fixes the issue. However, of course, I would like to
know a permanent cure for this, not a remedy.

 

We use a separate zookeeper service, spread across 3 Virtual Machines
within our private network of ~200 servers (physical and virtual).
Network traffic is constant but relatively little across 1GB bandwidth.

 

Any advice or suggestions greatly appreciated.

Gil

 

Gil Hoggarth

Web Archiving Engineer

The British Library, Boston Spa, West Yorkshire, LS23 7BQ

RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot

Posted by "Hoggarth, Gil" <Gi...@bl.uk>.

Thanks for your response Shawn, very much appreciated.
Gil

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: 16 May 2013 15:59
To: solr-user@lucene.apache.org
Subject: RE: Solr 4.3.0: Shard instances using incorrect data directory
on machine boot

> The dataDir is set in each solrconfig.xml; each one has been checked 
> to ensure it points to its corresponding location. The error we see is

> that on machine reboot not all of the shards start successfully, and 
> if the fail was to be a leader the replicas can't take its place 
> (presumably because the leader incorrect data directory is 
> inconsistent with their own).

Although you can set the dataDir in solrconfig.xml, I would strongly
recommend that you don't.

If you are using the old-style solr.xml (which has cores and core tags)
then set the dataDir in each core tag in solr.xml. This gets read and
set before the core is created, so there's less chance of it getting
scrambled. The solrconfig is read as part of core creation.

If you are using the new style solr.xml (new with 4.3.0) then you'll
need absolute dataDir paths, and they need to go in each core.properties
file.
Due to a bug, relative paths won't work as expected. I need to see if I
can make sure the fix makes it into 4.3.1.

If moving dataDir out of solrconfig.xml fixes it, then we probably have
a bug.

Yout Zookeeper problems might be helped by increasing zkClientTimeout.

Thanks,
Shawn

RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot

Posted by Shawn Heisey <so...@elyograg.org>.

> The dataDir is set in each solrconfig.xml; each one has been checked to
> ensure it points to its corresponding location. The error we see is that
> on machine reboot not all of the shards start successfully, and if the
> fail was to be a leader the replicas can't take its place (presumably
> because the leader incorrect data directory is inconsistent with their
> own).

Although you can set the dataDir in solrconfig.xml, I would strongly
recommend that you don't.

If you are using the old-style solr.xml (which has cores and core tags)
then set the dataDir in each core tag in solr.xml. This gets read and set
before the core is created, so there's less chance of it getting
scrambled. The solrconfig is read as part of core creation.

If you are using the new style solr.xml (new with 4.3.0) then you'll need
absolute dataDir paths, and they need to go in each core.properties file.
Due to a bug, relative paths won't work as expected. I need to see if I
can make sure the fix makes it into 4.3.1.

If moving dataDir out of solrconfig.xml fixes it, then we probably have a
bug.

Yout Zookeeper problems might be helped by increasing zkClientTimeout.

Thanks,
Shawn

RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot

Posted by "Hoggarth, Gil" <Gi...@bl.uk>.

Thanks for your reply Daniel.

The dataDir is set in each solrconfig.xml; each one has been checked to
ensure it points to its corresponding location. The error we see is that
on machine reboot not all of the shards start successfully, and if the
fail was to be a leader the replicas can't take its place (presumably
because the leader incorrect data directory is inconsistent with their
own).

More detail that I can add is that the catalina.out log for failed
shards reports:
May 15, 2013 5:56:02 PM org.apache.catalina.loader.WebappClassLoader
checkThreadLocalMapForLeaks
SEVERE: The web application [/solr] created a ThreadLocal with key of
type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value
[org.apache.solr.schema.DateField$ThreadLocalDateFormat@524e13f6]) and a
value of type
[org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] (value
[org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a])
but failed to remove it when the web application was stopped. Threads
are going to be renewed over time to try and avoid a probable memory
leak.

This doesn't (to me) relate to the problem, but that doesn't necessarily
mean it's not. Plus, it's the only SEVERE reported and only reported in
the failed shard catalina.out log.

Checking the zookeeper logs, we're seeing:
2013-05-16 13:25:46,839 [myid:1] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for
id 3, my id = 1, error =
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(Quoru
mCnxManager.java:747)
2013-05-16 13:25:46,841 [myid:1] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
2013-05-16 13:25:46,842 [myid:1] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@679] - Interrupted while
waiting for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.re
portInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.aw
aitNanos(AbstractQueuedSynchronizer.java:2095)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389
)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(Quorum
CnxManager.java:831)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnx
Manager.java:62)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(Quoru
mCnxManager.java:667)
2013-05-16 13:25:46,843 [myid:1] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@688] - Send worker leaving
thread

This is I think as separate issue in that this happens immediately after
I restart a zookeeper. (I.e., I see this in a log, restart that
zookeeper, and immediately see a similar issue in one of the other two
zookeeper logs).

-----Original Message-----
From: Daniel Collins [mailto:danwcollins@gmail.com] 
Sent: 16 May 2013 13:28
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3.0: Shard instances using incorrect data directory
on machine boot

What actual error do you see in Solr?  Is there an exception and if so,
can you post that?  As I understand it, datatDir is set from the
solrconfig.xml file, so either your instances are picking up the "wrong"
file, or you have some override which is incorrect?  Where do you set
solr.data.dir, at the environment when you start Solr or in solrconfig?

On 16 May 2013 12:23, Hoggarth, Gil <Gi...@bl.uk> wrote:

> Hi all, I hope you can advise a solution to our incorrect data 
> directory issue.
>
>
>
> We have 2 physical servers using Solr 4.3.0, each with 24 separate 
> tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a

> solr shard in each. This configuration means that each shard has its 
> own data directory declared. (Server OS, tomcat and solr, including 
> shards, created via automated builds.)
>
>
>
> That is, for example,
>
> - tomcat instance, /var/local/tomcat/solrshard3/, port 8985
>
> - corresponding solr instance, /usr/local/solrshard3/, with 
> /usr/local/solrshard3/collection1/conf/solrconfig.xml
>
> - corresponding solr data directory,
> /var/local/solrshard3/collection1/data/
>
>
>
> We process ~1.5 billion documents, which is why we use so 48 shards 
> (24 leaders, 24 replicas). These physical servers are rebooted 
> regularly to fsck their drives. When rebooted, we always see several 
> (~10-20) shards failing to start (UI cloud view shows them as 'Down'
or 'Recovering'
> though they never recover without intervention), though there is not a

> pattern to which shards fail to start - we haven't recorded any that 
> always or never fail. On inspection, the UI dashboard for these failed

> shards displays, for example:
>
> - Host                    Server1
>
> - Instance            /usr/local/sholrshard3/collection1
>
> - Data                    /var/local/solrshard6/collection1/data
>
> - Index                  /var/local/solrshard6/collection1/data/index
>
>
>
> To fix such failed shards, I manually restart the shard leader and 
> replicas, which fixes the issue. However, of course, I would like to 
> know a permanent cure for this, not a remedy.
>
>
>
> We use a separate zookeeper service, spread across 3 Virtual Machines 
> within our private network of ~200 servers (physical and virtual).
> Network traffic is constant but relatively little across 1GB
bandwidth.
>
>
>
> Any advice or suggestions greatly appreciated.
>
> Gil
>
>
>
> Gil Hoggarth
>
> Web Archiving Engineer
>
> The British Library, Boston Spa, West Yorkshire, LS23 7BQ
>
>
>
>

Re: Solr 4.3.0: Shard instances using incorrect data directory on machine boot

Posted by Daniel Collins <da...@gmail.com>.

What actual error do you see in Solr?  Is there an exception and if so, can
you post that?  As I understand it, datatDir is set from the solrconfig.xml
file, so either your instances are picking up the "wrong" file, or you have
some override which is incorrect?  Where do you set solr.data.dir, at the
environment when you start Solr or in solrconfig?


On 16 May 2013 12:23, Hoggarth, Gil <Gi...@bl.uk> wrote:

> Hi all, I hope you can advise a solution to our incorrect data directory
> issue.
>
>
>
> We have 2 physical servers using Solr 4.3.0, each with 24 separate
> tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a
> solr shard in each. This configuration means that each shard has its own
> data directory declared. (Server OS, tomcat and solr, including shards,
> created via automated builds.)
>
>
>
> That is, for example,
>
> - tomcat instance, /var/local/tomcat/solrshard3/, port 8985
>
> - corresponding solr instance, /usr/local/solrshard3/, with
> /usr/local/solrshard3/collection1/conf/solrconfig.xml
>
> - corresponding solr data directory,
> /var/local/solrshard3/collection1/data/
>
>
>
> We process ~1.5 billion documents, which is why we use so 48 shards (24
> leaders, 24 replicas). These physical servers are rebooted regularly to
> fsck their drives. When rebooted, we always see several (~10-20) shards
> failing to start (UI cloud view shows them as 'Down' or 'Recovering'
> though they never recover without intervention), though there is not a
> pattern to which shards fail to start - we haven't recorded any that
> always or never fail. On inspection, the UI dashboard for these failed
> shards displays, for example:
>
> - Host                    Server1
>
> - Instance            /usr/local/sholrshard3/collection1
>
> - Data                    /var/local/solrshard6/collection1/data
>
> - Index                  /var/local/solrshard6/collection1/data/index
>
>
>
> To fix such failed shards, I manually restart the shard leader and
> replicas, which fixes the issue. However, of course, I would like to
> know a permanent cure for this, not a remedy.
>
>
>
> We use a separate zookeeper service, spread across 3 Virtual Machines
> within our private network of ~200 servers (physical and virtual).
> Network traffic is constant but relatively little across 1GB bandwidth.
>
>
>
> Any advice or suggestions greatly appreciated.
>
> Gil
>
>
>
> Gil Hoggarth
>
> Web Archiving Engineer
>
> The British Library, Boston Spa, West Yorkshire, LS23 7BQ
>
>
>
>