You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mads Tomasgård Bjørgan <mt...@dips.no> on 2016/06/30 08:52:00 UTC

Solr node crashes while indexing - Too many open files

Hello,
We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by utilizing ZooKeeper 3.4.8.

We have two ensembles - and both clusters are running on three of their own respective VMs (CentOS 7). We first thought the error was due to CDCR - as we were trying to index a large amount of documents which had to be replicated to the target cluster. However, we got the same error even after turning of CDCR - which indicates CDCR wasn't the problem after all.

After indexing between 20 000 to 35 000 documents to the source cluster does the File Descriptor Count reach 4096 for one of the solr-nodes - and the respective node crashes. The count grows quite linearly as time goes. The remaining 2 nodes in the cluster is not affected at all, and their logs had no relevant posts.  We found the following errors for the crashing node in its log:

2016-06-30 08:23:12.459 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
                (...)
2016-06-30 08:23:12.460 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
                (...)
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
Too many open files
Too many open files
                (...)
2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/update params={version=2.2} status=-1 QTime=5
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
Too many open files
Too many open files
                (....)

2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://10.0.106.115:443/solr/DIPS_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1467275018057&isShard=true&wt=javabin&_=1467275017220} hits=30218 status=0 QTime=1

Running netstat -n -p on the VM that yields the exceptions reveals that there is at least 1 800 TCP connections (not counted how many - the netstat command filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
tcp6      70      0 10.0.106.115:34531      10.0.106.114:443        CLOSE_WAIT  21658/java
We're running the SolrCloud on 443, and the IP's belong to the VMs. We also tried adjusting the ulimit for the machine to 100 000 - without any results..

Greetings,
Mads

Re: Solr node crashes while indexing - Too many open files

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Mads Tomasgård Bjørgan <mt...@dips.no> wrote:

> That's true, but I was hoping there would be another way to solve this issue as it's not considered preferable in our situation.

What you are looking for might be
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-CompoundFileSegments

> Is it normal behavior for Solr to open over 4000 files without closing them properly?

Open, yes. Not closing them properly, no. The number of open file handles should match the number of files in the index folder.

- Toke Eskildsen, State and University Library, Denmark

RE: Solr node crashes while indexing - Too many open files

Posted by Mads Tomasgård Bjørgan <mt...@dips.no>.

That's true, but I was hoping there would be another way to solve this issue as it's not considered preferable in our situation.

Is it normal behavior for Solr to open over 4000 files without closing them properly? Is it for example possible to adjust autoCommit-settings I solrconfig.xml for forcing Solr to close the files?

Any help is appreciated :-)

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: torsdag 30. juni 2016 11.41
To: solr-user@lucene.apache.org
Subject: RE: Solr node crashes while indexing - Too many open files

Mads, some distributions require different steps for increasing max_open_files. Check how it works vor CentOS specifically.

Markus

 
 
-----Original message-----
> From:Mads Tomasgård Bjørgan <mt...@dips.no>
> Sent: Thursday 30th June 2016 10:52
> To: solr-user@lucene.apache.org
> Subject: Solr node crashes while indexing - Too many open files
> 
> Hello,
> We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by utilizing ZooKeeper 3.4.8.
> 
> We have two ensembles - and both clusters are running on three of their own respective VMs (CentOS 7). We first thought the error was due to CDCR - as we were trying to index a large amount of documents which had to be replicated to the target cluster. However, we got the same error even after turning of CDCR - which indicates CDCR wasn't the problem after all.
> 
> After indexing between 20 000 to 35 000 documents to the source cluster does the File Descriptor Count reach 4096 for one of the solr-nodes - and the respective node crashes. The count grows quite linearly as time goes. The remaining 2 nodes in the cluster is not affected at all, and their logs had no relevant posts.  We found the following errors for the crashing node in its log:
> 
> 2016-06-30 08:23:12.459 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
>                 (...)
> 2016-06-30 08:23:12.460 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
>                 (...)
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
> Too many open files
> Too many open files
>                 (...)
> 2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/update params={version=2.2} status=-1 QTime=5
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
> Too many open files
> Too many open files
>                 (....)
> 
> 2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
> 2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://10.0.106.115:443/solr/DIPS_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1467275018057&isShard=true&wt=javabin&_=1467275017220} hits=30218 status=0 QTime=1
> 
> Running netstat -n -p on the VM that yields the exceptions reveals that there is at least 1 800 TCP connections (not counted how many - the netstat command filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
> tcp6      70      0 10.0.106.115:34531      10.0.106.114:443        CLOSE_WAIT  21658/java
> We're running the SolrCloud on 443, and the IP's belong to the VMs. We also tried adjusting the ulimit for the machine to 100 000 - without any results..
> 
> Greetings,
> Mads
>

RE: Solr node crashes while indexing - Too many open files

Posted by Markus Jelsma <ma...@openindex.io>.

Mads, some distributions require different steps for increasing max_open_files. Check how it works vor CentOS specifically.

Markus

 
 
-----Original message-----
> From:Mads Tomasgård Bjørgan <mt...@dips.no>
> Sent: Thursday 30th June 2016 10:52
> To: solr-user@lucene.apache.org
> Subject: Solr node crashes while indexing - Too many open files
> 
> Hello,
> We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by utilizing ZooKeeper 3.4.8.
> 
> We have two ensembles - and both clusters are running on three of their own respective VMs (CentOS 7). We first thought the error was due to CDCR - as we were trying to index a large amount of documents which had to be replicated to the target cluster. However, we got the same error even after turning of CDCR - which indicates CDCR wasn't the problem after all.
> 
> After indexing between 20 000 to 35 000 documents to the source cluster does the File Descriptor Count reach 4096 for one of the solr-nodes - and the respective node crashes. The count grows quite linearly as time goes. The remaining 2 nodes in the cluster is not affected at all, and their logs had no relevant posts.  We found the following errors for the crashing node in its log:
> 
> 2016-06-30 08:23:12.459 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
>                 (...)
> 2016-06-30 08:23:12.460 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
>                 (...)
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
> Too many open files
> Too many open files
>                 (...)
> 2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/update params={version=2.2} status=-1 QTime=5
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update:
> Too many open files
> Too many open files
>                 (....)
> 
> 2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
> 2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://10.0.106.115:443/solr/DIPS_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1467275018057&isShard=true&wt=javabin&_=1467275017220} hits=30218 status=0 QTime=1
> 
> Running netstat -n -p on the VM that yields the exceptions reveals that there is at least 1 800 TCP connections (not counted how many - the netstat command filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
> tcp6      70      0 10.0.106.115:34531      10.0.106.114:443        CLOSE_WAIT  21658/java
> We're running the SolrCloud on 443, and the IP's belong to the VMs. We also tried adjusting the ulimit for the machine to 100 000 - without any results..
> 
> Greetings,
> Mads
>