You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kommu, Vinodh K." <vk...@dtcc.com> on 2020/07/02 11:16:21 UTC

Time-out errors while indexing (Solr 7.7.1)

Hi,

We are performing QA performance testing on couple of collections which holds 2 billion and 3.5 billion docs respectively. Indexing happens from a separate client using solrJ which uses 10 thread and batch size 1000. From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing. As part of troubleshooting, we did noticed that when peak disk IO utilization is reaching higher side, then indexing is happening slowly and when disk IO is constantly near 100%, timeout issues are observed.

Few questions here:


  1.  Our performance team noticed that read operations are pretty more than write operations like 100:1 ratio, is this expected during indexing or solr nodes are doing any other operations like syncing?
  2.  Zookeeper has a latency around (min/avg/max: 0/0/2205), can this latency create instabilities issues to ZK or Solr clusters? Or impact indexing or searching operations?
  3.  Our client timeout is set to 2mins, can they increase further more? Would that help or create any other problems?
  4.  When we created an empty collection and loaded same data file, it loaded fine without any issues so having more documents in a collection would create such problems?

Any suggestions or feedback would be really appreciated.

Solr version - 7.7.1

Time out error snippet:

ERROR (updateExecutor-3-thread-30055-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard6_replica_n22 r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) ~[?:1.8.0_212]
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_212]
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) ~[solr-core-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:07]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:349) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) ~[metrics-core-3.2.6.jar:3.2.6]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.dt_access$303(ExecutorUtil.java) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2020-07-01 21:02:58.301 ERROR (updateExecutor-3-thread-30033-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard5_replica_n6 r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:1122/solr/TestCollection_shard5_replica_n6: Server Error
request: https://localhost:1122/solr/TestCollection_shard5_replica_n6/update?update.chain=uuid&update.distrib=FROMLEADER&distrib.from=https%3A%2F%2Flocalhost%3A1122%2Fsolr%2FTestCollection_shard5_replica_n18%2F&wt=javabin&version=2
Remote error message: java.util.concurrent.TimeoutException: Idle timeout expired: 600000/600000 ms
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:385) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) ~[metrics-core-3.2.6.jar:3.2.6]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.dt_access$303(ExecutorUtil.java) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2020-07-01 21:02:58.322 ERROR (updateExecutor-3-thread-30027-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard3_replica_n14 r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:1122/solr/TestCollection_shard3_replica_n14: Server Error


Thanks & Regards,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

RE: Time-out errors while indexing (Solr 7.7.1)

Posted by "Kommu, Vinodh K." <vk...@dtcc.com>.
Anyone has any thoughts or suggestions on this issue?

Thanks & Regards,
Vinodh

From: Kommu, Vinodh K.
Sent: Thursday, July 2, 2020 4:46 PM
To: solr-user@lucene.apache.org
Subject: Time-out errors while indexing (Solr 7.7.1)

Hi,

We are performing QA performance testing on couple of collections which holds 2 billion and 3.5 billion docs respectively. Indexing happens from a separate client using solrJ which uses 10 thread and batch size 1000. From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing. As part of troubleshooting, we did noticed that when peak disk IO utilization is reaching higher side, then indexing is happening slowly and when disk IO is constantly near 100%, timeout issues are observed.

Few questions here:


  1.  Our performance team noticed that read operations are pretty more than write operations like 100:1 ratio, is this expected during indexing or solr nodes are doing any other operations like syncing?
  2.  Zookeeper has a latency around (min/avg/max: 0/0/2205), can this latency create instabilities issues to ZK or Solr clusters? Or impact indexing or searching operations?
  3.  Our client timeout is set to 2mins, can they increase further more? Would that help or create any other problems?
  4.  When we created an empty collection and loaded same data file, it loaded fine without any issues so having more documents in a collection would create such problems?

Any suggestions or feedback would be really appreciated.

Solr version - 7.7.1

Time out error snippet:

ERROR (updateExecutor-3-thread-30055-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard6_replica_n22<https://localhost:1122/solr/TestCollection_shard6_replica_n22> r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) ~[?:1.8.0_212]
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_212]
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) ~[solr-core-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:07]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:349) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) ~[metrics-core-3.2.6.jar:3.2.6]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.dt_access$303(ExecutorUtil.java) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2020-07-01 21:02:58.301 ERROR (updateExecutor-3-thread-30033-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard5_replica_n6<https://localhost:1122/solr/TestCollection_shard5_replica_n6> r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:1122/solr/TestCollection_shard5_replica_n6: Server Error
request: https://localhost:1122/solr/TestCollection_shard5_replica_n6/update?update.chain=uuid&update.distrib=FROMLEADER&distrib.from=https%3A%2F%2Flocalhost%3A1122%2Fsolr%2FTestCollection_shard5_replica_n18%2F&wt=javabin&version=2
Remote error message: java.util.concurrent.TimeoutException: Idle timeout expired: 600000/600000 ms
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:385) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) ~[metrics-core-3.2.6.jar:3.2.6]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.dt_access$303(ExecutorUtil.java) ~[solr-solrj-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 2019-02-23 02:39:09]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2020-07-01 21:02:58.322 ERROR (updateExecutor-3-thread-30027-processing-x:TestCollection_shard5_replica_n18 https:////localhost:1122//solr//TestCollection_shard3_replica_n14<https://localhost:1122/solr/TestCollection_shard3_replica_n14> r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://localhost:1122/solr/TestCollection_shard3_replica_n14: Server Error


Thanks & Regards,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Mad have <ma...@gmail.com>.
Thank a lot for your inputs and suggestions, even I was thinking similar like creating  another collection of the same ( hot and cold), and moving documents which are older than certain days like 180 days from original collection (hot) to new collection(cold). 

Thanks,
Madhava

Sent from my iPhone

> On 4 Jul 2020, at 14:37, Erick Erickson <er...@gmail.com> wrote:
> 
> You need more shards. And, I’m pretty certain, more hardware.
> 
> You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running at all unless that 13B is a round number. If you keep adding documents, your installation will shortly, at best, stop accepting new documents for indexing. At worst you’ll start seeing weird errors and possibly corrupt indexes and have to re-index everything from scratch.
> 
> You’ve backed yourself in to a pretty tight corner here. You either have to re-index to a properly-sized cluster or use SPLITSHARD. This latter will double the index-on-disk size (it creates two child indexes per replica and keeps the old one for safety’s sake that you have to clean up later). I strongly recommend you stop ingesting more data while you do this.
> 
> You say you have 6 VMs with 2 nodes running on each. If those VMs are co-located with anything else, the physical hardware is going to be stressed. VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…
> 
> In fact, I urge you to stop ingesting data immediately and address this issue. You have a cluster that’s mis-configured, and you must address that before Bad Things Happen.
> 
> Best,
> Erick
> 
>> On Jul 4, 2020, at 5:09 AM, Mad have <ma...@gmail.com> wrote:
>> 
>> Hi Eric,
>> 
>> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. Total number of shards are 6 with 3 replicas. I can see the index size is more than 220GB on each node for the collection where we are facing the performance issue.
>> 
>> The more documents we add to the collection the indexing become slow and I also have same impression that the size of the collection is creating this issue. Appreciate if you can suggests any solution on this.
>> 
>> 
>> Regards,
>> Madhava 
>> Sent from my iPhone
>> 
>>>> On 3 Jul 2020, at 23:30, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.
>>> 
>>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com> wrote:
>>>> 
>>>> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.
>>>> 
>>>> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.
>>>> 
>>>> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com> wrote:
>>>>> 
>>>>> Hi Eric,
>>>>> 
>>>>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.
>>>>> 
>>>>> Regards,
>>>>> Madhava 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
>>>>>> 
>>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>>> just have too much data on too little hardware. Check your
>>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>>> uses MMapDirectory to hold the index and you may well be
>>>>>> swapping, see:
>>>>>> 
>>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>>> 
>>>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>>>> 
>>>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
>>>>>> 
>>>>>> So have you been continually adding more documents to your
>>>>>> collections for more than the 2-3 weeks? If so you may have just
>>>>>> put so much data on the same boxes that you’ve gone over
>>>>>> the capacity of your hardware. As Toke says, adding physical
>>>>>> memory for the OS to use to hold relevant parts of the index may
>>>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>>>> 
>>>>>> All that said, if you’re going to keep adding document you need to
>>>>>> seriously think about adding new machines and moving some of
>>>>>> your replicas to them.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>>>>>>> 
>>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>>>>> We are performing QA performance testing on couple of collections
>>>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>>>> 
>>>>>>> How many shards?
>>>>>>> 
>>>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>>>> 
>>>>>>> Are you saying that there are 100 times more read operations when you
>>>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>>>> might be filled with the data that the writers are flushing.
>>>>>>> 
>>>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>>>> but such massive difference in IO-utilization does indicate that you
>>>>>>> are starved for cache.
>>>>>>> 
>>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>>>>> check: How many replicas are each physical box handling? If they are
>>>>>>> sharing resources, fewer replicas would probably be better.
>>>>>>> 
>>>>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>>>>> more? Would that help or create any other problems?
>>>>>>> 
>>>>>>> It does not hurt the server to increase the client timeout as the
>>>>>>> initiated query will keep running until it is finished, independent of
>>>>>>> whether or not there is a client to receive the result.
>>>>>>> 
>>>>>>> If you want a better max time for query processing, you should look at 
>>>>>>> 
>>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>>>>> but due to its inherent limitations it might not help in your
>>>>>>> situation.
>>>>>>> 
>>>>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>>>>> it loaded fine without any issues so having more documents in a
>>>>>>>> collection would create such problems?
>>>>>>> 
>>>>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>>>>> can see from an earlier post that you were using streaming expressions
>>>>>>> for another collection: This is one of the things that are affected by
>>>>>>> the Solr 7 DocValues issue.
>>>>>>> 
>>>>>>> More info about DocValues and streaming:
>>>>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>>>>> 
>>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>>>>> 
>>>>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>>>>> collection from scratch should fix it. 
>>>>>>> 
>>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>>>>> or you can ensure that there are values defined for all DocValues-
>>>>>>> fields in all your documents.
>>>>>>> 
>>>>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>>>>  at java.net.SocketInputStream.socketRead0(Native Method) 
>>>>>>> ...
>>>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>>>>> timeout expired: 600000/600000 ms
>>>>>>> 
>>>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>>>>> should be able to change it in solr.xml.
>>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>>>>> 
>>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>>>>> 
>>>>>>> - Toke Eskildsen, Royal Danish Library
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
> 

RE: Time-out errors while indexing (Solr 7.7.1)

Posted by "Kommu, Vinodh K." <vk...@dtcc.com>.
Hi Eric, Toke,

Can you please look at the details shared in my trail email & respond with your suggestions/feedback?


Thanks & Regards,
Vinodh

From: Kommu, Vinodh K.
Sent: Monday, July 6, 2020 4:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Time-out errors while indexing (Solr 7.7.1)


Thanks Eric & Toke for your response over this.





Just wanted to correct few things here about number of docs:



Total number of documents exists in the entire cluster (all collections) = 6393876826 (6.3B)

Total number of documents exists on 2 bigger collections (3749389864 & 1780147848) = 5529537712 (5.5B)

Total number of documents exists on remaining collections = 864339114 (864M)



So all collections docs altogether do not have 13B. If you see above numbers, the biggest collection in the cluster holds close to 3.7B docs and second biggest collection holds upto 1.7B docs whereas remaining 20 collections in the cluster holds 864M docs only which gives the total docs in the cluster is 6.3B docs



On hardware side, cluster sits on 6 solr VMs, each VMs has 170G total memory (with 2 solr instances running per VM), 16 vCPUs and each solr JVM runs with 31G heap. Remaining memory is allocated to OS disk cache and other OS related operations. Vm.swapiness on each VM is set to 0 so swap memory will be never used. Each collection is created using rule based replica placement API with 6 shards and replicas factor as 3.



One other observation with collections cores placement, as mentioned above we create collections using rule based replica placement i.e. rule to ensure no same shard’s replica should sit on same VM using following command.



curl -s -k user:password "https://localhost:22010/solr/admin/collections?action=CREATE&name=$SOLR_COLLECTION&numShards=${SHARDS_NO?}&replicationFactor=${REPLICATION_FACTOR?}&maxShardsPerNode=${MAX_SHARDS_PER_NODE?}&collection.configName=$SOLR_COLLECTION&rule=shard:*,replica:<2,host:*"



Variable values in above command:



SOLR_COLLECTION = collection name

SHARDS_NO = 6

REPLICATION_FACTOR = 3

MAX_SHARDS_PER_NODE = (a math logic will work based on number of solr VMs, number of nodes per VM and total number of replicas i.e total number of replicas / number of VMs. Here in this cluster the number would be 18/6 = 3 max shards per machine)





Ideally it is supposed to create 3 cores per VM for each collection based on rule based replica placement but from below snippet, there were 2, 3 & 4 cores for each collection are placed differently on each VMs. So apparently VM2 and VM6 have more cores than other VMs so I presume this could be one of the reason to see more IO operations than remaining 4 VMs.





That said, I believe solr does this replica placement considering other factors like free disk on each VM etc while creating a new collection correct? If so, is this replica placement across the VMs are fine? If not, what's needed to correct this? Can an additional core with 210G size can create more disk IO operations? If yes, can move the additional core from these VMs to other VM where the cores are less make any difference? (like ensuring each VM has only max of 3 shards)



Also we have been noticing significant surge in IO operations at storage level too. Wondering to understand if storage has IOPS limit could make solr crave for IO operations or other way around which is solr make more read write operations leading storage IOPS to reach its higher limit?





VM1:



176G  node1/solr/Collection2_shard5_replica_n30



176G  node2/solr/Collection2_shard2_replica_n24



176G  node2/solr/Collection2_shard3_replica_n2



177G  node1/solr/Collection2_shard6_replica_n10



208G  node1/solr/Collection1_shard5_replica_n18



208G  node2/solr/Collection1_shard2_replica_n1



1.1T  total





VM2:



176G  node2/solr/Collection2_shard4_replica_n16



176G  node2/solr/Collection2_shard6_replica_n34



177G  node1/solr/Collection2_shard5_replica_n6



207G  node2/solr/Collection1_shard6_replica_n10



208G  node1/solr/Collection1_shard1_replica_n32



208G  node2/solr/Collection1_shard5_replica_n30



210G  node1/solr/Collection1_shard3_replica_n14



1.4T  total





VM3:



175G  node2/solr/Collection2_shard2_replica_n12



177G  node1/solr/Collection2_shard1_replica_n20



208G  node1/solr/Collection1_shard1_replica_n8



208G  node2/solr/Collection1_shard2_replica_n12



209G  node1/solr/Collection1_shard4_replica_n28



976G  total





VM4:



176G  node1/solr/Collection2_shard4_replica_n28



177G  node1/solr/Collection2_shard1_replica_n8



207G  node2/solr/Collection1_shard6_replica_n22



208G  node1/solr/Collection1_shard5_replica_n6



210G  node1/solr/Collection1_shard3_replica_n26



975G  total





VM5:



176G  node2/solr/Collection2_shard3_replica_n14



177G  node1/solr/Collection2_shard5_replica_n18



177G  node2/solr/Collection2_shard1_replica_n32



208G  node1/solr/Collection1_shard2_replica_n24



210G  node1/solr/Collection1_shard3_replica_n2



210G  node2/solr/Collection1_shard4_replica_n4



1.2T  total





VM6:



177G  node1/solr/Collection2_shard3_replica_n26



177G  node1/solr/Collection2_shard4_replica_n4



177G  node2/solr/Collection2_shard2_replica_n1



178G  node2/solr/Collection2_shard6_replica_n22



207G  node2/solr/Collection1_shard6_replica_n34



208G  node1/solr/Collection1_shard1_replica_n20



209G  node2/solr/Collection1_shard4_replica_n16



1.3T  total





Thanks & Regards,

Vinodh



-----Original Message-----
From: Erick Erickson <er...@gmail.com>>
Sent: Saturday, July 4, 2020 7:07 PM
To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Time-out errors while indexing (Solr 7.7.1)



ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests for Login Information.



You need more shards. And, I’m pretty certain, more hardware.



You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running at all unless that 13B is a round number. If you keep adding documents, your installation will shortly, at best, stop accepting new documents for indexing. At worst you’ll start seeing weird errors and possibly corrupt indexes and have to re-index everything from scratch.



You’ve backed yourself in to a pretty tight corner here. You either have to re-index to a properly-sized cluster or use SPLITSHARD. This latter will double the index-on-disk size (it creates two child indexes per replica and keeps the old one for safety’s sake that you have to clean up later). I strongly recommend you stop ingesting more data while you do this.



You say you have 6 VMs with 2 nodes running on each. If those VMs are co-located with anything else, the physical hardware is going to be stressed. VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…



In fact, I urge you to stop ingesting data immediately and address this issue. You have a cluster that’s mis-configured, and you must address that before Bad Things Happen.



Best,

Erick



> On Jul 4, 2020, at 5:09 AM, Mad have <ma...@gmail.com>> wrote:

>

> Hi Eric,

>

> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. Total number of shards are 6 with 3 replicas. I can see the index size is more than 220GB on each node for the collection where we are facing the performance issue.

>

> The more documents we add to the collection the indexing become slow and I also have same impression that the size of the collection is creating this issue. Appreciate if you can suggests any solution on this.

>

>

> Regards,

> Madhava

> Sent from my iPhone

>

>> On 3 Jul 2020, at 23:30, Erick Erickson <er...@gmail.com>> wrote:

>>

>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.

>>

>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com>> wrote:

>>>

>>> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.

>>>

>>> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.

>>>

>>> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.

>>>

>>> Best,

>>> Erick

>>>

>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com>> wrote:

>>>>

>>>> Hi Eric,

>>>>

>>>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.

>>>>

>>>> Regards,

>>>> Madhava

>>>>

>>>> Sent from my iPhone

>>>>

>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com>> wrote:

>>>>>

>>>>> If you’re seeing low CPU utilization at the same time, you

>>>>> probably just have too much data on too little hardware. Check

>>>>> your swapping, how much of your I/O is just because Lucene can’t

>>>>> hold all the parts of the index it needs in memory at once? Lucene

>>>>> uses MMapDirectory to hold the index and you may well be swapping,

>>>>> see:

>>>>>

>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bi

>>>>> t.html

>>>>>

>>>>> But my guess is that you’ve just reached a tipping point. You say:

>>>>>

>>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”

>>>>>

>>>>> So have you been continually adding more documents to your

>>>>> collections for more than the 2-3 weeks? If so you may have just

>>>>> put so much data on the same boxes that you’ve gone over the

>>>>> capacity of your hardware. As Toke says, adding physical memory

>>>>> for the OS to use to hold relevant parts of the index may

>>>>> alleviate the problem (again, refer to Uwe’s article for why).

>>>>>

>>>>> All that said, if you’re going to keep adding document you need to

>>>>> seriously think about adding new machines and moving some of your

>>>>> replicas to them.

>>>>>

>>>>> Best,

>>>>> Erick

>>>>>

>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk>> wrote:

>>>>>>

>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:

>>>>>>> We are performing QA performance testing on couple of

>>>>>>> collections which holds 2 billion and 3.5 billion docs respectively.

>>>>>>

>>>>>> How many shards?

>>>>>>

>>>>>>> 1.  Our performance team noticed that read operations are pretty

>>>>>>> more than write operations like 100:1 ratio, is this expected

>>>>>>> during indexing or solr nodes are doing any other operations like syncing?

>>>>>>

>>>>>> Are you saying that there are 100 times more read operations when

>>>>>> you are indexing? That does not sound too unrealistic as the disk

>>>>>> cache might be filled with the data that the writers are flushing.

>>>>>>

>>>>>> In that case, more RAM would help. Okay, more RAM nearly always

>>>>>> helps, but such massive difference in IO-utilization does

>>>>>> indicate that you are starved for cache.

>>>>>>

>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to

>>>>>> sanity

>>>>>> check: How many replicas are each physical box handling? If they

>>>>>> are sharing resources, fewer replicas would probably be better.

>>>>>>

>>>>>>> 3.  Our client timeout is set to 2mins, can they increase

>>>>>>> further more? Would that help or create any other problems?

>>>>>>

>>>>>> It does not hurt the server to increase the client timeout as the

>>>>>> initiated query will keep running until it is finished,

>>>>>> independent of whether or not there is a client to receive the result.

>>>>>>

>>>>>> If you want a better max time for query processing, you should

>>>>>> look at

>>>>>>

>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.

>>>>>> html#timeallowed-parameter but due to its inherent limitations it

>>>>>> might not help in your situation.

>>>>>>

>>>>>>> 4.  When we created an empty collection and loaded same data

>>>>>>> file, it loaded fine without any issues so having more documents

>>>>>>> in a collection would create such problems?

>>>>>>

>>>>>> Solr 7 does have a problem with sparse DocValues and many

>>>>>> documents, leading to excessive IO-activity, which might be what

>>>>>> you are seeing. I can see from an earlier post that you were

>>>>>> using streaming expressions for another collection: This is one

>>>>>> of the things that are affected by the Solr 7 DocValues issue.

>>>>>>

>>>>>> More info about DocValues and streaming:

>>>>>> https://issues.apache.org/jira/browse/SOLR-13013

>>>>>>

>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:

>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374

>>>>>>

>>>>>> If this is your problem, upgrading to Solr 8 and indexing the

>>>>>> collection from scratch should fix it.

>>>>>>

>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to

>>>>>> 7.7 or you can ensure that there are values defined for all

>>>>>> DocValues- fields in all your documents.

>>>>>>

>>>>>>> java.net.SocketTimeoutException: Read timed out

>>>>>>>   at java.net.SocketInputStream.socketRead0(Native Method)

>>>>>> ...

>>>>>>> Remote error message: java.util.concurrent.TimeoutException:

>>>>>>> Idle timeout expired: 600000/600000 ms

>>>>>>

>>>>>> There is a default timeout of 10 minutes

>>>>>> (distribUpdateSoTimeout?). You should be able to change it in solr.xml.

>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html

>>>>>>

>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates

>>>>>> that the cluster is overloaded.  Increasing the timeout is just a band-aid.

>>>>>>

>>>>>> - Toke Eskildsen, Royal Danish Library

>>>>>>

>>>>>>

>>>>>

>>>

>>


DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

RE: Time-out errors while indexing (Solr 7.7.1)

Posted by "Kommu, Vinodh K." <vk...@dtcc.com>.
Thanks Eric & Toke for your response over this.





Just wanted to correct few things here about number of docs:



Total number of documents exists in the entire cluster (all collections) = 6393876826 (6.3B)

Total number of documents exists on 2 bigger collections (3749389864 & 1780147848) = 5529537712 (5.5B)

Total number of documents exists on remaining collections = 864339114 (864M)



So all collections docs altogether do not have 13B. If you see above numbers, the biggest collection in the cluster holds close to 3.7B docs and second biggest collection holds upto 1.7B docs whereas remaining 20 collections in the cluster holds 864M docs only which gives the total docs in the cluster is 6.3B docs



On hardware side, cluster sits on 6 solr VMs, each VMs has 170G total memory (with 2 solr instances running per VM), 16 vCPUs and each solr JVM runs with 31G heap. Remaining memory is allocated to OS disk cache and other OS related operations. Vm.swapiness on each VM is set to 0 so swap memory will be never used. Each collection is created using rule based replica placement API with 6 shards and replicas factor as 3.



One other observation with collections cores placement, as mentioned above we create collections using rule based replica placement i.e. rule to ensure no same shard’s replica should sit on same VM using following command.



curl -s -k user:password "https://localhost:22010/solr/admin/collections?action=CREATE&name=$SOLR_COLLECTION&numShards=${SHARDS_NO?}&replicationFactor=${REPLICATION_FACTOR?}&maxShardsPerNode=${MAX_SHARDS_PER_NODE?}&collection.configName=$SOLR_COLLECTION&rule=shard:*,replica:<2,host:*"



Variable values in above command:



SOLR_COLLECTION = collection name

SHARDS_NO = 6

REPLICATION_FACTOR = 3

MAX_SHARDS_PER_NODE = (a math logic will work based on number of solr VMs, number of nodes per VM and total number of replicas i.e total number of replicas / number of VMs. Here in this cluster the number would be 18/6 = 3 max shards per machine)





Ideally it is supposed to create 3 cores per VM for each collection based on rule based replica placement but from below snippet, there were 2, 3 & 4 cores for each collection are placed differently on each VMs. So apparently VM2 and VM6 have more cores than other VMs so I presume this could be one of the reason to see more IO operations than remaining 4 VMs.





That said, I believe solr does this replica placement considering other factors like free disk on each VM etc while creating a new collection correct? If so, is this replica placement across the VMs are fine? If not, what's needed to correct this? Can an additional core with 210G size can create more disk IO operations? If yes, can move the additional core from these VMs to other VM where the cores are less make any difference? (like ensuring each VM has only max of 3 shards)



Also we have been noticing significant surge in IO operations at storage level too. Wondering to understand if storage has IOPS limit could make solr crave for IO operations or other way around which is solr make more read write operations leading storage IOPS to reach its higher limit?



VM1:



176G  node1/solr/Collection2_shard5_replica_n30



176G  node2/solr/Collection2_shard2_replica_n24



176G  node2/solr/Collection2_shard3_replica_n2



177G  node1/solr/Collection2_shard6_replica_n10



208G  node1/solr/Collection1_shard5_replica_n18



208G  node2/solr/Collection1_shard2_replica_n1



1.1T  total



VM2:



176G  node2/solr/Collection2_shard4_replica_n16



176G  node2/solr/Collection2_shard6_replica_n34



177G  node1/solr/Collection2_shard5_replica_n6



207G  node2/solr/Collection1_shard6_replica_n10



208G  node1/solr/Collection1_shard1_replica_n32



208G  node2/solr/Collection1_shard5_replica_n30



210G  node1/solr/Collection1_shard3_replica_n14



1.4T  total



VM3:



175G  node2/solr/Collection2_shard2_replica_n12



177G  node1/solr/Collection2_shard1_replica_n20



208G  node1/solr/Collection1_shard1_replica_n8



208G  node2/solr/Collection1_shard2_replica_n12



209G  node1/solr/Collection1_shard4_replica_n28



976G  total



VM4:



176G  node1/solr/Collection2_shard4_replica_n28



177G  node1/solr/Collection2_shard1_replica_n8



207G  node2/solr/Collection1_shard6_replica_n22



208G  node1/solr/Collection1_shard5_replica_n6



210G  node1/solr/Collection1_shard3_replica_n26



975G  total



VM5:



176G  node2/solr/Collection2_shard3_replica_n14



177G  node1/solr/Collection2_shard5_replica_n18



177G  node2/solr/Collection2_shard1_replica_n32



208G  node1/solr/Collection1_shard2_replica_n24



210G  node1/solr/Collection1_shard3_replica_n2



210G  node2/solr/Collection1_shard4_replica_n4



1.2T  total



VM6:



177G  node1/solr/Collection2_shard3_replica_n26



177G  node1/solr/Collection2_shard4_replica_n4



177G  node2/solr/Collection2_shard2_replica_n1



178G  node2/solr/Collection2_shard6_replica_n22



207G  node2/solr/Collection1_shard6_replica_n34



208G  node1/solr/Collection1_shard1_replica_n20



209G  node2/solr/Collection1_shard4_replica_n16



1.3T  total





Thanks & Regards,

Vinodh



-----Original Message-----
From: Erick Erickson <er...@gmail.com>
Sent: Saturday, July 4, 2020 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Time-out errors while indexing (Solr 7.7.1)



ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests for Login Information.



You need more shards. And, I’m pretty certain, more hardware.



You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running at all unless that 13B is a round number. If you keep adding documents, your installation will shortly, at best, stop accepting new documents for indexing. At worst you’ll start seeing weird errors and possibly corrupt indexes and have to re-index everything from scratch.



You’ve backed yourself in to a pretty tight corner here. You either have to re-index to a properly-sized cluster or use SPLITSHARD. This latter will double the index-on-disk size (it creates two child indexes per replica and keeps the old one for safety’s sake that you have to clean up later). I strongly recommend you stop ingesting more data while you do this.



You say you have 6 VMs with 2 nodes running on each. If those VMs are co-located with anything else, the physical hardware is going to be stressed. VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…



In fact, I urge you to stop ingesting data immediately and address this issue. You have a cluster that’s mis-configured, and you must address that before Bad Things Happen.



Best,

Erick



> On Jul 4, 2020, at 5:09 AM, Mad have <ma...@gmail.com>> wrote:

>

> Hi Eric,

>

> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. Total number of shards are 6 with 3 replicas. I can see the index size is more than 220GB on each node for the collection where we are facing the performance issue.

>

> The more documents we add to the collection the indexing become slow and I also have same impression that the size of the collection is creating this issue. Appreciate if you can suggests any solution on this.

>

>

> Regards,

> Madhava

> Sent from my iPhone

>

>> On 3 Jul 2020, at 23:30, Erick Erickson <er...@gmail.com>> wrote:

>>

>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.

>>

>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com>> wrote:

>>>

>>> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.

>>>

>>> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.

>>>

>>> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.

>>>

>>> Best,

>>> Erick

>>>

>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com>> wrote:

>>>>

>>>> Hi Eric,

>>>>

>>>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.

>>>>

>>>> Regards,

>>>> Madhava

>>>>

>>>> Sent from my iPhone

>>>>

>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com>> wrote:

>>>>>

>>>>> If you’re seeing low CPU utilization at the same time, you

>>>>> probably just have too much data on too little hardware. Check

>>>>> your swapping, how much of your I/O is just because Lucene can’t

>>>>> hold all the parts of the index it needs in memory at once? Lucene

>>>>> uses MMapDirectory to hold the index and you may well be swapping,

>>>>> see:

>>>>>

>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bi

>>>>> t.html

>>>>>

>>>>> But my guess is that you’ve just reached a tipping point. You say:

>>>>>

>>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”

>>>>>

>>>>> So have you been continually adding more documents to your

>>>>> collections for more than the 2-3 weeks? If so you may have just

>>>>> put so much data on the same boxes that you’ve gone over the

>>>>> capacity of your hardware. As Toke says, adding physical memory

>>>>> for the OS to use to hold relevant parts of the index may

>>>>> alleviate the problem (again, refer to Uwe’s article for why).

>>>>>

>>>>> All that said, if you’re going to keep adding document you need to

>>>>> seriously think about adding new machines and moving some of your

>>>>> replicas to them.

>>>>>

>>>>> Best,

>>>>> Erick

>>>>>

>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk>> wrote:

>>>>>>

>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:

>>>>>>> We are performing QA performance testing on couple of

>>>>>>> collections which holds 2 billion and 3.5 billion docs respectively.

>>>>>>

>>>>>> How many shards?

>>>>>>

>>>>>>> 1.  Our performance team noticed that read operations are pretty

>>>>>>> more than write operations like 100:1 ratio, is this expected

>>>>>>> during indexing or solr nodes are doing any other operations like syncing?

>>>>>>

>>>>>> Are you saying that there are 100 times more read operations when

>>>>>> you are indexing? That does not sound too unrealistic as the disk

>>>>>> cache might be filled with the data that the writers are flushing.

>>>>>>

>>>>>> In that case, more RAM would help. Okay, more RAM nearly always

>>>>>> helps, but such massive difference in IO-utilization does

>>>>>> indicate that you are starved for cache.

>>>>>>

>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to

>>>>>> sanity

>>>>>> check: How many replicas are each physical box handling? If they

>>>>>> are sharing resources, fewer replicas would probably be better.

>>>>>>

>>>>>>> 3.  Our client timeout is set to 2mins, can they increase

>>>>>>> further more? Would that help or create any other problems?

>>>>>>

>>>>>> It does not hurt the server to increase the client timeout as the

>>>>>> initiated query will keep running until it is finished,

>>>>>> independent of whether or not there is a client to receive the result.

>>>>>>

>>>>>> If you want a better max time for query processing, you should

>>>>>> look at

>>>>>>

>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.

>>>>>> html#timeallowed-parameter but due to its inherent limitations it

>>>>>> might not help in your situation.

>>>>>>

>>>>>>> 4.  When we created an empty collection and loaded same data

>>>>>>> file, it loaded fine without any issues so having more documents

>>>>>>> in a collection would create such problems?

>>>>>>

>>>>>> Solr 7 does have a problem with sparse DocValues and many

>>>>>> documents, leading to excessive IO-activity, which might be what

>>>>>> you are seeing. I can see from an earlier post that you were

>>>>>> using streaming expressions for another collection: This is one

>>>>>> of the things that are affected by the Solr 7 DocValues issue.

>>>>>>

>>>>>> More info about DocValues and streaming:

>>>>>> https://issues.apache.org/jira/browse/SOLR-13013

>>>>>>

>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:

>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374

>>>>>>

>>>>>> If this is your problem, upgrading to Solr 8 and indexing the

>>>>>> collection from scratch should fix it.

>>>>>>

>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to

>>>>>> 7.7 or you can ensure that there are values defined for all

>>>>>> DocValues- fields in all your documents.

>>>>>>

>>>>>>> java.net.SocketTimeoutException: Read timed out

>>>>>>>   at java.net.SocketInputStream.socketRead0(Native Method)

>>>>>> ...

>>>>>>> Remote error message: java.util.concurrent.TimeoutException:

>>>>>>> Idle timeout expired: 600000/600000 ms

>>>>>>

>>>>>> There is a default timeout of 10 minutes

>>>>>> (distribUpdateSoTimeout?). You should be able to change it in solr.xml.

>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html

>>>>>>

>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates

>>>>>> that the cluster is overloaded.  Increasing the timeout is just a band-aid.

>>>>>>

>>>>>> - Toke Eskildsen, Royal Danish Library

>>>>>>

>>>>>>

>>>>>

>>>

>>


DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Erick Erickson <er...@gmail.com>.
You need more shards. And, I’m pretty certain, more hardware.

You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running at all unless that 13B is a round number. If you keep adding documents, your installation will shortly, at best, stop accepting new documents for indexing. At worst you’ll start seeing weird errors and possibly corrupt indexes and have to re-index everything from scratch.

You’ve backed yourself in to a pretty tight corner here. You either have to re-index to a properly-sized cluster or use SPLITSHARD. This latter will double the index-on-disk size (it creates two child indexes per replica and keeps the old one for safety’s sake that you have to clean up later). I strongly recommend you stop ingesting more data while you do this.

You say you have 6 VMs with 2 nodes running on each. If those VMs are co-located with anything else, the physical hardware is going to be stressed. VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…

In fact, I urge you to stop ingesting data immediately and address this issue. You have a cluster that’s mis-configured, and you must address that before Bad Things Happen.

Best,
Erick

> On Jul 4, 2020, at 5:09 AM, Mad have <ma...@gmail.com> wrote:
> 
> Hi Eric,
> 
> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. Total number of shards are 6 with 3 replicas. I can see the index size is more than 220GB on each node for the collection where we are facing the performance issue.
> 
> The more documents we add to the collection the indexing become slow and I also have same impression that the size of the collection is creating this issue. Appreciate if you can suggests any solution on this.
> 
> 
> Regards,
> Madhava 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 23:30, Erick Erickson <er...@gmail.com> wrote:
>> 
>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.
>> 
>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.
>>> 
>>> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.
>>> 
>>> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.
>>> 
>>> Best,
>>> Erick
>>> 
>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com> wrote:
>>>> 
>>>> Hi Eric,
>>>> 
>>>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.
>>>> 
>>>> Regards,
>>>> Madhava 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
>>>>> 
>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>> just have too much data on too little hardware. Check your
>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>> uses MMapDirectory to hold the index and you may well be
>>>>> swapping, see:
>>>>> 
>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>> 
>>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>>> 
>>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
>>>>> 
>>>>> So have you been continually adding more documents to your
>>>>> collections for more than the 2-3 weeks? If so you may have just
>>>>> put so much data on the same boxes that you’ve gone over
>>>>> the capacity of your hardware. As Toke says, adding physical
>>>>> memory for the OS to use to hold relevant parts of the index may
>>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>>> 
>>>>> All that said, if you’re going to keep adding document you need to
>>>>> seriously think about adding new machines and moving some of
>>>>> your replicas to them.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>>>>>> 
>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>>>> We are performing QA performance testing on couple of collections
>>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>>> 
>>>>>> How many shards?
>>>>>> 
>>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>>> 
>>>>>> Are you saying that there are 100 times more read operations when you
>>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>>> might be filled with the data that the writers are flushing.
>>>>>> 
>>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>>> but such massive difference in IO-utilization does indicate that you
>>>>>> are starved for cache.
>>>>>> 
>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>>>> check: How many replicas are each physical box handling? If they are
>>>>>> sharing resources, fewer replicas would probably be better.
>>>>>> 
>>>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>>>> more? Would that help or create any other problems?
>>>>>> 
>>>>>> It does not hurt the server to increase the client timeout as the
>>>>>> initiated query will keep running until it is finished, independent of
>>>>>> whether or not there is a client to receive the result.
>>>>>> 
>>>>>> If you want a better max time for query processing, you should look at 
>>>>>> 
>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>>>> but due to its inherent limitations it might not help in your
>>>>>> situation.
>>>>>> 
>>>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>>>> it loaded fine without any issues so having more documents in a
>>>>>>> collection would create such problems?
>>>>>> 
>>>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>>>> can see from an earlier post that you were using streaming expressions
>>>>>> for another collection: This is one of the things that are affected by
>>>>>> the Solr 7 DocValues issue.
>>>>>> 
>>>>>> More info about DocValues and streaming:
>>>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>>>> 
>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>>>> 
>>>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>>>> collection from scratch should fix it. 
>>>>>> 
>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>>>> or you can ensure that there are values defined for all DocValues-
>>>>>> fields in all your documents.
>>>>>> 
>>>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>>>   at java.net.SocketInputStream.socketRead0(Native Method) 
>>>>>> ...
>>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>>>> timeout expired: 600000/600000 ms
>>>>>> 
>>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>>>> should be able to change it in solr.xml.
>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>>>> 
>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>>>> 
>>>>>> - Toke Eskildsen, Royal Danish Library
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> 


Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Mad have <ma...@gmail.com>.
Hi Eric,

There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. Total number of shards are 6 with 3 replicas. I can see the index size is more than 220GB on each node for the collection where we are facing the performance issue.

The more documents we add to the collection the indexing become slow and I also have same impression that the size of the collection is creating this issue. Appreciate if you can suggests any solution on this.


Regards,
Madhava 
Sent from my iPhone

> On 3 Jul 2020, at 23:30, Erick Erickson <er...@gmail.com> wrote:
> 
> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.
> 
>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com> wrote:
>> 
>> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.
>> 
>> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.
>> 
>> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.
>> 
>> Best,
>> Erick
>> 
>>>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com> wrote:
>>> 
>>> Hi Eric,
>>> 
>>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.
>>> 
>>> Regards,
>>> Madhava 
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
>>>> 
>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>> just have too much data on too little hardware. Check your
>>>> swapping, how much of your I/O is just because Lucene can’t
>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>> uses MMapDirectory to hold the index and you may well be
>>>> swapping, see:
>>>> 
>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>> 
>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>> 
>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
>>>> 
>>>> So have you been continually adding more documents to your
>>>> collections for more than the 2-3 weeks? If so you may have just
>>>> put so much data on the same boxes that you’ve gone over
>>>> the capacity of your hardware. As Toke says, adding physical
>>>> memory for the OS to use to hold relevant parts of the index may
>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>> 
>>>> All that said, if you’re going to keep adding document you need to
>>>> seriously think about adding new machines and moving some of
>>>> your replicas to them.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>>>>> 
>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>>> We are performing QA performance testing on couple of collections
>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>> 
>>>>> How many shards?
>>>>> 
>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>> 
>>>>> Are you saying that there are 100 times more read operations when you
>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>> might be filled with the data that the writers are flushing.
>>>>> 
>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>> but such massive difference in IO-utilization does indicate that you
>>>>> are starved for cache.
>>>>> 
>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>>> check: How many replicas are each physical box handling? If they are
>>>>> sharing resources, fewer replicas would probably be better.
>>>>> 
>>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>>> more? Would that help or create any other problems?
>>>>> 
>>>>> It does not hurt the server to increase the client timeout as the
>>>>> initiated query will keep running until it is finished, independent of
>>>>> whether or not there is a client to receive the result.
>>>>> 
>>>>> If you want a better max time for query processing, you should look at 
>>>>> 
>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>>> but due to its inherent limitations it might not help in your
>>>>> situation.
>>>>> 
>>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>>> it loaded fine without any issues so having more documents in a
>>>>>> collection would create such problems?
>>>>> 
>>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>>> can see from an earlier post that you were using streaming expressions
>>>>> for another collection: This is one of the things that are affected by
>>>>> the Solr 7 DocValues issue.
>>>>> 
>>>>> More info about DocValues and streaming:
>>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>>> 
>>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>>> 
>>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>>> collection from scratch should fix it. 
>>>>> 
>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>>> or you can ensure that there are values defined for all DocValues-
>>>>> fields in all your documents.
>>>>> 
>>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>>    at java.net.SocketInputStream.socketRead0(Native Method) 
>>>>> ...
>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>>> timeout expired: 600000/600000 ms
>>>>> 
>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>>> should be able to change it in solr.xml.
>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>>> 
>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>>> 
>>>>> - Toke Eskildsen, Royal Danish Library
>>>>> 
>>>>> 
>>>> 
>> 
> 

Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Erick Erickson <er...@gmail.com>.
Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.

> On Jul 3, 2020, at 5:53 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.
> 
> Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.
> 
> And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com> wrote:
>> 
>> Hi Eric,
>> 
>> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.
>> 
>> Regards,
>> Madhava 
>> 
>> Sent from my iPhone
>> 
>>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> If you’re seeing low CPU utilization at the same time, you probably
>>> just have too much data on too little hardware. Check your
>>> swapping, how much of your I/O is just because Lucene can’t
>>> hold all the parts of the index it needs in memory at once? Lucene
>>> uses MMapDirectory to hold the index and you may well be
>>> swapping, see:
>>> 
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>> 
>>> But my guess is that you’ve just reached a tipping point. You say:
>>> 
>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
>>> 
>>> So have you been continually adding more documents to your
>>> collections for more than the 2-3 weeks? If so you may have just
>>> put so much data on the same boxes that you’ve gone over
>>> the capacity of your hardware. As Toke says, adding physical
>>> memory for the OS to use to hold relevant parts of the index may
>>> alleviate the problem (again, refer to Uwe’s article for why).
>>> 
>>> All that said, if you’re going to keep adding document you need to
>>> seriously think about adding new machines and moving some of
>>> your replicas to them.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>>>> 
>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>> We are performing QA performance testing on couple of collections
>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>> 
>>>> How many shards?
>>>> 
>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>> 
>>>> Are you saying that there are 100 times more read operations when you
>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>> might be filled with the data that the writers are flushing.
>>>> 
>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>> but such massive difference in IO-utilization does indicate that you
>>>> are starved for cache.
>>>> 
>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>> check: How many replicas are each physical box handling? If they are
>>>> sharing resources, fewer replicas would probably be better.
>>>> 
>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>> more? Would that help or create any other problems?
>>>> 
>>>> It does not hurt the server to increase the client timeout as the
>>>> initiated query will keep running until it is finished, independent of
>>>> whether or not there is a client to receive the result.
>>>> 
>>>> If you want a better max time for query processing, you should look at 
>>>> 
>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>> but due to its inherent limitations it might not help in your
>>>> situation.
>>>> 
>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>> it loaded fine without any issues so having more documents in a
>>>>> collection would create such problems?
>>>> 
>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>> can see from an earlier post that you were using streaming expressions
>>>> for another collection: This is one of the things that are affected by
>>>> the Solr 7 DocValues issue.
>>>> 
>>>> More info about DocValues and streaming:
>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>> 
>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>> 
>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>> collection from scratch should fix it. 
>>>> 
>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>> or you can ensure that there are values defined for all DocValues-
>>>> fields in all your documents.
>>>> 
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>     at java.net.SocketInputStream.socketRead0(Native Method) 
>>>> ...
>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>> timeout expired: 600000/600000 ms
>>>> 
>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>> should be able to change it in solr.xml.
>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>> 
>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>> 
>>>> - Toke Eskildsen, Royal Danish Library
>>>> 
>>>> 
>>> 
> 


Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Erick Erickson <er...@gmail.com>.
You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag.

Short form, though is yes. Subject to the questions above, this is what I’d be looking at first.

And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime.

Best,
Erick

> On Jul 3, 2020, at 5:32 PM, Mad have <ma...@gmail.com> wrote:
> 
> Hi Eric,
> 
> The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.
> 
> Regards,
> Madhava 
> 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
>> 
>> If you’re seeing low CPU utilization at the same time, you probably
>> just have too much data on too little hardware. Check your
>> swapping, how much of your I/O is just because Lucene can’t
>> hold all the parts of the index it needs in memory at once? Lucene
>> uses MMapDirectory to hold the index and you may well be
>> swapping, see:
>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> But my guess is that you’ve just reached a tipping point. You say:
>> 
>> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
>> 
>> So have you been continually adding more documents to your
>> collections for more than the 2-3 weeks? If so you may have just
>> put so much data on the same boxes that you’ve gone over
>> the capacity of your hardware. As Toke says, adding physical
>> memory for the OS to use to hold relevant parts of the index may
>> alleviate the problem (again, refer to Uwe’s article for why).
>> 
>> All that said, if you’re going to keep adding document you need to
>> seriously think about adding new machines and moving some of
>> your replicas to them.
>> 
>> Best,
>> Erick
>> 
>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>>> 
>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>> We are performing QA performance testing on couple of collections
>>>> which holds 2 billion and 3.5 billion docs respectively.
>>> 
>>> How many shards?
>>> 
>>>> 1.  Our performance team noticed that read operations are pretty
>>>> more than write operations like 100:1 ratio, is this expected during
>>>> indexing or solr nodes are doing any other operations like syncing?
>>> 
>>> Are you saying that there are 100 times more read operations when you
>>> are indexing? That does not sound too unrealistic as the disk cache
>>> might be filled with the data that the writers are flushing.
>>> 
>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>> but such massive difference in IO-utilization does indicate that you
>>> are starved for cache.
>>> 
>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>> check: How many replicas are each physical box handling? If they are
>>> sharing resources, fewer replicas would probably be better.
>>> 
>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>> more? Would that help or create any other problems?
>>> 
>>> It does not hurt the server to increase the client timeout as the
>>> initiated query will keep running until it is finished, independent of
>>> whether or not there is a client to receive the result.
>>> 
>>> If you want a better max time for query processing, you should look at 
>>> 
>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>> but due to its inherent limitations it might not help in your
>>> situation.
>>> 
>>>> 4.  When we created an empty collection and loaded same data file,
>>>> it loaded fine without any issues so having more documents in a
>>>> collection would create such problems?
>>> 
>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>> leading to excessive IO-activity, which might be what you are seeing. I
>>> can see from an earlier post that you were using streaming expressions
>>> for another collection: This is one of the things that are affected by
>>> the Solr 7 DocValues issue.
>>> 
>>> More info about DocValues and streaming:
>>> https://issues.apache.org/jira/browse/SOLR-13013
>>> 
>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>> 
>>> If this is your problem, upgrading to Solr 8 and indexing the
>>> collection from scratch should fix it. 
>>> 
>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>> or you can ensure that there are values defined for all DocValues-
>>> fields in all your documents.
>>> 
>>>> java.net.SocketTimeoutException: Read timed out
>>>>      at java.net.SocketInputStream.socketRead0(Native Method) 
>>> ...
>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>> timeout expired: 600000/600000 ms
>>> 
>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>> should be able to change it in solr.xml.
>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>> 
>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>> 
>>> - Toke Eskildsen, Royal Danish Library
>>> 
>>> 
>> 


Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Mad have <ma...@gmail.com>.
Hi Eric,

The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response.

Regards,
Madhava 

Sent from my iPhone

> On 3 Jul 2020, at 12:42, Erick Erickson <er...@gmail.com> wrote:
> 
> If you’re seeing low CPU utilization at the same time, you probably
> just have too much data on too little hardware. Check your
> swapping, how much of your I/O is just because Lucene can’t
> hold all the parts of the index it needs in memory at once? Lucene
> uses MMapDirectory to hold the index and you may well be
> swapping, see:
> 
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> But my guess is that you’ve just reached a tipping point. You say:
> 
> "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”
> 
> So have you been continually adding more documents to your
> collections for more than the 2-3 weeks? If so you may have just
> put so much data on the same boxes that you’ve gone over
> the capacity of your hardware. As Toke says, adding physical
> memory for the OS to use to hold relevant parts of the index may
> alleviate the problem (again, refer to Uwe’s article for why).
> 
> All that said, if you’re going to keep adding document you need to
> seriously think about adding new machines and moving some of
> your replicas to them.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
>> 
>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>> We are performing QA performance testing on couple of collections
>>> which holds 2 billion and 3.5 billion docs respectively.
>> 
>> How many shards?
>> 
>>> 1.  Our performance team noticed that read operations are pretty
>>> more than write operations like 100:1 ratio, is this expected during
>>> indexing or solr nodes are doing any other operations like syncing?
>> 
>> Are you saying that there are 100 times more read operations when you
>> are indexing? That does not sound too unrealistic as the disk cache
>> might be filled with the data that the writers are flushing.
>> 
>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>> but such massive difference in IO-utilization does indicate that you
>> are starved for cache.
>> 
>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>> check: How many replicas are each physical box handling? If they are
>> sharing resources, fewer replicas would probably be better.
>> 
>>> 3.  Our client timeout is set to 2mins, can they increase further
>>> more? Would that help or create any other problems?
>> 
>> It does not hurt the server to increase the client timeout as the
>> initiated query will keep running until it is finished, independent of
>> whether or not there is a client to receive the result.
>> 
>> If you want a better max time for query processing, you should look at 
>> 
>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>> but due to its inherent limitations it might not help in your
>> situation.
>> 
>>> 4.  When we created an empty collection and loaded same data file,
>>> it loaded fine without any issues so having more documents in a
>>> collection would create such problems?
>> 
>> Solr 7 does have a problem with sparse DocValues and many documents,
>> leading to excessive IO-activity, which might be what you are seeing. I
>> can see from an earlier post that you were using streaming expressions
>> for another collection: This is one of the things that are affected by
>> the Solr 7 DocValues issue.
>> 
>> More info about DocValues and streaming:
>> https://issues.apache.org/jira/browse/SOLR-13013
>> 
>> Fairly in-depth info on the problem with Solr 7 docValues:
>> https://issues.apache.org/jira/browse/LUCENE-8374
>> 
>> If this is your problem, upgrading to Solr 8 and indexing the
>> collection from scratch should fix it. 
>> 
>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>> or you can ensure that there are values defined for all DocValues-
>> fields in all your documents.
>> 
>>> java.net.SocketTimeoutException: Read timed out
>>>       at java.net.SocketInputStream.socketRead0(Native Method) 
>> ...
>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>> timeout expired: 600000/600000 ms
>> 
>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>> should be able to change it in solr.xml.
>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>> 
>> BUT if an update takes > 10 minutes to be processed, it indicates that
>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>> 
>> - Toke Eskildsen, Royal Danish Library
>> 
>> 
> 

Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Erick Erickson <er...@gmail.com>.
If you’re seeing low CPU utilization at the same time, you probably
just have too much data on too little hardware. Check your
swapping, how much of your I/O is just because Lucene can’t
hold all the parts of the index it needs in memory at once? Lucene
uses MMapDirectory to hold the index and you may well be
swapping, see:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

But my guess is that you’ve just reached a tipping point. You say:

"From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing”

So have you been continually adding more documents to your
collections for more than the 2-3 weeks? If so you may have just
put so much data on the same boxes that you’ve gone over
the capacity of your hardware. As Toke says, adding physical
memory for the OS to use to hold relevant parts of the index may
alleviate the problem (again, refer to Uwe’s article for why).

All that said, if you’re going to keep adding document you need to
seriously think about adding new machines and moving some of
your replicas to them.

Best,
Erick

> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <to...@kb.dk> wrote:
> 
> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>> We are performing QA performance testing on couple of collections
>> which holds 2 billion and 3.5 billion docs respectively.
> 
> How many shards?
> 
>>  1.  Our performance team noticed that read operations are pretty
>> more than write operations like 100:1 ratio, is this expected during
>> indexing or solr nodes are doing any other operations like syncing?
> 
> Are you saying that there are 100 times more read operations when you
> are indexing? That does not sound too unrealistic as the disk cache
> might be filled with the data that the writers are flushing.
> 
> In that case, more RAM would help. Okay, more RAM nearly always helps,
> but such massive difference in IO-utilization does indicate that you
> are starved for cache.
> 
> I noticed you have at least 18 replicas. That's a lot. Just to sanity
> check: How many replicas are each physical box handling? If they are
> sharing resources, fewer replicas would probably be better.
> 
>>  3.  Our client timeout is set to 2mins, can they increase further
>> more? Would that help or create any other problems?
> 
> It does not hurt the server to increase the client timeout as the
> initiated query will keep running until it is finished, independent of
> whether or not there is a client to receive the result.
> 
> If you want a better max time for query processing, you should look at 
> 
> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
> but due to its inherent limitations it might not help in your
> situation.
> 
>>  4.  When we created an empty collection and loaded same data file,
>> it loaded fine without any issues so having more documents in a
>> collection would create such problems?
> 
> Solr 7 does have a problem with sparse DocValues and many documents,
> leading to excessive IO-activity, which might be what you are seeing. I
> can see from an earlier post that you were using streaming expressions
> for another collection: This is one of the things that are affected by
> the Solr 7 DocValues issue.
> 
> More info about DocValues and streaming:
> https://issues.apache.org/jira/browse/SOLR-13013
> 
> Fairly in-depth info on the problem with Solr 7 docValues:
> https://issues.apache.org/jira/browse/LUCENE-8374
> 
> If this is your problem, upgrading to Solr 8 and indexing the
> collection from scratch should fix it. 
> 
> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
> or you can ensure that there are values defined for all DocValues-
> fields in all your documents.
> 
>> java.net.SocketTimeoutException: Read timed out
>>        at java.net.SocketInputStream.socketRead0(Native Method) 
> ...
>> Remote error message: java.util.concurrent.TimeoutException: Idle
>> timeout expired: 600000/600000 ms
> 
> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
> should be able to change it in solr.xml.
> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
> 
> BUT if an update takes > 10 minutes to be processed, it indicates that
> the cluster is overloaded.  Increasing the timeout is just a band-aid.
> 
> - Toke Eskildsen, Royal Danish Library
> 
> 


Re: Time-out errors while indexing (Solr 7.7.1)

Posted by Toke Eskildsen <to...@kb.dk>.
On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
> We are performing QA performance testing on couple of collections
> which holds 2 billion and 3.5 billion docs respectively.

How many shards?

>   1.  Our performance team noticed that read operations are pretty
> more than write operations like 100:1 ratio, is this expected during
> indexing or solr nodes are doing any other operations like syncing?

Are you saying that there are 100 times more read operations when you
are indexing? That does not sound too unrealistic as the disk cache
might be filled with the data that the writers are flushing.

In that case, more RAM would help. Okay, more RAM nearly always helps,
but such massive difference in IO-utilization does indicate that you
are starved for cache.

I noticed you have at least 18 replicas. That's a lot. Just to sanity
check: How many replicas are each physical box handling? If they are
sharing resources, fewer replicas would probably be better.

>   3.  Our client timeout is set to 2mins, can they increase further
> more? Would that help or create any other problems?

It does not hurt the server to increase the client timeout as the
initiated query will keep running until it is finished, independent of
whether or not there is a client to receive the result.

If you want a better max time for query processing, you should look at 

https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
 but due to its inherent limitations it might not help in your
situation.

>   4.  When we created an empty collection and loaded same data file,
> it loaded fine without any issues so having more documents in a
> collection would create such problems?

Solr 7 does have a problem with sparse DocValues and many documents,
leading to excessive IO-activity, which might be what you are seeing. I
can see from an earlier post that you were using streaming expressions
for another collection: This is one of the things that are affected by
the Solr 7 DocValues issue.

More info about DocValues and streaming:
https://issues.apache.org/jira/browse/SOLR-13013

Fairly in-depth info on the problem with Solr 7 docValues:
https://issues.apache.org/jira/browse/LUCENE-8374

If this is your problem, upgrading to Solr 8 and indexing the
collection from scratch should fix it. 

Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
or you can ensure that there are values defined for all DocValues-
fields in all your documents.

> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method) 
...
> Remote error message: java.util.concurrent.TimeoutException: Idle
> timeout expired: 600000/600000 ms

There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
should be able to change it in solr.xml.
https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html

BUT if an update takes > 10 minutes to be processed, it indicates that
the cluster is overloaded.  Increasing the timeout is just a band-aid.

- Toke Eskildsen, Royal Danish Library