You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Awasthi, Shishir" <sh...@baml.com> on 2011/11/02 04:57:58 UTC

RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there
be any performance issues?

Thanks,
Shishir

-----Original Message-----
From: Roman Alekseenkov [mailto:ralekseenkov@gmail.com] 
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream
showed that every document is following "update document" path instead
of "add document" path. Meaning that all document IDs are getting into
the "pending deletes" queue, and Solr has to rescan its index on every
commit for potential deletions. This is single threaded and seems to get
progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument()
function with deleteTerm not being null. So, I hacked Lucene a little
bit and set deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at
1TB index size and going, without noticeable degradation of the indexing
speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads
are being paused. And ConcurrentMergeScheduler with 4 threads is not
helping much. I guess the changes on the trunk would definitely help,
but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

RE: large scale indexing issues / single threaded bottleneck

Posted by Roman Alekseenkov <ra...@gmail.com>.

We have a rate of 2K small docs/sec which translates into 90 GB/day of index
space
You should be fine

Roman


Awasthi, Shishir wrote:
> 
> Roman,
> How frequently do you update your index? I have a need to do real time
> add/delete to SOLR documents at a rate of approximately 20/min.
> The total number of documents are in the range of 4 million. Will there
> be any performance issues?
> 
> Thanks,
> Shishir
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-threaded-bottleneck-tp3461815p3472901.html
Sent from the Solr - User mailing list archive at Nabble.com.

AW: large scale indexing issues / single threaded bottleneck

Posted by se...@immonet.de.

Hi,

we are currently thinking about the performance facts too.
I wonder if there are any sites on the net describing what a large index is? 

People always talk about huge indexes and heavy commits etc. but i can't find some stats about it in numbers and no information about the hardware used.

Maybe an article in the wiki would help.

I expect our index to be about 4 to 5 gig with 500.000 docs and 80.000 commits a day. Is that considered to be large, medium or small?

Greets
Sebastian

-----Ursprüngliche Nachricht-----
Von: Jaeger, Jay - DOT [mailto:Jay.Jaeger@dot.wi.gov] 
Gesendet: Donnerstag, 3. November 2011 14:00
An: 'solr-user@lucene.apache.org'
Betreff: RE: large scale indexing issues / single threaded bottleneck

Shishir, we have 35 million "documents", and should be doing about 5000-10000 new "documents" a day, but with very small "documents":  40 fields which have at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of your documents, and your search volume.

JRJ

-----Original Message-----
From: Awasthi, Shishir [mailto:shishir.awasthi@baml.com]
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there be any performance issues?

Thanks,
Shishir

-----Original Message-----
From: Roman Alekseenkov [mailto:ralekseenkov@gmail.com]
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday night. Adding a whole bunch of debug statements to the info stream showed that every document is following "update document" path instead of "add document" path. Meaning that all document IDs are getting into the "pending deletes" queue, and Solr has to rescan its index on every commit for potential deletions. This is single threaded and seems to get progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my debug statements showed that messages still go to updateDocument() function with deleteTerm not being null. So, I hacked Lucene a little bit and set deleteTerm=null as a temporary solution in the beginning of updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million documents/hour (producing 20Gb of index every hour). Right now it's at 1TB index size and going, without noticeable degradation of the indexing speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads are being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. I guess the changes on the trunk would definitely help, but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why "overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

RE: large scale indexing issues / single threaded bottleneck

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.

Shishir, we have 35 million "documents", and should be doing about 5000-10000 new "documents" a day, but with very small "documents":  40 fields which have at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of your documents, and your search volume.

JRJ

-----Original Message-----
From: Awasthi, Shishir [mailto:shishir.awasthi@baml.com] 
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there
be any performance issues?

Thanks,
Shishir

-----Original Message-----
From: Roman Alekseenkov [mailto:ralekseenkov@gmail.com] 
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream
showed that every document is following "update document" path instead
of "add document" path. Meaning that all document IDs are getting into
the "pending deletes" queue, and Solr has to rescan its index on every
commit for potential deletions. This is single threaded and seems to get
progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument()
function with deleteTerm not being null. So, I hacked Lucene a little
bit and set deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at
1TB index size and going, without noticeable degradation of the indexing
speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads
are being paused. And ConcurrentMergeScheduler with 4 threads is not
helping much. I guess the changes on the trunk would definitely help,
but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.