You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kenf_nc <ke...@realestate.com> on 2010/07/16 20:39:58 UTC

indexing best practices

I was curious if anyone has done work on finding what an optimal (or max)
number of client processes are for indexing. That is, if I have the ability
to spin up N number of processes that construct a POST to add/update a Solr
document, is there a point at which the number of clients posting
simultaneously overloads Solr's ability to keep up with the Add's? I know
this is very hardware dependent, but am looking for ballpark guidelines.
This will be in a Tomcat process running on Windows Server 2008, 2 Solr
instances, one master, one slave standard replication.

Related to this, is there a best practice number of documents to send in a
single POST. (again I know it depends on the complexity of the document,
field types, analyzers/tokenizers etc).

And finally, what do you find to be the best approach to getting data into
Solr. If the technology aspect isn't an issue (except I don't want to use
EmbeddedSolr), you just want to get documents added/updated as quickly as
possible.  POST, xml or csv document upload, DataImportHandler, other?  I'm
just looking for raw speed, not architectural factors.

So, nutshell, all other factors put aside, I'm looking for best approach to
indexing with pure raw speed the only criteria. 

Thanks,
Ken
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p973274.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re:indexing best practices

Posted by marship <ma...@126.com>.
Hi. I justed noticed when you add document to solr, turn the auto-commit flag off, after posting done, commit and optimize. The the speed is super fast. 

I was using 31 clients to post 31 solr cores at the same time. I think if you use 2 clients to post to same core, the question will be "how fast can your client generate the xml?". In my case, solr is faster than the speed I create the xml.


 

在2010-07-17 02:39:58,kenf_nc <ke...@realestate.com> 写道:
>
>I was curious if anyone has done work on finding what an optimal (or max)
>number of client processes are for indexing. That is, if I have the ability
>to spin up N number of processes that construct a POST to add/update a Solr
>document, is there a point at which the number of clients posting
>simultaneously overloads Solr's ability to keep up with the Add's? I know
>this is very hardware dependent, but am looking for ballpark guidelines.
>This will be in a Tomcat process running on Windows Server 2008, 2 Solr
>instances, one master, one slave standard replication.
>
>Related to this, is there a best practice number of documents to send in a
>single POST. (again I know it depends on the complexity of the document,
>field types, analyzers/tokenizers etc).
>
>And finally, what do you find to be the best approach to getting data into
>Solr. If the technology aspect isn't an issue (except I don't want to use
>EmbeddedSolr), you just want to get documents added/updated as quickly as
>possible.  POST, xml or csv document upload, DataImportHandler, other?  I'm
>just looking for raw speed, not architectural factors.
>
>So, nutshell, all other factors put aside, I'm looking for best approach to
>indexing with pure raw speed the only criteria. 
>
>Thanks,
>Ken
>-- 
>View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p973274.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing best practices

Posted by Lance Norskog <go...@gmail.com>.
"Nomerge" has struck me as somewhat uncontrollable. There is also a
"balanced" merge policy in the trunk, courtesy of LinkedIn.

On Mon, Jul 19, 2010 at 12:43 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi Ken,
>
> This is all very dependent on your documents, your indexing setup and your hardware. Just as an extreme data point, I'll describe our experience.
>
> We run 5 clients on each of 6 machines to send documents to Solr using the standard http xml process.  Our documents contain about 10 fields, but one field contains OCR for the full text of a book.  The documents are about 700KB in size.
>
> Each client sends solr documents to one of 10 solr shards on a round-robin basis.  We are running 5 shards on each of two dedicated indexing machines each with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz processors (Nehalem).  What we generally see is that once the index gets large enough for significant merging, our producers can send documents to solr faster than it can index them.
>
> We suspect that our bottleneck is simply disk I/O for index merging on the Solr build machines.  We are currently experimenting with changing the maxRAMBufferSize settings and various merge policies/merge factors to see if we can speed up the Solr end of the indexing process.   Since we optimize our index down to two segments, we are also planning to experiment with using the "nomerge" merge policy. I hope to have some results to report on our blog sometime in the next  month or so.
>
> Tom Burton-West
> www.hathitrust.org/blogs
>
> -----Original Message-----
> From: kenf_nc [mailto:ken.foster@realestate.com]
> Sent: Sunday, July 18, 2010 8:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: indexing best practices
>
>
> No one has done performance analysis? Or has a link to anywhere where it's
> been done?
>
> basically fastest way to get documents into Solr. So many options available,
> what's the fastest:
> 1) file import (xml, csv)  vs  DIH  vs POSTing
> 2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
> returns number?
>
> I have 16 million small (8 to 10 fields, no large text fields) docs that get
> updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
> fields) that get updated monthly. It currently takes about 20 hours to do a
> full import. I would like to cut that down as much as possible.
> Thanks,
> Ken
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com

RE: indexing best practices

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Ken,

This is all very dependent on your documents, your indexing setup and your hardware. Just as an extreme data point, I'll describe our experience.  

We run 5 clients on each of 6 machines to send documents to Solr using the standard http xml process.  Our documents contain about 10 fields, but one field contains OCR for the full text of a book.  The documents are about 700KB in size.

Each client sends solr documents to one of 10 solr shards on a round-robin basis.  We are running 5 shards on each of two dedicated indexing machines each with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz processors (Nehalem).  What we generally see is that once the index gets large enough for significant merging, our producers can send documents to solr faster than it can index them.

We suspect that our bottleneck is simply disk I/O for index merging on the Solr build machines.  We are currently experimenting with changing the maxRAMBufferSize settings and various merge policies/merge factors to see if we can speed up the Solr end of the indexing process.   Since we optimize our index down to two segments, we are also planning to experiment with using the "nomerge" merge policy. I hope to have some results to report on our blog sometime in the next  month or so.

Tom Burton-West
www.hathitrust.org/blogs

-----Original Message-----
From: kenf_nc [mailto:ken.foster@realestate.com] 
Sent: Sunday, July 18, 2010 8:18 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing best practices


No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing best practices

Posted by Geert-Jan Brits <gb...@gmail.com>.
Have you read:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

To be short there are only guidelines (see links) no definitive answers.
If you followed the guidelines for improviing indexing speed on a single box
and after having tested various settings indexing is still too slow, you may
want to test the scenario:
1. indexing to several boxes/shards (using round robin or something).
2. copy all created indexes to one box.
3. use indexwriter.addIndexes to merge the indexes.

1/2/3 done on ssd's is of course going to boost performance a lot as well
(on large indexes, bc small ones may fit in disk cache entirely)
<http://wiki.apache.org/lucene-java/ImproveIndexingSpeed>
Hope that helps a bit,
Geert-Jan

2010/7/18 kenf_nc <ke...@realestate.com>

>
> No one has done performance analysis? Or has a link to anywhere where it's
> been done?
>
> basically fastest way to get documents into Solr. So many options
> available,
> what's the fastest:
> 1) file import (xml, csv)  vs  DIH  vs POSTing
> 2) number of concurrent clients   1   vs 10 vs 100 ...is there a
> diminishing
> returns number?
>
> I have 16 million small (8 to 10 fields, no large text fields) docs that
> get
> updated monthly and 2.5 million largish (20 to 30 fields, a couple html
> text
> fields) that get updated monthly. It currently takes about 20 hours to do a
> full import. I would like to cut that down as much as possible.
> Thanks,
> Ken
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing best practices

Posted by kenf_nc <ke...@realestate.com>.
No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.