You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peyman Faratin <pe...@robustlinks.com> on 2012/03/10 17:09:45 UTC

Faster Solr Indexing

Hi

I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram). 

Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. 

The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields:

	- content, and
	- shingledContent (populated using copyField of content).

The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section "The Solution: Shingling"). 

Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. 

Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. 

thank you

Peyman 


Re: Faster Solr Indexing

Posted by Peyman Faratin <pe...@robustlinks.com>.
Hi Erick, Dimitry and Mikhail

thank you all for your time. I tried all of the suggestions below and am happy to report that indexing speeds have improved. There were several confounding problems including

- a bank of (~20) regexes that were poorly optimized and compiled at each indexing step
- single threaded
- not using StreamingUpdateSolrServer
- excessive logging

However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the time of building the SOLR document. Indexing sped up after precomputing these values offline.

Thank you all for your help. 

best

Peyman 

On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote:

> How have you determined that it's the solr add? By timing the call on the
> SolrJ side or by looking at the machine where Solr is running? This is the
> very first thing you have to answer. You can get a rough ides with any
> simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
> box). The point is just to see whether the indexer machine is being
> well utilized. I'd guess it's not actually.
> 
> One quick experiment would be to try using StreamingUpdateSolrServer
> (SUSS), which has the capability of having multiple threads
> fire at Solr at once. It is possible that your performance is spent
> waiting for I/O.
> 
> Once you have that question answered, you can refine. But until you
> know which side of the wire the problem is on, you're flying blind.
> 
> Both Yandong Peyman:
> These times are quite surprising. Running everything locally on my laptop,
> I'm indexing between 5-7K documents/second. The source is
> the Wikipedia dump.
> 
> I'm particularly surprised by the difference Yandong is seeing based
> on the various analysis chains. the first thing I'd back off is the
> MaxPermSize. 512M is huge for this parameter.
> If you're getting that kind of time differential and your CPU isn't
> pegged, you're probably swapping in which case you need
> to give the processes more memory. I'd just take the MaxPermSize
> out completely as a start.
> 
> Not sure if you've seen this page, something there might help.
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> 
> But throw a profiler at the indexer as a first step, just to see
> where the problem is, CPU or I/O.
> 
> Best
> Erick
> 
> On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin <pe...@robustlinks.com> wrote:
>> Hi
>> 
>> I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram).
>> 
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal.
>> 
>> The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields:
>> 
>>        - content, and
>>        - shingledContent (populated using copyField of content).
>> 
>> The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section "The Solution: Shingling").
>> 
>> Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working.
>> 
>> Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated.
>> 
>> thank you
>> 
>> Peyman
>> 


Re: Faster Solr Indexing

Posted by Erick Erickson <er...@gmail.com>.
How have you determined that it's the solr add? By timing the call on the
SolrJ side or by looking at the machine where Solr is running? This is the
very first thing you have to answer. You can get a rough ides with any
simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
box). The point is just to see whether the indexer machine is being
well utilized. I'd guess it's not actually.

One quick experiment would be to try using StreamingUpdateSolrServer
(SUSS), which has the capability of having multiple threads
fire at Solr at once. It is possible that your performance is spent
waiting for I/O.

Once you have that question answered, you can refine. But until you
know which side of the wire the problem is on, you're flying blind.

Both Yandong Peyman:
These times are quite surprising. Running everything locally on my laptop,
I'm indexing between 5-7K documents/second. The source is
the Wikipedia dump.

I'm particularly surprised by the difference Yandong is seeing based
on the various analysis chains. the first thing I'd back off is the
MaxPermSize. 512M is huge for this parameter.
If you're getting that kind of time differential and your CPU isn't
pegged, you're probably swapping in which case you need
to give the processes more memory. I'd just take the MaxPermSize
out completely as a start.

Not sure if you've seen this page, something there might help.
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

But throw a profiler at the indexer as a first step, just to see
where the problem is, CPU or I/O.

Best
Erick

On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin <pe...@robustlinks.com> wrote:
> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields:
>
>        - content, and
>        - shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section "The Solution: Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>

Re: Faster Solr Indexing

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Dmitry,

If you start to speak about logging, don't forget to say that jdk logging
is absolutely not really performant, but is default for 3.x. Logback is
much faster.

Peyman,
1. shingles has performance implication. That is. it can cost much. Why
term positions and phrase queries are not enough for you?
2. some time ago there was a similar thread caused by superfluous
shingleing, so it's worth do double check that you produce not much than
you really need (Captian O speaking)
3. when I have problem with performance the first thing I do is profiler or
sampler
4. The way to look inside lucene indexing is enabling infostream, you'll
have a lot of info
5. are all of your cpu cores utilized? is they aren't, employ indexing in
multiple threads. it scales. Post several indexing requests in parallel. Be
aware that DIH doesn't works for multiple threads yet SOLR-3011.
6. Some time ago I need to have a huge throughput and faced the trivial
producer-consumer trap. The indexing app (it was DIH hacked a little) pulls
data from jdbc, but in this time solr indexing were idle, then it pushed
constructed documents into solr for indexing but does it synchronously and
being idle while solr consumes them. As result I had overall time is equal
to sum of producing and consuming. So, I organized async buffer and reduce
time to maximum if those times. Double check that you have maximum of
producing and consuming but not a sum of it. I used perf4j to trace those
times.
7. As you data is huge you can try to employ cluster magic, spread your
docs between two solr instances then search them in parallel SolrShards,
SolrCloud for you, I never did it. If you don't like to search in parallel,
you can copy index shards between boxes to have a full replica on each box
- but I haven't heard about it out-of-the box.

Regards

On Sun, Mar 11, 2012 at 7:27 PM, Dmitry Kan <dm...@gmail.com> wrote:

> one approach we have taken was decreasing the solr logging level for
> the posting session, described here (implemented for 1.4, but should
> be easy to port to 3.x):
>
> http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
>
> On 3/11/12, Yandong Yao <yy...@gmail.com> wrote:
> > I have similar issues by using DIH,
> > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> > consumes most of the time when indexing 10K rows (each row is about 70K)
> >     -  DIH nextRow takes about 10 seconds totally
> >     -  If index uses whitespace tokenizer and lower case filter, then
> > addDoc() methods takes about 80 seconds
> >     -  If index uses whitespace tokenizer, lower case filer, WDF, then
> > addDoc uses about 112 seconds
> >     -  If index uses whitespace tokenizer, lower case filer, WDF and
> porter
> > stemmer, then addDoc uses about 145 seconds
> >
> > We have more than million rows totally, and am wondering whether i am
> using
> > sth. wrong or is there any way to improve the performance of addDoc()?
> >
> > Thanks very much in advance!
> >
> >
> > Following is the configure:
> > 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> > 2) Solr version 3.5
> > 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
> >
> >   <indexDefaults>
> >
> >     <useCompoundFile>false</useCompoundFile>
> >
> >     <mergeFactor>10</mergeFactor>
> >     <!-- Sets the amount of RAM that may be used by Lucene indexing
> >          for buffering added documents and deletions before they are
> >          flushed to the Directory.  -->
> >     <ramBufferSizeMB>64</ramBufferSizeMB>
> >     <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
> >          Lucene will flush based on whichever limit is hit first.
> >       -->
> >     <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
> >
> >     <maxFieldLength>2147483647</maxFieldLength>
> >     <writeLockTimeout>1000</writeLockTimeout>
> >     <commitLockTimeout>10000</commitLockTimeout>
> >
> >     <lockType>native</lockType>
> >   </indexDefaults>
> >
> > 2012/3/11 Peyman Faratin <pe...@robustlinks.com>
> >
> >> Hi
> >>
> >> I am trying to index 12MM docs faster than is currently happening in
> Solr
> >> (using solrj). We have identified solr's add method as the bottleneck
> (and
> >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
> >> and
> >> jvm ram).
> >>
> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure
> we
> >> add and commit in batches. And we've tried both CommonsHttpSolrServer
> and
> >> EmbeddedSolrServer (assuming removing http overhead would speed things
> up
> >> with embedding) but the differences is marginal.
> >>
> >> The docs being indexed are on average 20 fields long, mostly indexed but
> >> none stored. The major size contributors are two fields:
> >>
> >>        - content, and
> >>        - shingledContent (populated using copyField of content).
> >>
> >> The length of the content field is (likely) gaussian distributed (few
> >> large docs 50-80K tokens, but majority around 2k tokens). We use
> >> shingledContent to support phrase queries and content for unigram
> queries
> >> (following the advice of Solr Enterprise search server advice - p. 305,
> >> section "The Solution: Shingling").
> >>
> >> Clearly the size of the docs is a contributor to the slow adds
> (confirmed
> >> by removing these 2 fields resulting in halving the indexing time).
> We've
> >> tried compressed=true also but that is not working.
> >>
> >> Any guidance on how to support our application logic (without having to
> >> change the schema too much) and speed the indexing speed (from current
> 212
> >> days for 12MM docs) would be much appreciated.
> >>
> >> thank you
> >>
> >> Peyman
> >>
> >>
> >
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Faster Solr Indexing

Posted by Dmitry Kan <dm...@gmail.com>.
one approach we have taken was decreasing the solr logging level for
the posting session, described here (implemented for 1.4, but should
be easy to port to 3.x):

http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html

On 3/11/12, Yandong Yao <yy...@gmail.com> wrote:
> I have similar issues by using DIH,
> and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> consumes most of the time when indexing 10K rows (each row is about 70K)
>     -  DIH nextRow takes about 10 seconds totally
>     -  If index uses whitespace tokenizer and lower case filter, then
> addDoc() methods takes about 80 seconds
>     -  If index uses whitespace tokenizer, lower case filer, WDF, then
> addDoc uses about 112 seconds
>     -  If index uses whitespace tokenizer, lower case filer, WDF and porter
> stemmer, then addDoc uses about 145 seconds
>
> We have more than million rows totally, and am wondering whether i am using
> sth. wrong or is there any way to improve the performance of addDoc()?
>
> Thanks very much in advance!
>
>
> Following is the configure:
> 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> 2) Solr version 3.5
> 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
>
>   <indexDefaults>
>
>     <useCompoundFile>false</useCompoundFile>
>
>     <mergeFactor>10</mergeFactor>
>     <!-- Sets the amount of RAM that may be used by Lucene indexing
>          for buffering added documents and deletions before they are
>          flushed to the Directory.  -->
>     <ramBufferSizeMB>64</ramBufferSizeMB>
>     <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
>          Lucene will flush based on whichever limit is hit first.
>       -->
>     <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
>
>     <maxFieldLength>2147483647</maxFieldLength>
>     <writeLockTimeout>1000</writeLockTimeout>
>     <commitLockTimeout>10000</commitLockTimeout>
>
>     <lockType>native</lockType>
>   </indexDefaults>
>
> 2012/3/11 Peyman Faratin <pe...@robustlinks.com>
>
>> Hi
>>
>> I am trying to index 12MM docs faster than is currently happening in Solr
>> (using solrj). We have identified solr's add method as the bottleneck (and
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
>> and
>> jvm ram).
>>
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and
>> EmbeddedSolrServer (assuming removing http overhead would speed things up
>> with embedding) but the differences is marginal.
>>
>> The docs being indexed are on average 20 fields long, mostly indexed but
>> none stored. The major size contributors are two fields:
>>
>>        - content, and
>>        - shingledContent (populated using copyField of content).
>>
>> The length of the content field is (likely) gaussian distributed (few
>> large docs 50-80K tokens, but majority around 2k tokens). We use
>> shingledContent to support phrase queries and content for unigram queries
>> (following the advice of Solr Enterprise search server advice - p. 305,
>> section "The Solution: Shingling").
>>
>> Clearly the size of the docs is a contributor to the slow adds (confirmed
>> by removing these 2 fields resulting in halving the indexing time). We've
>> tried compressed=true also but that is not working.
>>
>> Any guidance on how to support our application logic (without having to
>> change the schema too much) and speed the indexing speed (from current 212
>> days for 12MM docs) would be much appreciated.
>>
>> thank you
>>
>> Peyman
>>
>>
>


-- 
Regards,

Dmitry Kan

Re: Faster Solr Indexing

Posted by Yandong Yao <yy...@gmail.com>.
I have similar issues by using DIH,
and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
consumes most of the time when indexing 10K rows (each row is about 70K)
    -  DIH nextRow takes about 10 seconds totally
    -  If index uses whitespace tokenizer and lower case filter, then
addDoc() methods takes about 80 seconds
    -  If index uses whitespace tokenizer, lower case filer, WDF, then
addDoc uses about 112 seconds
    -  If index uses whitespace tokenizer, lower case filer, WDF and porter
stemmer, then addDoc uses about 145 seconds

We have more than million rows totally, and am wondering whether i am using
sth. wrong or is there any way to improve the performance of addDoc()?

Thanks very much in advance!


Following is the configure:
1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
2) Solr version 3.5
3) solrconfig.xml  (almost copied from solr's  example/solr directory.)

  <indexDefaults>

    <useCompoundFile>false</useCompoundFile>

    <mergeFactor>10</mergeFactor>
    <!-- Sets the amount of RAM that may be used by Lucene indexing
         for buffering added documents and deletions before they are
         flushed to the Directory.  -->
    <ramBufferSizeMB>64</ramBufferSizeMB>
    <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
         Lucene will flush based on whichever limit is hit first.
      -->
    <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->

    <maxFieldLength>2147483647</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>

    <lockType>native</lockType>
  </indexDefaults>

2012/3/11 Peyman Faratin <pe...@robustlinks.com>

> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr
> (using solrj). We have identified solr's add method as the bottleneck (and
> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and
> jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
> add and commit in batches. And we've tried both CommonsHttpSolrServer and
> EmbeddedSolrServer (assuming removing http overhead would speed things up
> with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but
> none stored. The major size contributors are two fields:
>
>        - content, and
>        - shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few
> large docs 50-80K tokens, but majority around 2k tokens). We use
> shingledContent to support phrase queries and content for unigram queries
> (following the advice of Solr Enterprise search server advice - p. 305,
> section "The Solution: Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed
> by removing these 2 fields resulting in halving the indexing time). We've
> tried compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to
> change the schema too much) and speed the indexing speed (from current 212
> days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>
>