You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Manepalli, Kalyan" <KA...@orbitz.com> on 2009/07/01 21:42:45 UTC

Tips on speeding the indexing process

Hi,
            I have a very generic question regarding indexing. In my current app, I have about 450,000 docs each doc size around 2k. The total indexing time is around 2hrs.
Now due to multi language support, the number of documents is increasing to 2.0 million. The total indexing time is exceeding 6 hrs.
I wanted to know if there are any general tips to speedup the indexing process.

Thanks,
Kalyan Manepalli


Re: Tips on speeding the indexing process

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
StreamingServer adds docs in multiple threads using the same http connection

Or

you can use CommonsHttpSolrServer#add(Iterator<SolrInputDocument> docIterator)
method

if you are unhappy w/ the perf you can use the BinaryRequestWriter
http://wiki.apache.org/solr/Solrj#head-ddc28af4033350481a3cbb27bc1d25bffd801af0

if you still need more perf you can call the add method in multiple threads



On Thu, Jul 2, 2009 at 3:20 AM, Manepalli,
Kalyan<KA...@orbitz.com> wrote:
> By removing both the stopwordfilterFactory and SynonymfilterFactory, the indexing time per doc has reduced drastically to 2 to 5 ms per doc.
> Next I will try out StreamingServer. Any distinct advantages of using StreamingServer
>
> Thanks,
> Kalyan Manepalli
>
> -----Original Message-----
> From: Manepalli, Kalyan [mailto:KALYAN.MANEPALLI@orbitz.com]
> Sent: Wednesday, July 01, 2009 3:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Tips on speeding the indexing process
>
> Regarding the analysis, we do couple of things during indexing. First is use a dictionary text file for stopword filter factory. Secondly we use synonym text file for SynonymfilterFactory. I will test the indexing speed by temporarily removing both of them.
>
> Thanks,
> Kalyan Manepalli
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Wednesday, July 01, 2009 3:31 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tips on speeding the indexing process
>
>
> Kalyan,
>
> 150/200 ms per 1 document to index seems too long, but it really depends on how much analysis is going on and size of docs.  32 threads seems too high, unless your Solr server really has 32 cores.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: "Manepalli, Kalyan" <KA...@orbitz.com>
>> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>> Sent: Wednesday, July 1, 2009 4:21:30 PM
>> Subject: RE: Tips on speeding the indexing process
>>
>> Here are some specs for my indexer.
>> Indexer is custom Java code that reads data from DB and other services builds
>> the solrDocument and submits it using SolrJ via Http. Indexer is doing a bit of
>> work for building the documents. The overhead is around 30 to 40ms. For every
>> document addition solr takes around 150 to 200 ms.
>> I tried the bulk addition approach with 1000 documents at time. But found out
>> that solr just take the same amount of time. I commit and optimize only once at
>> the end. I currently use 32 threads in production environment to get that speed
>> of 2hrs.
>>
>>
>> Thanks,
>> Kalyan Manepalli
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Sent: Wednesday, July 01, 2009 3:11 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Tips on speeding the indexing process
>>
>>
>> Kalyan,
>>
>> Using SolrJ?  Use the StreamingServer, it's nice and fast.
>> Alternatively, start multiple indexing threads (match the number of Solr server
>> CPU cores) and index from there.
>> Send batches of docs, not one by one.
>> Don't commit or optimize until you are done.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: "Manepalli, Kalyan"
>> > To: "solr-user@lucene.apache.org"
>> > Sent: Wednesday, July 1, 2009 3:42:45 PM
>> > Subject: Tips on speeding the indexing process
>> >
>> > Hi,
>> >             I have a very generic question regarding indexing. In my current
>> > app, I have about 450,000 docs each doc size around 2k. The total indexing
>> time
>> > is around 2hrs.
>> > Now due to multi language support, the number of documents is increasing to
>> 2.0
>> > million. The total indexing time is exceeding 6 hrs.
>> > I wanted to know if there are any general tips to speedup the indexing
>> process.
>> >
>> > Thanks,
>> > Kalyan Manepalli
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

RE: Tips on speeding the indexing process

Posted by "Manepalli, Kalyan" <KA...@orbitz.com>.
By removing both the stopwordfilterFactory and SynonymfilterFactory, the indexing time per doc has reduced drastically to 2 to 5 ms per doc. 
Next I will try out StreamingServer. Any distinct advantages of using StreamingServer

Thanks,
Kalyan Manepalli

-----Original Message-----
From: Manepalli, Kalyan [mailto:KALYAN.MANEPALLI@orbitz.com] 
Sent: Wednesday, July 01, 2009 3:41 PM
To: solr-user@lucene.apache.org
Subject: RE: Tips on speeding the indexing process

Regarding the analysis, we do couple of things during indexing. First is use a dictionary text file for stopword filter factory. Secondly we use synonym text file for SynonymfilterFactory. I will test the indexing speed by temporarily removing both of them.

Thanks,
Kalyan Manepalli

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Wednesday, July 01, 2009 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Tips on speeding the indexing process


Kalyan,

150/200 ms per 1 document to index seems too long, but it really depends on how much analysis is going on and size of docs.  32 threads seems too high, unless your Solr server really has 32 cores.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Manepalli, Kalyan" <KA...@orbitz.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, July 1, 2009 4:21:30 PM
> Subject: RE: Tips on speeding the indexing process
> 
> Here are some specs for my indexer.
> Indexer is custom Java code that reads data from DB and other services builds 
> the solrDocument and submits it using SolrJ via Http. Indexer is doing a bit of 
> work for building the documents. The overhead is around 30 to 40ms. For every 
> document addition solr takes around 150 to 200 ms. 
> I tried the bulk addition approach with 1000 documents at time. But found out 
> that solr just take the same amount of time. I commit and optimize only once at 
> the end. I currently use 32 threads in production environment to get that speed 
> of 2hrs.
> 
> 
> Thanks,
> Kalyan Manepalli
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Wednesday, July 01, 2009 3:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tips on speeding the indexing process
> 
> 
> Kalyan,
> 
> Using SolrJ?  Use the StreamingServer, it's nice and fast.
> Alternatively, start multiple indexing threads (match the number of Solr server 
> CPU cores) and index from there.
> Send batches of docs, not one by one.
> Don't commit or optimize until you are done.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: "Manepalli, Kalyan" 
> > To: "solr-user@lucene.apache.org" 
> > Sent: Wednesday, July 1, 2009 3:42:45 PM
> > Subject: Tips on speeding the indexing process
> > 
> > Hi,
> >             I have a very generic question regarding indexing. In my current 
> > app, I have about 450,000 docs each doc size around 2k. The total indexing 
> time 
> > is around 2hrs.
> > Now due to multi language support, the number of documents is increasing to 
> 2.0 
> > million. The total indexing time is exceeding 6 hrs.
> > I wanted to know if there are any general tips to speedup the indexing 
> process.
> > 
> > Thanks,
> > Kalyan Manepalli


RE: Tips on speeding the indexing process

Posted by "Manepalli, Kalyan" <KA...@orbitz.com>.
Regarding the analysis, we do couple of things during indexing. First is use a dictionary text file for stopword filter factory. Secondly we use synonym text file for SynonymfilterFactory. I will test the indexing speed by temporarily removing both of them.

Thanks,
Kalyan Manepalli

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Wednesday, July 01, 2009 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Tips on speeding the indexing process


Kalyan,

150/200 ms per 1 document to index seems too long, but it really depends on how much analysis is going on and size of docs.  32 threads seems too high, unless your Solr server really has 32 cores.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Manepalli, Kalyan" <KA...@orbitz.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, July 1, 2009 4:21:30 PM
> Subject: RE: Tips on speeding the indexing process
> 
> Here are some specs for my indexer.
> Indexer is custom Java code that reads data from DB and other services builds 
> the solrDocument and submits it using SolrJ via Http. Indexer is doing a bit of 
> work for building the documents. The overhead is around 30 to 40ms. For every 
> document addition solr takes around 150 to 200 ms. 
> I tried the bulk addition approach with 1000 documents at time. But found out 
> that solr just take the same amount of time. I commit and optimize only once at 
> the end. I currently use 32 threads in production environment to get that speed 
> of 2hrs.
> 
> 
> Thanks,
> Kalyan Manepalli
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Wednesday, July 01, 2009 3:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tips on speeding the indexing process
> 
> 
> Kalyan,
> 
> Using SolrJ?  Use the StreamingServer, it's nice and fast.
> Alternatively, start multiple indexing threads (match the number of Solr server 
> CPU cores) and index from there.
> Send batches of docs, not one by one.
> Don't commit or optimize until you are done.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: "Manepalli, Kalyan" 
> > To: "solr-user@lucene.apache.org" 
> > Sent: Wednesday, July 1, 2009 3:42:45 PM
> > Subject: Tips on speeding the indexing process
> > 
> > Hi,
> >             I have a very generic question regarding indexing. In my current 
> > app, I have about 450,000 docs each doc size around 2k. The total indexing 
> time 
> > is around 2hrs.
> > Now due to multi language support, the number of documents is increasing to 
> 2.0 
> > million. The total indexing time is exceeding 6 hrs.
> > I wanted to know if there are any general tips to speedup the indexing 
> process.
> > 
> > Thanks,
> > Kalyan Manepalli


Re: Tips on speeding the indexing process

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Kalyan,

150/200 ms per 1 document to index seems too long, but it really depends on how much analysis is going on and size of docs.  32 threads seems too high, unless your Solr server really has 32 cores.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Manepalli, Kalyan" <KA...@orbitz.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, July 1, 2009 4:21:30 PM
> Subject: RE: Tips on speeding the indexing process
> 
> Here are some specs for my indexer.
> Indexer is custom Java code that reads data from DB and other services builds 
> the solrDocument and submits it using SolrJ via Http. Indexer is doing a bit of 
> work for building the documents. The overhead is around 30 to 40ms. For every 
> document addition solr takes around 150 to 200 ms. 
> I tried the bulk addition approach with 1000 documents at time. But found out 
> that solr just take the same amount of time. I commit and optimize only once at 
> the end. I currently use 32 threads in production environment to get that speed 
> of 2hrs.
> 
> 
> Thanks,
> Kalyan Manepalli
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Wednesday, July 01, 2009 3:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tips on speeding the indexing process
> 
> 
> Kalyan,
> 
> Using SolrJ?  Use the StreamingServer, it's nice and fast.
> Alternatively, start multiple indexing threads (match the number of Solr server 
> CPU cores) and index from there.
> Send batches of docs, not one by one.
> Don't commit or optimize until you are done.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: "Manepalli, Kalyan" 
> > To: "solr-user@lucene.apache.org" 
> > Sent: Wednesday, July 1, 2009 3:42:45 PM
> > Subject: Tips on speeding the indexing process
> > 
> > Hi,
> >             I have a very generic question regarding indexing. In my current 
> > app, I have about 450,000 docs each doc size around 2k. The total indexing 
> time 
> > is around 2hrs.
> > Now due to multi language support, the number of documents is increasing to 
> 2.0 
> > million. The total indexing time is exceeding 6 hrs.
> > I wanted to know if there are any general tips to speedup the indexing 
> process.
> > 
> > Thanks,
> > Kalyan Manepalli


RE: Tips on speeding the indexing process

Posted by "Manepalli, Kalyan" <KA...@orbitz.com>.
Here are some specs for my indexer.
Indexer is custom Java code that reads data from DB and other services builds the solrDocument and submits it using SolrJ via Http. Indexer is doing a bit of work for building the documents. The overhead is around 30 to 40ms. For every document addition solr takes around 150 to 200 ms. 
I tried the bulk addition approach with 1000 documents at time. But found out that solr just take the same amount of time. I commit and optimize only once at the end. I currently use 32 threads in production environment to get that speed of 2hrs.


Thanks,
Kalyan Manepalli

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Wednesday, July 01, 2009 3:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Tips on speeding the indexing process


Kalyan,

Using SolrJ?  Use the StreamingServer, it's nice and fast.
Alternatively, start multiple indexing threads (match the number of Solr server CPU cores) and index from there.
Send batches of docs, not one by one.
Don't commit or optimize until you are done.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Manepalli, Kalyan" <KA...@orbitz.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, July 1, 2009 3:42:45 PM
> Subject: Tips on speeding the indexing process
> 
> Hi,
>             I have a very generic question regarding indexing. In my current 
> app, I have about 450,000 docs each doc size around 2k. The total indexing time 
> is around 2hrs.
> Now due to multi language support, the number of documents is increasing to 2.0 
> million. The total indexing time is exceeding 6 hrs.
> I wanted to know if there are any general tips to speedup the indexing process.
> 
> Thanks,
> Kalyan Manepalli


Re: Tips on speeding the indexing process

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Kalyan,

Using SolrJ?  Use the StreamingServer, it's nice and fast.
Alternatively, start multiple indexing threads (match the number of Solr server CPU cores) and index from there.
Send batches of docs, not one by one.
Don't commit or optimize until you are done.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Manepalli, Kalyan" <KA...@orbitz.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, July 1, 2009 3:42:45 PM
> Subject: Tips on speeding the indexing process
> 
> Hi,
>             I have a very generic question regarding indexing. In my current 
> app, I have about 450,000 docs each doc size around 2k. The total indexing time 
> is around 2hrs.
> Now due to multi language support, the number of documents is increasing to 2.0 
> million. The total indexing time is exceeding 6 hrs.
> I wanted to know if there are any general tips to speedup the indexing process.
> 
> Thanks,
> Kalyan Manepalli


Re: Tips on speeding the indexing process

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Kalyan,

Tell us about your indexer.  Is it DIH-powered?  Custom Java code,  
perhaps, using SolrJ indexing over HTTP?  Is your indexer doing a lot  
of work itself to preprocess documents before sending to Solr?

	Erik


On Jul 1, 2009, at 3:42 PM, Manepalli, Kalyan wrote:

> Hi,
>            I have a very generic question regarding indexing. In my  
> current app, I have about 450,000 docs each doc size around 2k. The  
> total indexing time is around 2hrs.
> Now due to multi language support, the number of documents is  
> increasing to 2.0 million. The total indexing time is exceeding 6 hrs.
> I wanted to know if there are any general tips to speedup the  
> indexing process.
>
> Thanks,
> Kalyan Manepalli
>