You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by William Pierce <ev...@hotmail.com> on 2009/10/11 02:47:50 UTC

Tips on speeding up indexing needed...

Folks:

I have a corpus of approx 6 M documents each of approx 4K bytes. 
Currently, the way indexing is set up I read documents from a database and 
issue solr post requests in batches (batches are set up so that the 
maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
in each batch we write approx 600 or so documents to SOLR.  What I am seeing 
is that I am able to push about 2500 docs per minute or approx 40 or so per 
second.

I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 docs/sec 
have been achieved.  Needless to say I am sure that performance numbers vary 
widely and are dependent on the domain, machine configurations, etc.

I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

Any tips on what I can do to speed this up?

Thanks,

Bill 


Re: Dynamically compute document scores...

Posted by Chris Hostetter <ho...@fucit.org>.
: References: <4A...@umich.edu>
:     <69...@mail.gmail.com>
:     <4A...@umich.edu>
:     <Pi...@radix.cryptio.net>
:     <4A...@umich.edu> <SN...@phx.gbl>
:      <SN...@phx.gbl>
:     <87...@mail.gmail.com>
:     <SN...@phx.gbl>
: In-Reply-To: <SN...@phx.gbl>
: Subject: Dynamically compute document scores...

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking





-Hoss


Re: Dynamically compute document scores...

Posted by Avlesh Singh <av...@gmail.com>.
Options -

   1. Can you pre-compute your "business logic" score at index time? If yes,
   then this value can be stored in some field and you can use function queries
   to use this data plus the score to return a value which you can sort upon.
   2. Take a look at -
   http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/Similarity.html.
   Custom similarity implementations can be hooked up into Solr easily.

Cheers
Avlesh

On Tue, Oct 13, 2009 at 9:05 PM, William Pierce <ev...@hotmail.com>wrote:

> Folks:
>
> During query time, I want to dynamically compute a document score as
> follows:
>
>  a) Take the SOLR score for the document -- call it S.
>  b) Lookup the "business logic" score for this document.  Call it L.
>  c) Compute a new score T = func(S, L)
>  d) Return the documents sorted by T.
>
> I have looked at function queries but in my limited/quick review of it,  I
> could not see a quick way of doing this.
>
> Is this possible?
>
> Thanks,
>
> - Bill
>
>
>

Dynamically compute document scores...

Posted by William Pierce <ev...@hotmail.com>.
Folks:

During query time, I want to dynamically compute a document score as 
follows:

   a) Take the SOLR score for the document -- call it S.
   b) Lookup the "business logic" score for this document.  Call it L.
   c) Compute a new score T = func(S, L)
   d) Return the documents sorted by T.

I have looked at function queries but in my limited/quick review of it,  I 
could not see a quick way of doing this.

Is this possible?

Thanks,

- Bill
 


Re: Tips on speeding up indexing needed...

Posted by William Pierce <ev...@hotmail.com>.
Thanks, Lance.  I already commit at the end.  I will take a look at the data 
import handler.   Thanks again!

-- Bill

--------------------------------------------------
From: "Lance Norskog" <go...@gmail.com>
Sent: Saturday, October 10, 2009 7:58 PM
To: <so...@lucene.apache.org>
Subject: Re: Tips on speeding up indexing needed...

> A few things off the bat:
> 1) do not commit until the end.
> 2) use the DataImportHandler - it runs inside Solr and reads the
> database. This cuts out the HTTP transfer/XML xlation overheads.
> 3) examine your schema. Some of the text analyzers are quite slow.
>
> Solr tips:
> http://wiki.apache.org/solr/SolrPerformanceFactors
>
> Lucene tips:
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> And, what you don't want to hear: for jobs like this, Solr/Lucene is
> disk-bound. The Windows NTFS file system is much slower than what is
> available for Linux or the Mac, and these numbers are for those
> machines.
>
> Good luck!
>
> Lance Norskog
>
>
> On Sat, Oct 10, 2009 at 5:57 PM, William Pierce <ev...@hotmail.com> 
> wrote:
>> Oh and one more thing...For historical reasons our apps run using msft
>> technologies, so using SolrJ would be next to impossible at the present
>> time....
>>
>> Thanks in advance for your help!
>>
>> -- Bill
>>
>> --------------------------------------------------
>> From: "William Pierce" <ev...@hotmail.com>
>> Sent: Saturday, October 10, 2009 5:47 PM
>> To: <so...@lucene.apache.org>
>> Subject: Tips on speeding up indexing needed...
>>
>>> Folks:
>>>
>>> I have a corpus of approx 6 M documents each of approx 4K bytes.
>>> Currently, the way indexing is set up I read documents from a database 
>>> and
>>> issue solr post requests in batches (batches are set up so that the
>>> maxPostSize of tomcat which is set to 2MB is adhered to).  This means 
>>> that
>>> in each batch we write approx 600 or so documents to SOLR.  What I am 
>>> seeing
>>> is that I am able to push about 2500 docs per minute or approx 40 or so 
>>> per
>>> second.
>>>
>>> I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000
>>> docs/sec have been achieved.  Needless to say I am sure that performance
>>> numbers vary widely and are dependent on the domain, machine 
>>> configurations,
>>> etc.
>>>
>>> I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.
>>>
>>> Any tips on what I can do to speed this up?
>>>
>>> Thanks,
>>>
>>> Bill
>>>
>>
>
>
>
> -- 
> Lance Norskog
> goksron@gmail.com
> 

Re: Tips on speeding up indexing needed...

Posted by Lance Norskog <go...@gmail.com>.
A few things off the bat:
1) do not commit until the end.
2) use the DataImportHandler - it runs inside Solr and reads the
database. This cuts out the HTTP transfer/XML xlation overheads.
3) examine your schema. Some of the text analyzers are quite slow.

Solr tips:
http://wiki.apache.org/solr/SolrPerformanceFactors

Lucene tips:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

And, what you don't want to hear: for jobs like this, Solr/Lucene is
disk-bound. The Windows NTFS file system is much slower than what is
available for Linux or the Mac, and these numbers are for those
machines.

Good luck!

Lance Norskog


On Sat, Oct 10, 2009 at 5:57 PM, William Pierce <ev...@hotmail.com> wrote:
> Oh and one more thing...For historical reasons our apps run using msft
> technologies, so using SolrJ would be next to impossible at the present
> time....
>
> Thanks in advance for your help!
>
> -- Bill
>
> --------------------------------------------------
> From: "William Pierce" <ev...@hotmail.com>
> Sent: Saturday, October 10, 2009 5:47 PM
> To: <so...@lucene.apache.org>
> Subject: Tips on speeding up indexing needed...
>
>> Folks:
>>
>> I have a corpus of approx 6 M documents each of approx 4K bytes.
>> Currently, the way indexing is set up I read documents from a database and
>> issue solr post requests in batches (batches are set up so that the
>> maxPostSize of tomcat which is set to 2MB is adhered to).  This means that
>> in each batch we write approx 600 or so documents to SOLR.  What I am seeing
>> is that I am able to push about 2500 docs per minute or approx 40 or so per
>> second.
>>
>> I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000
>> docs/sec have been achieved.  Needless to say I am sure that performance
>> numbers vary widely and are dependent on the domain, machine configurations,
>> etc.
>>
>> I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.
>>
>> Any tips on what I can do to speed this up?
>>
>> Thanks,
>>
>> Bill
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Tips on speeding up indexing needed...

Posted by William Pierce <ev...@hotmail.com>.
Oh and one more thing...For historical reasons our apps run using msft 
technologies, so using SolrJ would be next to impossible at the present 
time....

Thanks in advance for your help!

-- Bill

--------------------------------------------------
From: "William Pierce" <ev...@hotmail.com>
Sent: Saturday, October 10, 2009 5:47 PM
To: <so...@lucene.apache.org>
Subject: Tips on speeding up indexing needed...

> Folks:
>
> I have a corpus of approx 6 M documents each of approx 4K bytes. 
> Currently, the way indexing is set up I read documents from a database and 
> issue solr post requests in batches (batches are set up so that the 
> maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
> in each batch we write approx 600 or so documents to SOLR.  What I am 
> seeing is that I am able to push about 2500 docs per minute or approx 40 
> or so per second.
>
> I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 
> docs/sec have been achieved.  Needless to say I am sure that performance 
> numbers vary widely and are dependent on the domain, machine 
> configurations, etc.
>
> I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.
>
> Any tips on what I can do to speed this up?
>
> Thanks,
>
> Bill
> 

Re: Tips on speeding up indexing needed...

Posted by William Pierce <ev...@hotmail.com>.
Oops....My bad!  I didn't realize that by changing the subject line I was 
still "part" of the thread whose subject I changed!

Sorry folks!  Thanks, Hoss for pointing this out!

- Bill

--------------------------------------------------
From: "Chris Hostetter" <ho...@fucit.org>
Sent: Tuesday, October 13, 2009 11:07 AM
To: <so...@lucene.apache.org>
Subject: Re: Tips on speeding up indexing needed...

>
> : References: <4A...@umich.edu>
> :     <69...@mail.gmail.com>
> :     <4A...@umich.edu>
> :     <Pi...@radix.cryptio.net>
> :     <4A...@umich.edu>
> : In-Reply-To: <4A...@umich.edu>
> : Subject: Tips on speeding up indexing needed...
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/Thread_hijacking
>
>
>
>
>
> -Hoss
>
> 

Re: Tips on speeding up indexing needed...

Posted by Chris Hostetter <ho...@fucit.org>.
: References: <4A...@umich.edu>
:     <69...@mail.gmail.com>
:     <4A...@umich.edu>
:     <Pi...@radix.cryptio.net>
:     <4A...@umich.edu>
: In-Reply-To: <4A...@umich.edu>
: Subject: Tips on speeding up indexing needed...

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking





-Hoss