You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Arjen van der Meijden <ac...@tweakers.net> on 2012/11/13 08:36:14 UTC
Improving search performance for forum search
Hi List,
I'm working on a search engine for our forum using Lucene 4. Since its a
brand new search engine, I can change it as I see fit.
We have about 1.5M topics in the various subforums and on average 20
replies to each topic (i.e. about 33M in total).
For now, I've opted to index all replies to topics and group the best
reply-matches based on their topic-id and only keep the top X (currently
at most 5 per topic).
This works quite well, but the search time is fairly long. It takes
about 330ms to achieve a result with a single word that matches about
45k of the topics. The index is on a ssd in my test-machine and the
330ms is after repeated searches and including several other aspects.
Obviously, with an average of 20 replies per topic, that could actually
be upwards to about 900k actual Documents being matched (I didn't look
at the actual count, but it was probably less).
According to yourkit, about 50% of the time is spent in the Scorer and
Collector. And it mainly breaks down to two aspects, my custom scoring
and the fact that my code is set up to retrieve all results and do
further processing. But given the grouping on the topic-id, I doubt I
can actually escape that last part...
To enable customized scoring of the documents, I need access to
per-reply and per-topic meta-data. The per-topic meta-data is stored in
in-memory objects accessible via a HashMap based on the topic's id and
the per-reply meta-data is simply a unix timestamp stored in a binary field.
A fair amount of the time (about 20% is spent in Reader.document(doc,
StoredFieldVisitor)) is spent retrieving the topicId, replyId and that
timestamp from the Document's. The topicId and replyId are encoded into
a single binary field.
I already use a specialized StoredFieldVisitor that only retrieves those
two binary fields from each document.
So now the questions:
- Can I reduce the overhead of retrieving the document's fields even
further?
-- Should I use a different Codec (perhaps Pulsing or one of the "load
the fielddata in memory"-codecs) to fetch those binary fields?
-- Should I change them to other field types?
-- Should I encode all binary data in a single field, rather than two
fields (i.e. going from 9+8 bytes to 17)?
- Should I use a FieldCache to be able to retrieve the required fields
quicker (and how do you even use a FieldCache??) once they've been read?
- Is there a way to delay or skip part of the scoring, so I can skip
retrieving Documents altogether? This would probably require predicting
that the results is intended for a topic which already has 5 very good
replies, so that seems a bit far-fetched (although it would yield the
most gain).
Any other tips?
Best regards,
Arjen van der Meijden
Tweakers.net B.V.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Improving search performance for forum search
Posted by Arjen van der Meijden <ac...@tweakers.net>.
Thanks Uwe,
I was able to rewrite my code with just a few changes to use
StraightByteRefDocValuesField for the field with 9 bytes and a
PackedLongDocValuesField for the timestamps.
The 9 bytes are actually a 1 byte type-identifier, 4 bytes for the topic
id and another 4 bytes for the reply id.
If I generally only need those two 4-byte ints, would you advise me to
go with 2 IntDocValuesField's and a ByteDocValuesField? Or is my aproach
with StraightBytesDocValuesField better?
In terms of uniqueness, the single type-byte will be the same for each
document in this particular index (its used in MultiReader scenarios).
And the topicId will be the same for on average about 20 docs each.
Unfortunately, reindexing takes about 2 days. So I won't be able to do
any benchmarks until tomorrow or the day after.
Best regards,
Arjen
On 13-11-2012 9:37 Uwe Schindler wrote:
> IndexReader.document() is documented to be used only for presenting search results. Fetching the document for every possible hit while scoring is the performance killer (it is funny that your query only takes 300 ms, maybe the SSD).
> The correct solution is to use the new field type DocValues, which are similar to stored fields but are stored column wise (and not document wise) and can be loaded to memory completely. In your CustomScoreQuery, you can use the DocValues (available on AtmoicReader) to score your documents.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Arjen van der Meijden [mailto:acmmailing@tweakers.net]
>> Sent: Tuesday, November 13, 2012 8:36 AM
>> To: java-user@lucene.apache.org
>> Subject: Improving search performance for forum search
>>
>> Hi List,
>>
>> I'm working on a search engine for our forum using Lucene 4. Since its a
>> brand new search engine, I can change it as I see fit.
>>
>> We have about 1.5M topics in the various subforums and on average 20
>> replies to each topic (i.e. about 33M in total).
>> For now, I've opted to index all replies to topics and group the best reply-
>> matches based on their topic-id and only keep the top X (currently at most 5
>> per topic).
>>
>> This works quite well, but the search time is fairly long. It takes about 330ms
>> to achieve a result with a single word that matches about 45k of the topics.
>> The index is on a ssd in my test-machine and the 330ms is after repeated
>> searches and including several other aspects.
>>
>> Obviously, with an average of 20 replies per topic, that could actually be
>> upwards to about 900k actual Documents being matched (I didn't look at the
>> actual count, but it was probably less).
>>
>> According to yourkit, about 50% of the time is spent in the Scorer and
>> Collector. And it mainly breaks down to two aspects, my custom scoring and
>> the fact that my code is set up to retrieve all results and do further
>> processing. But given the grouping on the topic-id, I doubt I can actually
>> escape that last part...
>>
>> To enable customized scoring of the documents, I need access to per-reply
>> and per-topic meta-data. The per-topic meta-data is stored in in-memory
>> objects accessible via a HashMap based on the topic's id and the per-reply
>> meta-data is simply a unix timestamp stored in a binary field.
>>
>> A fair amount of the time (about 20% is spent in Reader.document(doc,
>> StoredFieldVisitor)) is spent retrieving the topicId, replyId and that
>> timestamp from the Document's. The topicId and replyId are encoded into a
>> single binary field.
>> I already use a specialized StoredFieldVisitor that only retrieves those two
>> binary fields from each document.
>>
>> So now the questions:
>> - Can I reduce the overhead of retrieving the document's fields even
>> further?
>> -- Should I use a different Codec (perhaps Pulsing or one of the "load the
>> fielddata in memory"-codecs) to fetch those binary fields?
>> -- Should I change them to other field types?
>> -- Should I encode all binary data in a single field, rather than two fields (i.e.
>> going from 9+8 bytes to 17)?
>> - Should I use a FieldCache to be able to retrieve the required fields quicker
>> (and how do you even use a FieldCache??) once they've been read?
>> - Is there a way to delay or skip part of the scoring, so I can skip retrieving
>> Documents altogether? This would probably require predicting that the
>> results is intended for a topic which already has 5 very good replies, so that
>> seems a bit far-fetched (although it would yield the most gain).
>>
>> Any other tips?
>>
>> Best regards,
>>
>> Arjen van der Meijden
>> Tweakers.net B.V.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Improving search performance for forum search
Posted by Arjen van der Meijden <ac...@tweakers.net>.
Hi Uwe,
I forgot to update on this - and since the thread is now a bit old, I
won't rake it up again - but there was indeed a nice performance gain
from the change to DocValues. The total search time went down from the
mentioned 330ms to about 190ms.
I actually have to look at other performance aspects, outside Lucene's
core, to optimize it any further.
So, thanks for the tip,
Best regards,
Arjen
On 13-11-2012 9:37 Uwe Schindler wrote:
> IndexReader.document() is documented to be used only for presenting search results. Fetching the document for every possible hit while scoring is the performance killer (it is funny that your query only takes 300 ms, maybe the SSD).
> The correct solution is to use the new field type DocValues, which are similar to stored fields but are stored column wise (and not document wise) and can be loaded to memory completely. In your CustomScoreQuery, you can use the DocValues (available on AtmoicReader) to score your documents.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Arjen van der Meijden [mailto:acmmailing@tweakers.net]
>> Sent: Tuesday, November 13, 2012 8:36 AM
>> To: java-user@lucene.apache.org
>> Subject: Improving search performance for forum search
>>
>> Hi List,
>>
>> I'm working on a search engine for our forum using Lucene 4. Since its a
>> brand new search engine, I can change it as I see fit.
>>
>> We have about 1.5M topics in the various subforums and on average 20
>> replies to each topic (i.e. about 33M in total).
>> For now, I've opted to index all replies to topics and group the best reply-
>> matches based on their topic-id and only keep the top X (currently at most 5
>> per topic).
>>
>> This works quite well, but the search time is fairly long. It takes about 330ms
>> to achieve a result with a single word that matches about 45k of the topics.
>> The index is on a ssd in my test-machine and the 330ms is after repeated
>> searches and including several other aspects.
>>
>> Obviously, with an average of 20 replies per topic, that could actually be
>> upwards to about 900k actual Documents being matched (I didn't look at the
>> actual count, but it was probably less).
>>
>> According to yourkit, about 50% of the time is spent in the Scorer and
>> Collector. And it mainly breaks down to two aspects, my custom scoring and
>> the fact that my code is set up to retrieve all results and do further
>> processing. But given the grouping on the topic-id, I doubt I can actually
>> escape that last part...
>>
>> To enable customized scoring of the documents, I need access to per-reply
>> and per-topic meta-data. The per-topic meta-data is stored in in-memory
>> objects accessible via a HashMap based on the topic's id and the per-reply
>> meta-data is simply a unix timestamp stored in a binary field.
>>
>> A fair amount of the time (about 20% is spent in Reader.document(doc,
>> StoredFieldVisitor)) is spent retrieving the topicId, replyId and that
>> timestamp from the Document's. The topicId and replyId are encoded into a
>> single binary field.
>> I already use a specialized StoredFieldVisitor that only retrieves those two
>> binary fields from each document.
>>
>> So now the questions:
>> - Can I reduce the overhead of retrieving the document's fields even
>> further?
>> -- Should I use a different Codec (perhaps Pulsing or one of the "load the
>> fielddata in memory"-codecs) to fetch those binary fields?
>> -- Should I change them to other field types?
>> -- Should I encode all binary data in a single field, rather than two fields (i.e.
>> going from 9+8 bytes to 17)?
>> - Should I use a FieldCache to be able to retrieve the required fields quicker
>> (and how do you even use a FieldCache??) once they've been read?
>> - Is there a way to delay or skip part of the scoring, so I can skip retrieving
>> Documents altogether? This would probably require predicting that the
>> results is intended for a topic which already has 5 very good replies, so that
>> seems a bit far-fetched (although it would yield the most gain).
>>
>> Any other tips?
>>
>> Best regards,
>>
>> Arjen van der Meijden
>> Tweakers.net B.V.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Improving search performance for forum search
Posted by Uwe Schindler <uw...@thetaphi.de>.
IndexReader.document() is documented to be used only for presenting search results. Fetching the document for every possible hit while scoring is the performance killer (it is funny that your query only takes 300 ms, maybe the SSD).
The correct solution is to use the new field type DocValues, which are similar to stored fields but are stored column wise (and not document wise) and can be loaded to memory completely. In your CustomScoreQuery, you can use the DocValues (available on AtmoicReader) to score your documents.
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
> -----Original Message-----
> From: Arjen van der Meijden [mailto:acmmailing@tweakers.net]
> Sent: Tuesday, November 13, 2012 8:36 AM
> To: java-user@lucene.apache.org
> Subject: Improving search performance for forum search
>
> Hi List,
>
> I'm working on a search engine for our forum using Lucene 4. Since its a
> brand new search engine, I can change it as I see fit.
>
> We have about 1.5M topics in the various subforums and on average 20
> replies to each topic (i.e. about 33M in total).
> For now, I've opted to index all replies to topics and group the best reply-
> matches based on their topic-id and only keep the top X (currently at most 5
> per topic).
>
> This works quite well, but the search time is fairly long. It takes about 330ms
> to achieve a result with a single word that matches about 45k of the topics.
> The index is on a ssd in my test-machine and the 330ms is after repeated
> searches and including several other aspects.
>
> Obviously, with an average of 20 replies per topic, that could actually be
> upwards to about 900k actual Documents being matched (I didn't look at the
> actual count, but it was probably less).
>
> According to yourkit, about 50% of the time is spent in the Scorer and
> Collector. And it mainly breaks down to two aspects, my custom scoring and
> the fact that my code is set up to retrieve all results and do further
> processing. But given the grouping on the topic-id, I doubt I can actually
> escape that last part...
>
> To enable customized scoring of the documents, I need access to per-reply
> and per-topic meta-data. The per-topic meta-data is stored in in-memory
> objects accessible via a HashMap based on the topic's id and the per-reply
> meta-data is simply a unix timestamp stored in a binary field.
>
> A fair amount of the time (about 20% is spent in Reader.document(doc,
> StoredFieldVisitor)) is spent retrieving the topicId, replyId and that
> timestamp from the Document's. The topicId and replyId are encoded into a
> single binary field.
> I already use a specialized StoredFieldVisitor that only retrieves those two
> binary fields from each document.
>
> So now the questions:
> - Can I reduce the overhead of retrieving the document's fields even
> further?
> -- Should I use a different Codec (perhaps Pulsing or one of the "load the
> fielddata in memory"-codecs) to fetch those binary fields?
> -- Should I change them to other field types?
> -- Should I encode all binary data in a single field, rather than two fields (i.e.
> going from 9+8 bytes to 17)?
> - Should I use a FieldCache to be able to retrieve the required fields quicker
> (and how do you even use a FieldCache??) once they've been read?
> - Is there a way to delay or skip part of the scoring, so I can skip retrieving
> Documents altogether? This would probably require predicting that the
> results is intended for a topic which already has 5 very good replies, so that
> seems a bit far-fetched (although it would yield the most gain).
>
> Any other tips?
>
> Best regards,
>
> Arjen van der Meijden
> Tweakers.net B.V.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org