You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Robert Stewart <Ro...@epam.com> on 2011/06/10 21:35:40 UTC

[Lucene.Net] Faceting

I took a brief look at the documentation for faceting in contrib.  I did not look at code yet.  Do you think it can work for these requirements:

1) Needs to compute facets for fields of more than one value per document (for instance a document may have many company names associated to it).
2) Needs to compute facets over any arbitrary query
3) Needs to be fast:
	a) I have 100 million docs distributed in about 10 indexes (10 million docs each) and use parallel distributed search and merge
	b) For some facet fields, we have over 100,000 possible unique values (for example, we have 150,000 possible company values)

In our case, we pre-cache compressed doc sets in memory for each unique facet value.  if # of values < 1/9 size of index, then we use variable byte encoding of integers, otherwise we use BitArray.
These doc sets are then sorted in descending order by document frequency (so more frequent facets are counted first)
We open new index "snapshots" every couple minutes and pre-load these facet doc sets into ram each time new snapshot is opened in the background.
We use about 32 GB of RAM when fully loaded.

At search time we gather all the doc IDs matching search into a BitArray.
Then we enumerate all the facet doc sets in desc order by overall doc frequency, and count how many docs in search matched each facet.
These facet counts are passed into a priority queue to gather top N counts (such that when the next total count < full priority queue min value, it breaks out of loop, that is why we do it in desc order by total doc freq)

We also count # of docs per day over date range for each facet.
We also compute facets for about 10 fields during search, and get top 10 facets each.

Typically search over 100 million docs including facet counts and per-date counts takes about 1300ms.  

Our current solution actually works pretty well - but it is a burden on RAM, time to load new snapshots, and extra pressure on GC during busy times.

Do you think your current facet implementation can work as above, and should I try to contrib what I have (it would definitely take a little refactoring)?

Thanks,
Bob


On Jun 10, 2011, at 12:37 PM, Digy wrote:

> Have you tried to use Lucene.Net as is, before working on optimizing your
> code? There are a lot of speed improvements in it since 1.9.
> There is also a Faceted Search project in contrib.
> (https://cwiki.apache.org/confluence/display/LUCENENET/Simple+Faceted+Search
> )
> 
> DIGY
> 
> 
> 
> -----Original Message-----
> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
> Sent: Friday, June 10, 2011 7:14 PM
> To: <lu...@lucene.apache.org>
> Subject: [Lucene.Net] Score(collector) called for each subReader - but not
> what I need
> 
> As I previously tried to explain, I have custom query for some pre-cached
> terms, which I load into RAM in efficient compressed form.  I need this for
> faster searching and also for much faster faceting.  So what I do is process
> incoming query and replace certain sub-queries with my own "CachedTermQuery"
> objects, which extend Query.  Since these are not per-segment, I only want
> scorer.Score(collector) called once, not once for each segment in my index.
> Essentially what happens now if I have a search is it collects the same
> documents N times, 1 time for each segment.  Is there anyway to combine
> different Scorers/Collectors such that I can control when it enumerates
> collection by multiple sub-readers, and when not to?  This all worked in
> previous version of Lucene because enumerating sub-indexes (segments) was
> pushed to a lower level inside Lucene API and not it is elevated to a higher
> level.
> 
> Thanks
> Bob
> 
> 
> On Jun 9, 2011, at 4:33 PM, Robert Stewart wrote:
> 
>> I found the problem.  The problem is that I have a custom "query
> optimizer", and that replaces certain TermQuery's within a Boolean query
> with a custom Query and this query has its own weight/scorer that retrieves
> matching documents from an in-memory cache (and that is not Lucene backed).
> But it looks like my custom hitcollectors are now wrapped in a
> HitCollectorWrapper which assumes Collect() needs called for multiple
> segments - so it is adding a start offset to the doc ID that comes from my
> custom query implementation.  I looked at the new Collector class and it
> seems it works the same way (assumes it needs to set the next index reader
> with some offset).  How can I make my custom query work with the new API (so
> that there is basically a single "segment" in RAM that my query uses, but
> still other query clauses in same boolean query use multiple lucene
> segments)?  I am sure that is not clear and will try to provide more detail
> soon.
>> 
>> Thanks
>> Bob
>> 
>> 
>> On Jun 9, 2011, at 1:48 PM, Digy wrote:
>> 
>>> Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect
> the
>>> problem.
>>> DIGY
>>> 
>>> -----Original Message-----
>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>> Sent: Thursday, June 09, 2011 8:40 PM
>>> To: <lu...@lucene.apache.org>
>>> Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>> 
>>> I tried converting index using IndexWriter as follows:
>>> 
>>> Lucene.Net.Index.IndexWriter writer = new
> IndexWriter(TestIndexPath+"_2.9",
>>> new Lucene.Net.Analysis.KeywordAnalyzer());
>>> 
>>> writer.SetMaxBufferedDocs(2);
>>> writer.SetMaxMergeDocs(1000000);
>>> writer.SetMergeFactor(2);
>>> 
>>> writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
>>> Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
>>> 
>>> writer.Commit();
>>> 
>>> 
>>> That seems to work (I get what looks like a valid index directory at
> least).
>>> 
>>> But still when I run some tests using IndexSearcher I get the same
> problem
>>> (I get documents in Collect() which are larger than
> IndexReader.MaxDoc()).
>>> Any idea what the problem could be?  
>>> 
>>> BTW, this is a problem because I lookup some fields (date ranges, etc.)
> in
>>> some custom collectors which filter out documents, and it assumes I dont
> get
>>> any documents larger than maxDoc.
>>> 
>>> Thanks,
>>> Bob
>>> 
>>> 
>>> On Jun 9, 2011, at 12:37 PM, Digy wrote:
>>> 
>>>> One more point, some write operations using Lucene.Net 2.9.2 (add,
> delete,
>>>> optimize etc.) upgrades automatically your index to 2.9.2.
>>>> But if your index is somehow corrupted(eg, due to some bug in 1.9) this
>>> may
>>>> result in data loss.
>>>> 
>>>> DIGY
>>>> 
>>>> -----Original Message-----
>>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>>> Sent: Thursday, June 09, 2011 7:06 PM
>>>> To: lucene-net-dev@lucene.apache.org
>>>> Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>>> 
>>>> I have a Lucene index created with Lucene.Net 1.9.  I have a
> multi-segment
>>>> index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that
> index,
>>> I
>>>> get IndexOutOfRange exceptions in my collectors.  It is giving me
> document
>>>> IDs that are larger than maxDoc.  
>>>> 
>>>> My index contains 377831 documents, and IndexReader.MaxDoc() is
> returning
>>>> 377831, but I get documents from Collect() with large values (for
> instance
>>>> 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?
> If
>>>> not, is there some way I can convert it (in production we have many
>>> indexes
>>>> containing about 200 million docs so I'd rather convert existing indexes
>>>> than rebuilt them).
>>>> 
>>>> Thanks
>>>> Bob=
>>>> 
>>> 
>> 
> 


RE: [Lucene.Net] Faceting

Posted by Digy <di...@gmail.com>.
And yes for 1 & 2.
DIGY

-----Original Message-----
From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
Sent: Friday, June 10, 2011 10:36 PM
To: <lu...@lucene.apache.org>
Subject: [Lucene.Net] Faceting

I took a brief look at the documentation for faceting in contrib.  I did not
look at code yet.  Do you think it can work for these requirements:

1) Needs to compute facets for fields of more than one value per document
(for instance a document may have many company names associated to it).
2) Needs to compute facets over any arbitrary query
3) Needs to be fast:
	a) I have 100 million docs distributed in about 10 indexes (10
million docs each) and use parallel distributed search and merge
	b) For some facet fields, we have over 100,000 possible unique
values (for example, we have 150,000 possible company values)

In our case, we pre-cache compressed doc sets in memory for each unique
facet value.  if # of values < 1/9 size of index, then we use variable byte
encoding of integers, otherwise we use BitArray.
These doc sets are then sorted in descending order by document frequency (so
more frequent facets are counted first)
We open new index "snapshots" every couple minutes and pre-load these facet
doc sets into ram each time new snapshot is opened in the background.
We use about 32 GB of RAM when fully loaded.

At search time we gather all the doc IDs matching search into a BitArray.
Then we enumerate all the facet doc sets in desc order by overall doc
frequency, and count how many docs in search matched each facet.
These facet counts are passed into a priority queue to gather top N counts
(such that when the next total count < full priority queue min value, it
breaks out of loop, that is why we do it in desc order by total doc freq)

We also count # of docs per day over date range for each facet.
We also compute facets for about 10 fields during search, and get top 10
facets each.

Typically search over 100 million docs including facet counts and per-date
counts takes about 1300ms.  

Our current solution actually works pretty well - but it is a burden on RAM,
time to load new snapshots, and extra pressure on GC during busy times.

Do you think your current facet implementation can work as above, and should
I try to contrib what I have (it would definitely take a little
refactoring)?

Thanks,
Bob


On Jun 10, 2011, at 12:37 PM, Digy wrote:

> Have you tried to use Lucene.Net as is, before working on optimizing your
> code? There are a lot of speed improvements in it since 1.9.
> There is also a Faceted Search project in contrib.
>
(https://cwiki.apache.org/confluence/display/LUCENENET/Simple+Faceted+Search
> )
> 
> DIGY
> 
> 
> 
> -----Original Message-----
> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
> Sent: Friday, June 10, 2011 7:14 PM
> To: <lu...@lucene.apache.org>
> Subject: [Lucene.Net] Score(collector) called for each subReader - but not
> what I need
> 
> As I previously tried to explain, I have custom query for some pre-cached
> terms, which I load into RAM in efficient compressed form.  I need this
for
> faster searching and also for much faster faceting.  So what I do is
process
> incoming query and replace certain sub-queries with my own
"CachedTermQuery"
> objects, which extend Query.  Since these are not per-segment, I only want
> scorer.Score(collector) called once, not once for each segment in my
index.
> Essentially what happens now if I have a search is it collects the same
> documents N times, 1 time for each segment.  Is there anyway to combine
> different Scorers/Collectors such that I can control when it enumerates
> collection by multiple sub-readers, and when not to?  This all worked in
> previous version of Lucene because enumerating sub-indexes (segments) was
> pushed to a lower level inside Lucene API and not it is elevated to a
higher
> level.
> 
> Thanks
> Bob
> 
> 
> On Jun 9, 2011, at 4:33 PM, Robert Stewart wrote:
> 
>> I found the problem.  The problem is that I have a custom "query
> optimizer", and that replaces certain TermQuery's within a Boolean query
> with a custom Query and this query has its own weight/scorer that
retrieves
> matching documents from an in-memory cache (and that is not Lucene
backed).
> But it looks like my custom hitcollectors are now wrapped in a
> HitCollectorWrapper which assumes Collect() needs called for multiple
> segments - so it is adding a start offset to the doc ID that comes from my
> custom query implementation.  I looked at the new Collector class and it
> seems it works the same way (assumes it needs to set the next index reader
> with some offset).  How can I make my custom query work with the new API
(so
> that there is basically a single "segment" in RAM that my query uses, but
> still other query clauses in same boolean query use multiple lucene
> segments)?  I am sure that is not clear and will try to provide more
detail
> soon.
>> 
>> Thanks
>> Bob
>> 
>> 
>> On Jun 9, 2011, at 1:48 PM, Digy wrote:
>> 
>>> Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect
> the
>>> problem.
>>> DIGY
>>> 
>>> -----Original Message-----
>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>> Sent: Thursday, June 09, 2011 8:40 PM
>>> To: <lu...@lucene.apache.org>
>>> Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>> 
>>> I tried converting index using IndexWriter as follows:
>>> 
>>> Lucene.Net.Index.IndexWriter writer = new
> IndexWriter(TestIndexPath+"_2.9",
>>> new Lucene.Net.Analysis.KeywordAnalyzer());
>>> 
>>> writer.SetMaxBufferedDocs(2);
>>> writer.SetMaxMergeDocs(1000000);
>>> writer.SetMergeFactor(2);
>>> 
>>> writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
>>> Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
>>> 
>>> writer.Commit();
>>> 
>>> 
>>> That seems to work (I get what looks like a valid index directory at
> least).
>>> 
>>> But still when I run some tests using IndexSearcher I get the same
> problem
>>> (I get documents in Collect() which are larger than
> IndexReader.MaxDoc()).
>>> Any idea what the problem could be?  
>>> 
>>> BTW, this is a problem because I lookup some fields (date ranges, etc.)
> in
>>> some custom collectors which filter out documents, and it assumes I dont
> get
>>> any documents larger than maxDoc.
>>> 
>>> Thanks,
>>> Bob
>>> 
>>> 
>>> On Jun 9, 2011, at 12:37 PM, Digy wrote:
>>> 
>>>> One more point, some write operations using Lucene.Net 2.9.2 (add,
> delete,
>>>> optimize etc.) upgrades automatically your index to 2.9.2.
>>>> But if your index is somehow corrupted(eg, due to some bug in 1.9) this
>>> may
>>>> result in data loss.
>>>> 
>>>> DIGY
>>>> 
>>>> -----Original Message-----
>>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>>> Sent: Thursday, June 09, 2011 7:06 PM
>>>> To: lucene-net-dev@lucene.apache.org
>>>> Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>>> 
>>>> I have a Lucene index created with Lucene.Net 1.9.  I have a
> multi-segment
>>>> index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that
> index,
>>> I
>>>> get IndexOutOfRange exceptions in my collectors.  It is giving me
> document
>>>> IDs that are larger than maxDoc.  
>>>> 
>>>> My index contains 377831 documents, and IndexReader.MaxDoc() is
> returning
>>>> 377831, but I get documents from Collect() with large values (for
> instance
>>>> 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?
> If
>>>> not, is there some way I can convert it (in production we have many
>>> indexes
>>>> containing about 200 million docs so I'd rather convert existing
indexes
>>>> than rebuilt them).
>>>> 
>>>> Thanks
>>>> Bob=
>>>> 
>>> 
>> 
> 


RE: [Lucene.Net] Faceting

Posted by Digy <di...@gmail.com>.
For that many unique values I would recommend to take a look at "References
to adding faceted to Lucene.Net" section in that wiki page.
Using BitSets(as in contrib) is good for large number of search results(say
millions) with small # of facets(say 1000-4000).
Using "Collector" approach is good when result set is small but number of
facets is relatively large as in your case.

What to use really depends on your needs(or a hybrid approach that uses both
like in Solr)

DIGY

PS: every contribution is always welcome.





-----Original Message-----
From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
Sent: Friday, June 10, 2011 10:36 PM
To: <lu...@lucene.apache.org>
Subject: [Lucene.Net] Faceting

I took a brief look at the documentation for faceting in contrib.  I did not
look at code yet.  Do you think it can work for these requirements:

1) Needs to compute facets for fields of more than one value per document
(for instance a document may have many company names associated to it).
2) Needs to compute facets over any arbitrary query
3) Needs to be fast:
	a) I have 100 million docs distributed in about 10 indexes (10
million docs each) and use parallel distributed search and merge
	b) For some facet fields, we have over 100,000 possible unique
values (for example, we have 150,000 possible company values)

In our case, we pre-cache compressed doc sets in memory for each unique
facet value.  if # of values < 1/9 size of index, then we use variable byte
encoding of integers, otherwise we use BitArray.
These doc sets are then sorted in descending order by document frequency (so
more frequent facets are counted first)
We open new index "snapshots" every couple minutes and pre-load these facet
doc sets into ram each time new snapshot is opened in the background.
We use about 32 GB of RAM when fully loaded.

At search time we gather all the doc IDs matching search into a BitArray.
Then we enumerate all the facet doc sets in desc order by overall doc
frequency, and count how many docs in search matched each facet.
These facet counts are passed into a priority queue to gather top N counts
(such that when the next total count < full priority queue min value, it
breaks out of loop, that is why we do it in desc order by total doc freq)

We also count # of docs per day over date range for each facet.
We also compute facets for about 10 fields during search, and get top 10
facets each.

Typically search over 100 million docs including facet counts and per-date
counts takes about 1300ms.  

Our current solution actually works pretty well - but it is a burden on RAM,
time to load new snapshots, and extra pressure on GC during busy times.

Do you think your current facet implementation can work as above, and should
I try to contrib what I have (it would definitely take a little
refactoring)?

Thanks,
Bob


On Jun 10, 2011, at 12:37 PM, Digy wrote:

> Have you tried to use Lucene.Net as is, before working on optimizing your
> code? There are a lot of speed improvements in it since 1.9.
> There is also a Faceted Search project in contrib.
>
(https://cwiki.apache.org/confluence/display/LUCENENET/Simple+Faceted+Search
> )
> 
> DIGY
> 
> 
> 
> -----Original Message-----
> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
> Sent: Friday, June 10, 2011 7:14 PM
> To: <lu...@lucene.apache.org>
> Subject: [Lucene.Net] Score(collector) called for each subReader - but not
> what I need
> 
> As I previously tried to explain, I have custom query for some pre-cached
> terms, which I load into RAM in efficient compressed form.  I need this
for
> faster searching and also for much faster faceting.  So what I do is
process
> incoming query and replace certain sub-queries with my own
"CachedTermQuery"
> objects, which extend Query.  Since these are not per-segment, I only want
> scorer.Score(collector) called once, not once for each segment in my
index.
> Essentially what happens now if I have a search is it collects the same
> documents N times, 1 time for each segment.  Is there anyway to combine
> different Scorers/Collectors such that I can control when it enumerates
> collection by multiple sub-readers, and when not to?  This all worked in
> previous version of Lucene because enumerating sub-indexes (segments) was
> pushed to a lower level inside Lucene API and not it is elevated to a
higher
> level.
> 
> Thanks
> Bob
> 
> 
> On Jun 9, 2011, at 4:33 PM, Robert Stewart wrote:
> 
>> I found the problem.  The problem is that I have a custom "query
> optimizer", and that replaces certain TermQuery's within a Boolean query
> with a custom Query and this query has its own weight/scorer that
retrieves
> matching documents from an in-memory cache (and that is not Lucene
backed).
> But it looks like my custom hitcollectors are now wrapped in a
> HitCollectorWrapper which assumes Collect() needs called for multiple
> segments - so it is adding a start offset to the doc ID that comes from my
> custom query implementation.  I looked at the new Collector class and it
> seems it works the same way (assumes it needs to set the next index reader
> with some offset).  How can I make my custom query work with the new API
(so
> that there is basically a single "segment" in RAM that my query uses, but
> still other query clauses in same boolean query use multiple lucene
> segments)?  I am sure that is not clear and will try to provide more
detail
> soon.
>> 
>> Thanks
>> Bob
>> 
>> 
>> On Jun 9, 2011, at 1:48 PM, Digy wrote:
>> 
>>> Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect
> the
>>> problem.
>>> DIGY
>>> 
>>> -----Original Message-----
>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>> Sent: Thursday, June 09, 2011 8:40 PM
>>> To: <lu...@lucene.apache.org>
>>> Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>> 
>>> I tried converting index using IndexWriter as follows:
>>> 
>>> Lucene.Net.Index.IndexWriter writer = new
> IndexWriter(TestIndexPath+"_2.9",
>>> new Lucene.Net.Analysis.KeywordAnalyzer());
>>> 
>>> writer.SetMaxBufferedDocs(2);
>>> writer.SetMaxMergeDocs(1000000);
>>> writer.SetMergeFactor(2);
>>> 
>>> writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
>>> Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
>>> 
>>> writer.Commit();
>>> 
>>> 
>>> That seems to work (I get what looks like a valid index directory at
> least).
>>> 
>>> But still when I run some tests using IndexSearcher I get the same
> problem
>>> (I get documents in Collect() which are larger than
> IndexReader.MaxDoc()).
>>> Any idea what the problem could be?  
>>> 
>>> BTW, this is a problem because I lookup some fields (date ranges, etc.)
> in
>>> some custom collectors which filter out documents, and it assumes I dont
> get
>>> any documents larger than maxDoc.
>>> 
>>> Thanks,
>>> Bob
>>> 
>>> 
>>> On Jun 9, 2011, at 12:37 PM, Digy wrote:
>>> 
>>>> One more point, some write operations using Lucene.Net 2.9.2 (add,
> delete,
>>>> optimize etc.) upgrades automatically your index to 2.9.2.
>>>> But if your index is somehow corrupted(eg, due to some bug in 1.9) this
>>> may
>>>> result in data loss.
>>>> 
>>>> DIGY
>>>> 
>>>> -----Original Message-----
>>>> From: Robert Stewart [mailto:Robert_Stewart@epam.com] 
>>>> Sent: Thursday, June 09, 2011 7:06 PM
>>>> To: lucene-net-dev@lucene.apache.org
>>>> Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
>>>> 
>>>> I have a Lucene index created with Lucene.Net 1.9.  I have a
> multi-segment
>>>> index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that
> index,
>>> I
>>>> get IndexOutOfRange exceptions in my collectors.  It is giving me
> document
>>>> IDs that are larger than maxDoc.  
>>>> 
>>>> My index contains 377831 documents, and IndexReader.MaxDoc() is
> returning
>>>> 377831, but I get documents from Collect() with large values (for
> instance
>>>> 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?
> If
>>>> not, is there some way I can convert it (in production we have many
>>> indexes
>>>> containing about 200 million docs so I'd rather convert existing
indexes
>>>> than rebuilt them).
>>>> 
>>>> Thanks
>>>> Bob=
>>>> 
>>> 
>> 
>