You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Peter Keegan <pe...@gmail.com> on 2005/10/12 00:14:49 UTC

Re: "docMap" array in SegmentMergeInfo

On a multi-cpu system, this loop to build the docMap array can cause severe
thread thrashing because of the synchronized method 'isDeleted'. I have
observed this on an index with over 1 million documents (which contains a
few thousand deleted docs) when multiple threads perform a search with
either a sort field or a range query. A stack dump shows all threads here:

waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) -
waiting to lock <0x04e40278>

The performances worsens as the number of threads increases. The searches
may take minutes to complete.
If only a single thread issues the search, it completes fairly quickly. I
also noticed from looking at the code that the docMap doesn't appear to be
used in these cases. It seems only to be used for merging segments. If the
index is in 'search/read-only' mode, is there a way around this bottleneck?

Thanks,
Peter

On 7/13/05, Doug Cutting <cu...@apache.org> wrote:
>
> Lokesh Bajaj wrote:
> > For a very large index where we might want to delete/replace some
> documents, this would require a lot of memory (for 100 million documents,
> this would need 381 MB of memory). Is there any reason why this was
> implemented this way?
>
> In practice this has not been an issue. A single index with 100M
> documents is usually quite slow to search. When collections get this
> big folks tend to instead search multiple indexes in parallel in order
> to keep response times acceptable. Also, 381Mb of RAM is often not a
> problem for folks with 100M documents. But this is not to say that it
> could never be a problem. For folks with limited RAM and/or lots of
> small documents it could indeed be an issue.
>
> > It seems like this could be implemented as a much smaller array that
> only keeps track of the deleted document numbers and it would still be very
> efficient to calculate the new document number by using this much smaller
> array. Has this been done by anyone else or been considered for change in
> the Lucene code?
>
> Please submit a patch to the java-dev list.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: "docMap" array in SegmentMergeInfo

Posted by Peter Keegan <pe...@gmail.com>.

> If the index is in 'search/read-only' mode, is there a way around this
bottleneck?

The obvious answer (to answer my own question) is to optimize the index.
But the question remains: why is the docMap created and never used?

Peter

Re: Frustrated with tokenized listing terms

Posted by Chris Hostetter <ho...@fucit.org>.

: I've solved this by indexing the field twice, once as author:(searchable/not
: stored/tokenized)
: and once as author_phrased:(not searchable/stored/not tokenized).

: This works, but is it the proper way to do it?

It's the most effective/efficient method i can think of.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Frustrated with tokenized listing terms

Posted by JMA <mr...@comcast.net>.

Greetings...
Quick question, perhaps I am missing something.

I have a bunch of documents where one of the indexed fields is "author". For
example:

book1, by "John Smith"
book2, by "Steve Smith"
book3, by "John Smith"

I would like to find all distinct authors in my index.  I want to support
searches for author:smith, so I tokenize the author field during index.
However, getTerms() then returns:

John (x2)
Smith (x3)
Steve (x1)

I would like to see:
John Smith (x2)
Steve Smith (x1)

I've solved this by indexing the field twice, once as author:(searchable/not
stored/tokenized)
and once as author_phrased:(not searchable/stored/not tokenized).

Then I query using the 'author' field while listing terms using the
'author_phrased' field.

This works, but is it the proper way to do it?

Thanks in advance,

JMA



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "docMap" array in SegmentMergeInfo

Posted by Peter Keegan <pe...@gmail.com>.

Hi Yonik,

Your patch has corrected the thread thrashing problem on multi-cpu systems.
I've tested it with both 1.4.3 and 1.9. I haven't seen 100X performance
gain, but that's because I'm caching QueryFilters and Lucene is caching the
sort fields.

Thanks for the fast response!

btw, I had previously tried Chris's fix (replace synchronized method with
snapshot reference), but I was getting errors trying to fetch stored fields
from the Hits. I didn't chase it down, but the errors went away when I
reverted that specific patch.

Peter

On 10/12/05, Yonik Seeley <ys...@gmail.com> wrote:
>
> Here's the patch:
> http://issues.apache.org/jira/browse/LUCENE-454
>
> It resulted in quite a performance boost indeed!
>
> On 10/12/05, Yonik Seeley <ys...@gmail.com> wrote:
> >
> > Thanks for the trace Peter, and great catch!
> > It certainly does look like avoiding the construction of the docMap for
> a
> > MultiTermEnum will be a significant optimization.
> >
> >
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
>
>

Re: "docMap" array in SegmentMergeInfo

Posted by Yonik Seeley <ys...@gmail.com>.

Here's the patch:
http://issues.apache.org/jira/browse/LUCENE-454

It resulted in quite a performance boost indeed!

On 10/12/05, Yonik Seeley <ys...@gmail.com> wrote:
>
> Thanks for the trace Peter, and great catch!
> It certainly does look like avoiding the construction of the docMap for a
> MultiTermEnum will be a significant optimization.
>
>
-Yonik
Now hiring -- http://tinyurl.com/7m67g

Re: "docMap" array in SegmentMergeInfo

Posted by Yonik Seeley <ys...@gmail.com>.

Thanks for the trace Peter, and great catch!
It certainly does look like avoiding the construction of the docMap for a
MultiTermEnum will be a significant optimization.


-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/12/05, Peter Keegan <pe...@gmail.com> wrote:
>
> Here is one stack trace:
>
> Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode):
>
> "Thread-6" prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry
> [0x6d2cf000..0x6d2cfd6c]
> at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241)
> - waiting to lock <0x04e40278> (a org.apache.lucene.index.SegmentReader)
> at org.apache.lucene.index.SegmentMergeInfo.<init>(SegmentMergeInfo.java
> :43)
> at org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:277)
> at org.apache.lucene.index.MultiReader.terms(MultiReader.java:186)
> at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:75)
> at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
> at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
> at org.apache.lucene.search.Query.weight(Query.java:84)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:158)
> at org.apache.lucene.search.Searcher.search(Searcher.java:67)
> at org.apache.lucene.search.QueryFilter.bits(QueryFilter.java:62)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:121)
> at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
> at org.apache.lucene.search.Hits.<init>(Hits.java:51)
> at org.apache.lucene.search.Searcher.search(Searcher.java:49)
>
> I've also seen it happen during sorting from:
>
> FieldSortedHitQueue.comparatorAuto ->
> FieldCacheImpl.getAuto() ->
> MultiReader.terms() ->
> MultiTermEnum.init() ->
> SegmentMergerInfo.init() ->
> SegmentReader.isDeleted()
>
> Peter
>
> On 10/11/05, Yonik Seeley <ys...@gmail.com> wrote:
> >
> > > We've been using this in production for a while and it fixed the
> > > extremely slow searches when there are deleted documents.
> >
> > Who was the caller of isDeleted()? There may be an opportunity for an
> easy
> > optimization to grab the BitVector and reuse it instead of repeatedly
> > calling isDeleted() on the IndexReader.
> >
> > -Yonik
> > Now hiring -- http://tinyurl.com/7m67g
> >
>

Re: "docMap" array in SegmentMergeInfo

Posted by Peter Keegan <pe...@gmail.com>.

Here is one stack trace:

Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode):

"Thread-6" prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry
[0x6d2cf000..0x6d2cfd6c]
at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241)
- waiting to lock <0x04e40278> (a org.apache.lucene.index.SegmentReader)
at org.apache.lucene.index.SegmentMergeInfo.<init>(SegmentMergeInfo.java:43)
at org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:277)
at org.apache.lucene.index.MultiReader.terms(MultiReader.java:186)
at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:75)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:158)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.apache.lucene.search.QueryFilter.bits(QueryFilter.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:121)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.<init>(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:49)

I've also seen it happen during sorting from:

FieldSortedHitQueue.comparatorAuto ->
FieldCacheImpl.getAuto() ->
MultiReader.terms() ->
MultiTermEnum.init() ->
SegmentMergerInfo.init() ->
SegmentReader.isDeleted()

Peter

On 10/11/05, Yonik Seeley <ys...@gmail.com> wrote:
>
> > We've been using this in production for a while and it fixed the
> > extremely slow searches when there are deleted documents.
>
> Who was the caller of isDeleted()? There may be an opportunity for an easy
> optimization to grab the BitVector and reuse it instead of repeatedly
> calling isDeleted() on the IndexReader.
>
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
>
>

Re: "docMap" array in SegmentMergeInfo

Posted by Yonik Seeley <ys...@gmail.com>.

> We've been using this in production for a while and it fixed the
> extremely slow searches when there are deleted documents.

Who was the caller of isDeleted()? There may be an opportunity for an easy
optimization to grab the BitVector and reuse it instead of repeatedly
calling isDeleted() on the IndexReader.

-Yonik
Now hiring -- http://tinyurl.com/7m67g

Re: "docMap" array in SegmentMergeInfo

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

I'm pretty sure it doesn't solve the problem in general (it isn't a 
thread-save solution for sure, you mentioned the memory barrier, I'd add 
compiler optimizations). If it works it must be something 
application-specific, maybe synchronization isn't really needed there, 
or you just don't do anything (i.e. write operations) that would cause a 
crash.

D.

Yonik Seeley wrote:
> I'm not sure that looks like a safe patch.
> Synchronization does more than help prevent races... it also introduces
> memory barriers.
> Removing synchronization to objects that can change is very tricky business
> (witness the double-checked locking antipattern).
> 
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
> 
> On 10/11/05, Chris Lamprecht <cl...@gmail.com> wrote:
> 
>>Hi Peter,
>>
>>I observed the same issue on a multiprocessor machine. I included a
>>small fix for this in the NIO patch (against the 1.9 trunk) here:
>>http://issues.apache.org/jira/browse/LUCENE-414#action_12322523
>>
>>The change amounts to the following methods in SegmentReader.java, to
>>remove the need synchronized() block by taking a "snapshot" of the
>>variable:
>>
>>// Removed synchronized from document(int)
>>public Document document(int n) throws IOException {
>>if (isDeleted(n))
>>throw new IllegalArgumentException
>>("attempt to access a deleted document");
>>return fieldsReader.doc(n);
>>}
>>
>>// removed synchronized from isDeleted(int)
>>public boolean isDeleted(int n) {
>>// avoid race condition by getting a snapshot reference
>>final BitVector snapshot = deletedDocs;
>>return (snapshot != null && snapshot.get(n));
>>}
>>
>>We've been using this in production for a while and it fixed the
>>extremely slow searches when there are deleted documents. Maybe it
>>could be applied to the trunk, independent of the full NIO patch.
>>
>>-chris
>>
>>
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "docMap" array in SegmentMergeInfo

Posted by Yonik Seeley <ys...@gmail.com>.

I'm not sure that looks like a safe patch.
Synchronization does more than help prevent races... it also introduces
memory barriers.
Removing synchronization to objects that can change is very tricky business
(witness the double-checked locking antipattern).

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/11/05, Chris Lamprecht <cl...@gmail.com> wrote:
>
> Hi Peter,
>
> I observed the same issue on a multiprocessor machine. I included a
> small fix for this in the NIO patch (against the 1.9 trunk) here:
> http://issues.apache.org/jira/browse/LUCENE-414#action_12322523
>
> The change amounts to the following methods in SegmentReader.java, to
> remove the need synchronized() block by taking a "snapshot" of the
> variable:
>
> // Removed synchronized from document(int)
> public Document document(int n) throws IOException {
> if (isDeleted(n))
> throw new IllegalArgumentException
> ("attempt to access a deleted document");
> return fieldsReader.doc(n);
> }
>
> // removed synchronized from isDeleted(int)
> public boolean isDeleted(int n) {
> // avoid race condition by getting a snapshot reference
> final BitVector snapshot = deletedDocs;
> return (snapshot != null && snapshot.get(n));
> }
>
> We've been using this in production for a while and it fixed the
> extremely slow searches when there are deleted documents. Maybe it
> could be applied to the trunk, independent of the full NIO patch.
>
> -chris
>
>

Re: "docMap" array in SegmentMergeInfo

Posted by Chris Lamprecht <cl...@gmail.com>.

Hi Peter,

I observed the same issue on a multiprocessor machine.  I included a
small fix for this in the NIO patch (against the 1.9 trunk) here: 
http://issues.apache.org/jira/browse/LUCENE-414#action_12322523

The change amounts to the following methods in SegmentReader.java, to
remove the need synchronized() block by taking a "snapshot" of the
variable:

// Removed synchronized from document(int)
public Document document(int n) throws IOException {
    if (isDeleted(n))
      throw new IllegalArgumentException
              ("attempt to access a deleted document");
    return fieldsReader.doc(n);
}

// removed synchronized from isDeleted(int)
public boolean isDeleted(int n) {
    // avoid race condition by getting a snapshot reference
    final BitVector snapshot = deletedDocs;
    return (snapshot != null && snapshot.get(n));
}

We've been using this in production for a while and it fixed the
extremely slow searches when there are deleted documents.  Maybe it
could be applied to the trunk, independent of the full NIO patch.

-chris

On 10/11/05, Peter Keegan <pe...@gmail.com> wrote:
> On a multi-cpu system, this loop to build the docMap array can cause severe
> thread thrashing because of the synchronized method 'isDeleted'. I have
> observed this on an index with over 1 million documents (which contains a
> few thousand deleted docs) when multiple threads perform a search with
> either a sort field or a range query. A stack dump shows all threads here:
>
> waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) -
> waiting to lock <0x04e40278>
>
> The performances worsens as the number of threads increases. The searches
> may take minutes to complete.
> If only a single thread issues the search, it completes fairly quickly. I
> also noticed from looking at the code that the docMap doesn't appear to be
> used in these cases. It seems only to be used for merging segments. If the
> index is in 'search/read-only' mode, is there a way around this bottleneck?
>
> Thanks,
> Peter
>
>
>
>
> On 7/13/05, Doug Cutting <cu...@apache.org> wrote:
> >
> > Lokesh Bajaj wrote:
> > > For a very large index where we might want to delete/replace some
> > documents, this would require a lot of memory (for 100 million documents,
> > this would need 381 MB of memory). Is there any reason why this was
> > implemented this way?
> >
> > In practice this has not been an issue. A single index with 100M
> > documents is usually quite slow to search. When collections get this
> > big folks tend to instead search multiple indexes in parallel in order
> > to keep response times acceptable. Also, 381Mb of RAM is often not a
> > problem for folks with 100M documents. But this is not to say that it
> > could never be a problem. For folks with limited RAM and/or lots of
> > small documents it could indeed be an issue.
> >
> > > It seems like this could be implemented as a much smaller array that
> > only keeps track of the deleted document numbers and it would still be very
> > efficient to calculate the new document number by using this much smaller
> > array. Has this been done by anyone else or been considered for change in
> > the Lucene code?
> >
> > Please submit a patch to the java-dev list.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org