You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Carsten Schnober <sc...@ids-mannheim.de> on 2012/12/13 16:49:10 UTC

Boolean and SpanQuery: different results

Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3C08C90E81-1C33-487A-9E7D-2F05B2779760@apache.org%3E).

The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
List<String> terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
  bq.add(new org.apache.lucene.search.RegexpQuery(new
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())
  bits.set(doc.doc);

/* get the spans for each term separately */
for (String term : terms) {
  String payloads = retrieveSpans(term, bits);
  // process and print payloads for term ...
}

def String retrieveSpans(String term, BitSet bits) {
  StringBuilder payloads = new StringBuilder();
  Map<Term, TermContext> termContexts = new HashMap<>();
  Spans spans;
  SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery(new Term("text", term))).rewrite(reader);

  for (AtomicReaderContext atomic : reader.leaves()) {	
    spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
    while (luceneSpans.next()) {
      // extract and store payloads in 'payloads' StringBuilder
    }
  }
  return payloads.toString();
}


This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean and SpanQuery: different results

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?
> 
> Make sure that the Analyzer you are using doesn't do any magic on the
> indexed terms - your query term is unanalyzed. Maybe multiple, but
> distinct, index terms are analyzing to the same, but unexpected term.

Apart from the answer I've already given myself, here's another note
about the issue. I've been using WhitespaceAnalyzer for both indexing
and query parsing, but apparently, the query parser lowercased by
default while WhitespaceAnalyzer did not. Therefore,
QueryParser.setLowercaseExpandedTerms(false) is necessary in order to
get the same results.

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean and SpanQuery: different results

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?

The index I'm testing with is German Wikipedia and I've been testing
with different (arbitrarily chosen) terms. I'm listing some results, the
first number is the number of documents matched with a BooleanQuery, the
second number is the number of documents matches with a SpanQuery:

- Knacklaut	24/19
- schönes	70/70
- zufällige	71/70
- wunderbar	24/24
- Himmel	773/753
- Sonne	1190/1152


> Make sure that the Analyzer you are using doesn't do any magic on the
> indexed terms - your query term is unanalyzed. Maybe multiple, but
> distinct, index terms are analyzing to the same, but unexpected term.

I'm using a custom Analyzer during indexing. Regarding the analyzer
applied during search, I'm not sure: as I haven't defined any specific
one, what does Lucene choose? I wasn't thinking about that because I
assumed that this should make no difference regarding the BooleanQuery
vs. SpanQuery issue.
Thanks for the hint anyway, I'll have a closer look there.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean and SpanQuery: different results

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 17.12.2012 11:54, schrieb Carsten Schnober:

> Might this have to do with the docbase? I collect the document IDs from
> the BooleanQuery through a Collector, adding the actual ID to the
> current AtomicReaderContext.docbase. In the corresponding SpanQuery, I
> pass these document IDs as a DocIdBitSet as an argument to
> SpanQuery.getSpans().

Answering my own question that has made me think about the document base
issue: indeed, I should be collecting document IDs relative to their
respective AtomicReaderContext rather than adding the context's docbase
because the subsequent SpanQuery is performed within an
AtomicReaderContext as well.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean and SpanQuery: different results

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 13.12.2012 18:00, schrieb Jack Krupansky:
> Can you provide some examples of terms that don't work and the index
> token stream they fail on?
> 
> Make sure that the Analyzer you are using doesn't do any magic on the
> indexed terms - your query term is unanalyzed. Maybe multiple, but
> distinct, index terms are analyzing to the same, but unexpected term.

I've done some further analysis and it turns out that for some reason,
the SpanQuery described previously returns matches for the first entry
(in 18 existing ones) in the list returned by reader.leaves().

As stated in my first post in this thread, my code builds a SpanQuery
for each AtomicReaderContext in a list retrieved through
MultiReader.leaves(). That SpanQuery is identical to a BooleanQuery with
TermQueries for the exactly same terms performed with
IndexSearcher.search() on that same MultiReader.

The document ids of the hits found through the SpanQuery correspond to
the ones returned by the BooleanQuery for the same term. However, the
documents returned by the BooleanQuery that do not lye within the first
AtomicReaderContext are not found by the SpanQuery.

Might this have to do with the docbase? I collect the document IDs from
the BooleanQuery through a Collector, adding the actual ID to the
current AtomicReaderContext.docbase. In the corresponding SpanQuery, I
pass these document IDs as a DocIdBitSet as an argument to
SpanQuery.getSpans().

Thanks!
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean and SpanQuery: different results

Posted by Jack Krupansky <ja...@basetechnology.com>.

Can you provide some examples of terms that don't work and the index token 
stream they fail on?

Make sure that the Analyzer you are using doesn't do any magic on the 
indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, 
index terms are analyzing to the same, but unexpected term.

-- Jack Krupansky

-----Original Message----- 
From: Carsten Schnober
Sent: Thursday, December 13, 2012 10:49 AM
To: java-user@lucene.apache.org
Subject: Boolean and SpanQuery: different results

Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3C08C90E81-1C33-487A-9E7D-2F05B2779760@apache.org%3E).

The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
List<String> terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
  bq.add(new org.apache.lucene.search.RegexpQuery(new
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())
  bits.set(doc.doc);

/* get the spans for each term separately */
for (String term : terms) {
  String payloads = retrieveSpans(term, bits);
  // process and print payloads for term ...
}

def String retrieveSpans(String term, BitSet bits) {
  StringBuilder payloads = new StringBuilder();
  Map<Term, TermContext> termContexts = new HashMap<>();
  Spans spans;
  SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery(new Term("text", term))).rewrite(reader);

  for (AtomicReaderContext atomic : reader.leaves()) {
    spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
    while (luceneSpans.next()) {
      // extract and store payloads in 'payloads' StringBuilder
    }
  }
  return payloads.toString();
}


This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org