You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jason Eacott <je...@hardlight.com.au> on 2010/03/31 07:06:58 UTC

fastest way to gather simple terms that match documents?

Hi all,
    After I've run a query I need to know which terms matched each
result document (ie doc termfrequency>0).
the only way I know to do this is by calling explain on each document,
which the documentation claims to be
almost the equivalent of a new query for each call so I'm keen to
avoid that option if possible.
Is there a quick way to discover this information? All I need is a
list of terms (as simple strings would be fine),
I don't care how many were found or what position or anything else.
just which ones matched.

thoughts?

Thanks
Jason.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: fastest way to gather simple terms that match documents?

Posted by Chris Hostetter <ho...@fucit.org>.
: Alternatively index your documents with term vectors for the field enabled:
	...
: And then use IndexReader.getTermFreqVector() with the matching doc ID:

Uwe: this is an area i'm not particularly strong on, so i'm curious: do 
you expect that the TermFreqVector approach would be faster then the 
TermDocs approach for the type of usecase where docs tend to be "large"
but the list of specific terms you are interested in in testing for is 
"small" (ie: just the terms used in the original query)

I ask because off the top of my head i'm not seeing how it 
would really give you much of a time savings in return -- instead of 
seeking over the handful of terms you care about, the TermVectorMapper 
will have to scan over every Term in each of hte documents.  writing your 
own TermVectorMapper that ignores the terms you don't care about will 
help, but that still doesn't sound any faster)

: > :     After I've run a query I need to know which terms matched each
: > : result document (ie doc termfrequency>0).
: > 	...
: > : I don't care how many were found or what position or anything else.
: > : just which ones matched.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: fastest way to gather simple terms that match documents?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Alternatively index your documents with term vectors for the field enabled:

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/document/Field.TermVector.html

And then use IndexReader.getTermFreqVector() with the matching doc ID:

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/IndexReader.html#getTermFreqVector(int, java.lang.String)

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
> Sent: Monday, April 05, 2010 8:24 PM
> To: java-user@lucene.apache.org
> Subject: Re: fastest way to gather simple terms that match documents?
> 
> 
> :     After I've run a query I need to know which terms matched each
> : result document (ie doc termfrequency>0).
> 	...
> : I don't care how many were found or what position or anything else.
> : just which ones matched.
> 
> if all you care about is simple "which terms does it have" you can take
> your list of terms, and your list of docids, sort both lists and then
> use
> termDocs to loop over the terms and over the docs.  (the sorting is key
> for performance, because it allways you to alwasy skip forward, w/o
> needing to restart the termDocs)
> 
> something like...
> 
> TermDocs iter = indexReader.termDocs();
> for (Term t : myTerms) {
>   iter.seek(t);
>   for (int docid : myDocs) {
>     if (iter.skipTo(docid) && (iter.doc() == docid)) {
>       doSomethingWith(t, docid);
>     }
>   }
> }
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: fastest way to gather simple terms that match documents?

Posted by Chris Hostetter <ho...@fucit.org>.
:     After I've run a query I need to know which terms matched each
: result document (ie doc termfrequency>0).
	...
: I don't care how many were found or what position or anything else.
: just which ones matched.

if all you care about is simple "which terms does it have" you can take 
your list of terms, and your list of docids, sort both lists and then use 
termDocs to loop over the terms and over the docs.  (the sorting is key 
for performance, because it allways you to alwasy skip forward, w/o 
needing to restart the termDocs)

something like...

TermDocs iter = indexReader.termDocs();
for (Term t : myTerms) {
  iter.seek(t);
  for (int docid : myDocs) {
    if (iter.skipTo(docid) && (iter.doc() == docid)) {
      doSomethingWith(t, docid);
    }
  }
}



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org