You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sean O'Connor <se...@oconeco.com> on 2005/08/23 23:42:24 UTC

Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?

Hello,
    I am trying to work through term positions and how to get them from 
a collection of hits. Does setting TermVector.WITH_POSITIONS_OFFSETS to 
true save the start/end position of the term in the source text file? (I 
_think_ it does).

     If so, where would I start for trying to make that information 
accessible in a "result set"? I believe it would be extending a query, a 
scorer, a hit, and/or a weight object. I will be wanting to process ALL 
hits, so I think will need to implement a hitcollector.

    As an example of what I want, if I were looking for the offset 
position of "brown" in a properly indexed field containing "the lazy 
brown fox", I would like to get:
start==10
end==15 (assuming my counting is right)

    Based on Paul Elschot's previous response to a similar question I 
had (which I am still working on), I _think_ I need to extend something 
like the ExactPhraseScorer. While debugging with my IDE (Eclipse) I can 
see that the weight object in the scorer contains a reference to the 
query. The query contains the fields:
    Vector positions (just has ints of term positions in phrase?)
    Vector terms (vector of Term, just field name and field contents?)

    The weight also seems to have an array of TermPositions, which have 
SegmentTermPositions. I thought this was what I wanted, but I don't see 
the proper start/end fields, or anything which seems to be on the right 
track.

    Can anyone point me in the right direction?
Thanks,

Sean



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?

Posted by Mikko Noromaa <mi...@noromaa.fi>.
Hi,

I create my index with TermVector.WITH_POSITIONS_OFFSETS and get the term
offsets with the following code. The code collects two arrays: HFIDs (unique
ID's stored with documents) and Highlights (strings with offset info).

Please note that this code requires the patch from bug #36292
(http://issues.apache.org/bugzilla/show_bug.cgi?id=36292) to work with
prefix queries.


QueryParser parser = new QueryParser("text", analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Query query=parser.parse(querystr);

IndexSearcher searcher=new IndexSearcher(reader);
Hits hits = searcher.search(query);

//System.out.println("query.getClass()=\""+query.getClass().toString()+"\"")
;
HashSet QueryTerms=new HashSet();
query.extractTerms(QueryTerms);

int NumHits=hits.length();
int[] HFIDs=new int[NumHits];
String[] Highlights=new String[NumHits];

for (int i = 0; i < NumHits; i++) {
	Document doc = hits.doc(i);
	HFIDs[i]=Integer.parseInt(doc.get("hfid"));
	String HiliString="";

	TermPositionVector
tpv=(TermPositionVector)reader.getTermFreqVector(hits.id(i), "text");

	String[] DocTerms=tpv.getTerms();          
	int[] freq=tpv.getTermFrequencies();
	for (int t = 0; t < freq.length; t++) {
		if (QueryTerms.contains(new Term("text",DocTerms[t]))) {
		    TermVectorOffsetInfo[] offsets=tpv.getOffsets(t);
		    int[] pos=tpv.getTermPositions(t);

			for (int tp = 0; tp < pos.length; tp++) {
	
HiliString+=(HiliString!=""?",":"")+offsets[tp].getStartOffset()+"-"+offsets
[tp].getEndOffset();
			}
		}
	}

	Highlights[i]=HiliString;
}


--

Mikko Noromaa (mikko@noromaa.fi) - tel. +358 40 7348034
Noromaa Solutions - see http://www.nm-sol.com/
 

> -----Original Message-----
> From: Sean O'Connor [mailto:sean@oconeco.com] 
> Sent: Wednesday, August 24, 2005 12:42 AM
> To: java-user@lucene.apache.org
> Subject: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
> 
> 
> Hello,
>     I am trying to work through term positions and how to get 
> them from 
> a collection of hits. Does setting 
> TermVector.WITH_POSITIONS_OFFSETS to 
> true save the start/end position of the term in the source 
> text file? (I 
> _think_ it does).
> 
>      If so, where would I start for trying to make that information 
> accessible in a "result set"? I believe it would be extending 
> a query, a 
> scorer, a hit, and/or a weight object. I will be wanting to 
> process ALL 
> hits, so I think will need to implement a hitcollector.
> 
>     As an example of what I want, if I were looking for the offset 
> position of "brown" in a properly indexed field containing "the lazy 
> brown fox", I would like to get:
> start==10
> end==15 (assuming my counting is right)
> 
>     Based on Paul Elschot's previous response to a similar question I 
> had (which I am still working on), I _think_ I need to extend 
> something 
> like the ExactPhraseScorer. While debugging with my IDE 
> (Eclipse) I can 
> see that the weight object in the scorer contains a reference to the 
> query. The query contains the fields:
>     Vector positions (just has ints of term positions in phrase?)
>     Vector terms (vector of Term, just field name and field contents?)
> 
>     The weight also seems to have an array of TermPositions, 
> which have 
> SegmentTermPositions. I thought this was what I wanted, but I 
> don't see 
> the proper start/end fields, or anything which seems to be on 
> the right 
> track.
> 
>     Can anyone point me in the right direction?
> Thanks,
> 
> Sean
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org