You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sean O'Connor <se...@oconeco.com> on 2005/08/23 23:42:24 UTC
Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
Hello,
I am trying to work through term positions and how to get them from
a collection of hits. Does setting TermVector.WITH_POSITIONS_OFFSETS to
true save the start/end position of the term in the source text file? (I
_think_ it does).
If so, where would I start for trying to make that information
accessible in a "result set"? I believe it would be extending a query, a
scorer, a hit, and/or a weight object. I will be wanting to process ALL
hits, so I think will need to implement a hitcollector.
As an example of what I want, if I were looking for the offset
position of "brown" in a properly indexed field containing "the lazy
brown fox", I would like to get:
start==10
end==15 (assuming my counting is right)
Based on Paul Elschot's previous response to a similar question I
had (which I am still working on), I _think_ I need to extend something
like the ExactPhraseScorer. While debugging with my IDE (Eclipse) I can
see that the weight object in the scorer contains a reference to the
query. The query contains the fields:
Vector positions (just has ints of term positions in phrase?)
Vector terms (vector of Term, just field name and field contents?)
The weight also seems to have an array of TermPositions, which have
SegmentTermPositions. I thought this was what I wanted, but I don't see
the proper start/end fields, or anything which seems to be on the right
track.
Can anyone point me in the right direction?
Thanks,
Sean
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
Posted by Mikko Noromaa <mi...@noromaa.fi>.
Hi,
I create my index with TermVector.WITH_POSITIONS_OFFSETS and get the term
offsets with the following code. The code collects two arrays: HFIDs (unique
ID's stored with documents) and Highlights (strings with offset info).
Please note that this code requires the patch from bug #36292
(http://issues.apache.org/bugzilla/show_bug.cgi?id=36292) to work with
prefix queries.
QueryParser parser = new QueryParser("text", analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Query query=parser.parse(querystr);
IndexSearcher searcher=new IndexSearcher(reader);
Hits hits = searcher.search(query);
//System.out.println("query.getClass()=\""+query.getClass().toString()+"\"")
;
HashSet QueryTerms=new HashSet();
query.extractTerms(QueryTerms);
int NumHits=hits.length();
int[] HFIDs=new int[NumHits];
String[] Highlights=new String[NumHits];
for (int i = 0; i < NumHits; i++) {
Document doc = hits.doc(i);
HFIDs[i]=Integer.parseInt(doc.get("hfid"));
String HiliString="";
TermPositionVector
tpv=(TermPositionVector)reader.getTermFreqVector(hits.id(i), "text");
String[] DocTerms=tpv.getTerms();
int[] freq=tpv.getTermFrequencies();
for (int t = 0; t < freq.length; t++) {
if (QueryTerms.contains(new Term("text",DocTerms[t]))) {
TermVectorOffsetInfo[] offsets=tpv.getOffsets(t);
int[] pos=tpv.getTermPositions(t);
for (int tp = 0; tp < pos.length; tp++) {
HiliString+=(HiliString!=""?",":"")+offsets[tp].getStartOffset()+"-"+offsets
[tp].getEndOffset();
}
}
}
Highlights[i]=HiliString;
}
--
Mikko Noromaa (mikko@noromaa.fi) - tel. +358 40 7348034
Noromaa Solutions - see http://www.nm-sol.com/
> -----Original Message-----
> From: Sean O'Connor [mailto:sean@oconeco.com]
> Sent: Wednesday, August 24, 2005 12:42 AM
> To: java-user@lucene.apache.org
> Subject: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
>
>
> Hello,
> I am trying to work through term positions and how to get
> them from
> a collection of hits. Does setting
> TermVector.WITH_POSITIONS_OFFSETS to
> true save the start/end position of the term in the source
> text file? (I
> _think_ it does).
>
> If so, where would I start for trying to make that information
> accessible in a "result set"? I believe it would be extending
> a query, a
> scorer, a hit, and/or a weight object. I will be wanting to
> process ALL
> hits, so I think will need to implement a hitcollector.
>
> As an example of what I want, if I were looking for the offset
> position of "brown" in a properly indexed field containing "the lazy
> brown fox", I would like to get:
> start==10
> end==15 (assuming my counting is right)
>
> Based on Paul Elschot's previous response to a similar question I
> had (which I am still working on), I _think_ I need to extend
> something
> like the ExactPhraseScorer. While debugging with my IDE
> (Eclipse) I can
> see that the weight object in the scorer contains a reference to the
> query. The query contains the fields:
> Vector positions (just has ints of term positions in phrase?)
> Vector terms (vector of Term, just field name and field contents?)
>
> The weight also seems to have an array of TermPositions,
> which have
> SegmentTermPositions. I thought this was what I wanted, but I
> don't see
> the proper start/end fields, or anything which seems to be on
> the right
> track.
>
> Can anyone point me in the right direction?
> Thanks,
>
> Sean
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org