You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrew Duffy (JIRA)" <ji...@apache.org> on 2008/12/31 04:28:44 UTC
[jira] Updated: (LUCENE-579) TermPositionVector offsets incorrect
if indexed field has multiple values and one ends with non-term chars
[ https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Duffy updated LUCENE-579:
--------------------------------
Attachment: offsets.patch
I've attached a patch to 2.4's DocInverterPerField.java that fixes this. The problem is in line 160, which stores the starting offset for the next value of the same field:
- if a field value has delimiter text after its last token this is ignore.
- If there is no extra delimiter text after the last token, the offsets are off by +1 for the tokens in the second value, +2 for the third value and so on.
- The problem is hidden when there is exactly one delimiter character after each value.
The patch removes the +1 completely and uses the length of the string to adjust offsets for fields with a string value. Fields with reader or token stream values can't easily be fixed but can't be stored either so are much less likely to affect anyone.
> TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars
> ----------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-579
> URL: https://issues.apache.org/jira/browse/LUCENE-579
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9
> Reporter: Keiron McCammon
> Attachments: offsets.patch
>
>
> If you add multiple values for a field with term vector positions and offsets enabled and one of the values ends with a non-term then the offsets for the terms from subsequent values are wrong. For example (note the '.' in the first value):
> IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);
> Document doc = new Document();
> doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
> doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
> IndexSearcher searcher = new IndexSearcher(directory);
> Hits hits = searcher.search(new MatchAllDocsQuery());
> Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
> new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(), ""));
> for (int i = 0; i < hits.length(); ++i) {
> TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector(
> hits.id(i), "");
> StringBuilder str = new StringBuilder();
> for (String s : hits.doc(i).getValues("")) {
> str.append(s);
> str.append(" ");
> }
>
> System.out.println(str);
> TokenStream tokenStream = TokenSources.getTokenStream(v, false);
> String[] terms = v.getTerms();
> int[] freq = v.getTermFrequencies();
> for (int j = 0; j < terms.length; ++j) {
> System.out.print(terms[j] + ":" + freq[j] + ":");
>
> int[] pos = v.getTermPositions(j);
>
> System.out.print(Arrays.toString(pos));
>
> TermVectorOffsetInfo[] offset = v.getOffsets(j);
> for (int k = 0; k < offset.length; ++k) {
>
> System.out.print(":");
> System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset()));
> }
>
> System.out.println();
> }
> }
> searcher.close();
> If I run the above I get:
> one:1:[0]:one
> two:1:[1]: tw
> Note that the offsets for the second term are off by 1.
> It seems to be that the length of the value that is stored is not taken into account when calculating the offset for the fields of the next value.
> I noticed ths problem when using the highlight contrib package which can make use of term vectors for highlighting. I also noticed that the offset for the second string is +1 the end of the previous value, so when concatenating the fields values to pass to the hgighlighter I add to append a ' ' character after each string...which is quite useful, but not documented anywhere.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org