You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by andi rexha <a_...@hotmail.com> on 2013/04/02 10:56:35 UTC

Term vector Lucene 4.2

Hi, 
I have a problem while trying to extract term vector's attributes (i.e. position of the terms). What I have done was: 

Terms termVector = indexReader.getTermVector(docId, fieldName);
        TermsEnum reuse = null;
        TermsEnum iterator = termVector.iterator(reuse);
        PositionIncrementAttribute attribute =  iterator.attributes().getAttribute(PositionIncrementAttribute.class);
        BytesRef ref = null;
        while ((ref = iterator.next()) != null) {
            System.out.println(attribute.getPositionIncrement());
}


I get an Exception : 
This AttributeSource does not have the attribute 'org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute'.

>From the API I didn't find any other way to extract the information. Could you please help me? 
Thanks in advance
Best regards, Andi


ps. I have tried to open the index with Luke and the term vector's attributes are stored in the index. 
 		 	   		  

Re: Term vector Lucene 4.2

Posted by Adrien Grand <jp...@gmail.com>.
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha <a_...@hotmail.com> wrote:
> Hi Adrien,
> Thank you very much for the reply.
>
> I have two other small question about this:
> 1) Is  "final int freq = docsAndPositions.freq();" the same with "iterator.totalTermFreq()" ? In my tests it returns the same result and from the documentation it seems that the result should be the same.

In case of term vectors, the docs enums contain only one document so
iterator.totalTermFreq() and docsAndPositions.freq() are equal. This
would not be true if you consumed AtomicReader.fields() (since the
docs enums would have several documents).

> 2) How do I get the offsets for the term vector? I have tried to iterate over the docsAndPositions but I get the following exception:
>
> Exception in thread "main" java.lang.IllegalStateException: Position enum not started

You need to call startOffset and endOffset just after nextPosition:

        for (int i = 0; i < freq; ++i) {
          final int position = docsAndPositions.nextPosition();
          // 'position' is the i-th position of the current term in the document
          final int startOffset = docsAndPositions.startOffset();
          final int endOffset = docsAndPositions.endOffset();
          // offsets of the i-th term
        }

Beware that these methods will return -1 if you did not index offsets
(see FieldType.setIndexOptions and
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS).

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Term vector Lucene 4.2

Posted by andi rexha <a_...@hotmail.com>.
Hi Adrien, 
Thank you very much for the reply. 

I have two other small question about this:
1) Is  "final int freq = docsAndPositions.freq();" the same with "iterator.totalTermFreq()" ? In my tests it returns the same result and from the documentation it seems that the result should be the same. 

2) How do I get the offsets for the term vector? I have tried to iterate over the docsAndPositions but I get the following exception: 

Exception in thread "main" java.lang.IllegalStateException: Position enum not started


Thanks in advance,
Andi


> From: jpountz@gmail.com
> Date: Tue, 2 Apr 2013 12:05:12 +0200
> Subject: Re: Term vector Lucene 4.2
> To: java-user@lucene.apache.org
> 
> Hi Andi,
> 
> Here is how you could retrieve positions from your document:
> 
>     Terms termVector = indexReader.getTermVector(docId, fieldName);
>     TermsEnum reuse = null;
>     TermsEnum iterator = termVector.iterator(reuse);
>     BytesRef ref = null;
>     DocsAndPositionsEnum docsAndPositions = null;
>     while ((ref = iterator.next()) != null) {
>         docsAndPositions = iterator.docsAndPositions(null, docsAndPositions);
>         // beware that docsAndPositions will be null if you didn't
> index positions
>         if (docsAndPositions.nextDoc() != 0) { // you need to call
> nextDoc() to have the enum positioned
>           throw new AssertionError();
>         }
>         final int freq = docsAndPositions.freq(); // number of
> occurrences of the term
>         for (int i = 0; i < freq; ++i) {
>           final int position = docsAndPositions.nextPosition();
>           // 'position' is the i-th position of the current term in the document
>         }
>     }
> 
> I hope this helps.
> 
> -- 
> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
 		 	   		  

Re: Term vector Lucene 4.2

Posted by Adrien Grand <jp...@gmail.com>.
Hi Andi,

Here is how you could retrieve positions from your document:

    Terms termVector = indexReader.getTermVector(docId, fieldName);
    TermsEnum reuse = null;
    TermsEnum iterator = termVector.iterator(reuse);
    BytesRef ref = null;
    DocsAndPositionsEnum docsAndPositions = null;
    while ((ref = iterator.next()) != null) {
        docsAndPositions = iterator.docsAndPositions(null, docsAndPositions);
        // beware that docsAndPositions will be null if you didn't
index positions
        if (docsAndPositions.nextDoc() != 0) { // you need to call
nextDoc() to have the enum positioned
          throw new AssertionError();
        }
        final int freq = docsAndPositions.freq(); // number of
occurrences of the term
        for (int i = 0; i < freq; ++i) {
          final int position = docsAndPositions.nextPosition();
          // 'position' is the i-th position of the current term in the document
        }
    }

I hope this helps.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org