You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by MitchK <mi...@web.de> on 2010/03/12 18:46:40 UTC

How to get Term Positions?

Hello community,

is it possible to get TermPositions without a TermVector? If yes, how can I
do so?
If such a feature is not yet implemented in Solr, it would be interesting
how to do so with Lucene.

I don't want to use a TermVector, because I have read somewhere that Lucene
stores the TermPosition in its inverted index, but I don't know how to
retrieve it.

Any suggestions?

Thank you!
- Mitch
-- 
View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27880551.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get Term Positions?

Posted by gagan_goku <ga...@gmail.com>.
I tried the same thing today, am happy to share a snippet with you:


    SchemaField field = req.getSchema().getFields().get("field_name");

    AtomicReader ar = req.getSearcher().getAtomicReader();
    AtomicReaderContext context = ar.getContext();
    final Fields fields = context.reader().fields();
    final Terms terms = fields.terms("field_name");
    final TermsEnum termsEnum = terms.iterator(null);

    Bits acceptDocs = new Bits.MatchAllBits(10);

    BytesRef bytes;
    while ((bytes = termsEnum.next()) != null) {
      CharsRef chars = new CharsRef();
      field.getType().indexedToReadable(bytes, chars);

      final DocsAndPositionsEnum postings =
termsEnum.docsAndPositions(acceptDocs, null,
DocsAndPositionsEnum.FLAG_PAYLOADS);
      assertNotNull(postings);

      List<Integer> docIds = new ArrayList<Integer>();
      int docId;
      while ((docId = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
        docIds.add(docId);
        int freq = postings.freq();
        for (int i = 0; i < freq; i++) {
          int nextPosition = postings.nextPosition();

          String str = docId + "\t" + chars.toString() + "\t" +
nextPosition;
          System.out.println(str);
        }
      }
    }




--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-Term-Positions-tp477519p4052608.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to get Term Positions?

Posted by Grant Ingersoll <gs...@apache.org>.
If you're going to spend time mucking w/ TermPositions, you should just spend your time working with SpanQuery, as that is what I understand you to be asking about.  AIUI, you want to be able to get at the positions in the document where the query matched.  This is exactly what a SpanQuery and it's derivatives does.  It does all the work that you would have to do yourself by using the TermPositions class.


On Mar 12, 2010, at 6:38 PM, MitchK wrote:

> 
> Thank you both for your responses.
> 
> However, I am not familiar enough with Solr and even not with Lucene. So, at
> the moment, I have no real idea of what payloads are (I can't even translate
> this word...). 
> The manual says something about "metadata" - but there is nothing said about
> what metadata they mean.
> I think that - looking at my little experiences with Lucene and Solr - it
> would be a better idea to firstly read some stuff like "Lucene in Action",
> before tryring to customize (or contribute to)  Lucene/Solr at such a level. 
> 
> Do they currently work on the tickets? It seems like there was no more time
> to do so??
> 
> Last but not least: I want to add something productive to my question:
> The paper that maybe describes the solution for my problem... 
> 
> http://lucene.apache.org/java/3_0_1/fileformats.html#Positions
> 
> To quote:
> PositionDelta is, if payloads are disabled for the term's field, the
> difference between the position of the current occurrence in the document
> and the previous occurrence (or zero, if this is the first occurrence in
> this document). 
> 
> If I could retrive the given information, this would be great - even if it
> forces me to iterate over the document where the term occurs. Lucene's
> TermPositions-Class seems to be a good place to start, doesn't it??? What do
> you think? [1] 
> 
> Integrating some Lucene-based work to Solr is another question...I think one
> needs to have a map, where one can see which class is usually called by
> which class, but that is really another topic :). 
> 
> [1]
> http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/store/instantiated/InstantiatedTermPositions.html
> 
> Thank you!
> - Mitch
> -- 
> View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27884130.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: How to get Term Positions?

Posted by MitchK <mi...@web.de>.
Thank you both for your responses.

However, I am not familiar enough with Solr and even not with Lucene. So, at
the moment, I have no real idea of what payloads are (I can't even translate
this word...). 
The manual says something about "metadata" - but there is nothing said about
what metadata they mean.
I think that - looking at my little experiences with Lucene and Solr - it
would be a better idea to firstly read some stuff like "Lucene in Action",
before tryring to customize (or contribute to)  Lucene/Solr at such a level. 

Do they currently work on the tickets? It seems like there was no more time
to do so??

Last but not least: I want to add something productive to my question:
The paper that maybe describes the solution for my problem... 

http://lucene.apache.org/java/3_0_1/fileformats.html#Positions

To quote:
PositionDelta is, if payloads are disabled for the term's field, the
difference between the position of the current occurrence in the document
and the previous occurrence (or zero, if this is the first occurrence in
this document). 

If I could retrive the given information, this would be great - even if it
forces me to iterate over the document where the term occurs. Lucene's
TermPositions-Class seems to be a good place to start, doesn't it??? What do
you think? [1] 

Integrating some Lucene-based work to Solr is another question...I think one
needs to have a map, where one can see which class is usually called by
which class, but that is really another topic :). 

[1]
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/store/instantiated/InstantiatedTermPositions.html

Thank you!
- Mitch
-- 
View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27884130.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get Term Positions?

Posted by Tommy Chheng <to...@gmail.com>.
  I contributed a little reward to whoever can complete this task too
http://nextsprocket.com/tasks/solr-1337-spans-and-payloads-query-support-asf-jira

Feel free to contribute to the reward if you need this done too!

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/12/10 2:14 PM, Grant Ingersoll wrote:
> OK, you need https://issues.apache.org/jira/browse/SOLR-1337 and it's related item: https://issues.apache.org/jira/browse/SOLR-1485
>
> Unfortunately, not implemented yet.
>
> On Mar 12, 2010, at 1:36 PM, MitchK wrote:
>
>> Thanks for your response, Grant!
>>
>> Imagine you are searching for "foo".
>> "foor" occurs in doc1 three times. It is the 5th, the 20th, and the 50th
>> term in the document.
>> I want to get these positions.
>>
>> Of course, if I am searching for "foo bar" and "bar" occurs at the 4th and
>> the 21th position, I also want to know that. I am not sure, but I think this
>> is what you mean by "per doc basis", right?
>>
>> Since I need the TermPosition at scoring time, TermVectorComponent seems to
>> be no option in this case, or do you think it could be one, if I create such
>> Vectors at index-time?
>> -- 
>> View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27881024.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>

Re: How to get Term Positions?

Posted by Grant Ingersoll <gs...@apache.org>.
OK, you need https://issues.apache.org/jira/browse/SOLR-1337 and it's related item: https://issues.apache.org/jira/browse/SOLR-1485

Unfortunately, not implemented yet.

On Mar 12, 2010, at 1:36 PM, MitchK wrote:

> 
> Thanks for your response, Grant!
> 
> Imagine you are searching for "foo".
> "foor" occurs in doc1 three times. It is the 5th, the 20th, and the 50th
> term in the document.
> I want to get these positions.
> 
> Of course, if I am searching for "foo bar" and "bar" occurs at the 4th and
> the 21th position, I also want to know that. I am not sure, but I think this
> is what you mean by "per doc basis", right?
> 
> Since I need the TermPosition at scoring time, TermVectorComponent seems to
> be no option in this case, or do you think it could be one, if I create such
> Vectors at index-time?
> -- 
> View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27881024.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: How to get Term Positions?

Posted by MitchK <mi...@web.de>.
Thanks for your response, Grant!

Imagine you are searching for "foo".
"foor" occurs in doc1 three times. It is the 5th, the 20th, and the 50th
term in the document.
I want to get these positions.

Of course, if I am searching for "foo bar" and "bar" occurs at the 4th and
the 21th position, I also want to know that. I am not sure, but I think this
is what you mean by "per doc basis", right?

Since I need the TermPosition at scoring time, TermVectorComponent seems to
be no option in this case, or do you think it could be one, if I create such
Vectors at index-time?
-- 
View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27881024.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get Term Positions?

Posted by Grant Ingersoll <gs...@apache.org>.
What TermPositions do you want?  On a per doc basis or just in general for the index?  I think the TermsComponent could add the latter.  The former is only possible via TermVectors.

-Grant

On Mar 12, 2010, at 12:46 PM, MitchK wrote:

> 
> Hello community,
> 
> is it possible to get TermPositions without a TermVector? If yes, how can I
> do so?
> If such a feature is not yet implemented in Solr, it would be interesting
> how to do so with Lucene.
> 
> I don't want to use a TermVector, because I have read somewhere that Lucene
> stores the TermPosition in its inverted index, but I don't know how to
> retrieve it.
> 
> Any suggestions?
> 
> Thank you!
> - Mitch
> -- 
> View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27880551.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>