You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rafael Rossini <ra...@gmail.com> on 2007/09/04 23:01:39 UTC

Extract terms not by reader, but by documents

Hi all,

    In some custom highlighting, I often write a code like this:

       Set<Term> matchedTerms = new HashSet<Term>();
       query.rewrite(reader).extractTerms(matchedTerms);

    With this code the Term Set gets populated by the matched query in your
whole index. Is it possible to this with a document instead of the reader?
Something like
query.rewrite(documentId).extractTerms(matchedTerms) ?

[]s
     Rossini

Re: Extract terms not by reader, but by documents

Posted by Karl Wettin <ka...@gmail.com>.

Rafael, are you looking for IndexReader.getTermFreqVector?

--
karl

5 sep 2007 kl. 16.48 skrev Rafael Rossini:

> Thank´s for the reply Grant, let me try to explain exactly what I´d  
> like to
> do. Take the 2 docs:
>
> Doc1: "Microsoft is a nice software company, and Xbox seems to be a  
> nice
> product too."
> Doc2: "Nintendo and Sony have been in the game industry for a long  
> time, but
> now, Microsoft is trying to enter with Xbox"
>
> Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR
> (Xbox)" and perform a query.rewrite(reader).extractTerms(set), my  
> set is
> going to have all the terms (Nintendo, Sony, Microsoft, Xbox)  
> right? But
> when I iterate the docs, I wanted to do something like
> query.rewrite(doc).extracTerms(set),  [I know this method does not  
> exist, is
> just and example of the funcionality].
>
> So, for Doc1, my set would be populate with (Xbox) only, and for  
> Doc2 the
> set would be populated with (Nintendo, Sony, Microsoft, Xbox).
> Is this possible? Is it clear now what I´m trying to achieve?
>
> []s
>      Rossini
>
> On 9/4/07, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> Not sure if I am understanding what you are trying to do.  I think
>> you are trying to find out which terms occurred in a particular
>> document, correct?
>>
>> I also am not sure about your first example.  My understanding of
>> extractTerms is that it just gives you back the set of all terms that
>> occur in the _query_, not necessarily those that matched in the
>> document, although it has this effect for things like WildcardQuery
>> and others that get expanded using TermEnum since they are expanded
>> based on what is in the index.  I think this is best seen by the
>> implementation of extractTerms() in TermQuery.java in which it just
>> adds the term from the query into the set.  Likewise for BooleanQuery
>> which loops over the clauses and extracts the terms from each clause
>> and adds them to the set.  Thus, if you had a boolean query of all
>> term queries, you would get back the set of all the terms.
>>
>> As for the problem it sounds like you are interested in, you could
>> use SpanQuery functionality with some post processing analysis or try
>> using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
>> functionality (or possibly a combination of both).  In this case, you
>> will need to write your own implementation of the TVM that takes in
>> the query so it knows what terms to identify. If you go the latter
>> route, know that it is new functionality and probably doesn't have a
>> whole lot of users yet, so there may still be issues with it.  See
>> the nightly build or nightly javadocs for info on these.
>>
>> The other question that might be helpful, is what custom highlighting
>> are you doing that isn't covered by the contrib/highlighter?  Perhaps
>> you have some suggestions that are generic enough to help improve
>> it?  Just a thought.
>>
>> Hope this helps,
>> Grant
>>
>> On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:
>>
>>> Hi all,
>>>
>>>     In some custom highlighting, I often write a code like this:
>>>
>>>        Set<Term> matchedTerms = new HashSet<Term>();
>>>        query.rewrite(reader).extractTerms(matchedTerms);
>>>
>>>     With this code the Term Set gets populated by the matched query
>>> in your
>>> whole index. Is it possible to this with a document instead of the
>>> reader?
>>> Something like
>>> query.rewrite(documentId).extractTerms(matchedTerms) ?
>>>
>>> []s
>>>      Rossini
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extract terms not by reader, but by documents

Posted by Mike Klaas <mi...@gmail.com>.

On 6-Sep-07, at 11:48 AM, Grant Ingersoll wrote:

>
> On Sep 6, 2007, at 1:32 PM, Rafael Rossini wrote:
>
>> Karl, I´m aware of IndexReader.getTermFreqVector, with this I can  
>> get all
>> terms of a document, but I want all terms of a document that  
>> matched a
>> query.
>>
>> Grant,
>>
>>> Yes, I think I understand.  You want to know what terms from your
>>> query matched in a given document.
>>
>> Yep, that´s what I want. In the contrib/highlighter package, the
>> query.rewrite.extractTerms is used to match the terms in the  
>> documents. So
>>
>
> Can you point to where this is taking place in the contrib/ 
> highlighter?  I am not a highlighter expert, but I would like to  
> see it.  The only place I see a call to extractTerms is in  
> QueryTermExtractor.java

The document is re-analyzed, or the token stream is retrieved from  
term vector reconstruction.  Das ist allist.

-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extract terms not by reader, but by documents

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 6, 2007, at 1:32 PM, Rafael Rossini wrote:

> Karl, I´m aware of IndexReader.getTermFreqVector, with this I can  
> get all
> terms of a document, but I want all terms of a document that matched a
> query.
>
> Grant,
>
>> Yes, I think I understand.  You want to know what terms from your
>> query matched in a given document.
>
> Yep, that´s what I want. In the contrib/highlighter package, the
> query.rewrite.extractTerms is used to match the terms in the  
> documents. So
>

Can you point to where this is taking place in the contrib/ 
highlighter?  I am not a highlighter expert, but I would like to see  
it.  The only place I see a call to extractTerms is in  
QueryTermExtractor.java

-Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extract terms not by reader, but by documents

Posted by Rafael Rossini <ra...@gmail.com>.

Karl, I´m aware of IndexReader.getTermFreqVector, with this I can get all
terms of a document, but I want all terms of a document that matched a
query.

Grant,

>Yes, I think I understand.  You want to know what terms from your
>query matched in a given document.

Yep, that´s what I want. In the contrib/highlighter package, the
query.rewrite.extractTerms is used to match the terms in the documents. So
all the highlight "magic" is done with that terms, and the problem remains,
because the "logic" of the query is lost when I get those terms. I´ll take a
look at the TermVectorMapper you mentioned, but if anyone have an ideia how
to achieve this, please post here.

[]s
    Rossini

Re: Extract terms not by reader, but by documents

Posted by Grant Ingersoll <gr...@gmail.com>.

On Sep 5, 2007, at 10:48 AM, Rafael Rossini wrote:

> Thank´s for the reply Grant, let me try to explain exactly what I´d  
> like to
> do. Take the 2 docs:
>
> Doc1: "Microsoft is a nice software company, and Xbox seems to be a  
> nice
> product too."
> Doc2: "Nintendo and Sony have been in the game industry for a long  
> time, but
> now, Microsoft is trying to enter with Xbox"
>
> Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR
> (Xbox)" and perform a query.rewrite(reader).extractTerms(set), my  
> set is
> going to have all the terms (Nintendo, Sony, Microsoft, Xbox)  
> right? But
> when I iterate the docs, I wanted to do something like
> query.rewrite(doc).extracTerms(set),  [I know this method does not  
> exist, is
> just and example of the funcionality].
>
> So, for Doc1, my set would be populate with (Xbox) only, and for  
> Doc2 the
> set would be populated with (Nintendo, Sony, Microsoft, Xbox).
> Is this possible? Is it clear now what I´m trying to achieve?
>

Yes, I think I understand.  You want to know what terms from your  
query matched in a given document.

I still don't think query.rewrite.extractTerms is meaningful in the  
context of a document.  extractTerms really is just used to figure  
out the full set of terms that are going to be in a query, especially  
in light of a WildcardQuery or something similar.  Creating a new  
method to do what you want named extractTerms would only confuse what  
the other version of the method does.

If I were doing this, I would probably do a combination of a  
SpanQuery to get the positions of things that match in the document  
(i.e. Xbox, Sony, etc.) and then do some analysis on those positions  
to determine what terms matched from the query.  I think this  
analysis could be facilitated by the new TermVectorMapper  
functionality, but it isn't the only way to go.

Also, contrib/highlighter probably already does most of this, but I  
am not an expert on Highlighter.  I would bet, however, it would tell  
you which terms are in the document and where.


-Grant

> []s
>      Rossini
>
> On 9/4/07, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> Not sure if I am understanding what you are trying to do.  I think
>> you are trying to find out which terms occurred in a particular
>> document, correct?
>>
>> I also am not sure about your first example.  My understanding of
>> extractTerms is that it just gives you back the set of all terms that
>> occur in the _query_, not necessarily those that matched in the
>> document, although it has this effect for things like WildcardQuery
>> and others that get expanded using TermEnum since they are expanded
>> based on what is in the index.  I think this is best seen by the
>> implementation of extractTerms() in TermQuery.java in which it just
>> adds the term from the query into the set.  Likewise for BooleanQuery
>> which loops over the clauses and extracts the terms from each clause
>> and adds them to the set.  Thus, if you had a boolean query of all
>> term queries, you would get back the set of all the terms.
>>
>> As for the problem it sounds like you are interested in, you could
>> use SpanQuery functionality with some post processing analysis or try
>> using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
>> functionality (or possibly a combination of both).  In this case, you
>> will need to write your own implementation of the TVM that takes in
>> the query so it knows what terms to identify. If you go the latter
>> route, know that it is new functionality and probably doesn't have a
>> whole lot of users yet, so there may still be issues with it.  See
>> the nightly build or nightly javadocs for info on these.
>>
>> The other question that might be helpful, is what custom highlighting
>> are you doing that isn't covered by the contrib/highlighter?  Perhaps
>> you have some suggestions that are generic enough to help improve
>> it?  Just a thought.
>>
>> Hope this helps,
>> Grant
>>
>> On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:
>>
>>> Hi all,
>>>
>>>     In some custom highlighting, I often write a code like this:
>>>
>>>        Set<Term> matchedTerms = new HashSet<Term>();
>>>        query.rewrite(reader).extractTerms(matchedTerms);
>>>
>>>     With this code the Term Set gets populated by the matched query
>>> in your
>>> whole index. Is it possible to this with a document instead of the
>>> reader?
>>> Something like
>>> query.rewrite(documentId).extractTerms(matchedTerms) ?
>>>
>>> []s
>>>      Rossini
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extract terms not by reader, but by documents

Posted by Rafael Rossini <ra...@gmail.com>.

Thank´s for the reply Grant, let me try to explain exactly what I´d like to
do. Take the 2 docs:

Doc1: "Microsoft is a nice software company, and Xbox seems to be a nice
product too."
Doc2: "Nintendo and Sony have been in the game industry for a long time, but
now, Microsoft is trying to enter with Xbox"

Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR
(Xbox)" and perform a query.rewrite(reader).extractTerms(set), my set is
going to have all the terms (Nintendo, Sony, Microsoft, Xbox) right? But
when I iterate the docs, I wanted to do something like
query.rewrite(doc).extracTerms(set),  [I know this method does not exist, is
just and example of the funcionality].

So, for Doc1, my set would be populate with (Xbox) only, and for Doc2 the
set would be populated with (Nintendo, Sony, Microsoft, Xbox).
Is this possible? Is it clear now what I´m trying to achieve?

[]s
     Rossini

On 9/4/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> Not sure if I am understanding what you are trying to do.  I think
> you are trying to find out which terms occurred in a particular
> document, correct?
>
> I also am not sure about your first example.  My understanding of
> extractTerms is that it just gives you back the set of all terms that
> occur in the _query_, not necessarily those that matched in the
> document, although it has this effect for things like WildcardQuery
> and others that get expanded using TermEnum since they are expanded
> based on what is in the index.  I think this is best seen by the
> implementation of extractTerms() in TermQuery.java in which it just
> adds the term from the query into the set.  Likewise for BooleanQuery
> which loops over the clauses and extracts the terms from each clause
> and adds them to the set.  Thus, if you had a boolean query of all
> term queries, you would get back the set of all the terms.
>
> As for the problem it sounds like you are interested in, you could
> use SpanQuery functionality with some post processing analysis or try
> using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
> functionality (or possibly a combination of both).  In this case, you
> will need to write your own implementation of the TVM that takes in
> the query so it knows what terms to identify. If you go the latter
> route, know that it is new functionality and probably doesn't have a
> whole lot of users yet, so there may still be issues with it.  See
> the nightly build or nightly javadocs for info on these.
>
> The other question that might be helpful, is what custom highlighting
> are you doing that isn't covered by the contrib/highlighter?  Perhaps
> you have some suggestions that are generic enough to help improve
> it?  Just a thought.
>
> Hope this helps,
> Grant
>
> On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:
>
> > Hi all,
> >
> >     In some custom highlighting, I often write a code like this:
> >
> >        Set<Term> matchedTerms = new HashSet<Term>();
> >        query.rewrite(reader).extractTerms(matchedTerms);
> >
> >     With this code the Term Set gets populated by the matched query
> > in your
> > whole index. Is it possible to this with a document instead of the
> > reader?
> > Something like
> > query.rewrite(documentId).extractTerms(matchedTerms) ?
> >
> > []s
> >      Rossini
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Extract terms not by reader, but by documents

Posted by Grant Ingersoll <gs...@apache.org>.

Not sure if I am understanding what you are trying to do.  I think  
you are trying to find out which terms occurred in a particular  
document, correct?

I also am not sure about your first example.  My understanding of  
extractTerms is that it just gives you back the set of all terms that  
occur in the _query_, not necessarily those that matched in the  
document, although it has this effect for things like WildcardQuery  
and others that get expanded using TermEnum since they are expanded  
based on what is in the index.  I think this is best seen by the  
implementation of extractTerms() in TermQuery.java in which it just  
adds the term from the query into the set.  Likewise for BooleanQuery  
which loops over the clauses and extracts the terms from each clause  
and adds them to the set.  Thus, if you had a boolean query of all  
term queries, you would get back the set of all the terms.

As for the problem it sounds like you are interested in, you could  
use SpanQuery functionality with some post processing analysis or try  
using Term Vectors and the new (unreleased) TermVectorMapper (TVM)  
functionality (or possibly a combination of both).  In this case, you  
will need to write your own implementation of the TVM that takes in  
the query so it knows what terms to identify. If you go the latter  
route, know that it is new functionality and probably doesn't have a  
whole lot of users yet, so there may still be issues with it.  See  
the nightly build or nightly javadocs for info on these.

The other question that might be helpful, is what custom highlighting  
are you doing that isn't covered by the contrib/highlighter?  Perhaps  
you have some suggestions that are generic enough to help improve  
it?  Just a thought.

Hope this helps,
Grant

On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:

> Hi all,
>
>     In some custom highlighting, I often write a code like this:
>
>        Set<Term> matchedTerms = new HashSet<Term>();
>        query.rewrite(reader).extractTerms(matchedTerms);
>
>     With this code the Term Set gets populated by the matched query  
> in your
> whole index. Is it possible to this with a document instead of the  
> reader?
> Something like
> query.rewrite(documentId).extractTerms(matchedTerms) ?
>
> []s
>      Rossini

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org