You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sengly Heng <se...@gmail.com> on 2007/04/10 16:01:31 UTC

Get the total term frequency vector of a specific field from the hit results

Hello all,

I would like to extract the term freq vector from the hit results as a total
vector not by document.

I have searched the mailing and I found many have talked about this issue
but I still could not find the right solution to this matter. Everyone just
suggested to look at getTermFreqVector and TermEnum.

I wonder if there someone has already done this before and what was your
solution? Would you please share?

Also how to get a list of top n keywords from that hit results. I have also
looked at HighFreqTerms (in the contribution repositories as well as the
one implemented by Luke) but still this class is rather for the usage when
we want to get the top n keywords from an index and not from the hit
results.

Thank you.

Best regards,

Sengly

Re: Get the total term frequency vector of a specific field from the hit results

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 11, 2007, at 9:07 AM, karl wettin wrote:

>
> 11 apr 2007 kl. 04.21 skrev Grant Ingersoll:
>
>> Would some sort of caching strategy work?  How big is your overall  
>> collection?
>>
>> Also, lately there have been a few threads on TV (term vector)  
>> performance.  I don't recall anyone having actively profiled or  
>> examined it for improvements, so perhaps that would be helpful.
>>
>> Another thought: could you have a stored field that contains the  
>> top X terms for a given document with their freqs and then just do  
>> a merge based on your hit results?  Part of the problem w/ TVs is  
>> that not only do you have to load them, but then you have to  
>> iterate through them to sort them by frequency.  I could see that  
>> it might be beneficial to have alternate strategies for loading  
>> them, say into a map of terms -> frequencies or terms to TVInfo  
>> (freqs, offsets, positions) or parallel arrays sorted by frequency  
>> or something like that.
>
> I personally don't like the idea of putting such information in a  
> stored field. It would require parsing or so. I'd go deeper.

Agreed, but probably not too different from manipulating the arrays.

>
> Are there any dependencies to the natural order of a term vector?  
> Highlightning? What about allowing alternative order such as  
> frequency rather than string value? Multiple term vectors per  
> documents? Perhaps a new file would be the best way. I can think of  
> many uses of a consumer configurable term vector. Actually, I think  
> I'll look in to this some day.

I was thinking something along the lines of (really thinking out loud  
here)
IndexReader.getTermFreqVector(int docid, String fieldName,  
TermVectorLoader tvl)

where TermVectorLoader is an interface that has something like:

void loadTermVector(String term, int frequency, int offset, int  
position);

Then, you could implement this however you wanted.  We could provide  
an initial implementation that is backed by a SortedMap.

>
> Sengly, do you use the term vectors for anything else? I'd look in  
> to hacking the order of the values in the term vector. Could be  
> problematic if you want to use the default behaviour in the future  
> though.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the total term frequency vector of a specific field from the hit results

Posted by karl wettin <ka...@gmail.com>.

11 apr 2007 kl. 04.21 skrev Grant Ingersoll:

> Would some sort of caching strategy work?  How big is your overall  
> collection?
>
> Also, lately there have been a few threads on TV (term vector)  
> performance.  I don't recall anyone having actively profiled or  
> examined it for improvements, so perhaps that would be helpful.
>
> Another thought: could you have a stored field that contains the  
> top X terms for a given document with their freqs and then just do  
> a merge based on your hit results?  Part of the problem w/ TVs is  
> that not only do you have to load them, but then you have to  
> iterate through them to sort them by frequency.  I could see that  
> it might be beneficial to have alternate strategies for loading  
> them, say into a map of terms -> frequencies or terms to TVInfo  
> (freqs, offsets, positions) or parallel arrays sorted by frequency  
> or something like that.

I personally don't like the idea of putting such information in a  
stored field. It would require parsing or so. I'd go deeper.

Are there any dependencies to the natural order of a term vector?  
Highlightning? What about allowing alternative order such as  
frequency rather than string value? Multiple term vectors per  
documents? Perhaps a new file would be the best way. I can think of  
many uses of a consumer configurable term vector. Actually, I think  
I'll look in to this some day.

Sengly, do you use the term vectors for anything else? I'd look in to  
hacking the order of the values in the term vector. Could be  
problematic if you want to use the default behaviour in the future  
though.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the total term frequency vector of a specific field from the hit results

Posted by Grant Ingersoll <gs...@apache.org>.

Would some sort of caching strategy work?  How big is your overall  
collection?

Also, lately there have been a few threads on TV (term vector)  
performance.  I don't recall anyone having actively profiled or  
examined it for improvements, so perhaps that would be helpful.

Another thought: could you have a stored field that contains the top  
X terms for a given document with their freqs and then just do a  
merge based on your hit results?  Part of the problem w/ TVs is that  
not only do you have to load them, but then you have to iterate  
through them to sort them by frequency.  I could see that it might be  
beneficial to have alternate strategies for loading them, say into a  
map of terms -> frequencies or terms to TVInfo (freqs, offsets,  
positions) or parallel arrays sorted by frequency or something like  
that.

It _might_ be possible to do this in a HitCollector or FieldSelector  
style way.  This way, perhaps, you could build the TV structure you  
want as it is read from disk.  Do you have any interest in digging  
down into the Lucene code to help on such an idea?

-Grant

On Apr 10, 2007, at 9:38 PM, Sengly Heng wrote:

> Once again, thank you for your help.
>
>>
>> >> We don't really know what your problem is. Explaining that rathern
>> >> than the solution you have thought of might render a couple of
>> >> alternate solutions. Perhaps something could be precalculated and
>> >> stored in the documents. Perhaps feature selection (reduction)  
>> of the
>> >> terms might do the trick for you. And so on.
>> >
>> > I have a corpus of documents indexed with different fields.
>> > Approximately
>> > each document indexed has an average of 30 fields. Each field has
>> > about 100
>> > terms.
>> >
>> > Normally, the hit will return less than 100 documents. For each of
>> > the 30
>> > fields of the documents, I have to calculate the top 35 keywords
>> > from all
>> > the documents as well as the top 30 popular keywords (the keywords
>> > that are
>> > distributed in many documents - something like docFreq or IDF).
>>
>> Right, but /why/ do you need these values? Do you present them as
>> they are, or do you use them for some secondary calculation? Then
>> what is the result of this secondary calculation?
>
>
> Yes, I just want those values as they are. No second calculation is  
> to be
> performed.
>
>> Please let me know if you are still have some more questions.
>>
>> I'll reask of of the questions I placed in my previous reply:
>>
>> >> How slow is it, and how fast did you expect it to be?
>
>
> We expect to get those values sorted as fast as possible. Currently  
> for 100
> documents, the process is about 1~1.5 minutes. I believe this is  
> because of
> the loop.
>
>>> Can you limit the evaulation to the top n documents?
>
>
> Yes, we limit to only the top 100 documents.
>
> Thank you.
>
> Regards,
>
> Sengly

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the total term frequency vector of a specific field from the hit results

Posted by Sengly Heng <se...@gmail.com>.

Once again, thank you for your help.

>
> >> We don't really know what your problem is. Explaining that rathern
> >> than the solution you have thought of might render a couple of
> >> alternate solutions. Perhaps something could be precalculated and
> >> stored in the documents. Perhaps feature selection (reduction) of the
> >> terms might do the trick for you. And so on.
> >
> > I have a corpus of documents indexed with different fields.
> > Approximately
> > each document indexed has an average of 30 fields. Each field has
> > about 100
> > terms.
> >
> > Normally, the hit will return less than 100 documents. For each of
> > the 30
> > fields of the documents, I have to calculate the top 35 keywords
> > from all
> > the documents as well as the top 30 popular keywords (the keywords
> > that are
> > distributed in many documents - something like docFreq or IDF).
>
> Right, but /why/ do you need these values? Do you present them as
> they are, or do you use them for some secondary calculation? Then
> what is the result of this secondary calculation?


Yes, I just want those values as they are. No second calculation is to be
performed.

> Please let me know if you are still have some more questions.
>
> I'll reask of of the questions I placed in my previous reply:
>
> >> How slow is it, and how fast did you expect it to be?


We expect to get those values sorted as fast as possible. Currently for 100
documents, the process is about 1~1.5 minutes. I believe this is because of
the loop.

>> Can you limit the evaulation to the top n documents?


Yes, we limit to only the top 100 documents.

Thank you.

Regards,

Sengly

Re: Get the total term frequency vector of a specific field from the hit results

Posted by karl wettin <ka...@gmail.com>.

10 apr 2007 kl. 17.48 skrev Sengly Heng:

>> We don't really know what your problem is. Explaining that rathern
>> than the solution you have thought of might render a couple of
>> alternate solutions. Perhaps something could be precalculated and
>> stored in the documents. Perhaps feature selection (reduction) of the
>> terms might do the trick for you. And so on.
>
> I have a corpus of documents indexed with different fields.  
> Approximately
> each document indexed has an average of 30 fields. Each field has  
> about 100
> terms.
>
> Normally, the hit will return less than 100 documents. For each of  
> the 30
> fields of the documents, I have to calculate the top 35 keywords  
> from all
> the documents as well as the top 30 popular keywords (the keywords  
> that are
> distributed in many documents - something like docFreq or IDF).

Right, but /why/ do you need these values? Do you present them as  
they are, or do you use them for some secondary calculation? Then  
what is the result of this secondary calculation?

> Please let me know if you are still have some more questions.

I'll reask of of the questions I placed in my previous reply:

>> How slow is it, and how fast did you expect it to be?

>> Can you limit the evaulation to the top n documents?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the total term frequency vector of a specific field from the hit results

Posted by Sengly Heng <se...@gmail.com>.

Dear Karl,

Thank you for taking your time in my problem.


>
> We don't really know what your problem is. Explaining that rathern
> than the solution you have thought of might render a couple of
> alternate solutions. Perhaps something could be precalculated and
> stored in the documents. Perhaps feature selection (reduction) of the
> terms might do the trick for you. And so on.


Here is the description of my problem :

I have a corpus of documents indexed with different fields. Approximately
each document indexed has an average of 30 fields. Each field has about 100
terms.

Normally, the hit will return less than 100 documents. For each of the 30
fields of the documents, I have to calculate the top 35 keywords from all
the documents as well as the top 30 popular keywords (the keywords that are
distributed in many documents - something like docFreq or IDF). And time is
critical in my case as it is an interactive system. Users cannot wait
long to get the results.

Please let me know if you are still have some more questions.

Thank in advance for your time and help.

Best regards,

Sengly

Re: Get the total term frequency vector of a specific field from the hit results

Posted by karl wettin <ka...@gmail.com>.

10 apr 2007 kl. 16.58 skrev Sengly Heng:

> I wanted to do this way as well but I am a bit worrying about  
> computational
> time as I have many documents and each document is a bit large.

> I am looking for more solutions.

We don't really know what your problem is. Explaining that rathern  
than the solution you have thought of might render a couple of  
alternate solutions. Perhaps something could be precalculated and  
stored in the documents. Perhaps feature selection (reduction) of the  
terms might do the trick for you. And so on.

Let me pull some questions out of nowhere that might help: How slow  
is it, and how fast did you expect it to be? How many documents does  
your queries normally yeild in? Can you limit the evaulation to the  
top n documents?

> Please do contribute if you have any. Your help is hightly  
> appreciated.

As Lucene primarily is an inverted index the document vector space  
model is not available in any other fashion than the term frequency  
vectors, or building them from scratch by enumerating the whole  
index. The latter of course beeing horrible slow in most cases.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the total term frequency vector of a specific field from the hit results

Posted by Sengly Heng <se...@gmail.com>.

Thanks so much Thomas for your prompt reply.


>
> First of all you have to make sure, that you create new Fields, which
> you add to a Document, with the appropriate constructor. You have to
> specify the usage of term vectors (Field.TermVector.YES):
>
> new Field("text", "your text...", Field.Store.YES,
> Field.Index.TOKENIZED,Field.TermVector.YES));


I did set up like this.

Without the explicit storage of the term vectors it is not possible to
> get the term vectors during searching.
>
> Once you build the index, you can use the suggested method
> getTermFreqVector().
>
> To get the top n keywords from the hits object you can iterate over the
> first results.
> Here is an example:
>
>            for (int i = 0; i < 10; i++) {
>                int docNumber = hits.id(i);
>                TermFreqVector[] termsV =
> ir.getTermFreqVectors(docNumber); //return an array of term frequency
> vectors for the specified document.
>                for (int xy = 0; xy < termsV.length; xy++) { //loop over
> all terms-vectors in the current document
>                    String[] terms = termsV[xy].getTerms();
>                    for (int termsInArray = 0;    termsInArray <
> terms.length; termsInArray++) {
>                            //toDo: count the occurrence of the terms
>                    }
>
>                }
>            }


I wanted to do this way as well but I am a bit worrying about computational
time as I have many documents and each document is a bit large.

I am looking for more solutions.

Please do contribute if you have any. Your help is hightly appreciated.

Best,

Sengly

Sengly Heng wrote:
> > Hello all,
> >
> > I would like to extract the term freq vector from the hit results as a
> > total
> > vector not by document.
> >
> > I have searched the mailing and I found many have talked about this
> issue
> > but I still could not find the right solution to this matter. Everyone
> > just
> > suggested to look at getTermFreqVector and TermEnum.
> >
> > I wonder if there someone has already done this before and what was your
> > solution? Would you please share?
> >
> > Also how to get a list of top n keywords from that hit results. I have
> > also
> > looked at HighFreqTerms (in the contribution repositories as well as the
> > one implemented by Luke) but still this class is rather for the usage
> > when
> > we want to get the top n keywords from an index and not from the hit
> > results.
> >
> > Thank you.
> >
> > Best regards,
> >
> > Sengly
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Get the total term frequency vector of a specific field from the hit results

Posted by thomas arni <ar...@zhwin.ch>.

Hello Sengly

First of all you have to make sure, that you create new Fields, which 
you add to a Document, with the appropriate constructor. You have to 
specify the usage of term vectors (Field.TermVector.YES):

new Field("text", "your text...", Field.Store.YES, 
Field.Index.TOKENIZED,Field.TermVector.YES));

Without the explicit storage of the term vectors it is not possible to 
get the term vectors during searching.

Once you build the index, you can use the suggested method 
getTermFreqVector().

To get the top n keywords from the hits object you can iterate over the 
first results.
Here is an example:

            for (int i = 0; i < 10; i++) {
                int docNumber = hits.id(i);
                TermFreqVector[] termsV = 
ir.getTermFreqVectors(docNumber); //return an array of term frequency 
vectors for the specified document.
                for (int xy = 0; xy < termsV.length; xy++) { //loop over 
all terms-vectors in the current document
                    String[] terms = termsV[xy].getTerms();
                    for (int termsInArray = 0;    termsInArray < 
terms.length; termsInArray++) {                                 
                            //toDo: count the occurrence of the terms
                    }

                }
            }

Hope this helps.
Thomas


Sengly Heng wrote:
> Hello all,
>
> I would like to extract the term freq vector from the hit results as a 
> total
> vector not by document.
>
> I have searched the mailing and I found many have talked about this issue
> but I still could not find the right solution to this matter. Everyone 
> just
> suggested to look at getTermFreqVector and TermEnum.
>
> I wonder if there someone has already done this before and what was your
> solution? Would you please share?
>
> Also how to get a list of top n keywords from that hit results. I have 
> also
> looked at HighFreqTerms (in the contribution repositories as well as the
> one implemented by Luke) but still this class is rather for the usage 
> when
> we want to get the top n keywords from an index and not from the hit
> results.
>
> Thank you.
>
> Best regards,
>
> Sengly
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org