You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sengly Heng <se...@gmail.com> on 2007/03/28 10:36:54 UTC

TF-IDF API

Hello Luceners,

I have a collections of vector of terms (token) that I extracted from files.
I am looking for ways to calculate TF/IDF of each term.

I wanted to use Lucene to do this but Lucene is made for collections of
files and in my case I have already extracted those files into vector of
terms. I know it is not very difficult to implement this measurement but I
guess there should be such API available. Does anyone of you know any Java
API that directly handle this problem? or I have to implement from scratch.

Any idea would be highly appreciated.

Thank you in advance.

Best regards,

Sengly

Re: TF-IDF API

Posted by Sengly Heng <se...@gmail.com>.

Thank you very much for your time. Here is a sample of a vector of terms :

v1 = {"sad", "john", "intelligent", "news", "USA", "disneyland", "MIT",
"cambridge", "marry", ...}

I'll try out your method.

Best regards,

Sengly



On 3/28/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 mar 2007 kl. 15.24 skrev Sengly Heng:
>
> > Thank you but I still have have no clue of how to do that by using
> > Weka
> > after taking a look at its API. Let me reformulate my problem :
> >
> > I have a collection of vector of terms (actually each vector of terms
> > represents the list of tokens extracted from a file) and I do not
> > have the
> > original files. I would like to calculate TF as well as TFIDF of
> > each term
> > and sorted them by these value respectively. As suggested by Grant
> > Ingersoll, I could index those vectors of terms again using Lucene
> > and then
> > use its API to measure TF and TFIDF. However I guess there should be a
> > simpler way or API just fit-in this case.
>
> To my knowledge there is no thing in Lucene that makes it simpler for
> you than what Grant suggests. And according to me, Weka really must
> be the simplest way around. However, perhaps you should supply us
> with an example of what these vectors look like. That might change
> everything. Perhaps we are talking of completely different things here.
>
> Let me reformulate my suggestion:
>
> 1. rebuild your vector to a string.
> 2. put the data in a file called myvectors.arff:
>
> @relation termvectors
> @attribute the_vector string
> @data
> "first term vector as a string"
> "second term vector as a string"
>
> 3. open the file in the weka explorer application.
> 4. select filter/unsupervised/attribute/string to word vector
> 5. set your preferences of normalization, et c.
> 6. apply the filter.
> 7. save the output.
>
> All this can be done progamatically too, with only a few lines of code.
>
> >
> > Thanks once again everyone.
> >
> > Best regards,
> >
> > Sengly
> >
> >
> > On 3/28/07, karl wettin <ka...@gmail.com> wrote:
> >>
> >>
> >> 28 mar 2007 kl. 10.36 skrev Sengly Heng:
> >>
> >> > Does anyone of you know any Java API that directly handle this
> >> > problem?
> >> > or I have to implement from scratch.
> >>
> >> You can also try
> >> weka.filters.unsupervised.attribute.StringToWordVector, it has many
> >> neat features you might be interested in. And if applicable to what
> >> you attempt to do, the feature selection algorithms of the same
> >> project (Weka) does a great job reducing the data set.
> >>
> >> http://www.cs.waikato.ac.nz/ml/weka/
> >>
> >> It is GPL.
> >>
> >> --
> >> karl
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TF-IDF API

Posted by karl wettin <ka...@gmail.com>.

28 mar 2007 kl. 15.24 skrev Sengly Heng:

> Thank you but I still have have no clue of how to do that by using  
> Weka
> after taking a look at its API. Let me reformulate my problem :
>
> I have a collection of vector of terms (actually each vector of terms
> represents the list of tokens extracted from a file) and I do not  
> have the
> original files. I would like to calculate TF as well as TFIDF of  
> each term
> and sorted them by these value respectively. As suggested by Grant
> Ingersoll, I could index those vectors of terms again using Lucene  
> and then
> use its API to measure TF and TFIDF. However I guess there should be a
> simpler way or API just fit-in this case.

To my knowledge there is no thing in Lucene that makes it simpler for  
you than what Grant suggests. And according to me, Weka really must  
be the simplest way around. However, perhaps you should supply us  
with an example of what these vectors look like. That might change  
everything. Perhaps we are talking of completely different things here.

Let me reformulate my suggestion:

1. rebuild your vector to a string.
2. put the data in a file called myvectors.arff:

@relation termvectors
@attribute the_vector string
@data
"first term vector as a string"
"second term vector as a string"

3. open the file in the weka explorer application.
4. select filter/unsupervised/attribute/string to word vector
5. set your preferences of normalization, et c.
6. apply the filter.
7. save the output.

All this can be done progamatically too, with only a few lines of code.

>
> Thanks once again everyone.
>
> Best regards,
>
> Sengly
>
>
> On 3/28/07, karl wettin <ka...@gmail.com> wrote:
>>
>>
>> 28 mar 2007 kl. 10.36 skrev Sengly Heng:
>>
>> > Does anyone of you know any Java API that directly handle this
>> > problem?
>> > or I have to implement from scratch.
>>
>> You can also try
>> weka.filters.unsupervised.attribute.StringToWordVector, it has many
>> neat features you might be interested in. And if applicable to what
>> you attempt to do, the feature selection algorithms of the same
>> project (Weka) does a great job reducing the data set.
>>
>> http://www.cs.waikato.ac.nz/ml/weka/
>>
>> It is GPL.
>>
>> --
>> karl
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TF-IDF API

Posted by Sengly Heng <se...@gmail.com>.

Thank you but I still have have no clue of how to do that by using Weka
after taking a look at its API. Let me reformulate my problem :

I have a collection of vector of terms (actually each vector of terms
represents the list of tokens extracted from a file) and I do not have the
original files. I would like to calculate TF as well as TFIDF of each term
and sorted them by these value respectively. As suggested by Grant
Ingersoll, I could index those vectors of terms again using Lucene and then
use its API to measure TF and TFIDF. However I guess there should be a
simpler way or API just fit-in this case.

Thanks once again everyone.

Best regards,

Sengly

On 3/28/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 mar 2007 kl. 10.36 skrev Sengly Heng:
>
> > Does anyone of you know any Java API that directly handle this
> > problem?
> > or I have to implement from scratch.
>
> You can also try
> weka.filters.unsupervised.attribute.StringToWordVector, it has many
> neat features you might be interested in. And if applicable to what
> you attempt to do, the feature selection algorithms of the same
> project (Weka) does a great job reducing the data set.
>
> http://www.cs.waikato.ac.nz/ml/weka/
>
> It is GPL.
>
> --
> karl
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TF-IDF API

Posted by karl wettin <ka...@gmail.com>.

28 mar 2007 kl. 10.36 skrev Sengly Heng:

> Does anyone of you know any Java API that directly handle this  
> problem?
> or I have to implement from scratch.

You can also try  
weka.filters.unsupervised.attribute.StringToWordVector, it has many  
neat features you might be interested in. And if applicable to what  
you attempt to do, the feature selection algorithms of the same  
project (Weka) does a great job reducing the data set.

http://www.cs.waikato.ac.nz/ml/weka/

It is GPL.

-- 
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TF-IDF API

Posted by Grant Ingersoll <gs...@apache.org>.

You can pass in a String or a Reader to Field when indexing.  There  
is nothing file specific about Lucene when it comes to indexing.   
Take a look at the Field class for the various constructors.

On Mar 28, 2007, at 8:20 AM, Sengly Heng wrote:

> Thanks but in my case I do not have the files. What I have is just a
> collection of vectors of terms.
>
> Does lucene provide any mean to index each vector of terms as a  
> file? Or
> there is a better way to do?
>
> Thank everyone once again.
>
> Regards,
>
> Sengly
>
>
> On 3/28/07, thomas arni <ar...@zhwin.ch> wrote:
>>
>> Hava a look at the "TermDocs" Interface in the API.
>>
>> You can get term frequency  with a open IndexReader
>>
>> TermDocs termDocs = reader.termDocs(term);
>>
>> where "term" represents the current Term.
>>
>> now you can call:
>>
>> termDocs.freq()
>>
>> to get the frequency of the term within the current document.
>>
>> For the calculation of the idf, you can use the provided formula from
>> the "DefaultSimilarity".
>> To get the document frequency, which is necessary to calculate the  
>> idf,
>> you can call:
>>
>> reader.docFreq(term)
>>
>> Hope this helps...
>>
>> Thomas
>>
>>
>> Sengly Heng wrote:
>> > Hello Luceners,
>> >
>> > I have a collections of vector of terms (token) that I extracted  
>> from
>> > files.
>> > I am looking for ways to calculate TF/IDF of each term.
>> >
>> > I wanted to use Lucene to do this but Lucene is made for  
>> collections of
>> > files and in my case I have already extracted those files into  
>> vector of
>> > terms. I know it is not very difficult to implement this  
>> measurement
>> > but I
>> > guess there should be such API available. Does anyone of you  
>> know any
>> > Java
>> > API that directly handle this problem? or I have to implement from
>> > scratch.
>> >
>> > Any idea would be highly appreciated.
>> >
>> > Thank you in advance.
>> >
>> > Best regards,
>> >
>> > Sengly
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TF-IDF API

Posted by Sengly Heng <se...@gmail.com>.

Thanks but in my case I do not have the files. What I have is just a
collection of vectors of terms.

Does lucene provide any mean to index each vector of terms as a file? Or
there is a better way to do?

Thank everyone once again.

Regards,

Sengly


On 3/28/07, thomas arni <ar...@zhwin.ch> wrote:
>
> Hava a look at the "TermDocs" Interface in the API.
>
> You can get term frequency  with a open IndexReader
>
> TermDocs termDocs = reader.termDocs(term);
>
> where "term" represents the current Term.
>
> now you can call:
>
> termDocs.freq()
>
> to get the frequency of the term within the current document.
>
> For the calculation of the idf, you can use the provided formula from
> the "DefaultSimilarity".
> To get the document frequency, which is necessary to calculate the idf,
> you can call:
>
> reader.docFreq(term)
>
> Hope this helps...
>
> Thomas
>
>
> Sengly Heng wrote:
> > Hello Luceners,
> >
> > I have a collections of vector of terms (token) that I extracted from
> > files.
> > I am looking for ways to calculate TF/IDF of each term.
> >
> > I wanted to use Lucene to do this but Lucene is made for collections of
> > files and in my case I have already extracted those files into vector of
> > terms. I know it is not very difficult to implement this measurement
> > but I
> > guess there should be such API available. Does anyone of you know any
> > Java
> > API that directly handle this problem? or I have to implement from
> > scratch.
> >
> > Any idea would be highly appreciated.
> >
> > Thank you in advance.
> >
> > Best regards,
> >
> > Sengly
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TF-IDF API

Posted by thomas arni <ar...@zhwin.ch>.

Hava a look at the "TermDocs" Interface in the API.

You can get term frequency  with a open IndexReader

TermDocs termDocs = reader.termDocs(term);

where "term" represents the current Term.

now you can call:

termDocs.freq()

to get the frequency of the term within the current document.

For the calculation of the idf, you can use the provided formula from 
the "DefaultSimilarity".
To get the document frequency, which is necessary to calculate the idf, 
you can call:

reader.docFreq(term)

Hope this helps...

Thomas


Sengly Heng wrote:
> Hello Luceners,
>
> I have a collections of vector of terms (token) that I extracted from 
> files.
> I am looking for ways to calculate TF/IDF of each term.
>
> I wanted to use Lucene to do this but Lucene is made for collections of
> files and in my case I have already extracted those files into vector of
> terms. I know it is not very difficult to implement this measurement 
> but I
> guess there should be such API available. Does anyone of you know any 
> Java
> API that directly handle this problem? or I have to implement from 
> scratch.
>
> Any idea would be highly appreciated.
>
> Thank you in advance.
>
> Best regards,
>
> Sengly
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org