You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Péter Király <ki...@gmail.com> on 2016/01/29 16:09:21 UTC

Calculating tf-idf

Dear all,

I am working on a research project in which I create an OS tool which
tries to detect "bad" and "good" records in a metadata collection
(such as a library catalog, museum database etc. -- you can find more
info here http://pkiraly.github.io/). This is not the first project of
that kind, there are some scientific articles in the topic, and there
are some established metrics as well. One of the metrics is
"Conformance to expectation" which is more or less a variation of the
tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

The process in my case is to index the dabase, than iterate over the
records and caculate tf-idf of the important fields. Since I haven't
find a method with which I simply retrieve this from the Solr index, I
followed the method:

take a field value
use /analysis/field handler to extract the terms from the original value
use /terms with terms.limit=1, terms.sort=index, and terms.fl,
terms.prefix parameters to retrieve the document frequencies of each
terms
do the calculations based on those input variables

My question is: is there any more direct way to extract this
information from the Solr index either in Solr, or with the Lucene
API?

Thank you very much in advance!
Péter

-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Re: Calculating tf-idf

Posted by Péter Király <ki...@gmail.com>.

I found the solution:
https://wiki.apache.org/solr/TermVectorComponent. I did not know that
before, but that's exactly what I need.

Regards,
Péter

2016-01-29 16:09 GMT+01:00 Péter Király <ki...@gmail.com>:
> Dear all,
>
> I am working on a research project in which I create an OS tool which
> tries to detect "bad" and "good" records in a metadata collection
> (such as a library catalog, museum database etc. -- you can find more
> info here http://pkiraly.github.io/). This is not the first project of
> that kind, there are some scientific articles in the topic, and there
> are some established metrics as well. One of the metrics is
> "Conformance to expectation" which is more or less a variation of the
> tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
>
> The process in my case is to index the dabase, than iterate over the
> records and caculate tf-idf of the important fields. Since I haven't
> find a method with which I simply retrieve this from the Solr index, I
> followed the method:
>
> take a field value
> use /analysis/field handler to extract the terms from the original value
> use /terms with terms.limit=1, terms.sort=index, and terms.fl,
> terms.prefix parameters to retrieve the document frequencies of each
> terms
> do the calculations based on those input variables
>
> My question is: is there any more direct way to extract this
> information from the Solr index either in Solr, or with the Lucene
> API?
>
> Thank you very much in advance!
> Péter
>
> --
> Péter Király
> software developer
> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
> http://linkedin.com/in/peterkiraly



-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly