You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by JCodina <jo...@barcelonamedia.org> on 2009/06/25 12:53:13 UTC

Top tf_idf in TermVectorComponent

In order to perform any further study of the resultset, like clustering, the
TermVectorComponent
gives the list of words with the correspoing tf, idf, 
but this list can be huge for each document, and most of the terms may have
a low tf or a too high df, 
maybe, it is usefull to compare the relative increment of DF to the
collection in order to improve the facets (show only these terms that the
relative DF in the query is higher than in the full  collection)

To perform this it could be interesting that the TermVectorComponent could
sort the results by  some of these options:
*tf
*DF
* tf/df (to simplify) or tf*idf where idf is computed as log(total_docs/df)
and truncate the list to a number of words or a given value 
 
or maybe there is another way to perform this?
Joan
-- 
View this message in context: http://www.nabble.com/Top-tf_idf-in-TermVectorComponent-tp24201076p24201076.html
Sent from the Solr - User mailing list archive at Nabble.com.