You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kasun Perera <ka...@opensource.lk> on 2013/01/16 05:16:31 UTC

Calculating a IDF value for a document collection

I have set of documents separated in to doc_sections (d) that are again
separated in to (n) number of sentences. There is an ontology that I’m
using to calculate similarity between definitions of ontology terms vs
doc_sections.
The documents are indexed at sentence level, so each sentence is a document
in the Lucene index.Each ontology term definition is indexed as separate
document in the same lucene index.

These are my use cases
1)     I want to calculate similarity(Okapi, Cosine) between doc_section(i)
vs doc_section(j) and similarity between doc_section(j) vs ontology
definitions.
Now each sentence itself is a document in Lucene index, so I will be
calculating TF and IDF for a collection.
TF is specific to each document and will not change for a collection, but
what is the way to calculate IDF for a collection (IDF for not each
document but IDF value for collection)

This is the reason why I indexed at sentence level

2)     I would select some specific sentences, and consider those sentences
as new_document.
Then I want to calculate similarity between newly_created_document vs
doc_sections, and newly_created_document vs ontology definitions.
Here also each sentence is a document in lucene index, so basically I want
to calculate TF-IDF for a collection (IDF for not each document but one IDF
value for collection), how can this be done?

This can be done easily by creating the index again with custom documents.
I don’t want to re-create the index again and again with newly created
documents; it would be much computational intensive.

-- 
Regards

Kasun Perera