You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Junaid Surve <ju...@gmail.com> on 2012/05/22 12:54:16 UTC
Mahout's Text Similarity using HBase
Hello
In my Project we are trying to calculate the Text Similarity of a set of
documents for which I am facing 2 issues.
1.
I do not want to recalculate the Term Frequency of the documents I have
previously calculated. e.g. I have 10 docs and I have calculated the Term
Frequency and Inverse Document Frequency for all the 10 documents. Then I
get 2 more documents. Now I do not want to calculate the Term Frequency for
the already existing 10 documents but want to calculate the TF for the new
2 which have come in and then use the TF's for all the 12 documents and
calculate the IDF for the 12 documents as a whole.
*How to calculate the IDF of all the documents without calculating the
TF's of the existing docs again?*
2.
The number of documents might increase which means using the in memory
approach (InMemoryBayesDatastore) might become cumbersome. What I want is
to save the TF of all the documents in an HBASE table and when new
documents arrive, I calculate the TF of the new documents, save them in the
HBASE table and then I use this HBASE table to fetch the TF of all the
documents to calculate the IDF.
*How can I use HBase to provide data to Mahout's Text Similarity instead
of fetching it from the sequence file?*
--
Regards
Junaid