You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Junaid Surve <ju...@gmail.com> on 2012/05/22 12:54:16 UTC

Mahout's Text Similarity using HBase

Hello

In my Project we are trying to calculate the Text Similarity of a set of
documents for which I am facing 2 issues.

   1.

   I do not want to recalculate the Term Frequency of the documents I have
   previously calculated. e.g. I have 10 docs and I have calculated the Term
   Frequency and Inverse Document Frequency for all the 10 documents. Then I
   get 2 more documents. Now I do not want to calculate the Term Frequency for
   the already existing 10 documents but want to calculate the TF for the new
   2 which have come in and then use the TF's for all the 12 documents and
   calculate the IDF for the 12 documents as a whole.

   *How to calculate the IDF of all the documents without calculating the
   TF's of the existing docs again?*
   2.

   The number of documents might increase which means using the in memory
   approach (InMemoryBayesDatastore) might become cumbersome. What I want is
   to save the TF of all the documents in an HBASE table and when new
   documents arrive, I calculate the TF of the new documents, save them in the
   HBASE table and then I use this HBASE table to fetch the TF of all the
   documents to calculate the IDF.

   *How can I use HBase to provide data to Mahout's Text Similarity instead
   of fetching it from the sequence file?*


-- 
Regards
Junaid