You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Angel Luis Scull <as...@facinf.uho.edu.cu> on 2013/11/26 21:50:58 UTC

Document vector

Hello, I'm trying to use mahout in  Topic Detection an Tracking(TDT) System.
Currently I'm doing the Track task of TDT and and i need to develop the 
following algorithm using mahout:

1 Th = set of training documents
2 VTd = is the vector representation of Th
3 For each document D in the stream(unknown number of documents) of 
documents
     do
         (a) Use D to update idf statistics
         (b) apply tf*idf to VD and to VTd (when VD is the vector 
representation of document D)
         (c) Compute the similarity between  VD and VTd
          and so on  ....

  Mi problem is when i try to make a RandomAccessSparseVector. I don't 
know how to create that vector from a sequence file that contains a 
current document in the stream.

Thanks in advance.