You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Angel Luis Scull <as...@facinf.uho.edu.cu> on 2013/11/26 21:50:58 UTC
Document vector
Hello, I'm trying to use mahout in Topic Detection an Tracking(TDT) System.
Currently I'm doing the Track task of TDT and and i need to develop the
following algorithm using mahout:
1 Th = set of training documents
2 VTd = is the vector representation of Th
3 For each document D in the stream(unknown number of documents) of
documents
do
(a) Use D to update idf statistics
(b) apply tf*idf to VD and to VTd (when VD is the vector
representation of document D)
(c) Compute the similarity between VD and VTd
and so on ....
Mi problem is when i try to make a RandomAccessSparseVector. I don't
know how to create that vector from a sequence file that contains a
current document in the stream.
Thanks in advance.