You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Divya <di...@k2associates.com.sg> on 2010/10/26 08:10:49 UTC

generate document-document similarity matrix

Hi,

I am new mahout user and using Mahout 0.4 with eclipse.

I need to generate document similarity matrix from the vector file which I
have already created using SparseVectorsFromSequenceFiles

Now I need to generate the document similarity matrix.

Which gave me 

Directory structure 

-> df-count

-> tfidf-vectors

-> tf-vectors

-> tokenized-documents

-> wordcount

-> .dictionary.file-0.crc

-> .frequency.file-0.crc

-> dictionary.file-0

-> frequency.file-0

 

I am confused now which one to use 

Which utility of mahout  computes document  document similairity matrix.

 

Can any one help me.

 

 

Regards,

Divya  


RE: generate document-document similarity matrix

Posted by Divya <di...@k2associates.com.sg>.
Right now I have only few documents..
Just wanna know what kind of similarity it generates.
As I have no idea on what basis it generates similarity..

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Tuesday, October 26, 2010 2:37 PM
To: dev@mahout.apache.org
Subject: Re: generate document-document similarity matrix

Hi,

how many documents do you have and what kind of similarity do you wanna use?

--sebastian

On 26.10.2010 08:10, Divya wrote:
> Hi,
>
> I am new mahout user and using Mahout 0.4 with eclipse.
>
> I need to generate document similarity matrix from the vector file which I
> have already created using SparseVectorsFromSequenceFiles
>
> Now I need to generate the document similarity matrix.
>
> Which gave me
>
> Directory structure
>
> ->  df-count
>
> ->  tfidf-vectors
>
> ->  tf-vectors
>
> ->  tokenized-documents
>
> ->  wordcount
>
> ->  .dictionary.file-0.crc
>
> ->  .frequency.file-0.crc
>
> ->  dictionary.file-0
>
> ->  frequency.file-0
>
>
>
> I am confused now which one to use
>
> Which utility of mahout  computes document  document similairity matrix.
>
>
>
> Can any one help me.
>
>
>
>
>
> Regards,
>
> Divya
>
>
>    



Re: generate document-document similarity matrix

Posted by Sebastian Schelter <ss...@apache.org>.
Hi,

how many documents do you have and what kind of similarity do you wanna use?

--sebastian

On 26.10.2010 08:10, Divya wrote:
> Hi,
>
> I am new mahout user and using Mahout 0.4 with eclipse.
>
> I need to generate document similarity matrix from the vector file which I
> have already created using SparseVectorsFromSequenceFiles
>
> Now I need to generate the document similarity matrix.
>
> Which gave me
>
> Directory structure
>
> ->  df-count
>
> ->  tfidf-vectors
>
> ->  tf-vectors
>
> ->  tokenized-documents
>
> ->  wordcount
>
> ->  .dictionary.file-0.crc
>
> ->  .frequency.file-0.crc
>
> ->  dictionary.file-0
>
> ->  frequency.file-0
>
>
>
> I am confused now which one to use
>
> Which utility of mahout  computes document  document similairity matrix.
>
>
>
> Can any one help me.
>
>
>
>
>
> Regards,
>
> Divya
>
>
>