You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2010/10/28 10:27:47 UTC
Re: generate similar documents
Hi Divya,
--similarityClassname should point to an implementation of
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
you can use any value from
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
predefined similarity measure or you can point to an implementation of
your own
--numberOfColumns is the number of columns of the input matrix, which
would be the number of unique terms as I suppose your matrix is
documents x terms
--sebastian
On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to generate a list of most similar documents
>
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents .
>
>
>
> <ol>
>
> *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
> * SequenceFile<IntWritable,VectorWritable></li>
>
> *<li>-Dmapred.output.dir=(path): output path where the computations output
> should go (a {@link DistributedRowMatrix}
>
> * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
> *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
> *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
> * similarity</li>
>
> *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
> row to this number (100)</li>
>
> *</ol>
>
> *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>
Re: generate similar documents
Posted by Sebastian Schelter <ss...@apache.org>.
You have to supply that number, however if you don't use it number in
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you
can safely ignore it and pass in any number.
--sebastian
On 28.10.2010 12:02, Divya wrote:
> Hi Sebastian,
> From where can I get the numberOfColumns.
> How can I calculate I have these many columns my matrix has as
> SparseVectorsFromSequenceFiles generates vectors in binary format.
>
> Regards,
> Divya
>
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org]
> Sent: Thursday, October 28, 2010 4:28 PM
> To: dev@mahout.apache.org
> Subject: Re: generate similar documents
>
> Hi Divya,
>
> --similarityClassname should point to an implementation of
> org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
>
> you can use any value from
> org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
> predefined similarity measure or you can point to an implementation of
> your own
>
> --numberOfColumns is the number of columns of the input matrix, which
> would be the number of unique terms as I suppose your matrix is
> documents x terms
>
> --sebastian
>
> On 28.10.2010 10:11, Divya wrote:
>
>> Hi,
>>
>> I have directory of documents from which I have generated Sequence file
>> using SequenceFilesFromDirectory and then converted it into vectors
>> SparseVectorsFromSequenceFiles
>>
>> Now referring below link to generate a list of most similar documents
>>
>>
>>
>>
>>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
>
>> .6070703@googlemail.com%3E
>>
>>
>>
>> How can I use RowSimilarityJob to generate list of similar documents .
>>
>>
>>
>> <ol>
>>
>> *<li>-Dmapred.input.dir=(path): Directory containing a {@link
>> DistributedRowMatrix} as a
>>
>> * SequenceFile<IntWritable,VectorWritable></li>
>>
>> *<li>-Dmapred.output.dir=(path): output path where the computations
>>
> output
>
>> should go (a {@link DistributedRowMatrix}
>>
>> * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>>
>> *<li>--numberOfColumns: the number of columns in the input matrix</li>
>>
>> *<li>--similarityClassname (classname): an implementation of {@link
>> DistributedVectorSimilarity} used to compute the
>>
>> * similarity</li>
>>
>> *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
>>
> per
>
>> row to this number (100)</li>
>>
>> *</ol>
>>
>> *
>>
>>
>>
>> Which argument should I pass numberOfColumns and similarityClassname ?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>>
>>
>
>
RE: generate similar documents
Posted by Divya <di...@k2associates.com.sg>.
Hi Sebastian,
>From where can I get the numberOfColumns.
How can I calculate I have these many columns my matrix has as
SparseVectorsFromSequenceFiles generates vectors in binary format.
Regards,
Divya
-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org]
Sent: Thursday, October 28, 2010 4:28 PM
To: dev@mahout.apache.org
Subject: Re: generate similar documents
Hi Divya,
--similarityClassname should point to an implementation of
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
you can use any value from
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
predefined similarity measure or you can point to an implementation of
your own
--numberOfColumns is the number of columns of the input matrix, which
would be the number of unique terms as I suppose your matrix is
documents x terms
--sebastian
On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to generate a list of most similar documents
>
>
>
>
http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents .
>
>
>
> <ol>
>
> *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
> * SequenceFile<IntWritable,VectorWritable></li>
>
> *<li>-Dmapred.output.dir=(path): output path where the computations
output
> should go (a {@link DistributedRowMatrix}
>
> * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
> *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
> *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
> * similarity</li>
>
> *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
per
> row to this number (100)</li>
>
> *</ol>
>
> *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>