You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2010/10/28 10:27:47 UTC

Re: generate similar documents

Hi Divya,

--similarityClassname should point to an implementation of 
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity, 
you can use any value from 
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a 
predefined similarity measure or you can point to an implementation of 
your own

--numberOfColumns is the number of columns of the input matrix, which 
would be the number of unique terms as I suppose your matrix is 
documents x terms

--sebastian

On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to  generate a list of most similar documents
>
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents  .
>
>
>
> <ol>
>
>   *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
>   * SequenceFile<IntWritable,VectorWritable></li>
>
>   *<li>-Dmapred.output.dir=(path): output path where the computations output
> should go (a {@link DistributedRowMatrix}
>
>   * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
>   *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
>   *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
>   * similarity</li>
>
>   *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
> row to this number (100)</li>
>
>   *</ol>
>
>   *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>

Re: generate similar documents

Posted by Sebastian Schelter <ss...@apache.org>.

You have to supply that number, however if you don't use it number in 
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you 
can safely ignore it and pass in any number.

--sebastian

On 28.10.2010 12:02, Divya wrote:
> Hi Sebastian,
>  From where can I get the numberOfColumns.
> How can I calculate I have these many columns my matrix has as
> SparseVectorsFromSequenceFiles generates vectors in binary format.
>
> Regards,
> Divya
>
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org]
> Sent: Thursday, October 28, 2010 4:28 PM
> To: dev@mahout.apache.org
> Subject: Re: generate similar documents
>
> Hi Divya,
>
> --similarityClassname should point to an implementation of
> org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
>
> you can use any value from
> org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
> predefined similarity measure or you can point to an implementation of
> your own
>
> --numberOfColumns is the number of columns of the input matrix, which
> would be the number of unique terms as I suppose your matrix is
> documents x terms
>
> --sebastian
>
> On 28.10.2010 10:11, Divya wrote:
>    
>> Hi,
>>
>> I have directory of documents from which I have generated Sequence file
>> using SequenceFilesFromDirectory and then converted it into vectors
>> SparseVectorsFromSequenceFiles
>>
>> Now referring below link to  generate a list of most similar documents
>>
>>
>>
>>
>>      
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
>    
>> .6070703@googlemail.com%3E
>>
>>
>>
>> How can I use RowSimilarityJob to generate list of similar documents  .
>>
>>
>>
>> <ol>
>>
>>    *<li>-Dmapred.input.dir=(path): Directory containing a {@link
>> DistributedRowMatrix} as a
>>
>>    * SequenceFile<IntWritable,VectorWritable></li>
>>
>>    *<li>-Dmapred.output.dir=(path): output path where the computations
>>      
> output
>    
>> should go (a {@link DistributedRowMatrix}
>>
>>    * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>>
>>    *<li>--numberOfColumns: the number of columns in the input matrix</li>
>>
>>    *<li>--similarityClassname (classname): an implementation of {@link
>> DistributedVectorSimilarity} used to compute the
>>
>>    * similarity</li>
>>
>>    *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
>>      
> per
>    
>> row to this number (100)</li>
>>
>>    *</ol>
>>
>>    *
>>
>>
>>
>> Which argument should I pass numberOfColumns and similarityClassname ?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>>
>>      
>
>

RE: generate similar documents

Posted by Divya <di...@k2associates.com.sg>.

Hi Sebastian,
>From where can I get the numberOfColumns.
How can I calculate I have these many columns my matrix has as
SparseVectorsFromSequenceFiles generates vectors in binary format.

Regards,
Divya 

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Thursday, October 28, 2010 4:28 PM
To: dev@mahout.apache.org
Subject: Re: generate similar documents

Hi Divya,

--similarityClassname should point to an implementation of 
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,

you can use any value from 
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a 
predefined similarity measure or you can point to an implementation of 
your own

--numberOfColumns is the number of columns of the input matrix, which 
would be the number of unique terms as I suppose your matrix is 
documents x terms

--sebastian

On 28.10.2010 10:11, Divya wrote:
> Hi,
>
> I have directory of documents from which I have generated Sequence file
> using SequenceFilesFromDirectory and then converted it into vectors
> SparseVectorsFromSequenceFiles
>
> Now referring below link to  generate a list of most similar documents
>
>
>
>
http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
> .6070703@googlemail.com%3E
>
>
>
> How can I use RowSimilarityJob to generate list of similar documents  .
>
>
>
> <ol>
>
>   *<li>-Dmapred.input.dir=(path): Directory containing a {@link
> DistributedRowMatrix} as a
>
>   * SequenceFile<IntWritable,VectorWritable></li>
>
>   *<li>-Dmapred.output.dir=(path): output path where the computations
output
> should go (a {@link DistributedRowMatrix}
>
>   * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>
>   *<li>--numberOfColumns: the number of columns in the input matrix</li>
>
>   *<li>--similarityClassname (classname): an implementation of {@link
> DistributedVectorSimilarity} used to compute the
>
>   * similarity</li>
>
>   *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
per
> row to this number (100)</li>
>
>   *</ol>
>
>   *
>
>
>
> Which argument should I pass numberOfColumns and similarityClassname ?
>
>
>
>
>
> Regards,
>
> Divya
>
>
>