You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sören Brunk <so...@deri.org> on 2011/11/08 17:33:00 UTC
RowSimilarityJob input
Hi,
I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
similarities between feature vectors but I'm struggling a bit with the
correct input format.
I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
documents. But using the tfidf vectors directly as input doesn't work as
it produces vectors with Strings as keys, while RowSimilarityJob seems
to expect IntWritable.
I've also seen something about DistributedRowMatrix as input in some
older docs.
Any hints? Is RowSimilarityJob a good choice for that task at all?
Thanks for your help,
Sören
Re: RowSimilarityJob input
Posted by Sören Brunk <so...@deri.org>.
Ok after simply converting the vector keys from Text to IntWritable, it
worked fine for me.
Took a while though, but it ran only on my local machine with default
vectorization settings and almost no preprocessing, so there's much room
for improvement.
Thanks for your help!
Sören
On 08/11/11 16:45, Sebastian Schelter wrote:
> Hi Sören,
>
> RowSimilarityJob expects IntWritable,VectorWritable as input. It should
> be a reasonable choice for comparing the pairwise similarities between
> text documents. I suggest you throw away the 1% most frequent terms as
> described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
> think SparseVectorsFromSequenceFiles is already doing that per default.
>
> Would be great if let the mailinglist know how it worked out for you.
>
> Greetings to Galway!
> Sebastian
>
> On 08.11.2011 17:33, Sören Brunk wrote:
>> Hi,
>>
>> I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
>> similarities between feature vectors but I'm struggling a bit with the
>> correct input format.
>>
>> I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
>> documents. But using the tfidf vectors directly as input doesn't work as
>> it produces vectors with Strings as keys, while RowSimilarityJob seems
>> to expect IntWritable.
>> I've also seen something about DistributedRowMatrix as input in some
>> older docs.
>>
>> Any hints? Is RowSimilarityJob a good choice for that task at all?
>>
>> Thanks for your help,
>> Sören
Re: RowSimilarityJob input
Posted by Sebastian Schelter <ss...@apache.org>.
Hi Sören,
RowSimilarityJob expects IntWritable,VectorWritable as input. It should
be a reasonable choice for comparing the pairwise similarities between
text documents. I suggest you throw away the 1% most frequent terms as
described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
think SparseVectorsFromSequenceFiles is already doing that per default.
Would be great if let the mailinglist know how it worked out for you.
Greetings to Galway!
Sebastian
On 08.11.2011 17:33, Sören Brunk wrote:
> Hi,
>
> I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
> similarities between feature vectors but I'm struggling a bit with the
> correct input format.
>
> I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
> documents. But using the tfidf vectors directly as input doesn't work as
> it produces vectors with Strings as keys, while RowSimilarityJob seems
> to expect IntWritable.
> I've also seen something about DistributedRowMatrix as input in some
> older docs.
>
> Any hints? Is RowSimilarityJob a good choice for that task at all?
>
> Thanks for your help,
> Sören