You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dan Brickley <da...@danbri.org> on 2011/10/17 14:43:47 UTC

rowsimilarity and non-sparse input

I understand from https://issues.apache.org/jira/browse/MAHOUT-767
that the rowsimilarity job was recently improved, to handle
less-sparse input.

This week I had some fun using rowsimilarity with a matrix of items
(books) and librarian-assigned topic codes, to generate (and finally
prune) similarities that could be fed into Gephi for visualization.
With relatively little hacking, and fairly modest initial data (100k
items) it worked pretty fine, even on a laptop, and gave a rather
geographical 'map' of books clustered by similar topics.

See pretty pics and blather at http://danbri.org/words/2011/10/11/720
... (btw I'm quite happy with what I got back from Gephi given < 1
day's time, and encourage others to investigate the tool.)

So --- as I said in
http://www.mail-archive.com/user@mahout.apache.org/msg06602.html ---
flushed with initial success, I revisited svd/lanczos looking for a
more sophsticated analysis of the item/topic associations. Initially I
hit some boring problems with the rowid job that was needed before I
could transpose. But getting past that, I am now trying to run
'rowsimilarity' against the output of Lanczos, where my rows are items
(TV shows this time, but may as well be books). And my columns are
SVD-reorganized view of topic space.

I started this way (after having seqdirectory'd, then rowid'd,
transposed the original input):

mahout rowsimilarity --input
lonclassland/postsvd/transpose-128/part-00000  --output
lonclassland/svdsims2
-Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18  --numberOfColumns 280
--similarityClassname SIMILARITY_LOGLIKELIHOOD

...before reading/realising that rowsimilarity prefers sparse data.
And indeed its progress running on a cluster seems glacial.

Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't
see any obvious way forward.

* Would chosing a different similarity measure make any big
difference? (I'd guess for cosine...)
* should I experiment with values for  --threshold  ?
* Or somehow try to "re-sparsify" the input first? If I read the
output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000
' correctly, there are many very small values; can they be
approximated to zero and discarded somehow?

My high level goal is, for each item, to find a handful of the most
similar items, and then feed that to Gephi to generate topical maps of
the 'item landscape', grouping like items together. My intuition was
that doing this post-SVD might give a deeper insight into what the
bulk of these item/topic associations tell us, compared to doing
rowsimilarity against the raw item/topic matrix. However the tool I
found to explore this, rowsimilarity, seems to not to be the right fit
here.

Thanks for any pointers,

Dan

Re: rowsimilarity and non-sparse input

Posted by Dan Brickley <da...@danbri.org>.
On 17 October 2011 14:47, Sebastian Schelter <ss...@googlemail.com> wrote:
> RowSimilarityJob will have quadratic runtime for dense input and might
> generate large intermediate outputs. I'd argue against using it for such
> purposes.

Thanks, that's clear. I wonder if there's any way the Lanczos output
can be sparse-ified...

Or maybe if I ruthlessly prune down to only a handful of columns.
Would changing the similarity measure make any significant difference?

Dan

Re: rowsimilarity and non-sparse input

Posted by Sebastian Schelter <ss...@googlemail.com>.
RowSimilarityJob will have quadratic runtime for dense input and might
generate large intermediate outputs. I'd argue against using it for such
purposes.

--sebastian

On 17.10.2011 14:43, Dan Brickley wrote:
> I understand from https://issues.apache.org/jira/browse/MAHOUT-767
> that the rowsimilarity job was recently improved, to handle
> less-sparse input.
> 
> This week I had some fun using rowsimilarity with a matrix of items
> (books) and librarian-assigned topic codes, to generate (and finally
> prune) similarities that could be fed into Gephi for visualization.
> With relatively little hacking, and fairly modest initial data (100k
> items) it worked pretty fine, even on a laptop, and gave a rather
> geographical 'map' of books clustered by similar topics.
> 
> See pretty pics and blather at http://danbri.org/words/2011/10/11/720
> ... (btw I'm quite happy with what I got back from Gephi given < 1
> day's time, and encourage others to investigate the tool.)
> 
> So --- as I said in
> http://www.mail-archive.com/user@mahout.apache.org/msg06602.html ---
> flushed with initial success, I revisited svd/lanczos looking for a
> more sophsticated analysis of the item/topic associations. Initially I
> hit some boring problems with the rowid job that was needed before I
> could transpose. But getting past that, I am now trying to run
> 'rowsimilarity' against the output of Lanczos, where my rows are items
> (TV shows this time, but may as well be books). And my columns are
> SVD-reorganized view of topic space.
> 
> I started this way (after having seqdirectory'd, then rowid'd,
> transposed the original input):
> 
> mahout rowsimilarity --input
> lonclassland/postsvd/transpose-128/part-00000  --output
> lonclassland/svdsims2
> -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18  --numberOfColumns 280
> --similarityClassname SIMILARITY_LOGLIKELIHOOD
> 
> ...before reading/realising that rowsimilarity prefers sparse data.
> And indeed its progress running on a cluster seems glacial.
> 
> Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't
> see any obvious way forward.
> 
> * Would chosing a different similarity measure make any big
> difference? (I'd guess for cosine...)
> * should I experiment with values for  --threshold  ?
> * Or somehow try to "re-sparsify" the input first? If I read the
> output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000
> ' correctly, there are many very small values; can they be
> approximated to zero and discarded somehow?
> 
> My high level goal is, for each item, to find a handful of the most
> similar items, and then feed that to Gephi to generate topical maps of
> the 'item landscape', grouping like items together. My intuition was
> that doing this post-SVD might give a deeper insight into what the
> bulk of these item/topic associations tell us, compared to doing
> rowsimilarity against the raw item/topic matrix. However the tool I
> found to explore this, rowsimilarity, seems to not to be the right fit
> here.
> 
> Thanks for any pointers,
> 
> Dan