You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Joseph Turian <tu...@gmail.com> on 2010/11/06 03:11:38 UTC
Mahout to find semantically related terms over a large vocabulary (>1M)?
I'm organizing a bakeoff, if you want to show off some Mahout skills
and do a controlled comparison of Mahout to other people's approaches:
Let's say I have several hundred million documents, which are very
short (only a few words). There are several million terms in the
vocabulary. What is the fastest way to find the top-k semantically
related terms for each term in the vocabulary?
If you just want to hear the results, join this group:
http://groups.google.com/group/metaoptimize-challenge-announce
If you actually want to hack some data, read this blog post:
http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/
It would be really cool to see participation from the Mahout community
in a Mahout demo, to get a controlled comparison to other
implementations.
Best,
Joseph
Re: Mahout to find semantically related terms over a large vocabulary (>1M)?
Posted by jakobitsch juergen <ts...@yahoo.com>.
hi joseph,
i'm very much interested in stuff like that, allthough i'm not a
mahout guru, i'd be very glad to have a working sample, because
i can see very usefull things...
i'm working with large thesauri in skos-format and am sure
i could use working solutions in a couple of projects.
keep up
wkr www.turnguard.com/turnguard
----- Original Message ----
From: Joseph Turian <tu...@gmail.com>
To: mahout-user@apache.org
Sent: Sat, November 6, 2010 3:11:38 AM
Subject: Mahout to find semantically related terms over a large vocabulary
(>1M)?
I'm organizing a bakeoff, if you want to show off some Mahout skills
and do a controlled comparison of Mahout to other people's approaches:
Let's say I have several hundred million documents, which are very
short (only a few words). There are several million terms in the
vocabulary. What is the fastest way to find the top-k semantically
related terms for each term in the vocabulary?
If you just want to hear the results, join this group:
http://groups.google.com/group/metaoptimize-challenge-announce
If you actually want to hack some data, read this blog post:
http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/
It would be really cool to see participation from the Mahout community
in a Mahout demo, to get a controlled comparison to other
implementations.
Best,
Joseph