You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Eugen Paraschiv <ha...@gmail.com> on 2010/06/13 15:20:33 UTC

Mahout and UIMA

Hi, I'm starting to use Mahout for some text analysis work, and I was
looking at the multitude of Apache projects that are out there. I have a
question regarding the relation between Mahout and Apache UIMA, another
project that seems to be dealing with machine learning and data mining.
There may not be any explicit relation, none that I could find anyway, and I
don't know if Mahout addresses or will ever address the topic of analysis
and mining of unstructured content, or if it's outside the scope of the
project. So, is there this a direction Mahout will evolve towards in the
future? Thanks. Eugen.

Re: Mahout and UIMA

Posted by Isabel Drost <is...@apache.org>.

On Sun Eugen Paraschiv <ha...@gmail.com> wrote:
> Hi, I'm starting to use Mahout for some text analysis work, and I was
> looking at the multitude of Apache projects that are out there. I
> have a question regarding the relation between Mahout and Apache
> UIMA, another project that seems to be dealing with machine learning
> and data mining.

UIMA is most suited for annotating and analysing unstructured data,
e.g. text, but also images or video content. There are two possible
cases how UIMA and Mahout might be used together:

1) Mahout operates on vectors that represent the data points. UIMA is
well suited for document analysis and annotation. It is possible to use
UIMA for document processing, adding a document writer that writes
documents to disk in a format that can be processed by Mahout.

2) UIMA supports adding your own annotators. It should be no problem to
use Mahout models and algorithms in such annotators e.g. for document
classification.

For the first use case Mahout devs have so far relied on Lucene's
document processing capabilities - simply because there are several
Lucene devs in our community. However I have seen several projects
using UIMA for document pre-processing instead.

So far no glue code exists - would be more than welcome though.

Isabel

Re: Mahout and UIMA

Posted by Ted Dunning <te...@gmail.com>.

Eugen,

There are several very closely related projects here (from the standpoint of
Mahout).  These include Hadoop (required for scaling several Mahout
programs), Lucene (often used to collect documents), Tika (useful in
conjunction with Lucene to extract and process text) and, as you note, UIMA.

While all of these projects have something to do with data mining and
unstructured text, the fairly simple dividing line is generally that if it
is to do with the data itself or the computing platform it is UIMA, Lucene
or Hadoop while if it is to do with the actual mathematics involved in the
data mining, it will be Mahout doing the work.

As Isabel says, there is little explicit glue code available but integrating
software from these projects is not typically very difficult.  There is a
huge variety of ways to do this, however, so it is hard to anticipate what
use cases are really important.  If you have a use case, please talk about
it.

On Sun, Jun 13, 2010 at 6:20 AM, Eugen Paraschiv <ha...@gmail.com>wrote:

> Hi, I'm starting to use Mahout for some text analysis work, and I was
> looking at the multitude of Apache projects that are out there. I have a
> question regarding the relation between Mahout and Apache UIMA, another
> project that seems to be dealing with machine learning and data mining.
> There may not be any explicit relation, none that I could find anyway, and
> I
> don't know if Mahout addresses or will ever address the topic of analysis
> and mining of unstructured content, or if it's outside the scope of the
> project. So, is there this a direction Mahout will evolve towards in the
> future? Thanks. Eugen.
>