You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Julien Nioche <li...@gmail.com> on 2009/11/25 14:53:05 UTC

Announcement : Behemoth available on Google Code

Dear All,

Very early days, but I would like to announce a new Open Source project
named Behemoth which we have put on Google Code under Apache License (
http://code.google.com/p/behemoth-pebble/).

Behemoth allows to deploy GATE or UIMA applications over a Hadoop cluster in
order to do very large scale document analysis. It uses a very simple
representation format which can be used as a common ground between UIMA and
GATE-generated annotations, hence achieving compatibility between both
systems. Since it is Hadoop-based it benefits from all its features
(scalability, fault-tolerance, etc...) and most notably the back up of a
thriving open source community. Quite a few Apache resources already do or
will fit into it: Nutch, Tika, Mahout, Hbase etc...

The documentation is virtually non existant (apart from some basic wiki
pages) but this should hopefully be fixed as some point soon. Again, the
project is at a very early stage so do not expect anything stable. This also
means that user feedback is more likely to influence the design or
implementation. Apart from the Google code pages for the project the best
place to discuss Behemoth or get updates on it is the DigitalPebble user
group on http://groups.google.com/group/digitalpebble.

We've used Behemoth on a 100K documents corpus on a small Amazon EC2 cluster
with a GATE application and found that it worked fine. If you have a cluster
available and a large corpus to process with UIMA or GATE maybe you should
give Behemoth a try?

Best regards,

Julien Nioche
-- 
DigitalPebble Ltd
http://www.digitalpebble.com