You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by John Doe <lu...@gmail.com> on 2020/12/09 14:03:00 UTC

UIMA-AS on Apache Spark

Hello,

I'm looking to scale out my NLP pipeline across a Spark cluster and was
thinking UIMA-AS may work as a solution. However, I'm not sure how this
would work in practice because in UIMA-AS you basically start up your NLP
pipeline as a service using a message broker. The client sends documents to
the broker using the hostname:port of the server. So I'm not sure how you
would do that in a Spark environment.

On my local machine, I start the broker on localhost:61616 and then I can
run multiple pipelines in parallel. So, in a cluster, would I have to make
each machine start its own broker? And how would you configure the clients
to distribute the load? It seems like you would have to start multiple
clients independently, each specifying a subset of documents, and then tell
each one to send their load to a different server. So you would need the
host:port of each service. Or is there a way that you can have some manager
in between which handles the distribution for you? Ideally, I would want a
single client to be able to make a request have the load get distributed
automatically.

Re: UIMA-AS on Apache Spark

Posted by Richard Eckart de Castilho <re...@apache.org>.
Hi,

> On 9. Dec 2020, at 15:03, John Doe <lu...@gmail.com> wrote:
> 
> I'm looking to scale out my NLP pipeline across a Spark cluster and was
> thinking UIMA-AS may work as a solution. 

you can find some resources from people who have been using UIMA with Spark
on the web, e.g. here:

- https://databricks.com/session/leveraging-uima-in-spark

- https://github.com/EDS-APHP/UimaOnSpark

- https://www.slideshare.net/DavidTalby/semantic-natural-language-understanding-with-spark-uima-machine-learned-ontologies

Maybe some of these help you. The common denominator seems to be that people
leverage uimaFIT to faciliate the creation and management of the analysis engines
in Java (as compared to juggling around XML descriptors) and then let Spark handle
the scaleout.

Let us know if you end up using any of these approaches or of other approaches you
might find or develop.

Cheers,

-- Richard