You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Martin Wunderlich <ma...@gmx.net> on 2015/04/26 10:12:05 UTC

Using UIMA to build an NLP system

Hi all,

I am relatively new to UIMA and I was wondering, if the system would be the right choice for a project that I am currently working on. In essence, this project deals with a variety of text classification problems on different levels (document level, paragraph level, sentence level) using different methods.

To provide a concrete scenario, would UIMA be useful in modeling the following processing pipeline, given a corpus consisting of a number of text documents:

- annotate each doc with meta-data extracted from it, such as publication date
- preprocess the corpus, e.g. by stopword removal and lemmatization
- save intermediate pre-processed and annotated versions of corpus (so that pre-processing has to be done only once)
- run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number of topics ranging, for instance, from 50 to 100
- convert each doc to a feature vector as per the LDA model
- train and test an SVM for supervised text classification (binary classification into „relevant“ vs. „non-relevant“) using cross-validation
- store each trained SVM
- report results of CV into CSV file for further processing
- extract paragraphs from relevant documents and use for unsupervised pre-training in a deep learning architecture (built using e.g. Deeplearning4J)

Would UIMA be a good choice to build and manage a project like this?
What would be the advantages of UIMA compared to using simple shell scripts for „gluing together“ the individual components?

Thanks a lot.

Kind regards,

Martin

Re: Using UIMA to build an NLP system

Posted by Martin Wunderlich <ma...@gmx.net>.

Thanks so much, Petr and Mario, for your detailed views. They confirm my initial impression that the learning curve of the system was non underestimated. I might have a look at the DKPro project to see, if that would be a suitable starting point for my project. Although I might decide to stick with components that very losely coupled by scripting for getting a prototype together quickly and the move the system to UIMA when it has stabilised. It seems definitely like a system worth getting familiar with. 

Cheers, 

Martin
 


> Am 26.04.2015 um 13:44 schrieb Mario Gazzo <ma...@gmail.com>:
> 
> Hej Martin,
> 
> I agree with Peter. We are in the process of migrating our existing text analysis components to UIMA coming from an approach that more closely resembles what you would call just "gluing things together”. This works well when you initially just experiment with rapid prototypes. I think UIMA could in this phase even get in the way if you don’t already understand it very well. However, once you need to scale the dev team and move to production then these ad-hoc approaches become a problem. A framework like UIMA gives you a systematic development approach for the whole team and once you have climbed the steep learning curve then I believe it can also be a faster prototyping tool because it makes it easier to quickly combine different components in a new pipeline. An important factors for us was therefore also the diverse ecosystem of quality analysis components like DKPro, cTakes, clearTK etc. You can even integrate Gate components and vice versa (see https://gate.ac.uk/sale/tao/splitch22.html#chap:uima <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t myself played with this yet.
> 
> We are not using the distributed scale out features of UIMA but rely on various AWS services instead although it takes a bit of tinkering to figure out how to do this but we are gradually getting there. Generally we do the unstructured NLP processing on document by document basis in UIMA but then we do corpus wide structured analysis using map reduce type of approaches outside UIMA. That said, we are now also moving towards stream based approaches since we have to ingest large amount of data continuously. Doing very large MR batch jobs on a daily basis is in our case wasteful and impractical.
> 
> I think UIMA feels a bit "old school” with all these XML descriptions but there is purpose behind this once you start understanding the architecture. Luckily this is where UIMAfit comes to the rescue. We don’t use the Eclipse tools at all but integrate JCasGen with Gradle using this nice plugin: https://github.com/Dictanova/gradle-jcasgen-plugin <https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was direct support for Gradle as well. We don’t want to rely on these IDE specific tools ourselves since we use both Eclipse and Intellij IDEA in development and we need to have the code generation tools integrated with the automated build process. The main difference is that we only need to write the type definitions directly in XML and for the analysis engine and pipeline descriptions we can just use UIMAfit. However, be prepared to do some digging since not every detail is covered as well in the UIMAfit documentation as it is for the general UIMA framework. Community responses on this mailing is a big plus though.
> 
> Cheers
> Mario
> 
> 
>> On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote:
>> 
>> Hi!
>> 
>> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
>>> To provide a concrete scenario, would UIMA be useful in modeling the following processing pipeline, given a corpus consisting of a number of text documents: 
>>> 
>>> - annotate each doc with meta-data extracted from it, such as publication date
>>> - preprocess the corpus, e.g. by stopword removal and lemmatization
>>> - save intermediate pre-processed and annotated versions of corpus (so that pre-processing has to be done only once)
>>> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number of topics ranging, for instance, from 50 to 100
>>> - convert each doc to a feature vector as per the LDA model
>> +
>>> - extract paragraphs from relevant documents and use for unsupervised pre-training in a deep learning architecture (built using e.g. Deeplearning4J)
>> 
>> I think up to here, UIMA would be a good choice for you.
>> 
>>> - train and test an SVM for supervised text classification (binary classification into „relevant“ vs. „non-relevant“) using cross-validation
>>> - store each trained SVM
>>> - report results of CV into CSV file for further processing
>> 
>> The moment stop dealing with *unstructured* data and just do feature
>> vectors and classifier objects, it's imho easier to get out of UIMA,
>> but that may not be a big deal.
>> 
>>> Would UIMA be a good choice to build and manage a project like this? 
>>> What would be the advantages of UIMA compared to using simple shell scripts for „gluing together“ the individual components? 
>> 
>> Well, UIMA provides the gluing so you don't have to do it yourself,
>> it's not that small amount of work:
>> 
>> (i) a common container (CAS) for annotated data
>> (ii) pipeline flow control that also supports scale out
>> (iii) the DKpro project, which lets you effortlessly perform NLP
>> annotations, interface resources etc. using off-the-shelf NLP components
>> 
>> For me, UIMA had a rather steep learning curve.  But that was largely
>> because my pipeline is highly non-linear and I didn't use the Eclipse
>> GUI tools; I would hope things should go pretty easily in a simpler
>> scenario with a completely linear pipeline like yours.
>> 
>> P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
>> XML descriptors you see in the UIMA User Guide.  I recommend that you
>> just look at the DKpro example suite to get started up quickly.
>> 
>> -- 
>> 				Petr Baudis
>> 	If you do not work on an important problem, it's unlikely
>> 	you'll do important work.  -- R. Hamming
>> 	http://www.cs.virginia.edu/~robins/YouAndYourResearch.html
>

Re: Using UIMA to build an NLP system

Posted by Mario Gazzo <ma...@gmail.com>.

Hej Martin,

I agree with Peter. We are in the process of migrating our existing text analysis components to UIMA coming from an approach that more closely resembles what you would call just "gluing things together”. This works well when you initially just experiment with rapid prototypes. I think UIMA could in this phase even get in the way if you don’t already understand it very well. However, once you need to scale the dev team and move to production then these ad-hoc approaches become a problem. A framework like UIMA gives you a systematic development approach for the whole team and once you have climbed the steep learning curve then I believe it can also be a faster prototyping tool because it makes it easier to quickly combine different components in a new pipeline. An important factors for us was therefore also the diverse ecosystem of quality analysis components like DKPro, cTakes, clearTK etc. You can even integrate Gate components and vice versa (see https://gate.ac.uk/sale/tao/splitch22.html#chap:uima <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>) although I haven’t myself played with this yet.

We are not using the distributed scale out features of UIMA but rely on various AWS services instead although it takes a bit of tinkering to figure out how to do this but we are gradually getting there. Generally we do the unstructured NLP processing on document by document basis in UIMA but then we do corpus wide structured analysis using map reduce type of approaches outside UIMA. That said, we are now also moving towards stream based approaches since we have to ingest large amount of data continuously. Doing very large MR batch jobs on a daily basis is in our case wasteful and impractical.

I think UIMA feels a bit "old school” with all these XML descriptions but there is purpose behind this once you start understanding the architecture. Luckily this is where UIMAfit comes to the rescue. We don’t use the Eclipse tools at all but integrate JCasGen with Gradle using this nice plugin: https://github.com/Dictanova/gradle-jcasgen-plugin <https://github.com/Dictanova/gradle-jcasgen-plugin>. I would wish there was direct support for Gradle as well. We don’t want to rely on these IDE specific tools ourselves since we use both Eclipse and Intellij IDEA in development and we need to have the code generation tools integrated with the automated build process. The main difference is that we only need to write the type definitions directly in XML and for the analysis engine and pipeline descriptions we can just use UIMAfit. However, be prepared to do some digging since not every detail is covered as well in the UIMAfit documentation as it is for the general UIMA framework. Community responses on this mailing is a big plus though.

Cheers
Mario

> On 26 Apr 2015, at 11:05 , Petr Baudis <pa...@ucw.cz> wrote:
> 
>  Hi!
> 
> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
>> To provide a concrete scenario, would UIMA be useful in modeling the following processing pipeline, given a corpus consisting of a number of text documents: 
>> 
>> - annotate each doc with meta-data extracted from it, such as publication date
>> - preprocess the corpus, e.g. by stopword removal and lemmatization
>> - save intermediate pre-processed and annotated versions of corpus (so that pre-processing has to be done only once)
>> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number of topics ranging, for instance, from 50 to 100
>> - convert each doc to a feature vector as per the LDA model
> +
>> - extract paragraphs from relevant documents and use for unsupervised pre-training in a deep learning architecture (built using e.g. Deeplearning4J)
> 
>  I think up to here, UIMA would be a good choice for you.
> 
>> - train and test an SVM for supervised text classification (binary classification into „relevant“ vs. „non-relevant“) using cross-validation
>> - store each trained SVM
>> - report results of CV into CSV file for further processing
> 
>  The moment stop dealing with *unstructured* data and just do feature
> vectors and classifier objects, it's imho easier to get out of UIMA,
> but that may not be a big deal.
> 
>> Would UIMA be a good choice to build and manage a project like this? 
>> What would be the advantages of UIMA compared to using simple shell scripts for „gluing together“ the individual components? 
> 
>  Well, UIMA provides the gluing so you don't have to do it yourself,
> it's not that small amount of work:
> 
>  (i) a common container (CAS) for annotated data
>  (ii) pipeline flow control that also supports scale out
>  (iii) the DKpro project, which lets you effortlessly perform NLP
> annotations, interface resources etc. using off-the-shelf NLP components
> 
>  For me, UIMA had a rather steep learning curve.  But that was largely
> because my pipeline is highly non-linear and I didn't use the Eclipse
> GUI tools; I would hope things should go pretty easily in a simpler
> scenario with a completely linear pipeline like yours.
> 
>  P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
> XML descriptors you see in the UIMA User Guide.  I recommend that you
> just look at the DKpro example suite to get started up quickly.
> 
> -- 
> 				Petr Baudis
> 	If you do not work on an important problem, it's unlikely
> 	you'll do important work.  -- R. Hamming
> 	http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Re: Using UIMA to build an NLP system

Posted by Petr Baudis <pa...@ucw.cz>.

  Hi!

On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
> To provide a concrete scenario, would UIMA be useful in modeling the following processing pipeline, given a corpus consisting of a number of text documents: 
> 
> - annotate each doc with meta-data extracted from it, such as publication date
> - preprocess the corpus, e.g. by stopword removal and lemmatization
> - save intermediate pre-processed and annotated versions of corpus (so that pre-processing has to be done only once)
> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number of topics ranging, for instance, from 50 to 100
> - convert each doc to a feature vector as per the LDA model
+
> - extract paragraphs from relevant documents and use for unsupervised pre-training in a deep learning architecture (built using e.g. Deeplearning4J)

  I think up to here, UIMA would be a good choice for you.

> - train and test an SVM for supervised text classification (binary classification into „relevant“ vs. „non-relevant“) using cross-validation
> - store each trained SVM
> - report results of CV into CSV file for further processing

  The moment stop dealing with *unstructured* data and just do feature
vectors and classifier objects, it's imho easier to get out of UIMA,
but that may not be a big deal.

> Would UIMA be a good choice to build and manage a project like this? 
> What would be the advantages of UIMA compared to using simple shell scripts for „gluing together“ the individual components? 

  Well, UIMA provides the gluing so you don't have to do it yourself,
it's not that small amount of work:

  (i) a common container (CAS) for annotated data
  (ii) pipeline flow control that also supports scale out
  (iii) the DKpro project, which lets you effortlessly perform NLP
annotations, interface resources etc. using off-the-shelf NLP components

  For me, UIMA had a rather steep learning curve.  But that was largely
because my pipeline is highly non-linear and I didn't use the Eclipse
GUI tools; I would hope things should go pretty easily in a simpler
scenario with a completely linear pipeline like yours.

  P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
XML descriptors you see in the UIMA User Guide.  I recommend that you
just look at the DKpro example suite to get started up quickly.

-- 
				Petr Baudis
	If you do not work on an important problem, it's unlikely
	you'll do important work.  -- R. Hamming
	http://www.cs.virginia.edu/~robins/YouAndYourResearch.html