You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2008/01/31 18:00:59 UTC

Processing collections as a set of documents

Hi,

Making my first cpe, I am wondering how to deal with NLP tasks which
aims at processing several documents (i.e. pair of collection of
documents considered as a single entity) in a time. I am thinking
about applications such as (multilingual) text alignement, or term
extraction based on measures over a corpus, or text clustering (how to
compare one document with a set of documents)... Such applications
requires handling CAS over a kind of "collection artefact".

I saw it exists only the concepts of Annotation (inner document
description) and DocumentAnnotation.
I can imagine that some solutions can be possible thanks to CAS
Consumers or CAS Multipliers to deal with my problem but its only
hacking UIMA.

Does someone have got some experiences with such similar aims using
UIMA ? How do you handle them ? Does it exist something dedicated in
UIMA to work with a "collection artefact" ?

Thanks

/Nicolas

-- 
Nicolas.Hernandez@univ-nantes.fr
--
# Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
# Institut Universitaire de Technologie de Nantes - Département Informatique
tel. +33 (0)2 40 30 60 67

Re: Processing collections as a set of documents

Posted by David Buttler <bu...@llnl.gov>.

In terms of clustering, we have only done incremental clustering in 
UIMA. Essentially, the clustering component keeps track of all of the 
clusters, and as new documents come in, updates the appropriate cluster. 
Other types of clustering we do externally. 

We are still stuck on UIMA 1.4, so the hacks we use are probably not 
appropriate for the more recent versions. 

Some ideas: if you are processing a parallel corpus (i.e. for machine 
translation), the reader could create a single CAS for each pair of 
documents. Different languages could go in different SOFAs. Then your 
subsequent components would have to know how to deal with the different 
sofas (e.g. by only using an English POS parser on the English sofa)

For multi pass algorithms, I would suggest different pipelines: one to 
collect corpus statistics, and the next pipeline to use them.  If 
incremental statistics are sufficient, then you can just create an 
internal data structure to manage collection statistics as you stream 
through your collection.

Dave

Nicolas Hernandez wrote:
> Hi,
>
> Making my first cpe, I am wondering how to deal with NLP tasks which
> aims at processing several documents (i.e. pair of collection of
> documents considered as a single entity) in a time. I am thinking
> about applications such as (multilingual) text alignement, or term
> extraction based on measures over a corpus, or text clustering (how to
> compare one document with a set of documents)... Such applications
> requires handling CAS over a kind of "collection artefact".
>
> I saw it exists only the concepts of Annotation (inner document
> description) and DocumentAnnotation.
> I can imagine that some solutions can be possible thanks to CAS
> Consumers or CAS Multipliers to deal with my problem but its only
> hacking UIMA.
>
> Does someone have got some experiences with such similar aims using
> UIMA ? How do you handle them ? Does it exist something dedicated in
> UIMA to work with a "collection artefact" ?
>
> Thanks
>
> /Nicolas
>
>