You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Eddie Epstein <ea...@gmail.com> on 2010/11/15 04:49:17 UTC

Re: Collection processing

Is the analysis of each document to be done independently of
the others? For example, annotation offsets are relative to the
beginning of each document. If not, the documents can be
concatenated together and analyzed at the same time.

If the documents are to be considered independently, the
annotator has to process each separately. One could
create a view for each document and let the annotator
iterate over all views. Of course since the CAS is memory
resident there is a natural limit to the total size of all
documents to be processed in this way.

On Sun, Nov 14, 2010 at 10:10 AM, Drenski <mi...@yahoo.com> wrote:
> Hi,
> I am new to UIMA and i have been struggling for some time
> with the following problem.
> I have some documents, which i need to process simultaneously.
> So I implemented a collection reader, which reads all the files
> from a directory and annotates them as Documents. But how can
> i put these all files in an Array for example so that I can
> iterate them and make my further processing. Basically I
> just want to fetch the files from the directory and put
> them in an array so that i can process them.
> Is CAS consumer what I need? I saw in the doc that
> it is now deprecated. Or should I use some index like Lucene?
> But I guess this will be too complex for my simple task?
> I would appreciate any suggestions.
> Regards,
> Drenski
>
>

Re: Collection processing

Posted by Drenski <mi...@yahoo.com>.

Thank you very much Eddie, i will try your approach.
Milen

Re: Collection processing

Posted by Eddie Epstein <ea...@gmail.com>.

On Mon, Nov 15, 2010 at 5:36 AM, Drenski <mi...@yahoo.com> wrote:
> My goal is to do some clustering of those
> documents. As input for this clustering
> i need a list of feature vectors and each
> feature vector represents a single
> document. I implemented the clustering as
> an annotator. So my first guess was to use
> a collection reader to read these documents
> and put each document in a list which i can
> use for the clustering. But i can't figure out
> where and how to store those documents, so that
> i can use them after all of them are read,
> because the collection reader reads one document
> and then sends it to the annotator.
> Regards,
> Drenski
>

One way to do this would be to have the collection reader put
a single document into each CAS; then an annotator would
processes the document into a feature vector and put into
the CAS; a final annotator (a CAS consumer) would read
the feature vector from each CAS and stores it in a local
array. When ready to do clustering, collectionprocesscomplete
would tell the final annotator to do the clustering step.

Eddie

Re: Collection processing

Posted by Drenski <mi...@yahoo.com>.

Eddie Epstein <ea...@...> writes:

> 
> Is the analysis of each document to be done independently of
> the others? For example, annotation offsets are relative to the
> beginning of each document. If not, the documents can be
> concatenated together and analyzed at the same time.
> 
> If the documents are to be considered independently, the
> annotator has to process each separately. One could
> create a view for each document and let the annotator
> iterate over all views. Of course since the CAS is memory
> resident there is a natural limit to the total size of all
> documents to be processed in this way.
> 
> On Sun, Nov 14, 2010 at 10:10 AM, Drenski <mi...@...> wrote:
> > Hi,
> > I am new to UIMA and i have been struggling for some time
> > with the following problem.
> > I have some documents, which i need to process simultaneously.
> > So I implemented a collection reader, which reads all the files
> > from a directory and annotates them as Documents. But how can
> > i put these all files in an Array for example so that I can
> > iterate them and make my further processing. Basically I
> > just want to fetch the files from the directory and put
> > them in an array so that i can process them.
> > Is CAS consumer what I need? I saw in the doc that
> > it is now deprecated. Or should I use some index like Lucene?
> > But I guess this will be too complex for my simple task?
> > I would appreciate any suggestions.
> > Regards,
> > Drenski
> >
> >
> 
> 

Thank you for your reply!
My goal is to do some clustering of those
documents. As input for this clustering 
i need a list of feature vectors and each
feature vector represents a single 
document. I implemented the clustering as
an annotator. So my first guess was to use
a collection reader to read these documents 
and put each document in a list which i can 
use for the clustering. But i can't figure out 
where and how to store those documents, so that
i can use them after all of them are read, 
because the collection reader reads one document
and then sends it to the annotator. 
Regards,
Drenski