You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Radwen ANIBA <ar...@gmail.com> on 2009/06/25 11:35:53 UTC

Using the Cas to compare documents

Hi everyone,

Following some examples applications of UIMA allow us to understand how
every component in UIMA framework works. That great. But one question that a
developper may ask is how to use the CAS to make a comparison of analyzed
documents.

The CAS is common to everydocument and when analzing one of them we have an
acces to the CAS for writing or updating.
Let's imagine We have 3 documents to analyze. We write to the CAS metadata
relative to each of them, but to go futher for the analysis of the documents
it could be very interesting to compare these documents using the CAS,
either in multiple manner or in pairwise.

To illustrate what i'm saying, let's imagine we are looking for email
adresses inside three big documents using UIMA regexp capabilities.
A result may be illustrated like this :

Document 1 :  Number of Unique emails 9 | Number of emails in common with
Document 2 : 10 | Number of emails in common with Document 3 : 6
Document 2 :  Number of Unique emails 5| Number of emails in common with
Document 1 : 20 | Number of emails in common with Document 3 : 1
Document 3 :  Number of Unique emails 4 | Number of emails in common with
Document 1 : 15 | Number of emails in common with Document 2 : 3

Here is a simple cross comparison of documents in pairwise using the CAS, My
question is how to achieve that ?
Do we need to create additional Type System for the common information ? We
have to do it on the fly dynamically ?

Thanks

Rad

Re: Using the Cas to compare documents

Posted by Radwen ANIBA <ar...@gmail.com>.
Thank you Thilo,

Well, I will investigate this idea.

Regards

Rad

2009/6/25 Thilo Goetz <tw...@gmx.de>

> Radwen ANIBA wrote:
> > Hi everyone,
> >
> > Following some examples applications of UIMA allow us to understand how
> > every component in UIMA framework works. That great. But one question
> that a
> > developper may ask is how to use the CAS to make a comparison of analyzed
> > documents.
> >
> > The CAS is common to everydocument and when analzing one of them we have
> an
> > acces to the CAS for writing or updating.
> > Let's imagine We have 3 documents to analyze. We write to the CAS
> metadata
> > relative to each of them, but to go futher for the analysis of the
> documents
> > it could be very interesting to compare these documents using the CAS,
> > either in multiple manner or in pairwise.
> >
> > To illustrate what i'm saying, let's imagine we are looking for email
> > adresses inside three big documents using UIMA regexp capabilities.
> > A result may be illustrated like this :
> >
> > Document 1 :  Number of Unique emails 9 | Number of emails in common with
> > Document 2 : 10 | Number of emails in common with Document 3 : 6
> > Document 2 :  Number of Unique emails 5| Number of emails in common with
> > Document 1 : 20 | Number of emails in common with Document 3 : 1
> > Document 3 :  Number of Unique emails 4 | Number of emails in common with
> > Document 1 : 15 | Number of emails in common with Document 2 : 3
> >
> > Here is a simple cross comparison of documents in pairwise using the CAS,
> My
> > question is how to achieve that ?
> > Do we need to create additional Type System for the common information ?
> We
> > have to do it on the fly dynamically ?
> >
> > Thanks
> >
> > Rad
> >
>
> Hi Rad,
>
> using the CAS to do this will get expensive very quickly.  You will
> not want to keep every document in its own CAS because of the memory
> overhead.  I would probably write the information you're interested
> in to an external datastore (e.g., a DB such as Derby) and do the
> comparison there.
>
> --Thilo
>

Re: Using the Cas to compare documents

Posted by Thilo Goetz <tw...@gmx.de>.
Radwen ANIBA wrote:
> Hi everyone,
> 
> Following some examples applications of UIMA allow us to understand how
> every component in UIMA framework works. That great. But one question that a
> developper may ask is how to use the CAS to make a comparison of analyzed
> documents.
> 
> The CAS is common to everydocument and when analzing one of them we have an
> acces to the CAS for writing or updating.
> Let's imagine We have 3 documents to analyze. We write to the CAS metadata
> relative to each of them, but to go futher for the analysis of the documents
> it could be very interesting to compare these documents using the CAS,
> either in multiple manner or in pairwise.
> 
> To illustrate what i'm saying, let's imagine we are looking for email
> adresses inside three big documents using UIMA regexp capabilities.
> A result may be illustrated like this :
> 
> Document 1 :  Number of Unique emails 9 | Number of emails in common with
> Document 2 : 10 | Number of emails in common with Document 3 : 6
> Document 2 :  Number of Unique emails 5| Number of emails in common with
> Document 1 : 20 | Number of emails in common with Document 3 : 1
> Document 3 :  Number of Unique emails 4 | Number of emails in common with
> Document 1 : 15 | Number of emails in common with Document 2 : 3
> 
> Here is a simple cross comparison of documents in pairwise using the CAS, My
> question is how to achieve that ?
> Do we need to create additional Type System for the common information ? We
> have to do it on the fly dynamically ?
> 
> Thanks
> 
> Rad
> 

Hi Rad,

using the CAS to do this will get expensive very quickly.  You will
not want to keep every document in its own CAS because of the memory
overhead.  I would probably write the information you're interested
in to an external datastore (e.g., a DB such as Derby) and do the
comparison there.

--Thilo