You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2011/10/01 00:36:33 UTC

Re: CAS Id

slight generalization...

For texts, the CAS == unit of work flowing in UIMA == (typically) a "document"

But, UIMA is used for other kinds of unstructured data, such as audio, video,
image, etc.  In this case the CAS == unit of work flowing in UIMA != a "document"...

We might want to consider more generic naming, because of this, like Jörn's
"CasId".  So in the following a name like CAS.setId() or CAS.setIdUri() might be
better (dropping "Document").

-Marshall


On 9/30/2011 10:59 AM, Richard Eckart de Castilho wrote:
> I always thought that a CAS.setDocumentUri() would have been helpful. In the beginning I mistook setSofaDataUri() to be such a thing and was quite surprise that if I set that, I cannot set the document text anymore. 
>
> So how about adding a setDocumentUri() method to CAS?
>
> From the experience with our own type system which supports such things, we find that it is also very useful to have a documentBaseUri for cases where recursive processing is taking place. I find a simple ID is not enough in many cases, e.g. when recursively reading files from one directory and writing them to another one while preserving the relative hierarchy.
>
> So a setDocumentBaseUri() in my opinion would also be desirable.
>
> Cheers,
>
> -- Richard
>
> Am 30.09.2011 um 16:53 schrieb Jörn Kottmann:
>
>> On 9/30/11 4:38 PM, Marshall Schor wrote:
>>> Can you say a bit more what this is?
>>>
>> Sure. The intent of the ID field is to reference a CAS instance to 
>> another system.
>>
>> Lets say we have an application where a UIMA analysis pipeline is used 
>> to process documents
>> which are stored in a database there you need to write the IDs of the 
>> documents into the CAS,
>> otherwise it is not possible to write analysis results back to the database.
>>
>> So typically your collection reader or first AE in the pipeline will set 
>> the ID and the last AE in the
>> pipeline will use it again to save the analysis results.
>>
>> Currently you always need to define a FS which holds your custom ID, but 
>> I guess a generic
>> string ID field would be just fine for almost any use case.
>>
>> Jörn

Re: CAS Id

Posted by Richard Eckart de Castilho <ec...@tk.informatik.tu-darmstadt.de>.
Am 01.10.2011 um 00:36 schrieb Marshall Schor:

> slight generalization...
> 
> For texts, the CAS == unit of work flowing in UIMA == (typically) a "document"
> 
> But, UIMA is used for other kinds of unstructured data, such as audio, video,
> image, etc.  In this case the CAS == unit of work flowing in UIMA != a "document"...

I am not a native speaker of English, but I suppose one could argue that there are "video documents" and "audio documents".

> We might want to consider more generic naming, because of this, like Jörn's
> "CasId".  So in the following a name like CAS.setId() or CAS.setIdUri() might be
> better (dropping "Document").
> 
> -Marshall

I would like to know if the ID should be per Sofa (per view) or really per (Base?)CAS, i.e. identifying the actual overall underlying data container.

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckartde@tk.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------